← Home

Reward Distribution

Table of contents

  1. The reward distribution task
  2. Redirecting meta-level questions to the dialog market
  3. The grounding problem
  4. Strategies
    1. Start with what can be relied on
    2. Combine many weak heuristics
    3. Defer to more trusted systems
    4. Defer evaluation to the future
    5. Grow slowly, check behavior at each stage
  5. Reward distribution using escalating bets
    1. Betting on reward choices
    2. Escalation on disagreement
    3. The end of a dialog
  6. Reward distribution using probabilistic models
    1. Elicit cheap, uninformative signals at runtime
    2. Elicit expensive, informative signals after the fact
    3. Learn a model that relates cheap and expensive signals
    4. Incentivize cheap signals to be informative
  7. Summary

The reward distribution task

Recall the setup: a person asks a (possibly very vague) question, which starts a dialog. This person offers some monetary reward that goes towards funding the dialog. Imagine the dialog as a tree of short contributions. Contributors—both humans and automated systems—add follow-up questions, comments, and partial answers to this tree. The goal is to produce a dialog that is as helpful as possible for solving the problem that prompted the person to start the dialog. The task that the dialog market faces is to distribute the reward over these contributions in proportion to their helpfulness, such that helpful contributions are incentivized.

This reward distribution task cannot be left (only) to the original asker. First, they will not always be in the best position to judge how helpful contributions are (e.g., consider medical advice, where they don’t have the required knowledge; or dialogs with multiple partial solutions where the problem of assigning causal responsibility for the overall solution is itself a difficult problem). Second, I anticipate that some dialogs will be very large (thousands of contributions, say), with subdialogs that don’t involve the original asker much, and it would be burdensome to require the original asker to judge all of them.

At the same time, the distribution of reward needs to be grounded in the original asker’s values; a contribution is only helpful to the extent that it contributes to producing a dialog that is helpful for this person.

So, how do you distribute reward over contributions in a way that reflects the original asker’s values? This is the reward distribution problem.

Redirecting meta-level questions to the dialog market

The reward distribution problem can be viewed as just another set of questions (“How much reward should be assigned to this contribution?”). Hence, it is natural to wonder: can we outsource questions like this to the dialog market as well?

This is appealing, since it allows arbitrary considerations to be taken into account when judging contributions and assigning reward, and since any improvements to the object-level system also apply to the evaluation mechanism.

Generating judgments through argumentation can be much better than pure voting. Votes provide very little information (a number + author identity + what is being voted on, and when)—it can be difficult to tell, for example, whether two votes are given for independent or redundant reasons, which makes aggregation challenging. Arguments, on the other hand, can more easily be judged on their own terms, and combined in a way that is sensitive to their content.

Therefore, it seems desirable to solve the reward distribution problem by redirecting meta-questions such as “How helpful is this contribution?” back to the dialog market, and to answer them using the same mechanisms that are used to answer other questions.

Of course, this cannot happen for every contribution (as this would result in infinite regress). So, we might only investigate with some probability, or ask more general questions such as “How helpful are Bob’s contributions in this dialog?”.

The grounding problem

Redirecting meta-level questions to the market raises the grounding problem: if the dialog market works, then meta-level arguments and judgments will be good, which leads to good market incentives, and therefore to the overall system working; if the dialog market doesn’t work, the meta-level judgments will be bad as well, and the market will incentivize the wrong kinds of contributions.

In other words, there is a chicken-and-egg problem: to get a working dialog market (with respect to object-level contributions), you already need a working dialog market (with respect to meta-level contributions).


I don’t have a completely satisfying solution the grounding problem yet. I will first list some general directions that seem promising, and then two concrete approaches, but there is still a lot of room for improvement.

Start with what can be relied on

Any solution to the grounding problem will have to make use of some information sources, and will assume that they satisfy certain criteria. One way of approaching the grounding problem is to ask: what information sources are there? We need to be careful about incentives—sources that are usually reliable may become unreliable in the presence of market incentives.

Here is a list of some information sources:

  • I (as the system designer) will use my best judgment about what is helpful. (Of course, my best judgment in any given moment may not be very good.)

  • I trust certain market participants to use their best judgment (to some degree). This trust can be topic-specific, and transitive (to some degree).

  • The original asker will use their best judgment when they rate contributions or answer follow-up questions. This is no longer true when there are dependencies between dialogs, e.g. through reputation effects (Bob might answer his own questions using sockpuppet accounts, and rate these answers highly to increase the probability of getting high reward for future answers).

  • When people bet substantial amounts of money, they will bet according to their best judgment, unless the side-effects of betting otherwise outweigh the gains from winning.

    • Some answers can be verified, which makes it easier to judge them, and therefore to judge others’ judgments:

      • Some answers can be checked empirically.

      • Some answers can be checked logically.

  • We can require answers to be given in a form that is easy to judge.

  • When identities are expensive, participants will avoid actions that lead to identities becoming useless (“destroying their reputation”), unless benefits outweigh costs.

  • We can probably make some assumptions about contributions in general. It is unclear what exactly they are (“there is some signal in some contributions, probably”), but an adaptive system may be able to learn this over time.

Combine many weak heuristics

In judging contributions, there may be many considerations that can be relied on weakly (e.g., reputation, agreement between raters), and no single information source that can be strongly relied on. However, this still raises the question of what the process looks like that integrates these considerations. Ideally, these considerations can come in the form of arguments in the dialog system. However, this brings us back to the start: how do we integrate these arguments, and judge which are good? For any particular way of integrating them, how do we have assurance that it will lead to good judgments? Can we build a reward distribution system that learns over time how to combine such heuristics?

Defer to more trusted systems

When we have different evaluation methods, or multiple information sources, and it is costly to evaluate in depth, we can choose to only evaluate in depth sometimes. This choice can be made randomly (defer with some probability) or based on discussion features (e.g., defer when there is disagreement). The results of such deferral can be taken into account by reputation systems (if, whenever we evaluated in depth, Bob’s judgments turned out to be good, we might want to rely on them more in the future).

A tempting option for deferral is to increase the reward with some probability; e.g., with probability p, offer 1/p reward, with some upper bound. However, this already assumes that the dialog market works when rewards are high, so this option may not help with the fundamental grounding problem, but could be useful in boosting a partial solution.

Deferral to more trusted groups of market participants doesn’t suffer from this problem, but also seems problematic on its own, as it could limit the overall capability of the system to what this trusted group can verify on their own.

A potentially promising idea is to combine deferral to more trusted participants and deferral to the market with higher rewards: defer to more trusted participants, but provide them with some (potentially large) amount of reward to spend on the dialog market. This allows the trusted participants to make use of any capabilities that may be present in the dialog market, but they may also ignore the market to the extent that it isn’t useful. For instance, a trusted participant may ask the market to judge a contribution and to provide easily understandable justifications; then she can take these considerations into account to the extent that she understands them. As presented, this proposal seems like it would still significantly restrict the capability of the market (limiting ideal meta-level judgments to arguments that a single trusted participant can understand, augmented by the use of the market to help them in their understanding). However, there may be proposals of similar shape that are less restrictive.

Defer evaluation to the future

A particular type of deferral to a more trusted system is deferral to the future. As time goes on, we tend to collect more information that may help us determine to what extent contributions are helpful. For example, we learn more about whether advice helped and whether predictions came true.

One way to make use of this effect is to make payoffs temporally heavy-tailed: e.g, spend 50% of the reward for a dialog by the deadline given by the user, then 50% of the remainder after 2x this time, 50% of the remainder again after 4x, after 8x, etc. This could lead to dialogs becoming better over time, which allows better judgment of how good the early contributions were, and which thus incentivizes good early contributions.

Reputation can be used to amplify the effects of the small residual payments. If Alice’s judgments tend to turn out to be good after many months or years of deliberation, we should probably trust and reward her more.

Grow slowly, check behavior at each stage

Finally, one can give a non-answer: simply implement a dialog market, restrict it to a particularly cooperative set of participants, see how well it works, grow it slowly, check for failures and degradation of performance at each stage, and apply sophisticated reward distribution strategies as they become necessary. This is a reasonable strategy, but I’d prefer to follow it in conjunction with more theoretical guarantees, not as the sole strategy.

Reward distribution using escalating bets

I will now give an example of how one might solve the grounding problem. This solution relies on people betting in their self-interest and on the original asker using their best judgment (informed by information provided by the dialog market).

There is a lot of room for solutions that combine the ingredients given in the previous section in different ways. The particular approach I’m showing here mainly serves to illustrate that this is indeed a solvable problem. It is directly based on ideas by Paul Christiano (Of Arguments and Wagers; personal communication).

Before I go into details, here is an overview of what happens from start to end in a dialog under this proposal:

  1. Start: The asker—we’ll call her Alice—starts a dialog by submitting a question, reward, and expiration date/time.

  2. Contributions: Alice and other users (bots and humans) post follow-up questions and other responses to the dialog, make edits, and contribute in arbitrary other ways.

  3. Bets: For each contribution, users can bet on how much Alice would want to pay to the contributor, under reflection. The stakes for these bets are fixed and initially small.

  4. Escalation: At expiration time, the system checks all bets, and escalates ones with disagreement, potentially all the way up to Alice.

  5. Conclusion: When all bets are settled, either by agreement or by Alice’s decree, a single best answer has been selected for each of the reward questions. The system enacts these conclusions by paying out the corresponding rewards.

I will now talk about some of these stages in more detail.

Betting on reward choices

For each contribution, users can bet on how much Alice would want to pay if she thought about it (or, more precisely, what payment will be selected by an escalation process that involves increasingly well-funded dialog markets investigating what to select, and that potentially ends with Alice making a decision; see below).

Bets happen simultaneously with other contributions. A little while before the dialog expires, we stop all contributions. From this point on, users can only update their bets.

Users can edit their bets at any time before the expiration time.

Adding a contribution may automatically set up a small default bet by its author on the corresponding payment question, so that (1) other users are incentivized to bet, and (2) to disincentivize spam.

Escalation on disagreement

For contributions where the bets don’t indicate substantial disagreement about how much to pay, we determine the payment using the data provided by bets.

For contributions where bets do indicate disagreement, we escalate. This means:

  1. We increase the stake required for the bets, and set a new expiration time for the bets

  2. We implement arbitrage on the bets, and use the guaranteed profit to:

    • set up a system bet for the next round (to incentivize participation)

    • fund a dialog market that investigates the question: “How much should Alice pay for [contribution]?” (if the profit exceeds some threshold)

We repeat steps 1 and 2, incrementally increasing the stake. If the bets indicate agreement at any step in the ladder, the process stops and all bets are settled.

For each stake size, we compute how many users are eligible to participate in the bets. As the stake grows, this number shrinks. If this number is very small, we directly ask Alice to pick the answer she likes best, using the dialog markets funded by arbitrage profits as a source of advice. This settles all bets.

The end of a dialog

When all bets are settled and the dialog concludes, market participants have selected the payment that they expect Alice to choose if she thought about the choice and used the dialog market as a tool for investigation.

Reward distribution using probabilistic models

I will now describe a second approach to solving the reward distribution problem. This approach doesn’t directly support meta-level discussion, and so doesn’t run into the grounding problem, but is also less powerful than approaches that do support meta-level discussion. I am including it here as an example of a less complete solution that may be more practical early on.

This approach is fairly independent of the dialog setup, and so we if we build it, we could also apply it in other scenarios where we’d like to set up incentives for behavior that’s good under some expensive/long-term metric, where we can measure many indirect proxies of goodness, but only get a few high-quality measurements, and where we want to incentivize people to make these proxies as informative as possible. For example, this sort of procedure could be used to distribute rewards to the authors of commits or pull requests in a git repository.

Elicit cheap, uninformative signals at runtime

While the dialog is happening, we elicit easy-to-provide information (such as upvotes/downvotes, likes, 1-5 star ratings) from the original asker and from other contributors. (This sort of information doesn’t need to have special status within the system—it could be implemented via replies.)

This information is cheap to get, but it may reflect long-run helpfulness only in very indirect or noisy ways, or sometimes not at all. It will also generally not directly speak to the usefulness of structure changes, edits, and deletions. We can only make very weak assumptions about how this information relates to the long-run helpfulness of particular changes, at least initially.

This information also serves to guide the dialog while it is happening. For example, if the original asker likes a contribution, other participants may take this to indicate that they want to encourage similar contributions.

Elicit expensive, informative signals after the fact

A while after the dialog has concluded, perhaps after the asker has taken a decision recommended within the dialog and observed its outcome, we require the asker to answer questions to help us determine how much to pay for each contribution. We have some flexibility in what questions we ask here (“How helpful was the dialog overall?”, “How much should we pay for change x?”, “Was change x more helpful than change y?”, “How much worse would the dialog have been if change x hadn’t happened?”, etc.), but we will be limited to a relatively small number of questions. We definitely cannot ask about each contribution.

In contrast to the questions we asked while the dialog was happening, we will make strong assumptions about how the questions asked here relate to long-run goodness, and thus to how much we should pay. In other words, we assume a mostly fixed semantics for these questions. As an extreme case, we could simply ask “How much do you want to pay for change x, knowing what you know now?” for a few contributions and take that at face value. (However, this might be a particularly difficult question for people to answer, so other questions with comparably clear semantics might be preferable.)

Learn a model that relates cheap and expensive signals

Since we can only ask a few questions in (2), we need to determine (a) what questions to ask and (b) how much to pay for contributions that the asker doesn’t directly evaluate. To this end, we will learn a probabilistic model (across dialogs) that relates the early/cheap pieces of information to the answers to post-dialog questions, and thus to facts about how much to pay per contribution. For example, we might learn that when Bob indicates that he likes a contribution in dialogs related to where to go on vacation, it will frequently turn out that the asker rates it highly in retrospect.

Given this model, we ask the post-dialog questions that, in expectation, most reduce the variance of our distribution on how much to pay (within this dialog, or taking into account future dialogs). We pay contributors based on the expected payments under this model.

Incentivize cheap signals to be informative

We don’t just want to reward people for helpful contributions, but also for helping us figure out how much to pay by providing the information in (1). I don’t know what approach to choose here, but it could be something like taking a constant fraction of the dialog reward, and paying each info provider in proportion to the information gain between the distribution on payments excluding their data point, and including it (e.g., with and without taking into account the fact that Bob liked a particular contribution); or before/after their data point.


It is appealing to consider dialog markets that solve the reward distribution problem by redirecting meta-level judgment questions back to the market. This raises the grounding problem: the quality of judgments will only be good if the market works, which in turn depends on good judgments. To solve this problem, it seems necessary to apply some additional techniques, such as relying on people’s self-interest when betting or relying on the original asker’s judgment. I have described a candidate for a reward distribution system that attempts to solve the grounding problem, but haven’t given any serious arguments that it will work, and so expect that (in its current form) it will run into issues. I have also described another approach to reward distribution that is less powerful but perhaps more practical initially. Overall, the question of what reward distribution systems work well in practice is still wide open.