PPO and Reward Design: Training a Chatbot Policy Network That Actually Works

A chatbot policy network is a small neural network that decides how to compose responses from knowledge fragments. It is trained with reinforcement learning, specifically Proximal Policy Optimization (PPO). The trickiest part of the entire training pipeline is not the algorithm. It is the reward function.

This post walks through the rewards we use, the ones that failed, and the design principles that emerged.

Why PPO

We picked PPO for two reasons. First, it is stable. Unlike vanilla policy gradient, PPO clips the policy update to prevent destructively large steps, which is critical when the reward signal is noisy (real user feedback is noisy). Second, it is simple. PPO has fewer hyperparameters than alternatives like SAC or TD3, which matters when the team is small and the compute is limited.

The PPO objective is:

L = E[ min(r(θ)  A, clip(r(θ), 1-ε, 1+ε)  A) ]

Where r(θ) is the ratio of new to old policy probabilities and A is the advantage. The clip prevents the new policy from going too far from the old one in a single update.

The Reward Function

Our reward is a weighted sum of several signals, each normalized to roughly [0, 1]:

reward = 0.35 * relevance
        + 0.25 * completeness
        + 0.15 * conciseness
        + 0.15 * engagement
        + 0.10 * retrieval_grounding

The weights are not magic — they came out of an A/B test on real users. The single biggest predictor of user satisfaction was relevance, so it gets the largest weight. But removing any one of the others hurt overall quality.

What Each Signal Measures

Relevance — does the response address the user's query? Computed by a held-out embedding model that scores (query, response) similarity. We use a separate embedding model for evaluation to avoid self-bias.

Completeness — did the response cover the key aspects of the query? Measured by checking whether the response mentions the entities and concepts that the retriever considered relevant. This is essentially a recall metric on the entities.

Conciseness — is the response appropriately brief? A penalty for responses that are too long (measured in word count) and a small reward for being concise. The optimum depends on the query — a definition should be 1-2 sentences, an explanation can be longer.

Engagement — did the user continue the conversation? Did they ask a follow-up? Did they rephrase (suggesting the first response was unclear)? This is the noisiest signal but the most honest one.

Retrieval grounding — is every claim in the response traceable to a retrieved fragment? Computed as the fraction of sentences in the response that overlap (lexically) with at least one retrieved fragment. A response that has 50% ungrounded text is hallucinating, even if it is fluent.

What Failed

We tried several reward formulations that did not work.

BLEU/ROUGE scores against ground truth. These are classic NLG metrics, but they reward surface-level overlap with a reference, not actual quality. A response that paraphrases correctly can have low BLEU. A response that copies the reference verbatim (useless) can have high BLEU. We removed them.

Pure engagement reward. "Did the user keep chatting?" is a tempting signal — the user is voting with their attention. But engagement is heavily biased toward entertaining responses, not accurate ones. A chatbot that tells jokes would have high engagement. We down-weighted this significantly and added the grounding term as a counterweight.

User-rated thumbs up/down. We added a thumbs-up button for a while. The data was too sparse (most users do not rate) and biased (users who rate are not representative). We removed it.

Length penalty alone. We tried penalizing long responses to combat verbosity. It caused the policy to over-truncate, sometimes cutting off mid-explanation. The conciseness term is now a target (reward hitting a length appropriate to the query) not a pure penalty.

What Surprisingly Mattered

The most important lesson was that the combination of signals matters more than any individual one. A policy trained on just relevance was good but verbose. A policy trained on just conciseness was good but incomplete. Combining them in the right ratios took weeks of iteration.

The second surprise was that the grounding term mattered even though it was hard to compute. We initially used a simple lexical overlap. It was noisy and missed paraphrases. But even this crude grounding signal was the strongest defense against hallucination. The policy learned: "if I want to maximize my reward, every sentence I write should come from a retrieved fragment." That single insight is what made the system trustworthy.

The third surprise was that the reward function is a hyperparameter. We spent more time tuning the reward than tuning the network architecture. The 25→128→64→6 MLP worked the first time. The reward function is on its 7th iteration and will probably keep changing.

The Dark Art of Reward Shaping

Reward shaping is the part of RL that nobody teaches you in a textbook. The textbook says "design a reward that captures the objective." In practice you design a reward, train, look at the failure modes, adjust, retrain, and iterate. The reward is a hypothesis about what good behavior is, and training is the experiment that tests it.

The trap is reward hacking — the policy finds a way to maximize the reward without doing what you actually want. Common forms:

Maximize retrieval grounding by quoting fragments verbatim (looks good, reads poorly)
Maximize conciseness by giving one-word answers to everything
Maximize engagement by being entertaining at the expense of accuracy

The fix is always the same: add a signal that detects the hack and penalize it. Once you spot a hack, you cannot unsee it — and the only cure is another term in the reward function.

Where We Are Now

The current policy is the 7th iteration. It is not the last. As we collect more interaction data and find new failure modes, the reward will continue to evolve. PPO gives us a stable optimization target, but the target itself is the part that takes the most thought.

If you are training an RL-based system, invest most of your time in the reward function. The algorithm is a solved problem. The objective is not.