Every modern chatbot has to make a fundamental choice: generate responses token-by-token, or retrieve pre-written fragments and stitch them together. ReLU.chat picked the second path. This post explains why, and what we learned about doing it well.

The Generation Trap

Large language models are remarkable. They write coherent, contextually appropriate text on almost any topic. For an open-ended creative assistant, that is exactly what you want.

For a knowledge-grounded chatbot — a game-theory tutor, a documentation assistant, a scientific reference — generation has a fatal problem: hallucination. A model can produce text that is fluent, confident, and completely wrong. In a domain where users trust the bot to be accurate, that is unacceptable.

We tried the LLM path first. It worked for the happy path — clean queries, well-covered topics. It failed for edge cases:

The accuracy floor is the issue. With generation you can never say "this is grounded in the source."

The Retrieval Alternative

A retrieval-based system works differently. You maintain a knowledge base of pre-written fragments, each one verified for accuracy. At query time you:

  1. Find the fragments most relevant to the query
  2. Decide which ones to include
  3. Connect them with linguistic glue
  4. Return the composed result
The output is always traceable to a source. Every fact came from a fragment that a human wrote and verified. The bot cannot hallucinate because it cannot generate new text — it can only assemble existing text.

The tradeoff is flexibility. If the user asks something not covered by the knowledge base, the system has to say so. There is no fallback to "sounds-plausible-but-wrong." That is the whole point.

How ReLU.chat Composes

A knowledge fragment looks like:

{
  id: "gt:nash:def",
  topic: "nash-equilibrium",
  intent: "definition",
  body: "A Nash equilibrium is a set of strategies, one for each player, such that no player can benefit by unilaterally changing their strategy while the other players' strategies remain unchanged.",
  tags: ["game-theory", "nash", "definition"]
}

Fragments are small (1-3 sentences), focused on a single concept, and tagged with topic + intent. The intent tag is critical: it tells the policy network whether this fragment is a definition, an example, a proof sketch, a comparison, etc.

At query time the pipeline is:

  1. Retrieve — find top-k candidate fragments via BM25 + dense retrieval (see our retrieval post)
  2. Rank — score candidates with the policy network
  3. Select — pick the fragments to include (typically 1-3)
  4. Order — decide the sequence (e.g., definition before example)
  5. Connect — insert linguistic connectors ("To illustrate this,", "In contrast,", "Specifically,")
  6. Format — wrap in LaTeX for math, code blocks for snippets, lists for enumerations
The result reads naturally because the connectors are doing real linguistic work, and the fragments themselves are written in a consistent style.

What the Policy Network Decides

The 6 action heads of the policy network control:

The policy learns these from interaction data via reinforcement learning (see our RL post). It is not a hand-tuned heuristic — it is a 13K-parameter MLP trained on thousands of real interactions.

What Goes Wrong

Retrieval + composition is not magic. The failure modes are different from generation but they exist:

The mitigations:

Why We Picked This

For a knowledge-grounded chatbot, retrieval + composition gives you:

We trade flexibility for accuracy, and we trade the magic of generation for the honesty of retrieval. For our use case — domain-specific, accuracy-critical, privacy-first — that is exactly the right tradeoff.