Every modern chatbot has to make a fundamental choice: generate responses token-by-token, or retrieve pre-written fragments and stitch them together. ReLU.chat picked the second path. This post explains why, and what we learned about doing it well.
The Generation Trap
Large language models are remarkable. They write coherent, contextually appropriate text on almost any topic. For an open-ended creative assistant, that is exactly what you want.
For a knowledge-grounded chatbot — a game-theory tutor, a documentation assistant, a scientific reference — generation has a fatal problem: hallucination. A model can produce text that is fluent, confident, and completely wrong. In a domain where users trust the bot to be accurate, that is unacceptable.
We tried the LLM path first. It worked for the happy path — clean queries, well-covered topics. It failed for edge cases:
- A user asks about "a theorem in game theory" and gets a paraphrase of a different theorem
- A query about a specific equation gets a formula that is almost right but subtly wrong
- A user asks for the year of an event and the model invents a plausible but incorrect date
The Retrieval Alternative
A retrieval-based system works differently. You maintain a knowledge base of pre-written fragments, each one verified for accuracy. At query time you:
- Find the fragments most relevant to the query
- Decide which ones to include
- Connect them with linguistic glue
- Return the composed result
The tradeoff is flexibility. If the user asks something not covered by the knowledge base, the system has to say so. There is no fallback to "sounds-plausible-but-wrong." That is the whole point.
How ReLU.chat Composes
A knowledge fragment looks like:
{
id: "gt:nash:def",
topic: "nash-equilibrium",
intent: "definition",
body: "A Nash equilibrium is a set of strategies, one for each player, such that no player can benefit by unilaterally changing their strategy while the other players' strategies remain unchanged.",
tags: ["game-theory", "nash", "definition"]
}
Fragments are small (1-3 sentences), focused on a single concept, and tagged with topic + intent. The intent tag is critical: it tells the policy network whether this fragment is a definition, an example, a proof sketch, a comparison, etc.
At query time the pipeline is:
- Retrieve — find top-k candidate fragments via BM25 + dense retrieval (see our retrieval post)
- Rank — score candidates with the policy network
- Select — pick the fragments to include (typically 1-3)
- Order — decide the sequence (e.g., definition before example)
- Connect — insert linguistic connectors ("To illustrate this,", "In contrast,", "Specifically,")
- Format — wrap in LaTeX for math, code blocks for snippets, lists for enumerations
What the Policy Network Decides
The 6 action heads of the policy network control:
- Fragment selection — which fragments from the candidate set to include
- Response length — concise (one fragment) vs detailed (multiple fragments + connectors)
- Connector style — formal vs conversational, with or without transitions
- Confidence display — whether to show the source attribution, hedge, or not
- Creativity balance — pure retrieval vs light paraphrasing of fragment text
- Follow-up generation — whether to suggest related topics at the end
What Goes Wrong
Retrieval + composition is not magic. The failure modes are different from generation but they exist:
- Coverage gaps — the knowledge base simply does not cover the query. The bot has to say "I don't have information on that."
- Stitching artifacts — the connectors can sound formulaic if overused. The policy has to learn when not to connect.
- Stale fragments — the knowledge base does not auto-update. A new theorem published yesterday is not in the bot's KB.
- Topic drift — the retriever returns fragments that are semantically related but topically wrong. A query about "Nash equilibrium" might return a fragment about "Walrasian equilibrium" if the embeddings are confused.
- For coverage: detect low-confidence retrievals and explicitly say so
- For stitching: train the policy to use connectors sparingly
- For staleness: this is a maintenance burden; KB updates are manual
- For drift: use the intent tag in retrieval scoring — a query looking for a definition should weight definition-tagged fragments higher
Why We Picked This
For a knowledge-grounded chatbot, retrieval + composition gives you:
- Verifiable accuracy — every claim is traceable
- Smaller models — no need for a 7B parameter LLM, a 22MB retriever is enough
- Faster inference — retrieval is microseconds, generation is hundreds of milliseconds
- On-device viability — the whole system runs in 22MB; an LLM in the browser is still hard