Knowledge Fragment Composition: Why We Ditched Generation for Retrieval

Every modern chatbot has to make a fundamental choice: generate responses token-by-token, or retrieve pre-written fragments and stitch them together. ReLU.chat picked the second path. This post explains why, and what we learned about doing it well.

The Generation Trap

Large language models are remarkable. They write coherent, contextually appropriate text on almost any topic. For an open-ended creative assistant, that is exactly what you want.

For a knowledge-grounded chatbot — a game-theory tutor, a documentation assistant, a scientific reference — generation has a fatal problem: hallucination. A model can produce text that is fluent, confident, and completely wrong. In a domain where users trust the bot to be accurate, that is unacceptable.

We tried the LLM path first. It worked for the happy path — clean queries, well-covered topics. It failed for edge cases:

A user asks about "a theorem in game theory" and gets a paraphrase of a different theorem
A query about a specific equation gets a formula that is almost right but subtly wrong
A user asks for the year of an event and the model invents a plausible but incorrect date

The accuracy floor is the issue. With generation you can never say "this is grounded in the source."

The Retrieval Alternative

A retrieval-based system works differently. You maintain a knowledge base of pre-written fragments, each one verified for accuracy. At query time you:

Find the fragments most relevant to the query
Decide which ones to include
Connect them with linguistic glue
Return the composed result

The output is always traceable to a source. Every fact came from a fragment that a human wrote and verified. The bot cannot hallucinate because it cannot generate new text — it can only assemble existing text.

The tradeoff is flexibility. If the user asks something not covered by the knowledge base, the system has to say so. There is no fallback to "sounds-plausible-but-wrong." That is the whole point.

How ReLU.chat Composes

A knowledge fragment looks like:

{
  id: "gt:nash:def",
  topic: "nash-equilibrium",
  intent: "definition",
  body: "A Nash equilibrium is a set of strategies, one for each player, such that no player can benefit by unilaterally changing their strategy while the other players' strategies remain unchanged.",
  tags: ["game-theory", "nash", "definition"]
}

Fragments are small (1-3 sentences), focused on a single concept, and tagged with topic + intent. The intent tag is critical: it tells the policy network whether this fragment is a definition, an example, a proof sketch, a comparison, etc.

At query time the pipeline is:

Retrieve — find top-k candidate fragments via BM25 + dense retrieval (see our retrieval post)
Rank — score candidates with the policy network
Select — pick the fragments to include (typically 1-3)
Order — decide the sequence (e.g., definition before example)
Connect — insert linguistic connectors ("To illustrate this,", "In contrast,", "Specifically,")
Format — wrap in LaTeX for math, code blocks for snippets, lists for enumerations

The result reads naturally because the connectors are doing real linguistic work, and the fragments themselves are written in a consistent style.

What the Policy Network Decides

The 6 action heads of the policy network control:

Fragment selection — which fragments from the candidate set to include
Response length — concise (one fragment) vs detailed (multiple fragments + connectors)
Connector style — formal vs conversational, with or without transitions
Confidence display — whether to show the source attribution, hedge, or not
Creativity balance — pure retrieval vs light paraphrasing of fragment text
Follow-up generation — whether to suggest related topics at the end

The policy learns these from interaction data via reinforcement learning (see our RL post). It is not a hand-tuned heuristic — it is a 13K-parameter MLP trained on thousands of real interactions.

What Goes Wrong

Retrieval + composition is not magic. The failure modes are different from generation but they exist:

Coverage gaps — the knowledge base simply does not cover the query. The bot has to say "I don't have information on that."
Stitching artifacts — the connectors can sound formulaic if overused. The policy has to learn when not to connect.
Stale fragments — the knowledge base does not auto-update. A new theorem published yesterday is not in the bot's KB.
Topic drift — the retriever returns fragments that are semantically related but topically wrong. A query about "Nash equilibrium" might return a fragment about "Walrasian equilibrium" if the embeddings are confused.

The mitigations:

For coverage: detect low-confidence retrievals and explicitly say so
For stitching: train the policy to use connectors sparingly
For staleness: this is a maintenance burden; KB updates are manual
For drift: use the intent tag in retrieval scoring — a query looking for a definition should weight definition-tagged fragments higher

Why We Picked This

For a knowledge-grounded chatbot, retrieval + composition gives you:

Verifiable accuracy — every claim is traceable
Smaller models — no need for a 7B parameter LLM, a 22MB retriever is enough
Faster inference — retrieval is microseconds, generation is hundreds of milliseconds
On-device viability — the whole system runs in 22MB; an LLM in the browser is still hard

We trade flexibility for accuracy, and we trade the magic of generation for the honesty of retrieval. For our use case — domain-specific, accuracy-critical, privacy-first — that is exactly the right tradeoff.