BM25 vs Dense Retrieval: Why We Use Both for Browser Chatbots

If you build a retrieval system you have to choose: lexical (BM25, TF-IDF) or semantic (dense embeddings, vector search). For a long time people argued one would replace the other. In practice, neither does. The reason is that real user queries have both a lexical and a semantic shape, and a single signal misses half the time.

What BM25 Is Good At

BM25 (Best Matching 25) scores a document D against a query Q using term frequency, inverse document frequency, and document length normalization:

score(D, Q) = Σ IDF(qi)  (f(qi, D)  (k1 + 1)) /
              (f(qi, D) + k1  (1 - b + b  |D| / avgdl))

It is fast (a single pass over an inverted index), interpretable (you can see which terms matched), and unbeatable on queries that share exact vocabulary with the corpus.

A user asking "What is the Shapley value?" needs the document that literally contains "Shapley value." BM25 finds it instantly. Dense retrieval might also find it — but only because the embedding happened to learn that phrase's neighborhood. BM25 is a contract: it will not fail on exact terms.

What Dense Retrieval Is Good At

Dense retrieval embeds both the query and the documents into the same vector space, then retrieves by cosine similarity. A query like "how do I split profits fairly" should match a document about "Shapley value" even though no words overlap. That is exactly the kind of paraphrase BM25 cannot solve.

Dense retrieval also handles synonyms, morphological variants, and the noise of natural language. If a user types "NASH equlibrium" (typo), a good embedding model still finds "Nash equilibrium." BM25 treats that as a totally different token.

Where Each One Fails

BM25 fails when:

The query uses synonyms the corpus never used
The query is short and ambiguous
The corpus has heavy paraphrasing

Dense retrieval fails when:

The query contains rare, domain-specific terms the embedding model has never seen in pretraining
The user is searching for a literal phrase (an error code, an API name, a formula)
The corpus is very small relative to the pretraining distribution

In a chatbot knowledge base, the third case is constant. The user is often searching for a name, a theorem, a specific tool — and a 22 MB MiniLM that was trained on web text has no idea what "Shapley value" is as a named entity even if it can paraphrase it.

Fusing the Two

ReLU.chat runs both signals in parallel and fuses them into a single score per document. The signal layer produces 25 features, but the two main ones are:

bm25_score — raw BM25 over the indexed knowledge base
dense_score — cosine similarity between query embedding and document embedding

The fusion is a weighted sum, with weights tuned via the policy network (see our RL post):

final = w_bm25  normalize(bm25_score) + w_dense  normalize(dense_score)

Where w_bm25 + w_dense ≈ 1 and the policy decides per-query how to split them. For a query that looks lexical ("What is the formula for X?"), the policy weights BM25 higher. For a paraphrase ("how do I split fairly"), it weights dense higher.

A Concrete Example

Query: "auction second price"

BM25 ranks: "Vickrey auction" document first (exact "auction" overlap, partial term match)
Dense ranks: "sealed-bid mechanism" first (semantically similar but no overlap)
Fusion: "Vickrey auction" wins because the policy correctly identified the lexical intent

Query: "how do I split profits fairly among players"

BM25 ranks: nothing strong (no overlap with "Shapley value")
Dense ranks: "Shapley value" first
Fusion: dense wins

What This Costs

Almost nothing. BM25 is a single inverted-index scan — microseconds. The embedding is the expensive part, and we already compute it for dense retrieval. Adding BM25 adds <1 ms to retrieval.

The bookkeeping is the real cost: you need an inverted index, IDF precomputation, length normalization, and a serialization format. We precompute the BM25 stats at build time and ship them as a JSON file (~30 KB for our knowledge base).

When to Use What

Use BM25 only when:

Your corpus is small and you need explainability
Your users mostly type literal queries (error codes, names)

Use dense only when:

Your users paraphrase constantly
You have a very large, well-trained embedding model

Use both when:

You have real users with messy queries
You can afford the 1ms overhead
You care about quality more than simplicity

ReLU.chat uses both. It is the right default.