If you build a retrieval system you have to choose: lexical (BM25, TF-IDF) or semantic (dense embeddings, vector search). For a long time people argued one would replace the other. In practice, neither does. The reason is that real user queries have both a lexical and a semantic shape, and a single signal misses half the time.
What BM25 Is Good At
BM25 (Best Matching 25) scores a document D against a query Q using term frequency, inverse document frequency, and document length normalization:
score(D, Q) = Σ IDF(qi) (f(qi, D) (k1 + 1)) /
(f(qi, D) + k1 (1 - b + b |D| / avgdl))
It is fast (a single pass over an inverted index), interpretable (you can see which terms matched), and unbeatable on queries that share exact vocabulary with the corpus.
A user asking "What is the Shapley value?" needs the document that literally contains "Shapley value." BM25 finds it instantly. Dense retrieval might also find it — but only because the embedding happened to learn that phrase's neighborhood. BM25 is a contract: it will not fail on exact terms.
What Dense Retrieval Is Good At
Dense retrieval embeds both the query and the documents into the same vector space, then retrieves by cosine similarity. A query like "how do I split profits fairly" should match a document about "Shapley value" even though no words overlap. That is exactly the kind of paraphrase BM25 cannot solve.
Dense retrieval also handles synonyms, morphological variants, and the noise of natural language. If a user types "NASH equlibrium" (typo), a good embedding model still finds "Nash equilibrium." BM25 treats that as a totally different token.
Where Each One Fails
BM25 fails when:
- The query uses synonyms the corpus never used
- The query is short and ambiguous
- The corpus has heavy paraphrasing
- The query contains rare, domain-specific terms the embedding model has never seen in pretraining
- The user is searching for a literal phrase (an error code, an API name, a formula)
- The corpus is very small relative to the pretraining distribution
Fusing the Two
ReLU.chat runs both signals in parallel and fuses them into a single score per document. The signal layer produces 25 features, but the two main ones are:
bm25_score— raw BM25 over the indexed knowledge basedense_score— cosine similarity between query embedding and document embedding
final = w_bm25 normalize(bm25_score) + w_dense normalize(dense_score)
Where w_bm25 + w_dense ≈ 1 and the policy decides per-query how to split them. For a query that looks lexical ("What is the formula for X?"), the policy weights BM25 higher. For a paraphrase ("how do I split fairly"), it weights dense higher.
A Concrete Example
Query: "auction second price"
- BM25 ranks: "Vickrey auction" document first (exact "auction" overlap, partial term match)
- Dense ranks: "sealed-bid mechanism" first (semantically similar but no overlap)
- Fusion: "Vickrey auction" wins because the policy correctly identified the lexical intent
- BM25 ranks: nothing strong (no overlap with "Shapley value")
- Dense ranks: "Shapley value" first
- Fusion: dense wins
What This Costs
Almost nothing. BM25 is a single inverted-index scan — microseconds. The embedding is the expensive part, and we already compute it for dense retrieval. Adding BM25 adds <1 ms to retrieval.
The bookkeeping is the real cost: you need an inverted index, IDF precomputation, length normalization, and a serialization format. We precompute the BM25 stats at build time and ship them as a JSON file (~30 KB for our knowledge base).
When to Use What
Use BM25 only when:
- Your corpus is small and you need explainability
- Your users mostly type literal queries (error codes, names)
- Your users paraphrase constantly
- You have a very large, well-trained embedding model
- You have real users with messy queries
- You can afford the 1ms overhead
- You care about quality more than simplicity