Sentence Embeddings vs Word Embeddings: Why We Embed Whole Questions

If you have heard of word2vec, GloVe, or fastText, you know that each word can be represented as a dense vector. A natural question is: why not just embed each word in a query, average the vectors, and use that as the query representation?

We tried that. It does not work. For Q&A retrieval, the right tool is a sentence transformer that embeds the whole sentence at once. Here is why the average-of-word-vectors approach fails, and what sentence transformers do differently.

The Word Embedding Baseline

A word embedding model like Word2Vec produces a 300-dim vector for each word, trained so that words appearing in similar contexts have similar vectors. "King" and "queen" are close. "Cat" and "dog" are close. "Bank" (river) and "bank" (financial) are unfortunately the same vector — that is a known limitation.

For retrieval, the obvious baseline is:

query_vector = mean(embed(word) for word in tokenize(query))
document_vector = mean(embed(word) for word in tokenize(document))
similarity = cosine(query_vector, document_vector)

This is fast, requires no model at runtime, and uses tiny word vectors. It is also a poor retrieval signal, for a deep reason.

Why Averaging Fails

The averaging approach assumes that the meaning of a sentence is the sum (or mean) of the meanings of its words. This is a strong assumption that is often wrong.

Consider the query: "What is the difference between X and Y?"

Averaging the word vectors gives you a point in 300-dim space that is influenced by "difference," "X," and "Y." But the actual meaning of the query is about the relation between X and Y — the difference. The mean-of-words representation has no way to encode that relational structure.

Another example: "How do I cook a chicken?" vs "Should I cook a chicken?"

The two sentences have nearly identical word vectors. The average is nearly identical. But they ask different things — one is procedural, one is asking for advice. Word averaging cannot distinguish them.

A third example, more concrete: "the cat sat on the mat" vs "the mat sat on the cat"

Same words, opposite meaning. The mean-of-word-vectors is identical. The information that matters — word order — is completely lost.

This is the bag-of-words problem dressed up in continuous vectors. The averaging approach inherits all the limitations of treating language as a multiset of words.

What Sentence Transformers Do Differently

A sentence transformer is trained to embed a whole sentence into a vector that captures its meaning. Internally it still uses word vectors (subword token embeddings, to be precise), but the model applies multiple transformer layers that mix the word vectors based on their context.

The key operation is self-attention. Each token's representation is updated based on every other token in the sentence:

attention(Q, K, V) = softmax(QK^T / sqrt(d)) * V

After 6 or 12 layers of this, each token's representation has been informed by the entire sentence context. The final sentence embedding is typically a pooling operation over the token embeddings (mean-pool, CLS-token, etc.).

The result: the vector for "the cat sat on the mat" is different from the vector for "the mat sat on the cat." The self-attention has encoded word order and syntactic structure into the final representation.

This is not magic. It is just a model that has learned to produce vectors where semantic similarity (the thing we care about for retrieval) corresponds to cosine similarity in vector space. That alignment is what the contrastive training objective gives us.

The Performance Difference

We benchmarked the two approaches on a held-out set of 500 (query, relevant-document) pairs from the game-theory knowledge base. The metric is recall@5 — the fraction of queries where the relevant document is in the top 5 retrieved.

| Method | Recall@5 | |---|---| | Word-vector average (Word2Vec) | 0.41 | | TF-IDF (lexical baseline) | 0.58 | | BM25 (sparse retrieval) | 0.72 | | MiniLM-L6-v2 (sentence transformer) | 0.79 | | BM25 + MiniLM (hybrid) | 0.86 |

The sentence transformer alone beats the lexical baselines. Combining it with BM25 (see our retrieval post) beats either alone. The word-vector average is the worst — worse than even TF-IDF, because it loses word order information that even a simple lexical matcher preserves.

When Word Vectors Are Useful

Word embeddings are not useless. They are great for:

Vocabulary expansion — finding synonyms ("car" ↔ "automobile")
Word-level analogies — "king - man + woman ≈ queen"
Out-of-vocabulary handling — subword models (fastText) can embed words they have never seen
Tiny models — when you cannot afford a transformer at all

For Q&A retrieval specifically, they are a poor fit. The unit of meaning is the sentence, not the word.

The Practical Takeaway

If you are building a Q&A retrieval system today, use a sentence transformer. The cost is 22 MB (quantized MiniLM) and ~15 ms per query on a mid-range laptop. The benefit over word averaging is dramatic — roughly 2x recall on real queries.

If you absolutely cannot load a transformer (some embedded systems, very strict size budgets), use a strong lexical baseline (BM25) and accept the lower recall. Word averaging is not a good middle ground — it is the worst of both worlds: slow by transformer standards, inaccurate by lexical standards.

The right tool for the job depends on the job. For Q&A retrieval, the job is sentence-level meaning, and the right tool is a sentence transformer.