Sentence Embeddings Explained: Why 384 Dimensions and What They Mean

When we say ReLU.chat embeds text into "384 dimensions," it sounds arbitrary. It is not. The 384-dim output is a deliberate design choice for all-MiniLM-L6-v2, the sentence transformer we use. This post explains what those numbers mean, why the dimension count is what it is, and what that implies for browser deployment.

What a Sentence Embedding Is

A sentence embedding is a fixed-length vector of real numbers that represents the meaning of a sentence. Two sentences with similar meaning should have similar vectors. The similarity is measured by cosine distance:

similarity(A, B) = (A · B) / (||A|| * ||B||)

The vector lives in a high-dimensional space. Cosine similarity ranges from -1 (opposite) to 1 (identical). For semantically related sentences, you typically see 0.6-0.9.

A 384-dimensional vector is just 384 numbers. For the sentence "A Nash equilibrium is a strategy profile where no player wants to deviate," one dimension might loosely correspond to "game theory," another to "stability," another to "multi-agent," and so on. But it is misleading to assign labels to dimensions — they are learned, not interpretable, and the meaning is distributed across them.

Why 384 Specifically

The choice of 384 is the result of three competing pressures:

Accuracy — more dimensions means more capacity to encode nuance. BERT-base uses 768, BERT-large uses 1024. Larger vectors discriminate better between similar concepts.

Storage — a 384-dim FP32 vector is 1.5 KB. For a knowledge base of 10,000 fragments, that is 15 MB. At 768 dims it would be 30 MB. At 1024 dims, 40 MB.

Latency — every dimension requires work in the transformer. 384 dims of computation is roughly half of 768. For browser inference this matters.

MiniLM-L6-v2 chose 384 because the original all-MiniLM-L12-v2 (12-layer) at 384 dim was already nearly as accurate as BERT-base at 768 dim for sentence-similarity benchmarks. The number 384 is not fundamental — it is the point where the model is "good enough" for the size and speed budget.

How It Is Learned

A sentence transformer is trained with a contrastive objective. The model sees pairs of sentences and learns to:

Make similar sentences have similar embeddings
Make dissimilar sentences have dissimilar embeddings

The training data is huge — typically 1B+ sentence pairs from web crawl, Q&A sites, and paraphrase databases. The loss function is something like:

L = max(0, margin - cos(A, A_positive) + cos(A, A_negative))

Or, in the more modern setup, a softmax over in-batch negatives scaled by a temperature. Either way, the model is rewarded for pulling similar sentences together and pushing dissimilar ones apart in the embedding space.

The "L6" in MiniLM-L6-v2 means 6 transformer layers. The original BERT-base has 12. The L6 model is smaller and faster but slightly less accurate. For a retrieval system that already uses BM25 as a complement, the L6 accuracy is sufficient.

What 384 Dimensions Actually Encode

The dimensions are not labeled. But we can probe them. If you train a linear probe on dimension 47 (or whatever), you might find that it correlates with the presence of a named entity, or sentiment polarity, or tense of the main verb. The model has learned to allocate dimensions to whatever signals help minimize the contrastive loss.

Empirically, MiniLM-L6-v2's dimensions encode a mix of:

Topical signals — what the sentence is about
Syntactic signals — grammar, structure, length
Semantic relations — synonymy, entailment, contradiction
Style markers — formal vs informal, technical vs general

But the encoding is entangled — a single dimension participates in multiple concepts. That is why you cannot interpret individual dimensions in isolation.

The Practical Sweet Spot

For a browser-deployed retrieval system, 384 dim hits a sweet spot:

Accuracy: 90-95% of full BERT-base for sentence similarity tasks
Size: 22 MB quantized, fits on a phone in seconds
Latency: ~15ms per embedding on a mid-range laptop, ~3ms on WebGPU
Storage: 15 MB for 10,000 KB fragments — acceptable

If you go to 768 dim, the storage doubles and the inference is ~2x slower. If you go to 128 dim, you lose too much accuracy. 384 is the local minimum of the pain curve.

What This Means for Users

The user does not see "384 dimensions." They see a chatbot that:

Recognizes paraphrases ("how do I split profits fairly" → finds "Shapley value")
Handles synonyms ("auction" ↔ "sealed-bid mechanism")
Tolerates typos and slight rewordings
Fails gracefully on out-of-domain queries (low similarity → no good match)

That is what 384 numbers buy you. Not magic — careful engineering and a model trained on more text than any human can read in a lifetime.