Running a transformer in the browser sounds impossible until you realize the model only needs to be good enough for a constrained domain. ReLU.chat ships a quantized all-MiniLM-L6-v2 that is roughly 22 MB on the wire but ~90 MB in its original PyTorch form. Two technologies make that work: ONNX for a portable graph format, and INT8 post-training quantization for size and speed.
What ONNX Gives You
ONNX (Open Neural Network Exchange) is a graph IR. You export a model once from PyTorch or TensorFlow and the same file runs in C++, Python, JavaScript, and WebAssembly. For browser work the key targets are:
- onnxruntime-web (WASM) — the safe baseline, runs everywhere
- onnxruntime-web (WebGL/WebGPU EP) — hardware-accelerated when available
The Size Problem
all-MiniLM-L6-v2 in FP32 is about 90 MB. The original bert-base is 440 MB. If a user has to download 90 MB before the first response, they will leave. Mobile users on flaky connections will absolutely leave.
The two levers are:
- Architecture: MiniLM has 6 layers (vs 12 in BERT-base) and 384 hidden dim (vs 768). It is distilled to mimic a larger teacher. That alone gets us to ~90 MB.
- Quantization: Convert FP32 weights to INT8 (8-bit integers). That cuts weight memory by 4x.
INT8 Post-Training Quantization
INT8 quantization maps each FP32 weight to a signed 8-bit integer using a per-tensor or per-channel scale:
q = round(x / scale) # FP32 → INT8
x_hat = dequantize(q, scale) # INT8 → FP32 (during matmul)
The scale is calibrated on a small calibration set so that the dynamic range of each tensor is preserved. We use per-channel scales on the matmul weights and per-tensor scales on activations.
What you give up:
- Slight accuracy loss — typically <1% on downstream tasks for MiniLM
- Some math stays in FP32 — accumulators and softmax are usually FP32 for numerical stability
- ~4x smaller weights
- ~2-3x faster inference on WASM (INT8 matmul is a single SIMD instruction per element)
- Lower memory bandwidth — fewer bytes to move
The Calibration Step
Naive quantization breaks models. The fix is calibration: feed ~500 representative sentences through the FP32 model, observe activation distributions per layer, and choose scales that minimize reconstruction error. The standard recipe is:
- Collect activation histograms on the calibration set
- Pick scales that minimize KL divergence between FP32 and INT8 distributions
- Freeze scales, quantize weights statically
Inference in the Browser
The runtime pipeline looks like this:
input_ids + attention_mask → tokenizer (WASM)
→ ONNX session (WASM, INT8)
→ mean-pool over token embeddings
→ L2-normalize
→ 384-dim float32 vector
Mean pooling and L2 normalization are done in plain JS — they are trivial math, not worth a kernel. The expensive part is the transformer body, which is what INT8 speeds up.
When Quantization Hurts
INT8 is not free. We observed two failure modes during calibration:
- Layer norm collapse: if a layer's activations have heavy tails, naive quantization kills the small values. Solution: per-channel scale on the layer-norm inputs.
- Embedding outliers: the embedding layer has a few dimensions with very large magnitudes. Solution: keep the embedding layer in FP16 (it is small relative to the rest).
The Result
A 22 MB model that loads on a slow mobile connection in ~6 seconds, runs at ~15 ms per inference on a mid-range laptop, and is accurate enough to power a knowledge-base chatbot. The 4x size reduction and 2-3x speedup are what make the entire product viable.
Quantization is not a hack. For domain-specific, retrieval-grounded systems, it is the right tool.