Running a transformer in the browser sounds impossible until you realize the model only needs to be good enough for a constrained domain. ReLU.chat ships a quantized all-MiniLM-L6-v2 that is roughly 22 MB on the wire but ~90 MB in its original PyTorch form. Two technologies make that work: ONNX for a portable graph format, and INT8 post-training quantization for size and speed.

What ONNX Gives You

ONNX (Open Neural Network Exchange) is a graph IR. You export a model once from PyTorch or TensorFlow and the same file runs in C++, Python, JavaScript, and WebAssembly. For browser work the key targets are:

ReLU.chat currently uses the WASM execution provider because it is deterministic across browsers, has no GPU driver requirements, and ships in a small footprint.

The Size Problem

all-MiniLM-L6-v2 in FP32 is about 90 MB. The original bert-base is 440 MB. If a user has to download 90 MB before the first response, they will leave. Mobile users on flaky connections will absolutely leave.

The two levers are:

  1. Architecture: MiniLM has 6 layers (vs 12 in BERT-base) and 384 hidden dim (vs 768). It is distilled to mimic a larger teacher. That alone gets us to ~90 MB.
  2. Quantization: Convert FP32 weights to INT8 (8-bit integers). That cuts weight memory by 4x.

INT8 Post-Training Quantization

INT8 quantization maps each FP32 weight to a signed 8-bit integer using a per-tensor or per-channel scale:

q = round(x / scale)            # FP32 → INT8
x_hat = dequantize(q, scale)    # INT8 → FP32 (during matmul)

The scale is calibrated on a small calibration set so that the dynamic range of each tensor is preserved. We use per-channel scales on the matmul weights and per-tensor scales on activations.

What you give up:

What you gain:

The Calibration Step

Naive quantization breaks models. The fix is calibration: feed ~500 representative sentences through the FP32 model, observe activation distributions per layer, and choose scales that minimize reconstruction error. The standard recipe is:

  1. Collect activation histograms on the calibration set
  2. Pick scales that minimize KL divergence between FP32 and INT8 distributions
  3. Freeze scales, quantize weights statically
For MiniLM-L6, this takes about 10 minutes on a laptop and produces a model that is indistinguishable from FP32 for retrieval use cases.

Inference in the Browser

The runtime pipeline looks like this:

input_ids + attention_mask  →  tokenizer (WASM)
                            →  ONNX session (WASM, INT8)
                            →  mean-pool over token embeddings
                            →  L2-normalize
                            →  384-dim float32 vector

Mean pooling and L2 normalization are done in plain JS — they are trivial math, not worth a kernel. The expensive part is the transformer body, which is what INT8 speeds up.

When Quantization Hurts

INT8 is not free. We observed two failure modes during calibration:

The Result

A 22 MB model that loads on a slow mobile connection in ~6 seconds, runs at ~15 ms per inference on a mid-range laptop, and is accurate enough to power a knowledge-base chatbot. The 4x size reduction and 2-3x speedup are what make the entire product viable.

Quantization is not a hack. For domain-specific, retrieval-grounded systems, it is the right tool.