The idea of running a chatbot entirely in your browser — without sending a single byte to a remote server — was science fiction just a few years ago. Today, thanks to advances in model quantization, WebAssembly, and the ONNX Runtime, it's production reality.
The Core Challenge
Traditional chatbots rely on cloud APIs: you send a query, a server runs inference, and you get a response. This introduces latency, privacy concerns, and dependency on external services. Browser-based chatbots flip this model entirely.
The Embedding Layer
Everything starts with embeddings. A sentence transformer like all-MiniLM-L6-v2 converts text into 384-dimensional vectors. In ReLU.chat, this model is quantized to ONNX format (~22MB) and runs via ONNX Runtime WebAssembly.
const embeddings = await session.run({ input_ids, attention_mask });
const vector = embeddings.last_hidden_state.data;
The quantization step is critical: the original model is ~90MB, but INT8 quantization brings it to ~22MB with minimal accuracy loss. That's small enough to load on a mobile connection.
BM25 Sparse Retrieval
Alongside dense embeddings, we use BM25 — a classic information retrieval algorithm. BM25 scores documents based on term frequency and inverse document frequency. The key formula:
score(D, Q) = Σ IDF(qi) (f(qi, D) (k1 + 1)) / (f(qi, D) + k1 (1 - b + b |D| / avgdl))
BM25 is fast, requires no GPU, and catches keyword matches that dense embeddings might miss.
The Signal Layer
The signal layer fuses multiple retrieval signals into a single decision packet:
- Dense cosine similarity (semantic match)
- BM25 score (keyword match)
- Entity overlap (named entity recognition)
- Intent classification (what the user wants)
- Follow-up detection (context continuity)
The Policy Network
A 3-layer MLP (25 → 128 → 64 → 6 action heads) decides how to respond. Trained via reinforcement learning, it learns to balance:
- Factual accuracy vs. creative elaboration
- Short vs. detailed responses
- Direct answers vs. follow-up questions
Fragment Composition
The final response isn't generated token-by-token like an LLM. Instead, it's composed from pre-written knowledge fragments connected by linguistic connectors. This approach guarantees factual accuracy (fragments are curated) while allowing natural-sounding output.
Why It Matters
Browser-based chatbots offer three key advantages:
- Privacy: No data leaves your device. Ever.
- Speed: Sub-100ms inference with no network round-trip.
- Offline: Works without internet after initial load.
Try It Yourself
ReLU.chat is open-source and runs entirely in your browser. Visit the live demo or explore the full architecture.