How Browser-Based Chatbots Work: A Deep Dive into On-Device NLP

The idea of running a chatbot entirely in your browser — without sending a single byte to a remote server — was science fiction just a few years ago. Today, thanks to advances in model quantization, WebAssembly, and the ONNX Runtime, it's production reality.

The Core Challenge

Traditional chatbots rely on cloud APIs: you send a query, a server runs inference, and you get a response. This introduces latency, privacy concerns, and dependency on external services. Browser-based chatbots flip this model entirely.

The Embedding Layer

Everything starts with embeddings. A sentence transformer like all-MiniLM-L6-v2 converts text into 384-dimensional vectors. In ReLU.chat, this model is quantized to ONNX format (~22MB) and runs via ONNX Runtime WebAssembly.

const embeddings = await session.run({ input_ids, attention_mask });
const vector = embeddings.last_hidden_state.data;

The quantization step is critical: the original model is ~90MB, but INT8 quantization brings it to ~22MB with minimal accuracy loss. That's small enough to load on a mobile connection.

BM25 Sparse Retrieval

Alongside dense embeddings, we use BM25 — a classic information retrieval algorithm. BM25 scores documents based on term frequency and inverse document frequency. The key formula:

score(D, Q) = Σ IDF(qi) (f(qi, D) (k1 + 1)) / (f(qi, D) + k1 (1 - b + b |D| / avgdl))

BM25 is fast, requires no GPU, and catches keyword matches that dense embeddings might miss.

The Signal Layer

The signal layer fuses multiple retrieval signals into a single decision packet:

Dense cosine similarity (semantic match)
BM25 score (keyword match)
Entity overlap (named entity recognition)
Intent classification (what the user wants)
Follow-up detection (context continuity)

These 25 features become the input to the policy network.

The Policy Network

A 3-layer MLP (25 → 128 → 64 → 6 action heads) decides how to respond. Trained via reinforcement learning, it learns to balance:

Factual accuracy vs. creative elaboration
Short vs. detailed responses
Direct answers vs. follow-up questions

When the MLP is unavailable, a parameterized heuristic fallback handles response planning.

Fragment Composition

The final response isn't generated token-by-token like an LLM. Instead, it's composed from pre-written knowledge fragments connected by linguistic connectors. This approach guarantees factual accuracy (fragments are curated) while allowing natural-sounding output.

Why It Matters

Browser-based chatbots offer three key advantages:

Privacy: No data leaves your device. Ever.
Speed: Sub-100ms inference with no network round-trip.
Offline: Works without internet after initial load.

The tradeoff is knowledge scope: these chatbots know what's in their knowledge base, not the entire internet. For domain-specific applications — customer support, documentation, education — this is often the right tradeoff.

Try It Yourself

ReLU.chat is open-source and runs entirely in your browser. Visit the live demo or explore the full architecture.