How ReLU.chat Works

An open-source, browser-based chatbot platform. Progressive loading with heuristic/BOW fallback enables instant first turns while the ~22 MB MiniLM model and KB embeddings load (Service Worker pre-caches); full pipeline hot-swaps automatically. Query embedding memoization and top-k ranking keep responses fast. A sentence transformer embeds queries and knowledge-base entries into 384-dimensional vectors. A signal layer extracts field-weighted BM25 sparse signals with bigram phrase matching, dense cosine similarities, fuzzy entity extraction with word-overlap scoring, and temperature-calibrated intent classification with 19 prototypes per intent — then fuses them into a 25-feature decision packet. An RL-trained MLP policy network with auto-quantized int8 inference decides how to respond (with local shadow compare vs heuristic), and a fragment-based composition engine with session-aware diversity + semantic dedup, EMA session vector, and progressive streaming rendering delivers the final answer in a native chat experience.

1. Overview

ReLU.chat processes every message entirely in your browser. No API keys, no cloud LLMs, no telemetry. Progressive: heuristic fallback + BOW gives usable answers on the very first query while the model loads; dense hot-swap happens seamlessly. Query memo + top-k (bounded selection, no full sort) + session EMA vector improve speed and multi-turn quality. The system uses a sentence-transformer (all-MiniLM-L6-v2, quantized ONNX, ~22 MB) to embed queries and knowledge-base entries into 384-dimensional vectors. A signal layer combines field-weighted BM25 (name/aliases boosted via TF repetition + bigram phrase matching) with dense cosine similarity, fuzzy entity extraction (Levenshtein distance + word-overlap scoring + notation pattern matching), and temperature-calibrated intent classification (19 prototypes per intent, 70/30 best-vs-average scoring) into a 25-feature decision packet. A reinforcement-learning-trained MLP policy network (25 inputs → 128 → 64 → 6 action heads, auto-quantized to int8 at construction for ~4× memory reduction and faster inference, with local shadow logging vs heuristic) decides how to respond. A fragment-based composition engine with session-aware diversity penalties + cross-fragment dedup renders the final answer with linguistic connectors, delivered via progressive streaming rendering for a native chat experience. Session memory retains up to 30 turns with importance-based eviction, response compression, and an EMA summary vector for long-context coherence.

QueryUser types

→

EmbeddingMiniLM-L6-v2
384-dim ONNX

→

Signal LayerBM25 + cosine
ensemble ranking

→

Features25-dim vector
extraction

→

PolicyMLP 25→128→64
6 action heads

→

ComposeFragment engine
+ connectors

→

ResponseRendered text

2. Signal Layer

After embedding, a lightweight signal layer prepares a structured DecisionPacket for the policy network. It combines multiple retrieval and classification signals into a coherent pre-policy feature bundle.

BM25 Sparse Retrieval — Field-weighted term-frequency scoring (k1=1.5, b=0.75). Entry names are repeated 3× and aliases 2× during indexing, so name-matched terms naturally score higher via TF boosting. Bigram phrase matching catches multi-word queries like "Nash equilibrium" as a single token (nash_equilibrium), with an IDF-weighted bonus that complements unigram scoring.
Entity Extraction — Three-pass extraction: (1) exact alias regex matching, (2) fuzzy word-overlap scoring using Levenshtein distance and substring containment for typo tolerance, (3) notation pattern matching for game-theory expressions like (D,D) and (C,C). Session context enriches entities from recent turns with decay weighting.
Intent Classification — Cosine similarity against 19 intent prototypes per category (definition, example, formal, application, comparison), scored with a 70/30 blend of best-match and weighted-average to prevent one lucky prototype from dominating. Calibrated with temperature=1.5 softmax for reliable confidence estimates.
Ensemble Ranking — Dense cosine similarity (0.7 weight) and BM25 scores (0.3 weight) are fused into a combined ranking. A neural reranking pass applies a token-overlap bonus to refine the top results. Follow-up queries get a 0.35 boost (scaled by conversation depth) to the previous topic, preventing topic drift.
Feature Extraction — The ensemble ranking, calibrated intent scores, entity data, and session context are compiled into the 25-feature vector that feeds the policy network.

Topic Correction — The signal layer detects explicit topic corrections (e.g., "I meant X" or "no, just X") via regex pattern matching against the current query and session history. When a correction is detected, the corrected topic embedding is forced to the top of the dense-sparse ranking, overriding the ensemble score. This prevents topic drift when the user redirects the conversation mid-turn.

The signal layer is stateless and runs entirely in the browser — no server calls, no external inference APIs. The resulting DecisionPacket contains the query embedding, entity list, calibrated intent distribution, dense and sparse rankings, confidence metrics, and session context.

3. Feature Extraction

Every query produces a 25-feature vector that feeds the policy network. These features capture similarity, entity presence, intent distribution, session history, and fragment metadata.

25-Feature Layout

Idx	Name	Type	Range	Description
0	`qSimTop1`	f32	[0,1]	Ensemble similarity (dense + BM25) to top-1 KB entry
1	`qSimTop2`	f32	[0,1]	Ensemble similarity to top-2 KB entry
2	`entityCount`	u8	[0,3]	Named entities extracted (capped)
3	`entityBoostHit`	bool	{0,1}	Top-5 ranked entry matches a detected entity
4–8	`intent*Score`	f32	[0,1]	Cosine scores vs definition, example, formal, application, comparison prototypes
9	`lastTopicSim`	f32	[0,1]	Cosine of query to last topic embedding
10	`lastTopicAge`	u8	[0,8]	Turns since last topic change (capped)
11	`kbCoverage`	f32	[0,1]	Fraction of KB entries with sim > 0.25
12	`queryLenTokens`	u8	[1,32]	Token count after stop-word removal
13	`hasComparisonCue`	bool	{0,1}	"vs", "compare", "difference" detected
14	`hasFormalCue`	bool	{0,1}	"prove", "theorem", "formal" detected
15	`hasExampleCue`	bool	{0,1}	"example", "illustrate", "case" detected
16	`botCreativity`	f32	[0,1]	Bot profile creativity ceiling
17	`domainMatch`	f32	[0,1]	Max cosine to domain prototype embeddings
18	`followUpType`	u8	[0,22]	Session follow-up type (simplify, elaborate, topic correction, etc.)
19	`wasAmbiguous`	bool	{0,1}	Previous turn flagged as ambiguous
20	`avgTruthConf`	f32	[0,1]	Average truth confidence of fragments in top results
21	`avgSourceConf`	f32	[0,1]	Average source confidence of fragments in top results
22	`minDifficulty`	u8	[0,4]	Minimum difficulty across available fragments
23	`fragDiversity`	u8	[0,5]	Distinct fragment styles available
24	`avoidWithCount`	f32	[0,1]	Fraction of top entries with compatibility constraints

4. Policy Network (MLP)

The policy is a multilayer perceptron trained via reinforcement learning to select the optimal response parameters given the 25-dim feature context. At construction time, weights are automatically quantized to int8 (symmetric per-layer quantization), reducing memory footprint ~4× and enabling faster integer arithmetic for inference. Practical cold-start tiers (what users actually experience):

Heuristic + sparse/BOW immediately (usable shell before any heavy assets).
Policy MLP once small weights load (decisions become policy-driven).
Dense MiniLM once the ~22 MB transformer finishes (full embeddings + reranking active; hot-swap from progressive bootstrap).
Optional WASM/ORT acceleration only when a real policy WASM asset + cross-origin isolation are present.

Architecture

Input:     Float32Array(25) — 25 normalized features
  ↓
fc1:       Linear(25, 128) + ReLU       (3,328 params)  [int8 quantized]
  ↓
fc2:       Linear(128, 64) + ReLU       (8,256 params)  [int8 quantized]
  ↓
Heads (all share fc2 output, 64 dims):
  mode_head:        Linear(64, 5)   → softmax → [normal, off_topic, greeting, help, comparison]
  intent_head:      Linear(64, 5)   → softmax → [definition, example, formal, application, comparison]
  topic_count_head: Linear(64, 4)   → softmax → [1, 2, 3, 4]
  frag_count_head:  Linear(64, 4)   → softmax → [1, 2, 3, 4]
  creativity_head:  Linear(64, 1)   → sigmoid → [0, 1]
  tone_head:        Linear(64, 4)   → softmax → [neutral, formal, intuitive, playful]

Total:  ~13,079 parameters (trained, auto-int8-quantized at load)
Version: 0.4.0 (25-feature input, field-weighted BM25, fuzzy entity extraction, session-aware fragment diversity)

Multi-Engine Inference

The policy runtime (policy/policy-runtime.js) uses this order for the tiny 13k-param policy (WASM asset is optional/stub-like until a real compiled policy is shipped; MLP is the real, validated, always-available path):

MLP Engine (primary) — Pure-JS math with auto-quantized int8 weights (policy/mlp-inference.js). Weights are symmetrically quantized to Int8 at construction time, reducing memory ~4× and accelerating inference via int8 dot products. No dependencies. Same architecture as the PyTorch-trained model.
WASM Engine (secondary/optional) — Only if a real compiled policy WASM is provided. Not the main target until the asset is real (a 13k-param MLP is not the runtime bottleneck).
Heuristic Fallback — 15 parameterized decision thresholds. Used when neither MLP weights nor WASM are ready. Ensures the system is always functional.

5. Training Pipeline

The MLP policy network is trained offline using PyTorch, then exported as JSON weights for the browser-based JS engine.

Pipeline Stages

Step 1

Prompt Generation

Seed prompts are generated from KB entries and intent prototypes, then automatically augmented with synonym substitution, typos, informal phrasing, conversational context, and rephrasing. Target: 5000+ per bot. An optional LLM augmentation pass adds additional diversity.

Step 2

Retrieval Dataset

build_retrieval_dataset() embeds all KB entries and queries using sentence-transformers/all-MiniLM-L6-v2 (real embeddings) or a TF-IDF fallback when the library is unavailable. Computes per-sample cosine rankings, entity extractions, intent scores, and the full 25-feature vector.

Step 3

RL Training (REINFORCE)

The policy network is trained with a state-dependent value baseline. Each step: forward pass → sample actions → ε-greedy exploration → compute reward → policy gradient update with gradient clipping. The reward function has 6 dynamic components: intent match, topic precision, fragment coherence, length penalty, creativity alignment, and guardrail compliance. Follow-up topic continuity is heavily rewarded (0.6 weight) to penalize drift, and explicit topic corrections ("I asked about X") generate dedicated reward signals.

Step 4

Weight Export

Trained PyTorch parameters are remapped to JS-compatible keys (fc1.weight, mode_head.bias, etc.) and exported to assets/models/policy/policy.weights.json. The JS MLPPolicy class validates all 16 weight tensor shapes at construction time (fail-fast), then automatically quantizes all weight matrices to symmetric int8 (per-layer scale factor) for ~4× memory reduction and faster integer dot-product inference.

Step 5

ONNX & WASM

export_onnx() freezes the PyTorch graph and exports to policy.onnx (opset 17, constant folding, validated). compile_wasm() compiles to WASM via available toolchains (wonnx-cli or onnx2json), with wasm-opt -O3 optimization. When no compilation tools are available, the JS MLP engine serves as the primary runtime.

6. Fragment-Based Response Composition

Each knowledge-base entry contains categorized fragments (def, int, ex, form, app) with metadata fields: truth_confidence, source_confidence, difficulty, style, avoid_with.

The policy produces an AnswerPlan specifying:

Mode — normal, comparison, greeting, help, off_topic
Intent — definition, example, formal, application, comparison
Topics — which KB entries to include
Fragment Plan — which categories and indices per topic
Template — opener/closer indices, comparison opener key, connector keys
Creativity — scalar [0,1] controlling response variation
Tone — neutral, formal, intuitive, playful

composeV2() in core/nlp.js reads the AnswerPlan and assembles the final text by selecting fragments, applying linguistic connectors ("For instance,", "More formally,", etc.), prefixed by openers and suffixed by closers — all indexed from the plan with modulo-safety.

Progressive Streaming Rendering

Composed responses are rendered progressively via pushMessageStream() in core/ui.js. Rather than inserting the full HTML at once, the response is revealed in ~40-character chunks using requestAnimationFrame, producing a natural typing effect. KaTeX math rendering is deferred until the stream completes, avoiding layout thrash during animation.

Session Memory

The SessionMemory class (core/session.js) tracks up to 30 turns with importance-based eviction (recent 5 turns are always protected). Old responses are compressed to 120-char summaries after 5 turns. An EMA summary vector (exponential moving average, α=0.75) of query embeddings provides dense multi-turn context to the policy without additional feature extraction. Entity mentions decay with a half-life of 5 turns, and fragment diversity penalties prevent repetitive responses across long conversations.

Comparison Mode

When mode === 'comparison', the policy selects a comparisonOpenerKey (both, contrast, or similarity) from the template. The renderer uses patterned openers like "Both A and B are important concepts here." and distributes categories across multiple topics.

7. Action Schema & Validation

Every AnswerPlan passes through validatePlan() (policy/action-schema.js) which enforces:

Type checking on all 10 top-level fields
Enum membership (mode, intent, tone)
Range validation (creativity ∈ [0,1], topics ≤ maxTopics)
Cross-field consistency (fragmentPlan length matches topics; comparison with <2 topics falls back to normal)
Sanitization with schema defaults for missing/invalid fields

8. Feature Serialization

For the WASM boundary, features are packed into a 107-byte buffer:

packFeatures(features) → {
  float32: Float32Array(25),     // offset 0,  100 bytes
  uint8:   Uint8Array(7),       // offset 100, 7 bytes
  buffer:  ArrayBuffer(107)      // total
}

Uint8Array layout:
  [0] = entityCount         (u8, 0-3)
  [1] = packed booleans     (bits: entityBoostHit|hasComparisonCue|hasFormalCue|hasExampleCue|wasAmbiguous)
  [2] = lastTopicAge        (u8, 0-8)
  [3] = queryLenTokens      (u8, 1-32)
  [4] = followUpType        (u8, 0-22)
  [5] = minDifficulty       (u8, 0-4)
  [6] = fragDiversity       (u8, 0-5)

9. Heuristic Fallback

When the MLP policy engine is unavailable (e.g., during cold start or weight load failure), planAnswerHeuristic() generates the same AnswerPlan structure using 15 parameterized decision thresholds covering greeting detection, off-topic handling, comparison fallback, entity boost, and creativity defaults. This ensures the system is always functional even without trained weights.

10. Open Source

The full codebase is available at github.com/yunusemrejr/relu-chat under the MIT license. This includes:

core/ — NLP engine, chatbot engine, session memory, BM25 scorer, signal layer
policy/ — Feature extractor, MLP inference, action schema, policy runtime
dev/scripts/ — PyTorch training, weight export, prompt augmentation