Most modern chatbots use large language models. ReLU.chat takes a fundamentally different approach: a small MLP policy network trained via reinforcement learning decides how to compose responses from curated knowledge fragments.
Why Not Just Use an LLM?
Large language models are powerful but come with significant costs:
- Latency: Cloud LLM inference takes 1-5 seconds per response
- Privacy: Your queries are sent to external servers
- Cost: API calls cost money per token
- Hallucination: LLMs can generate plausible but incorrect information
The Policy Network Architecture
Our policy network is a 3-layer MLP:
Input: 25 features (signal layer output)
Layer 1: 25 → 128 (ReLU activation)
Layer 2: 128 → 64 (ReLU activation)
Layer 3: 64 → 6 (action heads, sigmoid/tanh)
Each action head controls a different aspect of response composition:
- Fragment selection: Which knowledge fragments to include
- Response length: How detailed the answer should be
- Connector style: How fragments are linked together
- Confidence: How certain the system is about its answer
- Creativity balance: Factual vs. elaborative response
- Follow-up: Whether to suggest related topics
The Training Pipeline
1. Data Collection
We collect interaction data: user queries, selected fragments, user engagement signals (did they continue the conversation? did they rephrase?).
2. Reward Function
The reward function combines multiple signals:
- Relevance: Did the response address the query?
- Completeness: Did it cover the key aspects?
- Conciseness: Was it appropriately brief?
- Engagement: Did the user continue exploring?
3. Training
We use Proximal Policy Optimization (PPO), a standard RL algorithm that's stable for policy gradient updates. The training loop:
- Sample a batch of (state, action, reward) tuples
- Compute advantage estimates
- Update policy using clipped surrogate objective
- Repeat for thousands of episodes
4. Export
The trained PyTorch model is exported to ONNX format, quantized to INT8, and loaded in the browser via ONNX Runtime WebAssembly.
The Heuristic Fallback
Not every environment can load the ONNX model (slow connections, old browsers). Our heuristic fallback uses 15 parameterized thresholds to make similar decisions:
- Greeting detection (match "hello", "hi", etc.)
- Off-topic handling (low similarity → redirect)
- Comparison patterns (match "vs", "difference")
- Entity boost (named entity → prioritize related fragments)
Results
The RL-trained policy outperforms the heuristic baseline on:
- Response relevance: +12%
- User engagement: +18%
- Response conciseness: +23%
Open Source
The full training pipeline is available in the ReLU.chat repository. Train your own policy network for any domain.