Training Chatbots with Reinforcement Learning: The ReLU.chat Approach

Most modern chatbots use large language models. ReLU.chat takes a fundamentally different approach: a small MLP policy network trained via reinforcement learning decides how to compose responses from curated knowledge fragments.

Why Not Just Use an LLM?

Large language models are powerful but come with significant costs:

Latency: Cloud LLM inference takes 1-5 seconds per response
Privacy: Your queries are sent to external servers
Cost: API calls cost money per token
Hallucination: LLMs can generate plausible but incorrect information

For domain-specific chatbots where accuracy matters more than open-ended generation, a different approach makes sense.

The Policy Network Architecture

Our policy network is a 3-layer MLP:

Input: 25 features (signal layer output)
Layer 1: 25 → 128 (ReLU activation)
Layer 2: 128 → 64 (ReLU activation)
Layer 3: 64 → 6 (action heads, sigmoid/tanh)

Each action head controls a different aspect of response composition:

Fragment selection: Which knowledge fragments to include
Response length: How detailed the answer should be
Connector style: How fragments are linked together
Confidence: How certain the system is about its answer
Creativity balance: Factual vs. elaborative response
Follow-up: Whether to suggest related topics

The Training Pipeline

1. Data Collection

We collect interaction data: user queries, selected fragments, user engagement signals (did they continue the conversation? did they rephrase?).

2. Reward Function

The reward function combines multiple signals:

Relevance: Did the response address the query?
Completeness: Did it cover the key aspects?
Conciseness: Was it appropriately brief?
Engagement: Did the user continue exploring?

3. Training

We use Proximal Policy Optimization (PPO), a standard RL algorithm that's stable for policy gradient updates. The training loop:

Sample a batch of (state, action, reward) tuples
Compute advantage estimates
Update policy using clipped surrogate objective
Repeat for thousands of episodes

4. Export

The trained PyTorch model is exported to ONNX format, quantized to INT8, and loaded in the browser via ONNX Runtime WebAssembly.

The Heuristic Fallback

Not every environment can load the ONNX model (slow connections, old browsers). Our heuristic fallback uses 15 parameterized thresholds to make similar decisions:

Greeting detection (match "hello", "hi", etc.)
Off-topic handling (low similarity → redirect)
Comparison patterns (match "vs", "difference")
Entity boost (named entity → prioritize related fragments)

This ensures the chatbot works even without trained weights.

Results

The RL-trained policy outperforms the heuristic baseline on:

Response relevance: +12%
User engagement: +18%
Response conciseness: +23%

The model is tiny (~50KB quantized) and infers in under 5ms — fast enough for real-time conversation.

Open Source

The full training pipeline is available in the ReLU.chat repository. Train your own policy network for any domain.