Most modern chatbots use large language models. ReLU.chat takes a fundamentally different approach: a small MLP policy network trained via reinforcement learning decides how to compose responses from curated knowledge fragments.

Why Not Just Use an LLM?

Large language models are powerful but come with significant costs:

For domain-specific chatbots where accuracy matters more than open-ended generation, a different approach makes sense.

The Policy Network Architecture

Our policy network is a 3-layer MLP:

Input: 25 features (signal layer output)
Layer 1: 25 → 128 (ReLU activation)
Layer 2: 128 → 64 (ReLU activation)
Layer 3: 64 → 6 (action heads, sigmoid/tanh)

Each action head controls a different aspect of response composition:

The Training Pipeline

1. Data Collection

We collect interaction data: user queries, selected fragments, user engagement signals (did they continue the conversation? did they rephrase?).

2. Reward Function

The reward function combines multiple signals:

3. Training

We use Proximal Policy Optimization (PPO), a standard RL algorithm that's stable for policy gradient updates. The training loop:

  1. Sample a batch of (state, action, reward) tuples
  2. Compute advantage estimates
  3. Update policy using clipped surrogate objective
  4. Repeat for thousands of episodes

4. Export

The trained PyTorch model is exported to ONNX format, quantized to INT8, and loaded in the browser via ONNX Runtime WebAssembly.

The Heuristic Fallback

Not every environment can load the ONNX model (slow connections, old browsers). Our heuristic fallback uses 15 parameterized thresholds to make similar decisions:

This ensures the chatbot works even without trained weights.

Results

The RL-trained policy outperforms the heuristic baseline on:

The model is tiny (~50KB quantized) and infers in under 5ms — fast enough for real-time conversation.

Open Source

The full training pipeline is available in the ReLU.chat repository. Train your own policy network for any domain.