A 22MB model takes 5-10 seconds to load on a slow connection. A user lands on the page, types a query, hits enter — and waits. They will not wait. They will close the tab and assume the product is broken.
The standard solution is a loading spinner. The better solution is a heuristic fallback — a simpler, lower-quality but instantly-available response system that runs on the first interaction, then hot-swaps to the full model when it finishes loading.
What Graceful Degradation Means
Graceful degradation is the principle that a system should always do something reasonable, even when its full capability is unavailable. A video player that shows a poster image before the video loads is graceful degradation. An email client that lets you read cached messages while offline is graceful degradation.
For a browser AI chatbot, graceful degradation looks like:
- First paint (0ms) — chatbot UI is interactive, suggestion chips visible, input enabled
- First query (~10ms after typing) — heuristic returns a response based on lexical matching and pre-compiled patterns
- Model loaded (~5-10s after first paint) — full transformer hot-swaps in, subsequent queries use the full pipeline
- No user-visible switch — the response quality goes up silently, the user just notices the bot getting smarter
The Two-Stage Pipeline
ReLU.chat has two complete response pipelines running side by side:
Stage 1: Heuristic + BOW (always available)
- Tokenize query with a lightweight in-memory tokenizer
- Match against a precomputed bag-of-words index over the knowledge base
- Score fragments by lexical overlap (TF-IDF, no model needed)
- Apply 15 hand-tuned thresholds for: which fragment to pick, how many fragments to combine, whether to add connectors, how to format
- Return a response
- Service worker caches the 22MB ONNX model in the background
- Once loaded, the dense embedding model is used to compute query/document similarity
- BM25 sparse retrieval runs in parallel (see our retrieval post)
- The 25-feature signal layer fuses everything
- The RL-trained policy network (see our RL post) decides the response composition
- Returns a response
What The Heuristic Gets Right
The heuristic is not a toy. For many queries it produces a perfectly fine response:
- Queries that share vocabulary with a knowledge-base fragment — handled by the lexical scorer
- Queries with high entity overlap — handled by the entity matcher
- Definition queries ("What is X?") — handled by a pattern that prefers the definition-tagged fragment
- Comparison queries ("X vs Y") — handled by a pattern that returns both fragments
What The Heuristic Gets Wrong
The heuristic fails on:
- Paraphrases — "how do I split fairly" should match "Shapley value" but the heuristic has no semantic understanding
- Typo tolerance — "equlibrium" should match "equilibrium" but a strict lexical matcher treats them as different
- Multi-hop queries — questions that need combining two fragments, where neither alone is the answer
- Out-of-vocabulary terms — rare technical jargon the bot has never seen
The 15 Thresholds
The heuristic is parameterized by 15 thresholds:
MIN_LEXICAL_OVERLAP = 0.18 // Below this, return fallback (I do not know)
MIN_ENTITY_OVERLAP = 0.5 // Strong entity match required
MAX_FRAGMENTS = 3 // Never return more than 3 fragments
DEFINITION_TRIGGERS = [...] // Patterns that signal definition intent
COMPARISON_TRIGGERS = [...] // Patterns that signal comparison intent
EXAMPLE_TRIGGERS = [...] // Patterns that signal example intent
FOLLOWUP_CONFIDENCE = 0.4 // When to suggest follow-ups
CONNECTOR_PROBABILITY = 0.6 // How often to insert connectors
MAX_RESPONSE_WORDS = 80 // Hard cap on response length
CONFIDENCE_DISPLAY_THRESHOLD = 0.7
// ... 5 more
These are tuned by running the heuristic on a held-out set of queries and measuring how often it produces a good-enough response. We update them when the knowledge base changes significantly.
The Hot-Swap
The transition from heuristic to full model is a Promise.race:
async function getResponse(query) {
if (fullModelReady) {
return await fullPipeline(query);
}
return await heuristicPipeline(query); // Always available
}
When the model finishes loading, we set fullModelReady = true. From that point on, every query goes through the full pipeline. There is no migration step, no warmup, no re-loading.
The only subtlety is caching. The heuristic responses should not be persisted to suggest the user "this is a system answer." We mark heuristic responses internally and re-process them through the full model if the user asks a follow-up.
What This Buys
- Time to first response drops from 5-10 seconds to 10 milliseconds — a 1000x improvement
- User perception — the product feels responsive, even on a slow connection
- Offline behavior — even before the model is cached, the user gets a useful chatbot
- Failure resilience — if the model load fails (corrupt cache, browser bug), the heuristic still works
When Not To Use This
The heuristic fallback pattern makes sense when:
- The full model is large or slow to load
- Users will give up if the first interaction is slow
- A simpler model is available and useful enough as a placeholder
- The model is already fast to load (sub-100ms)
- Quality is critical even on the first query
- There is no useful simpler alternative