Heuristic Fallback for Cold Start: Graceful Degradation in Browser AI

A 22MB model takes 5-10 seconds to load on a slow connection. A user lands on the page, types a query, hits enter — and waits. They will not wait. They will close the tab and assume the product is broken.

The standard solution is a loading spinner. The better solution is a heuristic fallback — a simpler, lower-quality but instantly-available response system that runs on the first interaction, then hot-swaps to the full model when it finishes loading.

What Graceful Degradation Means

Graceful degradation is the principle that a system should always do something reasonable, even when its full capability is unavailable. A video player that shows a poster image before the video loads is graceful degradation. An email client that lets you read cached messages while offline is graceful degradation.

For a browser AI chatbot, graceful degradation looks like:

First paint (0ms) — chatbot UI is interactive, suggestion chips visible, input enabled
First query (~10ms after typing) — heuristic returns a response based on lexical matching and pre-compiled patterns
Model loaded (~5-10s after first paint) — full transformer hot-swaps in, subsequent queries use the full pipeline
No user-visible switch — the response quality goes up silently, the user just notices the bot getting smarter

The user never sees a spinner. The user never waits. The user gets a working chatbot on the first keystroke.

The Two-Stage Pipeline

ReLU.chat has two complete response pipelines running side by side:

Stage 1: Heuristic + BOW (always available)

Tokenize query with a lightweight in-memory tokenizer
Match against a precomputed bag-of-words index over the knowledge base
Score fragments by lexical overlap (TF-IDF, no model needed)
Apply 15 hand-tuned thresholds for: which fragment to pick, how many fragments to combine, whether to add connectors, how to format
Return a response

Stage 2: Transformer + RL policy (loads progressively)

Service worker caches the 22MB ONNX model in the background
Once loaded, the dense embedding model is used to compute query/document similarity
BM25 sparse retrieval runs in parallel (see our retrieval post)
The 25-feature signal layer fuses everything
The RL-trained policy network (see our RL post) decides the response composition
Returns a response

Stage 1 is what runs on the first query. Stage 2 takes over when ready.

What The Heuristic Gets Right

The heuristic is not a toy. For many queries it produces a perfectly fine response:

Queries that share vocabulary with a knowledge-base fragment — handled by the lexical scorer
Queries with high entity overlap — handled by the entity matcher
Definition queries ("What is X?") — handled by a pattern that prefers the definition-tagged fragment
Comparison queries ("X vs Y") — handled by a pattern that returns both fragments

For these cases, the heuristic and the full model produce similar results. The user cannot tell the difference.

What The Heuristic Gets Wrong

The heuristic fails on:

Paraphrases — "how do I split fairly" should match "Shapley value" but the heuristic has no semantic understanding
Typo tolerance — "equlibrium" should match "equilibrium" but a strict lexical matcher treats them as different
Multi-hop queries — questions that need combining two fragments, where neither alone is the answer
Out-of-vocabulary terms — rare technical jargon the bot has never seen

For these, the heuristic returns a low-confidence response or a "I don't have specific information on that" message. That is fine — it is honest about its limitations, and the user gets the full model in a few seconds.

The 15 Thresholds

The heuristic is parameterized by 15 thresholds:

MIN_LEXICAL_OVERLAP = 0.18        // Below this, return fallback (I do not know)
MIN_ENTITY_OVERLAP = 0.5          // Strong entity match required
MAX_FRAGMENTS = 3                 // Never return more than 3 fragments
DEFINITION_TRIGGERS = [...]       // Patterns that signal definition intent
COMPARISON_TRIGGERS = [...]       // Patterns that signal comparison intent
EXAMPLE_TRIGGERS = [...]          // Patterns that signal example intent
FOLLOWUP_CONFIDENCE = 0.4         // When to suggest follow-ups
CONNECTOR_PROBABILITY = 0.6       // How often to insert connectors
MAX_RESPONSE_WORDS = 80           // Hard cap on response length
CONFIDENCE_DISPLAY_THRESHOLD = 0.7
// ... 5 more

These are tuned by running the heuristic on a held-out set of queries and measuring how often it produces a good-enough response. We update them when the knowledge base changes significantly.

The Hot-Swap

The transition from heuristic to full model is a Promise.race:

async function getResponse(query) {
  if (fullModelReady) {
    return await fullPipeline(query);
  }
  return await heuristicPipeline(query);  // Always available
}

When the model finishes loading, we set fullModelReady = true. From that point on, every query goes through the full pipeline. There is no migration step, no warmup, no re-loading.

The only subtlety is caching. The heuristic responses should not be persisted to suggest the user "this is a system answer." We mark heuristic responses internally and re-process them through the full model if the user asks a follow-up.

What This Buys

Time to first response drops from 5-10 seconds to 10 milliseconds — a 1000x improvement
User perception — the product feels responsive, even on a slow connection
Offline behavior — even before the model is cached, the user gets a useful chatbot
Failure resilience — if the model load fails (corrupt cache, browser bug), the heuristic still works

The cost is complexity. Two pipelines to maintain, two sets of failure modes, a hot-swap to test. For a system where every millisecond of perceived latency matters, it is the right tradeoff.

When Not To Use This

The heuristic fallback pattern makes sense when:

The full model is large or slow to load
Users will give up if the first interaction is slow
A simpler model is available and useful enough as a placeholder

It does not make sense when:

The model is already fast to load (sub-100ms)
Quality is critical even on the first query
There is no useful simpler alternative

For ReLU.chat — a privacy-first, on-device chatbot where the model is 22MB and the alternative is no chatbot — the heuristic fallback is not a nice-to-have. It is the difference between a product and a demo.