A 22MB model takes 5-10 seconds to load on a slow connection. A user lands on the page, types a query, hits enter — and waits. They will not wait. They will close the tab and assume the product is broken.

The standard solution is a loading spinner. The better solution is a heuristic fallback — a simpler, lower-quality but instantly-available response system that runs on the first interaction, then hot-swaps to the full model when it finishes loading.

What Graceful Degradation Means

Graceful degradation is the principle that a system should always do something reasonable, even when its full capability is unavailable. A video player that shows a poster image before the video loads is graceful degradation. An email client that lets you read cached messages while offline is graceful degradation.

For a browser AI chatbot, graceful degradation looks like:

  1. First paint (0ms) — chatbot UI is interactive, suggestion chips visible, input enabled
  2. First query (~10ms after typing) — heuristic returns a response based on lexical matching and pre-compiled patterns
  3. Model loaded (~5-10s after first paint) — full transformer hot-swaps in, subsequent queries use the full pipeline
  4. No user-visible switch — the response quality goes up silently, the user just notices the bot getting smarter
The user never sees a spinner. The user never waits. The user gets a working chatbot on the first keystroke.

The Two-Stage Pipeline

ReLU.chat has two complete response pipelines running side by side:

Stage 1: Heuristic + BOW (always available)

Stage 2: Transformer + RL policy (loads progressively) Stage 1 is what runs on the first query. Stage 2 takes over when ready.

What The Heuristic Gets Right

The heuristic is not a toy. For many queries it produces a perfectly fine response:

For these cases, the heuristic and the full model produce similar results. The user cannot tell the difference.

What The Heuristic Gets Wrong

The heuristic fails on:

For these, the heuristic returns a low-confidence response or a "I don't have specific information on that" message. That is fine — it is honest about its limitations, and the user gets the full model in a few seconds.

The 15 Thresholds

The heuristic is parameterized by 15 thresholds:

MIN_LEXICAL_OVERLAP = 0.18        // Below this, return fallback (I do not know)
MIN_ENTITY_OVERLAP = 0.5          // Strong entity match required
MAX_FRAGMENTS = 3                 // Never return more than 3 fragments
DEFINITION_TRIGGERS = [...]       // Patterns that signal definition intent
COMPARISON_TRIGGERS = [...]       // Patterns that signal comparison intent
EXAMPLE_TRIGGERS = [...]          // Patterns that signal example intent
FOLLOWUP_CONFIDENCE = 0.4         // When to suggest follow-ups
CONNECTOR_PROBABILITY = 0.6       // How often to insert connectors
MAX_RESPONSE_WORDS = 80           // Hard cap on response length
CONFIDENCE_DISPLAY_THRESHOLD = 0.7
// ... 5 more

These are tuned by running the heuristic on a held-out set of queries and measuring how often it produces a good-enough response. We update them when the knowledge base changes significantly.

The Hot-Swap

The transition from heuristic to full model is a Promise.race:

async function getResponse(query) {
  if (fullModelReady) {
    return await fullPipeline(query);
  }
  return await heuristicPipeline(query);  // Always available
}

When the model finishes loading, we set fullModelReady = true. From that point on, every query goes through the full pipeline. There is no migration step, no warmup, no re-loading.

The only subtlety is caching. The heuristic responses should not be persisted to suggest the user "this is a system answer." We mark heuristic responses internally and re-process them through the full model if the user asks a follow-up.

What This Buys

The cost is complexity. Two pipelines to maintain, two sets of failure modes, a hot-swap to test. For a system where every millisecond of perceived latency matters, it is the right tradeoff.

When Not To Use This

The heuristic fallback pattern makes sense when:

It does not make sense when: For ReLU.chat — a privacy-first, on-device chatbot where the model is 22MB and the alternative is no chatbot — the heuristic fallback is not a nice-to-have. It is the difference between a product and a demo.