WebGPU and the Future of In-Browser AI Inference

WebGPU is the new low-level graphics and compute API for the web. It has shipped in stable Chrome, Edge, Firefox Nightly, and Safari Technology Preview. For in-browser machine learning, it is the single biggest unlock since WebAssembly.

What WebGPU Actually Is

WebGPU is to the web what Vulkan is to native: a thin, explicit API that maps cleanly to modern GPU hardware. Unlike WebGL, which was designed for graphics and adapted for compute, WebGPU is built for general-purpose GPU work from day one.

The primitives matter:

Buffers and textures with explicit memory layouts
Compute shaders in WGSL (a Rust-like shading language)
Command encoders for batching and pipelining
Async by default — every operation is a promise

For ML, the critical primitive is the compute shader. A shader is a small program that runs in parallel across thousands of GPU threads. Matrix multiplications, the core of neural network inference, are embarrassingly parallel.

Why This Matters for In-Browser AI

Without GPU access, browser ML runs on the CPU via WebAssembly. WASM SIMD is good — you can get 2-4x speedup over plain scalar code — but it is still bounded by single-threaded CPU performance. A modern laptop CPU can do maybe 50 GFLOPS in INT8.

A modern integrated GPU can do 2-5 TFLOPS. A discrete GPU: 10-40 TFLOPS. That is 40-800x more compute, available without installing anything.

The first time we ran MiniLM inference on a WebGPU compute shader, the embedding latency dropped from ~15ms to ~3ms. A 5x speedup with the same quantized model.

What's Hard About WebGPU

It is not a drop-in replacement for WASM. The tradeoffs are real:

Pros:

Massive throughput for matrix work
Lower power per FLOP than CPU
Scales to large models (multi-GB fits in VRAM on discrete GPUs)

Cons:

Initialization cost is high — first compute shader compile can take 200-500ms
Driver bugs are still common, especially on older Windows machines
Memory transfers between CPU and GPU are slow (PCIe-bound)
Not available in all browsers yet (Firefox Nightly only, Safari gated behind a flag)

For a chatbot, the cold start matters. A 500ms GPU initialization eats most of the speedup for short interactions. WASM is still the right default for small models.

Where WebGPU Wins

The sweet spot is medium-to-large models with high query volume:

1B-7B parameter LLMs running in the browser
Image generation (Stable Diffusion runs in browser via WebGPU)
Speech recognition (Whisper, ~1B params, runs at real-time)
Real-time translation models

The first 7B LLM that runs interactively in a browser shipped in early 2026. It uses WebGPU, runs at ~8 tokens/second on a discrete GPU, and lives entirely client-side. The same trick that made ReLU.chat work for retrieval-based chatbots will make real LLMs work in the browser within 2-3 years.

What's Still Hard

Cold start. The 22MB model download, tokenizer warmup, and GPU init can take 2-3 seconds total on first load. We mitigate with the progressive loading pattern (heuristic fallback first, model hot-swap later) but the first impression is still rough.

Mobile. WebGPU is just landing on mobile. iOS Safari's support is gated. Android Chrome is closer but has thermal throttling issues — a phone can do GPU inference for ~30 seconds before throttling kicks in.

Driver bugs. Every browser team has a list of known-broken GPU/driver combinations. We hit three in 2025 alone.

The Architecture Shift

The interesting question is not whether WebGPU will replace WASM for browser ML. It is when does the tradeoff flip:

Below ~50M parameters and ~10 queries/min: WASM wins on cold start
Above ~500M parameters or higher query rate: WebGPU wins on throughput
In between: it depends on the model and the user

ReLU.chat will likely ship WebGPU as an opt-in execution provider in 2026. Users on capable hardware get faster inference; everyone else stays on WASM. The architecture already supports multiple execution providers — the signal layer, policy network, and chatbot engine are provider-agnostic.

What This Means For Privacy

If a 7B LLM can run in your browser at 8 tokens/second on consumer hardware, the cloud-LLM privacy argument collapses. There is no longer a quality/privacy tradeoff — you can have both. WebGPU is the technology that makes that statement true.

The browser is becoming a first-class AI runtime. The next 24 months will be wild.