WebGPU is the new low-level graphics and compute API for the web. It has shipped in stable Chrome, Edge, Firefox Nightly, and Safari Technology Preview. For in-browser machine learning, it is the single biggest unlock since WebAssembly.
What WebGPU Actually Is
WebGPU is to the web what Vulkan is to native: a thin, explicit API that maps cleanly to modern GPU hardware. Unlike WebGL, which was designed for graphics and adapted for compute, WebGPU is built for general-purpose GPU work from day one.
The primitives matter:
- Buffers and textures with explicit memory layouts
- Compute shaders in WGSL (a Rust-like shading language)
- Command encoders for batching and pipelining
- Async by default — every operation is a promise
Why This Matters for In-Browser AI
Without GPU access, browser ML runs on the CPU via WebAssembly. WASM SIMD is good — you can get 2-4x speedup over plain scalar code — but it is still bounded by single-threaded CPU performance. A modern laptop CPU can do maybe 50 GFLOPS in INT8.
A modern integrated GPU can do 2-5 TFLOPS. A discrete GPU: 10-40 TFLOPS. That is 40-800x more compute, available without installing anything.
The first time we ran MiniLM inference on a WebGPU compute shader, the embedding latency dropped from ~15ms to ~3ms. A 5x speedup with the same quantized model.
What's Hard About WebGPU
It is not a drop-in replacement for WASM. The tradeoffs are real:
Pros:
- Massive throughput for matrix work
- Lower power per FLOP than CPU
- Scales to large models (multi-GB fits in VRAM on discrete GPUs)
- Initialization cost is high — first compute shader compile can take 200-500ms
- Driver bugs are still common, especially on older Windows machines
- Memory transfers between CPU and GPU are slow (PCIe-bound)
- Not available in all browsers yet (Firefox Nightly only, Safari gated behind a flag)
Where WebGPU Wins
The sweet spot is medium-to-large models with high query volume:
- 1B-7B parameter LLMs running in the browser
- Image generation (Stable Diffusion runs in browser via WebGPU)
- Speech recognition (Whisper, ~1B params, runs at real-time)
- Real-time translation models
What's Still Hard
Cold start. The 22MB model download, tokenizer warmup, and GPU init can take 2-3 seconds total on first load. We mitigate with the progressive loading pattern (heuristic fallback first, model hot-swap later) but the first impression is still rough.
Mobile. WebGPU is just landing on mobile. iOS Safari's support is gated. Android Chrome is closer but has thermal throttling issues — a phone can do GPU inference for ~30 seconds before throttling kicks in.
Driver bugs. Every browser team has a list of known-broken GPU/driver combinations. We hit three in 2025 alone.
The Architecture Shift
The interesting question is not whether WebGPU will replace WASM for browser ML. It is when does the tradeoff flip:
- Below ~50M parameters and ~10 queries/min: WASM wins on cold start
- Above ~500M parameters or higher query rate: WebGPU wins on throughput
- In between: it depends on the model and the user
What This Means For Privacy
If a 7B LLM can run in your browser at 8 tokens/second on consumer hardware, the cloud-LLM privacy argument collapses. There is no longer a quality/privacy tradeoff — you can have both. WebGPU is the technology that makes that statement true.
The browser is becoming a first-class AI runtime. The next 24 months will be wild.