The Rise of Local-First AI: Benchmarking WebGPU Performance for On-Device LLM Inference
Practical guide to benchmarking WebGPU for on-device LLM inference: methodology, sample code, bottlenecks, and optimization checklist for developers.
The Rise of Local-First AI: Benchmarking WebGPU Performance for On-Device LLM Inference
Local-first AI is rapidly moving from a research slogan to a practical deployment model. Developers want LLM-powered features that run offline, preserve privacy, and have predictable latency. Modern browsers and native runtimes expose GPU compute through WebGPU and wgpu-native, making performant on-device inference realistic. But shipping reliable local LLMs requires methodical benchmarking: memory is limited, compute patterns differ from training, and device heterogeneity is vast.
This post gives a practical, repeatable framework to benchmark WebGPU for on-device LLM inference. You’ll get a clear methodology, a minimal WebGPU example to bootstrap experiments, and a checklist of optimizations and measurements that reveal the real bottlenecks.
Why local-first LLMs matter for engineers
- Privacy: user data stays on-device; fewer compliance headaches.
- Latency: no network hops; predictable response times for interactive apps.
- Offline availability: features that work without connectivity.
- Cost control: avoid server-side inference spending for every request.
Constraints that shape our benchmarking approach:
- Memory ceilings (VRAM/shared RAM) limit model size and batch.
- Precision trade-offs (f16/8-bit/4-bit) affect quality and throughput.
- Compute vs memory-bound kernels: LLM inference pays heavily for memory bandwidth.
WebGPU as an enabler (quick primer)
WebGPU provides modern GPU compute and explicit buffers, pipelines, and shaders (WGSL). Compared to WebGL compute hacks, WebGPU gives real compute dispatch semantics and better portability to native via wgpu. Typical inference pipeline on WebGPU:
- Upload quantized weights into GPU buffers (or stream them in chunks).
- Upload token embeddings or token IDs as inputs.
- Dispatch compute shaders implementing matrix multiplies, attention, and activation kernels.
- Read back logits or next-token probabilities.
Important runtime notes:
- Browser WebGPU availability varies. Chrome/Edge have stable support; Firefox may require flags.
- Node + native apps can use wgpu-native for lower-level control and fewer browser-imposed limits.
- WGSL is the shading language—profiling and kernel fusion are key to hitting performance targets.
Benchmarking methodology — keep it scientific
Pick a matrix of variables to sweep and report:
- Models: small (e.g., 125M), medium (1–3B), large (7B quantized). Use quantized variants to match on-device aims.
- Runtimes: browser WebGPU, Node/wgpu-native, and a WASM-only baseline if available.
- Hardware: integrated GPU (Intel/Apple Silicon), discrete GPU (NVIDIA/AMD) across macOS, Windows, and Linux.
- Metrics: cold-start time, tokens/sec (throughput), latency per token (p50/p95/p99), peak memory usage, and power draw when measurable.
Measurement tips:
- Warm up before measuring: run several inference passes to JIT/compile shaders.
- Use multiple runs and report median and percentiles, not just averages.
- Measure both per-token latency and batched throughput. For interactive agents, single-token p95 is often the real KPI.
- When possible, run profiling counters (timestamps, GPU time) to separate CPU overhead from GPU execution.
Benchmark configuration should be explicit. For example, a config might be { "model": "ggml-7b-q4", "dtype": "f16", "batch": 1 } so others can reproduce.
Implementing on-device LLM inference with WebGPU — minimal example
This small snippet shows a minimal WebGPU initialization, buffer creation, and dispatch setup you can adapt. The code is intentionally compact—real inference requires many more kernel definitions and weight uploads.
async function initWebGPU(maxTokens) {
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) throw new Error('No GPU adapter found');
const device = await adapter.requestDevice();
// Staging / storage buffers
const inputBuffer = device.createBuffer({
size: 4 * maxTokens,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
});
const outputBuffer = device.createBuffer({
size: 4 * maxTokens,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
});
// Minimal WGSL compute shader (placeholder for a real matmul/attention)
const wgsl = `@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
// kernel implementation here
}`;
const module = device.createShaderModule({code: wgsl});
const pipeline = device.createComputePipeline({compute: {module, entryPoint: 'main'}});
return {device, inputBuffer, outputBuffer, pipeline};
}
async function dispatchInference(device, pipeline, bindGroup, workgroups) {
const commandEncoder = device.createCommandEncoder();
const pass = commandEncoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(workgroups.x, workgroups.y, workgroups.z);
pass.end();
const commands = commandEncoder.finish();
device.queue.submit([commands]);
// Optionally await a readback fence or mapAsync on the output buffer
}
Replace the WGSL kernel with optimized matmul and attention kernels. The bottleneck will be memory transfers and how weights are laid out in GPU-friendly form (tile sizes, vectorized loads).
Interpreting results and typical bottlenecks
- Memory-bound vs compute-bound: On many GPUs, LLM inference (especially quantized) is memory-bound. If utilization is low but memory bandwidth is saturated, optimize data layout and reduce transfers.
- Buffer copy overhead: Frequent CPU & GPU buffer copies kill latency. Use persistent GPU-resident buffers and update only small slices.
- Kernel launch overhead: Small kernels per layer add overhead. Fuse operations (e.g., GEMM + bias + activation) to reduce dispatch count.
- Precision/quantization tradeoffs: 4-bit quantization can drastically reduce memory and bandwidth, but requires decoder logic and careful dequantization inside kernels.
- Driver/runtime differences: The same code can behave very differently across vendors. Always test on target devices.
Optimization checklist (practical)
- Minimize host↔GPU transfers: keep weights on GPU, stream only tokens and small state.
- Fuse kernels where possible: fewer dispatches means lower overhead.
- Use vectorized loads/stores in WGSL (vec4/vec2) to improve memory throughput.
- Align buffers to cache-line-friendly sizes and tile your GEMMs to fit L1/L2.
- Use quantized formats on-disk and dequantize on-GPU in fused kernels to avoid multiple passes.
- Profile: use GPU timestamps, and measure both API-level timing and device counters.
- Test mixed-precision: f16 on some GPUs gives a large perf win; on others, int8/4-bit with custom kernels is better.
- Warm-up shaders and reuse pipelines to avoid cold-start JITs.
- Consider streaming weights: load only critical layers to fit memory, swap others as needed.
Tooling and libraries to consider
- onnxruntime-web (WebGPU backend) — good baseline for tuned kernels.
- TensorFlow.js (experimental WebGPU) — useful for some ops, less LLM-focused.
- wgpu-native — native apps with the same WebGPU API semantics.
- Native projects like
llama.cppand its WASM ports demonstrate trade-offs; they’re good references even if they aren’t full WebGPU stacks.
When selecting libraries, prefer those that expose low-level kernels or allow custom WGSL kernels so you can implement fused ops and quantized paths.
Summary / Quick checklist for your first benchmark run
- Choose representative model sizes and quantization settings.
- Test on target devices: integrated and discrete GPUs across OSes.
- Measure cold-start, warm-start p50/p95/p99 latency, throughput, and peak memory.
- Warm up shaders, run multiple trials, and report medians and percentiles.
- Profile to determine memory vs compute bound behavior.
- Apply the optimization checklist: reduce transfers, fuse kernels, and test quantization.
Local-first AI is achievable today for many use cases, but predictable, performant deployments require disciplined benchmarking and careful kernel design. Use WebGPU for portability, but measure across devices and iterate on data layout and kernel fusion to squeeze out reliable latency and throughput.
If you want, I can provide a runnable starter repo skeleton (browser + Node) that sets up WebGPU inference scaffolding and a benchmark harness you can run on multiple machines—tell me your target devices and model sizes and I’ll tailor it.