Stylized browser with GPU chip and locked data symbol
Run small LLMs locally in the browser with WebGPU for privacy-first apps

The Rise of Local-First AI: Deploying SLMs with WebGPU for Privacy-Preserving Applications

Practical guide to running small language models (SLMs) in the browser with WebGPU — design choices, quantization, runtime options, and a WebGPU shader example.

The Rise of Local-First AI: Deploying SLMs with WebGPU for Privacy-Preserving Applications

Developers are increasingly building applications that keep inference on-device. Local-first AI — running small language models (SLMs) in the browser or on the edge — reduces latency, cuts server costs, and preserves user privacy because data never leaves the client. WebGPU makes this feasible: it unlocks real GPU compute in the browser and enables efficient linear algebra kernels needed by transformer inference.

This post gives a pragmatic playbook: how to choose and quantize models, available runtimes and fallbacks, an overview of a WebGPU inference approach, a concrete code example (matrix multiply + WGSL shader), and operational tips for production-grade, privacy-preserving apps.

Why local-first AI matters now

Tradeoffs: model size, accuracy, and the heterogeneity of client hardware. The engineering work is about squeezing acceptable quality from compact, quantized SLMs and making the runtime resilient across GPUs and CPUs.

What to pick: SLMs, formats, and quantization

Models and size targets

Formats and toolchain

Practical rule of thumb

Runtime options and fallbacks

WebGPU approach overview

You can implement transformer kernels in WebGPU or use an existing WebGPU-backed runtime. Essential building blocks:

  1. Tokenization and prompt processing (runs on CPU). Keep this lightweight.
  2. A quantized weight format and loader that maps model tensors into GPU buffers.
  3. Shader kernels for matrix multiplication, attention, and fused feed-forward layers (WGSL).
  4. Management of GPU memory and streaming weights for large models (partition, swap, or offload to IndexedDB).
  5. Sampling/decoding loop and batched token generation.

Key engineering constraints:

Example: minimal WebGPU matmul + WGSL shader

Below is a concise example showing how to run a single matrix-vector multiply on WebGPU. This is the primitive you will reuse for attention and feed-forward layers.

JavaScript: create device, buffers, pipeline, dispatch, read back results.

// 1) Acquire an adapter and device
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();

// 2) Prepare input data
const rows = 64; // output dim
const cols = 256; // input dim
const input = new Float32Array(cols).fill(0).map((_, i) => Math.sin(i));
const weights = new Float32Array(rows * cols).map((_, i) => (i % 13) / 13);

// 3) Create GPU buffers
const inputBuffer = device.createBuffer({
    size: input.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
});
const weightBuffer = device.createBuffer({
    size: weights.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
});
const outputBuffer = device.createBuffer({
    size: rows * 4,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
});

device.queue.writeBuffer(inputBuffer, 0, input.buffer, input.byteOffset, input.byteLength);
device.queue.writeBuffer(weightBuffer, 0, weights.buffer, weights.byteOffset, weights.byteLength);

// 4) WGSL shader: simple matmul (output[r] = dot(weights[r,:], input))
const shaderCode = `
[[block]] struct Input { values: array<f32>; };
[[block]] struct Weights { values: array<f32>; };
[[block]] struct Output { values: array<f32>; };

[[group(0), binding(0)]] var<storage, read> input : Input;
[[group(0), binding(1)]] var<storage, read> weights : Weights;
[[group(0), binding(2)]] var<storage, write> output : Output;

[[stage(compute), workgroup_size(64)]]
fn main([[builtin(global_invocation_id)]] gid : vec3<u32>) {
    let r = gid.x;
    if (r >= ${rows}u) { return; }
    var sum : f32 = 0.0;
    for (var c: u32 = 0u; c < ${cols}u; c = c + 1u) {
        let widx = r * ${cols}u + c;
        sum = sum + weights.values[widx] * input.values[c];
    }
    output.values[r] = sum;
}
`;

// 5) Create pipeline and run
const shaderModule = device.createShaderModule({ code: shaderCode });
const pipeline = device.createComputePipeline({
    compute: { module: shaderModule, entryPoint: 'main' }
});

const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
        { binding: 0, resource: { buffer: inputBuffer } },
        { binding: 1, resource: { buffer: weightBuffer } },
        { binding: 2, resource: { buffer: outputBuffer } }
    ]
});

const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(pipeline);
passEncoder.setBindGroup(0, bindGroup);
passEncoder.dispatch(Math.ceil(rows / 64));
passEncoder.endPass();

device.queue.submit([commandEncoder.finish()]);

// 6) Read back result (map & copy)
const readBuffer = device.createBuffer({ size: rows * 4, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ });
const copyEncoder = device.createCommandEncoder();
copyEncoder.copyBufferToBuffer(outputBuffer, 0, readBuffer, 0, rows * 4);
device.queue.submit([copyEncoder.finish()]);
await readBuffer.mapAsync(GPUMapMode.READ);
const resultArray = new Float32Array(readBuffer.getMappedRange().slice());

This skeleton demonstrates the core mechanics: GPU buffers for inputs/weights/output, a WGSL compute shader that executes the kernel, and copying results back. In a real transformer you will implement tiled matmuls, fused bias and activation, and a parallelized attention kernel.

Integration tips for full transformer inference

Security, privacy, and deployment patterns

Performance tuning checklist

When to use server-based inference instead

Summary and quick checklist

Checklist:

Local-first AI with SLMs and WebGPU is no longer a research-only scenario. With careful quantization, a pragmatic runtime design, and tuned WGSL kernels, you can deliver private, fast, and low-cost inference straight in the browser — unlocking new classes of privacy-preserving applications for end users.

Related

Get sharp weekly insights