Edge-sized foundation models: practical strategies to run private on-device LLMs on smartphones and microcontrollers
Step-by-step, practical tactics to run private on-device LLMs on phones and MCUs—quantization, runtimes, memory budgeting, and deployment patterns.
Edge-sized foundation models: practical strategies to run private on-device LLMs on smartphones and microcontrollers
Modern foundation models are large and power-hungry by design. But for many applications you don’t need the full cloud-scale model — you need a private, low-latency assistant that runs on-device with bounded memory and power. This post is a practical, tactical guide for engineers who want to deploy edge-sized foundation models on smartphones and even microcontrollers. No fluff. Concrete techniques, runtimes, and patterns you can start using today.
Why run LLMs on-device?
- Privacy: user data never leaves the device.
- Latency: local inference removes network round trips.
- Availability: works offline or on constrained networks.
- Cost predictability: you avoid per-request cloud costs.
Those benefits come with tradeoffs: limited RAM, storage, battery, and compute. The rest of this post explains how to trade accuracy for affordability with minimal engineering overhead.
Pick the right model family and target size
Edge deployment starts with model selection. Look for models designed or proven to compress well:
- Smaller foundation models: open checkpoints at 3B or 7B are common starting points.
- Distilled variants: distillation reduces inference compute and memory with modest quality loss.
- Community-friendly formats: models that convert to GGUF / ggml or ONNX are easier to run in optimized runtimes.
Rule of thumb: aim for a model that, when quantized, fits into 2–4× the device RAM you can allocate for the model. On modern phones that means a quantized 3B model or 7B at aggressive 4-bit quantization. On microcontrollers you’re likely targeting tiny distilled models & specialized decoders.
Distillation, pruning, and low-rank adapters
If you control training, apply these compressions:
- Distillation: train a smaller student on teacher outputs; retains behavior while reducing parameters.
- Structured pruning: prune heads or whole layers when sensitivity permits.
- LoRA / adapter baselines: keep a small parameter delta for personalization instead of storing full fine-tuned weights.
When you can’t retrain, use parameter-efficient tuning (LoRA) or compiler-guided pruning where supported.
Practical compression recipe
- Start with a 7B or 3B checkpoint.
- Apply task distillation to a 3B student if latency matters.
- Apply post-training quantization (next section).
- If personalization is required, store LoRA adapters rather than new full checkpoints.
Quantization strategies (the single biggest lever)
Quantization is the core technique to make models fit. Approaches and tradeoffs:
- 8-bit integer (INT8): good compatibility, moderate size reduction, often minimal quality loss with per-channel scales.
- 4-bit quant (GPTQ-style, Q4_0/Q4_K_M): cuts model storage & memory bandwidth dramatically; requires specialized kernels but is now supported in many runtimes.
- Mixed precision: keep layernorm and a few sensitive tensors in float16/float32 and quantize weights otherwise.
Important concepts:
- Per-channel vs per-tensor scales: per-channel reduces quantization error on weights and is preferable.
- Symmetric vs asymmetric: symmetric quantization simplifies kernels.
- Calibration data: for post-training quantization, use representative inputs to compute scales.
Practical tip: test quantized variants on your evaluation prompts. Many models remain useful even at 4-bit; some tasks degrade more than others.
Runtimes and toolchains
Choose the runtime that matches your platform and quant format:
- llama.cpp / ggml / GGUF: mainline for CPU inference and quantized weights; runs on phones and desktops; WASM builds available for browsers.
- OnnxRuntime: good for converted ONNX models with int8 support and accelerators.
- TensorFlow Lite (TFLite): on Android and MCU with TFLite Micro and NNAPI support.
- TVM: compile kernels for a target device and fuse ops for performance.
- CoreML / Core ML Tools: convert quantized models for iOS with hardware acceleration.
For microcontrollers:
- TensorFlow Lite Micro and CMSIS-NN give int8-friendly kernels for Cortex-M devices.
- Hardware NPUs and DSPs: use vendor SDKs to run matrix ops on accelerators rather than the main CPU.
Memory management and KV-cache strategies
Memory is the hardest constraint. You must account for:
- Model parameters (quantized storage).
- Runtime working memory (activation buffers, temporary tensors).
- KV cache for autoregressive decoding: scales with sequence length and head size.
Patterns that work:
- Single arena allocator: allocate one contiguous arena for all model data to avoid fragmentation and expensive mallocs.
- Memory-map the model file when supported (mmap) so storage pages are loaded on demand.
- Evict or recompute KV cache: for very long contexts, either stream tokens out or recompute some attention if storage is lacking.
- Reduce context length for on-device assistants.
Example: quick KV cache footprint estimate
def estimate_kv_cache(tokens, n_heads, head_dim, dtype_bytes=4):
# KV cache stores key + value per token: 2 * tokens * n_heads * head_dim * dtype_bytes
return 2 * tokens * n_heads * head_dim * dtype_bytes
# Example values for a 3B model-ish config
tokens = 512
n_heads = 16
head_dim = 64
bytes_needed = estimate_kv_cache(tokens, n_heads, head_dim, 2) # quantized ~2 bytes
This helps you decide whether to cap context length or accept a lower-precision KV cache.
Platform-specific tactics
Android
- Use NDK and native libraries (llama.cpp compiled for armeabi-v8a and arm64-v8a).
- Use NNAPI for vendor accelerators where you can convert operator sets.
- Memory: store model in app-specific storage and use memory mapping where possible.
iOS
- Convert to CoreML if you need GPU/Neural Engine acceleration.
- Use Metal for custom kernels if CoreML lacks quantized operator coverage.
- Keep the ML model in the app bundle or downloaded to a secure directory.
Microcontrollers
- Target models <1MB or use streaming micro-inference patterns.
- Use INT8 kernels (CMSIS-NN) and keep static buffers; avoid dynamic allocation.
- External QSPI RAM or PSRAM can hold larger quantized models; ensure your board supports execute-in-place or DMA.
Inference patterns: batch, temperature, top-k/top-p
Keep generation efficient:
- Use greedy or low-temperature decoding for faster token generation.
- Use small top-k or top-p to reduce compute and token churn.
- Batch multiple inferences only when latency allows; on-device assistant often needs single-token response.
Security and privacy hygiene
Even on-device models need secure handling:
- Store models and adapters in app-protected storage.
- Use OS-provided secure enclaves for sensitive personalization data.
- Don’t log prompts or generated text to external telemetry without explicit consent.
Example: running a quantized model with a lightweight runtime
Here is a minimal pattern to invoke a quantized model using a native CLI style runtime (conceptual). Replace with your runtime’s API.
// Pseudocode: single arena allocation and model load
size_t ARENA_SIZE = MODEL_BYTES + 256*1024*1024; // reserve workspace
void *arena = malloc(ARENA_SIZE);
if (!arena) exit(1);
// load quantized model file into memory
FILE *f = fopen("model.gguf.q4_0.bin", "rb");
fread(arena, 1, MODEL_BYTES, f);
fclose(f);
// initialize runtime with pointers into the arena
runtime_init(arena, ARENA_SIZE);
runtime_set_prompt("Summarize the following text...");
runtime_generate_max_tokens(128);
runtime_run();
This pattern avoids repeated allocations and keeps working memory contiguous.
Measuring success: metrics to track
- Latency per token (ms/token).
- Peak memory usage and resident set size.
- Battery drain over sustained usage.
- Quantization-induced quality metrics: BLEU/ROUGE or task-specific score.
Automate these measurements in CI for every model/quantization variant.
Summary checklist
- Pick a compressible model family (3B–7B nominal) or distill a smaller student.
- Quantize aggressively: test INT8 and 4-bit variants; use per-channel scales.
- Use a runtime that supports your quant format (llama.cpp/ggml, ONNX, TFLite, CoreML).
- Allocate a single arena, memory-map the model, and cap KV cache when needed.
- Use hardware accelerators (NNAPI, CoreML, vendor DSP) when available.
- Store adapters (LoRA) instead of full checkpoints for personalization.
- Measure latency, memory, and quality; iterate.
Edge-sized foundation models let you deliver private, responsive AI features without cloud dependency. Start by choosing a pragmatic model size, apply quantization, and pick a runtime tuned for your target hardware. The rest is engineering: careful memory budgeting, platform-specific kernels, and continuous measurement.
Checklist (copyable)
- Choose model size and evaluation prompts
- Apply distillation or LoRA if training is possible
- Convert and quantize (test 8-bit and 4-bit)
- Pick runtime and compile for target CPU/accelerator
- Implement single-arena allocator and mmap support
- Cap context length and manage KV cache strategy
- Validate quality vs. latency on-device
Start small, measure often, and prioritize user privacy and battery. Running LLMs on-device is engineering-heavy but entirely feasible with today’s tools and a disciplined quantization strategy.