Skip to content

Software stack

The cleanest portable stack is out-of-process binaries + a thin orchestrator. No daemon, no host Python required, no per-machine install.

A single self-contained binary, no Python runtime, runs on a Pi or a workstation. llama-server exposes an OpenAI-compatible HTTP API (/v1/chat/completions, /v1/embeddings), with streaming, batching, idle sleep/unload, and multimodal input via libmtmd.

Terminal window
llama-server -m models/qwen3-8b-q4_k_m.gguf --host 127.0.0.1 --port 8080
  • Acceleration: Metal (Apple), CUDA (NVIDIA), ROCm (AMD), Vulkan, CPU fallback.
  • Pin a tested build tag per release; vendor per-OS/arch binaries under engine/llama-server/.

Why not the alternatives (kept as optional providers): Ollama is friendly but daemon/cache-oriented; LM Studio isn’t built for redistributable USB packaging; vLLM/SGLang are too CUDA/Python-heavy for arbitrary offline laptops; MLX is Apple-only. llama-server wins on portability.

Same philosophy: a portable binary, offline, with Metal/Vulkan/CUDA/ROCm/CPU and VAD. Use base for the Pocket tier, large-v3-turbo for Field/Lab.

Vision (VLM): Qwen2.5-VL GGUF via llama-server

Section titled “Vision (VLM): Qwen2.5-VL GGUF via llama-server”
UseModel
Default local visionQwen2.5-VL-7B-Instruct-GGUF
Smaller (Field)Qwen2.5-VL-3B-Instruct-GGUF
Tiny caption / OCR-litemoondream2 GGUF

Caveats (still version-sensitive): the model GGUF and its mmproj projector must match; freeze exact files + SHA-256 in the manifest. Images consume context tokens, so budget a larger -c.

LayerPickWhy
Model sourceHugging Face Hub (hf download, pinned revision + exact filename)Standard GGUF distribution
FormatGGUF, usually Q4_K_MPortable, compact for USB
Integritymodels.lock.json + SHA-256repo_id, revision, filename, size, license, sha256, required engine build, source URL
App builduv + uv.lock, frozen with PyInstaller --onedirreproducible, no host Python
OrchestratorFastAPI + httpx / OpenAI clientthin: start llama-server, call localhost, stream council stages

Do not make llama-cpp-python the default runtime. Bundling native wheels across Metal/CUDA/Vulkan/ROCm is messier than keeping llama-server out-of-process. See the build runbook to implement this.

Sources: llama.cpp · llama-server README · whisper.cpp · HF GGUF docs. Verify versions per release.