llama.cpp
The C++ inference engine under most of the local stack
The C++ inference engine under most of the local stack
Mistral AI publishes the 123B-parameter weights under Apache 2.0 — Codestral-class reasoning at half the GPU footprint of Llama 3.1 405B. Locally runnable via vLLM, llama.cpp, and MLX.
Editor-curated slugs that route to this platform’s coverage. Reader-voted tags live below.
Be the first to tag this page. A tag becomes publicly visible once it reaches the community vote threshold.
Loading edit history…
Georgi Gerganov's llama.cpp is the lingua franca of local inference — pure C/C++ with CUDA / Metal / OpenCL / Vulkan backends, GGUF model format, and a wire-compatible server mode. Most desktop wrappers (Ollama, LM Studio, KoboldCPP, etc.) ship it under the hood.
Posts to your status feed
Pick the closest match below, edit the body, and post. Your report carries the #llama-cpp tag automatically so it surfaces here + in the trending-tags rail.
We ran the same 4-bit quant on both backends across coding, summarisation, and long-context recall. MLX wins single-prompt latency; llama.cpp wins throughput. Full numbers + memory traces inside.
All systems normal
No community reports inside the window.
No reports for llama.cpp in the last 2 hours. All clear.