Running Llama 3.1 70B on M4 Max: MLX vs llama.cpp
We ran the same 4-bit quant on both backends across coding, summarisation, and long-context recall. MLX wins single-prompt latency; llama.cpp wins throughput. Full numbers + memory traces inside.
We ran the same 4-bit quant on both backends across coding, summarisation, and long-context recall. MLX wins single-prompt latency; llama.cpp wins throughput. Full numbers + memory traces inside.
Reader-curated screenshots, walk-throughs, and source material for this story. Manuals, scans, and supplementary docs all welcome.
No media yet for Running Llama 3.1 70B on M4 Max: MLX vs llama.cpp.
Manuals, photos, walk-throughs, firmware notes — anything that helps another reader use this product land here first.
Be the first to tag this page. A tag becomes publicly visible once it reaches the community vote threshold.
Loading edit history…
The week's best AI and smart-living reads, every Friday morning. No spam, unsubscribe anytime.
Pull `ollama run llava` and the new image-understanding pipeline is up — vision encoder swap, structured-output mode, and a fast CLI for batch captioning.
The desktop UI's server mode now mirrors OpenAI's `/v1/chat/completions` shape end-to-end — drop-in `OPENAI_BASE_URL` swap from the Python SDK, no auth shim required.
Anthropic released Claude 4 Opus this morning, and the headline feature isn't a benchmark — it's the model's new ability to *think out loud* before answering. Hit a complex enough question and the model now shows its reasoning, sometimes for thirty seconds or more, before producing a response.
We've been testing it for the past 48 hours on the kinds of tasks where prior models fall apart: refactoring a 4,000-line legacy codebase, planning a six-week product launch, and walking through scientific papers we don't have the background to evaluate independently. The pattern is consistent: extended thinking helps most when the right answer requires holding multiple constraints in mind at once.
The clearest wins are in tasks where the failure mode of prior models was confident-but-wrong. Asked to design a database migration that preserves backward compatibility, Opus 4 will now spend 20+ seconds visibly considering edge cases — partial deploys, in-flight transactions, stale read replicas — before producing a plan. The plan it produces is materially more careful than what came out of Sonnet 3.7 on identical inputs.
The model now shows its reasoning, sometimes for thirty seconds or more, before producing a response.
For simple lookups, summarizations, or single-step transforms, extended thinking just adds latency. The product makes the right call here — a routing layer in the API decides whether thinking is needed, and skips it for obvious cases. You can also force-enable or force-disable per request.
Sign in to join the conversation.
No comments yet — be the first.