This Week in Local AI — llama.cpp Gets MCP, Qwen3-Coder-Next Hits #1
InsiderLLM Weekly — March 8, 2026
This Week in Local AI
The tools are moving faster than the models this week. llama.cpp had its biggest feature merge in months, vLLM hit a major version, and Qwen3-Coder-Next quietly became the best open-source coding model anyone can run locally.
llama.cpp Just Got MCP Support. This Changes the Tool Calling Story.
The Model Context Protocol PR merged into mainline llama.cpp this week, and it's a big deal for anyone running local agents.
What it means: llama-server can now directly connect to MCP tool servers — the same protocol Claude, OpenClaw, and most agent frameworks use. Before this, llama.cpp was an inference engine. Now it's an agent runtime. You can point it at MCP servers for file access, web search, databases, whatever — and models that support tool calling (Qwen 3.5, Qwen Coder) can use them natively.
Also merged this week: An automatic parser generator for structured output, a 30% prompt processing speedup from Kimi Linear, and a significant token generation speedup specifically for Qwen3.5 models. If you haven't recompiled in a few weeks, now's the time.
One more: Tool calling for Qwen models using XML-tagged format got a proper fix. The parsing was broken for arguments that needed to be in a specific format — that's resolved now.
📖 We updated our llama.cpp vs Ollama vs vLLM comparison with all of this today.
Qwen3-Coder-Next Is the Best Open-Source Coding Model. Period.
It showed up quietly, but the numbers are hard to argue with. Qwen3-Coder-Next hit #1 on SWE-rebench at Pass 5 — not just among open models, but all models, including closed ones. It's an 80B MoE with only 3B active parameters, so it's fast despite the parameter count.
The practical story: 70.6% on SWE-bench Verified, meaning it can resolve real GitHub issues — read the repo, plan multi-file changes, and execute them. 256K context window. Apache 2.0 license. At Q4 it needs ~35-40GB, which means Mac with 48GB+ unified memory or dual GPUs on PC.
The catch: No thinking mode. It's an instruct model, not a reasoning model. Some users report it struggles with complex architectural decisions where step-by-step reasoning helps. For agentic coding — where the model is reading, planning, and executing — it's the best thing running locally right now.
Unsloth just requantized it with their new KLD metric. The old GGUFs had MXFP4 layers that hurt quality — the new quants are clean.
📖 We updated our Best Local Coding Models guide with full Qwen3-Coder-Next coverage today.
vLLM 0.17.0 Shipped
The big production inference engine hit a major version. Notable for the local crowd: if you're on CUDA 12.9+ and hit a CUBLAS error after updating, it's a known library mismatch — check their release notes for the fix.
vLLM remains the choice for multi-user serving and high-throughput batch work. If you're running inference for just yourself, llama.cpp or Ollama is still simpler.
ik_llama.cpp: 5x Faster Prompt Processing on CPU
A fork called ik_llama.cpp is getting attention for dramatically outperforming mainline llama.cpp on CPU inference. Testing on a Zen5 laptop CPU with Qwen3.5 4B IQ4_XS showed 5x prompt processing speed and 1.7x token generation compared to mainline.
If you're running models on CPU (no GPU, or offloading), this is worth trying. It's not a drop-in replacement for every use case, but for CPU-bound inference it's a significant improvement.
OpenClaw Security: The Numbers Keep Getting Worse
RankClaw audited all 14,706 skills in the OpenClaw marketplace and found 1,103 are malicious — up from 341 when Koi Security first reported the problem in February. OpenClaw has partnered with VirusTotal for skill scanning, but the fundamental problem remains: the marketplace still has weak vetting.
Meanwhile, a new CVE dropped this week — CVE-2026-28458, a Browser Relay auth bypass that let malicious websites steal browser session data through the Chrome DevTools Protocol. It was present in every version from 2026.1.20 through 2026.2.0. Fixed in 2026.2.1.
If you're running OpenClaw: update to 2026.2.1+, audit your installed skills, and read our January and February security reports.
Quick Hits
- GPT-5.4 launched with native computer-use capabilities and 1M token context. Benchmarks are strong but refusal rates are the highest of any OpenAI model so far.
- $70 house-call OpenClaw installs are a real business in China. Taobao sellers charge 100-500 RMB for remote or in-person setup. The local AI side hustle economy is real.
- Qwen3.5 family benchmarks compared — 122B, 35B, and 27B retain most of the flagship's performance. The 2B and 0.8B fall off hard on long-context and agent tasks.
- LTX 2.3 dropped for video generation — 22B model, GGUF quants available, running on 12GB VRAM cards. The open-source video generation space is heating up.
- ETH Zurich study confirms more context doesn't mean better agents — LLM-generated context files actually hurt performance on 138 real GitHub tasks.
- Anthropic designated a supply-chain risk by the Pentagon after refusing unrestricted military AI access. The story is still developing.
That's the week. Next edition drops next Saturday.
— InsiderLLM
Running local AI on weird hardware? Built something novel with it? We're always looking for real benchmarks and creative local AI applications. Drop us a line at hello@insiderllm.com
You're getting this because you signed up at insiderllm.com. Unsubscribe