InsiderLLM Weekly — Feb 28, 2026
InsiderLLM Weekly — Feb 28, 2026
This Week in Local AI
Welcome to the first InsiderLLM Weekly. No fluff. No hype cycles. Just what happened in local AI this week and what it means for your hardware.
Qwen 3.5 Landed. It's the Real Deal.
Qwen dropped three models this week — a 27B dense, a 35B MoE (3B active), and a 122B MoE — and the local AI community lost its collective mind. Here's why:
The 35B-A3B is the story. Only 3 billion active parameters, but it's replacing setups that needed two models. Users on M1 64GB are running it as their sole daily driver. On an RTX 5060 Ti (16GB), it does 41 tok/s generation at 100K context. On CPU, expect 5-6 tok/s at Q4.
The 27B dense punches way above its weight. Artificial Analysis scores it at 42 on their Intelligence Index — the most intelligent model under 230B. It's matching DeepSeek-V3.2 on raw reasoning benchmarks. On a single 16GB card, it fits comfortably at Q4.
The 122B is for the multi-GPU crowd. Three 3090s (72GB total) get you 25 tok/s with full context on GPU. It nails every reasoning benchmark thrown at it, including the infamous car wash test.
The catch: These models overthink. The MoE especially burns through thinking tokens second-guessing itself. Community workarounds include disabling thinking mode for direct chat and tuning system prompts to constrain reasoning budgets.
Quant situation is still settling — Unsloth had a bug in early GGUFs (now fixed). Q4_K_M is the sweet spot for most setups.
📖 We're publishing a full hardware-matched guide this weekend — which model, which quant, which card. Watch for it.
RTX 5060 Ti: The New Budget Local AI Card
Real benchmarks are in from community users, and the 16GB 5060 Ti is looking like the best price-to-performance card for local inference right now.
The headline number: Qwen 3.5-35B-A3B at Q4, 100K context, 41 tok/s generation, 700+ tok/s prefill. On a $400 card. That's not a typo.
KV cache at Q8 is confirmed "free lunch" — no quality loss, saves VRAM. The --fit on flag in llama.cpp helps maximize what you can load.
If you're building a local AI rig on a budget, this is the card to beat right now.
DeepSeek V4: Coming Next Week
The Financial Times reports DeepSeek will release V4 next week with image and video generation capabilities. Two things worth noting:
Huawei gets early access. Nvidia and AMD do not. This is the first major open-weight model release where US chipmakers are explicitly locked out of early access.
The model is expected to be multimodal from day one — not a text model with vision bolted on later. If the weights drop publicly, expect the local community to have it running within hours.
We'll have a setup guide ready the moment weights are available.
Quick Hits
- llama.cpp pushed 4 releases this week (b8155–b8170). Key fix: AMX support and batched processing.
- Ollama v0.17.3–v0.17.4 fixed Qwen 3/3.5 tool calling that was broken during thinking mode.
- Ubuntu 26.04 LTS will ship with auto-detected CUDA/ROCm drivers out of the box. Local AI on Linux is about to get a lot easier.
- Google paper found that longer chain-of-thought actually correlates negatively with accuracy (r = -0.54). More thinking ≠ better answers.
- KV-cache sharing between agents — a new approach passing KV-cache instead of text between multi-agent setups showed 73-78% token savings across Qwen, Llama, and DeepSeek.
- Anthropic refused the Pentagon's demands for unrestricted AI access. Trump ordered agencies to stop using Anthropic tech. Employees at Google and OpenAI signed an open letter supporting Anthropic's position.
That's the week. Next edition drops next Saturday.
— InsiderLLM
Running local AI on weird hardware? Built something novel with it? We're always looking for real benchmarks and creative local AI applications. Drop us a line at hello@insiderllm.com
*You're getting this because you signed up at insiderllm.com.