Your SSD Is the New GPU
InsiderLLM Weekly issue: 4 — March 22, 2026
This Week in Local AI
The biggest local AI story this week isn't a new model. It's a new way to run the ones we already have.
Flash-MoE: 397 billion parameters, from your SSD
Dan Woods built a pure C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max with 48GB RAM. The model is 209GB on disk. The engine keeps 5.5GB resident in memory. Everything else streams from your NVMe SSD on demand, at 4.4 tokens per second.
The trick is MoE architecture. Qwen3.5-397B has 512 experts per layer, but only 4 fire per token. Flash-MoE loads those 4 (~6.75MB each) via parallel pread() calls, runs them on Metal shaders, and moves on. The OS page cache handles repeat experts, hitting about 71% of the time. No Python, no PyTorch, no frameworks. Just C, Objective-C, and 1,200 lines of hand-tuned Metal kernels.
The 4-bit config at 4.4 tok/s is the production choice. It handles tool calling and JSON output correctly. A 2-bit config hits 5.7 tok/s but breaks structured output — the repo's own issue tracker documents it producing \name\ instead of "name" in JSON. Simon Willison flagged the thin evaluation methodology behind the project's claim that 2-bit quality matches 4-bit.
There's a real quality question beyond quantization. Flash-MoE activates 4 experts per token instead of the standard 10 to stay within SSD bandwidth. Woods reports the "biggest quality drop-off occurred at 3," but dropping from 10 to 4 is still a hit. One HN commenter called it "particularly misleading" when stacked on top of aggressive quantization. A well-tuned 30B model at 4-bit, fully in RAM at 25 tok/s, might produce better results for practical work.
Still — this is a 397B model running on a $3,000 laptop. Woods built it using Claude Code with Karpathy's autoresearch pattern, 90 experiments. The project hit 199+ points on Hacker News. An M5 Pro user reported 6.55 tok/s. The technique scales with SSD speed, and the concept isn't Mac-specific even though this implementation is. "Will it fit in RAM" is no longer the only question for MoE models.
📖 Full guide here.
ik_llama.cpp: the fork that's hard to ignore
ik_llama.cpp has been picking up steam all month, and this week's numbers make it hard to dismiss as a curiosity.
A user running Qwen 3.5 27B on an NVIDIA RTX PRO 4000 (Blackwell, 24GB) reported 26x faster prompt processing compared to mainline llama.cpp. That's not a typo. Earlier benchmarks showed 5x prompt processing speedup on a Zen5 laptop CPU with Qwen3.5 4B. Multiple Reddit posts scored 9-13 upvotes this week. The community is noticing.
The fork ships features mainline doesn't have: FlashMLA for DeepSeek models, SOTA quantization types (the IQ_K series), row-interleaved quant packing, graph-split multi-GPU, and first-class BitNet support. If you're CPU-bound or hybrid GPU/CPU, the speedups are real.
The caveats are equally real. CPU (AVX2+ or ARM_NEON+) and CUDA only. No Metal, no ROCm, no Vulkan. Unsloth _XL quants don't work. And this is a permanent fork — the maintainer confirmed no plans to upstream any of it. If you're on Mac or AMD, this isn't for you. If you're on an Intel or AMD CPU doing offload, or running NVIDIA with big prompt loads, it's worth testing.
GitHub: ikawrakow/ik_llama.cpp
OpenClaw security: the numbers keep climbing
RankClaw's latest audit counts 1,184 malicious skills on ClawHub. That was 341 when Koi Security first flagged the problem in January. 820 in early March. The growth rate isn't slowing down.
A new CVE dropped this week: CVE-2026-28458, a Browser Relay auth bypass present in versions 2026.1.20 through 2026.2.0, fixed in 2026.2.1. Meanwhile, Langflow (145K GitHub stars) got hit by CVE-2026-33017 — unauthenticated remote code execution, exploited in the wild within 20 hours of disclosure. Attackers harvested API keys from live instances.
The pattern is clear: self-hosted AI tools with default-insecure configs are getting hammered. If you're running OpenClaw or Langflow, update now.
📖 Our January report is here. February report here.
Quick Hits
- Alibaba confirms continued Qwen open-sourcing. Posted on ModelScope, velocity-boosted on Reddit (597 upvotes in 7.3 hours). New Wan video generation models likely incoming. The open-source AI pipeline out of Hangzhou isn't slowing down.
- Qwen 3.5 35B running on an iPhone at 5.6 tok/s. Same SSD expert streaming concept as Flash-MoE, ported to iOS Metal. Only 3B active parameters need to be in memory at once. MoE architecture makes phones viable for models that have no business running on phones.
- Sam Altman called AI ads "uniquely unsettling" and a "last resort." OpenAI is now running Peloton and Target ads in ChatGPT. Filed under: things that were inevitable.
- llama.cpp shipped 3 releases this week (b8460, b8469, b8475). Notable: a parser bug fix that caused crashes after valid streaming output completed, and router sleep status reporting for multi-instance setups.
That's the week. Next edition drops next Sunday.
— InsiderLLM
Running local AI on weird hardware? Built something novel with it? We're always looking for real benchmarks and creative local AI applications. Drop us a line at hello@insiderllm.com
You're getting this because you signed up at insiderllm.com. Unsubscribe