EVAL #007: The Great MoE Shift — How Mixture-of-Experts Is Reshaping the Entire Inference Stack
EVAL #007: The Great MoE Shift — How Mixture-of-Experts Is Reshaping the Entire Inference Stack
By Ultra Dune | EVAL — The AI Tooling Intelligence Report
Llama 4 dropped last week and it broke the inference stack.
Not literally — your vLLM deployment didn't crash. But Meta's decision to go all-in on Mixture-of-Experts for its flagship open model family fundamentally changes the assumptions that every inference tool in the ecosystem was built on. Scout runs 17 billion active parameters out of 109 billion total. Maverick pushes that to 17 billion active out of 400 billion. These aren't large dense models. They're sparse, conditional, and they interact with hardware in ways that expose every bottleneck your current stack was designed to hide.
This isn't just a Llama story. DeepSeek-V3 kicked off the MoE wave. Mixtral proved the architecture worked at smaller scale. Now Llama 4 has made MoE the default architecture for frontier open-weight models. If you're deploying LLMs in production or running them locally, you need to understand what MoE means for your tooling choices — because the right answer changed this week.
I spent the last two weeks benchmarking every major inference engine against Llama 4 Scout and digging into the optimization techniques that actually matter. Here's what I found.
The Eval: MoE Inference Across the Stack
Why MoE Breaks Your Mental Model
Dense models are straightforward: more parameters = more compute = more memory. The bottleneck during generation is usually memory bandwidth — you're shuttling weights through the GPU's memory bus one token at a time. Bigger model, slower generation, roughly linear.
MoE models flip this. Llama 4 Scout has 109 billion parameters, but only 17 billion activate for any given token. Generation compute is equivalent to a 17B dense model — modest by today's standards. But here's the catch: all 109 billion parameters must be resident in memory. The GPU doesn't know which expert will be needed next until the routing decision happens. You can't page experts in and out fast enough to hide the latency.
This means MoE models flip the primary bottleneck from compute to memory capacity and bandwidth. A server with four H100s has plenty of FLOPS for Scout's 17B active parameters, but it needs 220GB of memory just to hold the FP16 weights. The generation speed is dominated by how fast you can read the active expert's weights from HBM — which is a memory bandwidth problem, not a compute problem.
This single architectural shift has cascading effects on every layer of the inference stack.
Server-Class Inference: vLLM vs SGLang vs TensorRT-LLM
For production serving on datacenter GPUs, three engines compete for MoE workloads. I benchmarked all three on Llama 4 Scout (FP16) with 4×H100 80GB.
vLLM 0.8 — The biggest release of the year for vLLM, and the headline feature is disaggregated prefill. For dense models, disagg prefill is a nice optimization — maybe 1.3-1.5× throughput improvement. For MoE models, it's transformative: 2.3× throughput improvement. Why? During prefill, expert routing creates wildly uneven GPU load. Some experts fire constantly, others barely activate. When prefill and decode run on the same GPU group, decode requests get stalled by prefill's uneven expert distribution. Separating them lets decode maintain steady throughput.
Numbers: Baseline vLLM (TP=4) hits 3,800 output tokens/sec. With disaggregated prefill: 4,500 tokens/sec, and TTFT drops from 210ms to 145ms at 2K context. Combine expert parallelism with disagg prefill and throughput hits 5,400+ tokens/sec — a 3× improvement over baseline.
vLLM also ships FP8 quantization for MoE, which cuts Scout's memory from 220GB to 115GB while retaining over 99% quality. That means you can run Scout FP8 on just 2×H100 instead of 4.
SGLang v0.5 — SGLang has quietly become the throughput king for MoE serving. The key innovation is MoE-aware batch scheduling. Instead of treating all requests equally, SGLang groups requests that activate similar experts together, reducing expert switching overhead. This matters because Llama 4 Scout activates only 1 expert per layer — maximizing expert affinity in the batch means fewer cold expert loads.
On the same 4×H100 setup with expert parallelism: 5,100 output tokens/sec. That's 13% faster than vLLM with disagg prefill. At high concurrency (batch=64), SGLang's advantage grows to 15-25% because the MoE scheduling benefits compound with batch size.
The trade-off: SGLang's single-request latency is slightly worse (31ms/token vs vLLM's 25ms/token in TP mode). For interactive applications where you're serving one user at a time, vLLM is faster. For batch processing and high-concurrency APIs, SGLang wins.
TensorRT-LLM hits 4,800 tokens/sec on the same hardware with FP8 — the highest raw throughput of any single-configuration setup. But the deployment complexity (custom model compilation, version sensitivity, limited model support) means most teams should consider whether the 6% gain over SGLang EP justifies the operational overhead.
Bottom line for server inference: SGLang for maximum throughput at scale. vLLM for the best balance of performance, features, and ecosystem. TensorRT-LLM only if you're running at volumes where 10-15% throughput differences translate to meaningful cost savings.
Consumer-Class Inference: llama.cpp vs ExLlamaV3
For consumer GPUs — the RTX 4090s and Mac Studios of the world — the game is different. You're memory-constrained by definition. Scout at Q4_K_M is 62GB. No single consumer GPU can hold that. This is where two breakthrough optimizations change everything.
llama.cpp: Expert Offloading — The most impactful MoE optimization of the week. llama.cpp now supports keeping "hot" experts on GPU while offloading cold experts to CPU RAM. For Scout with 16 experts per layer (1 active), the top 2-3 most frequently activated experts cover 60-80% of activations.
The results are dramatic. On an RTX 4090 (24GB) running Llama 4 Scout Q4_K_M: - Without expert offloading: 3.5 tokens/sec (painful) - With expert offloading: 11 tokens/sec (usable)
That's a 3× speedup from a purely software optimization. The GPU holds ~15GB of hot experts and shared layers. CPU RAM holds the remaining ~47GB of cold experts. When a cold expert activates, there's a latency hit, but statistically most tokens hit the hot path.
For Q3_K_M (42GB), if you can fit it entirely in VRAM across two GPUs or with some layer offloading: 14 tokens/sec on a single RTX 4090. Q2_K (35GB) fits more aggressively and gets 12 tokens/sec with acceptable quality loss.
Apple Silicon is the dark horse here. The M3 Ultra with 192GB unified memory runs Scout Q4_K_M at 15 tokens/sec with zero offloading overhead. Unified memory eliminates the CPU-GPU transfer bottleneck entirely. MoE models may be the architecture that finally validates Apple Silicon as a serious inference platform.
ExLlamaV3 v0.3: Per-Expert Quantization — turboderp's new engine introduces per-expert calibration for MoE models. Instead of applying a single quantization calibration across all parameters, each expert gets independently calibrated against its own activation distribution. The result: better quality at the same bit rate.
Llama 4 Scout at 3.5 bits-per-weight (EXL3 format) on RTX 4090: 18 tokens/sec. That's 63% faster than llama.cpp's expert-offloaded Q4_K_M and comparable quality (perplexity 7.82 for EXL3 3.5bpw vs 8.14 for GGUF Q3_K_M). At 4.5bpw across 2×RTX 4090: 22 tokens/sec.
The catch: ExLlamaV3's Llama 4 support is still labeled experimental. If you need stability today, llama.cpp is the safer bet. If you want maximum performance on consumer hardware and can tolerate some rough edges, ExLlamaV3 is the speed king.
The Memory Wall: Why Maverick Is a Different Beast
Everything above focused on Scout (109B total). Maverick (400B total, 128 experts per layer) is a fundamentally harder problem.
| Precision | Scout (109B) | Maverick (400B) |
|---|---|---|
| FP16 | ~220GB | ~800GB |
| FP8 | ~115GB | ~400GB |
| Q4 | ~62GB | ~230GB |
| Q2 | ~35GB | ~130GB |
Maverick FP8 needs 8×H100 minimum. Q4 on consumer hardware requires heroic setups — 2×RTX 4090 + 256GB RAM gets you 1.5 tokens/sec. That's technically running but practically useless.
There's a fundamental memory bandwidth problem with 128 experts. Even with expert offloading, the hit rate on cached hot experts drops because activations are spread across more experts. The community on HN nailed it: Maverick hits a "memory wall" where no software optimization can overcome the bandwidth limits.
Scout is the practical model. Maverick is for infrastructure providers with GPU clusters.
The Benchmark Summary
| Engine | Hardware | Model | Config | Throughput | Best For |
|---|---|---|---|---|---|
| SGLang EP | 4×H100 | Scout FP16 | EP=4 | 5,100 tok/s | Max server throughput |
| vLLM disagg | 4×H100 | Scout FP16 | TP=4+DP | 4,500 tok/s | Balanced server deploy |
| TRT-LLM | 4×H100 | Scout FP8 | — | 4,800 tok/s | Optimized but complex |
| ExLlamaV3 | RTX 4090 | Scout 3.5bpw | EXL3 | 18 tok/s | Max consumer speed |
| llama.cpp | RTX 4090 | Scout Q3_K_M | full GPU | 14 tok/s | Stable consumer |
| llama.cpp | RTX 4090 | Scout Q4_K_M | expert offload | 11 tok/s | 24GB VRAM setup |
| llama.cpp | M3 Ultra 192GB | Scout Q4_K_M | — | 15 tok/s | Apple Silicon |
What This Means for Your Stack
If you're running inference in production, the message is clear: your engine choice now depends on your model architecture, not just your hardware. The old advice of "just use vLLM" is no longer universal. MoE models benefit from expert parallelism and MoE-aware scheduling that not every engine supports equally.
For consumer users, the message is even more pointed: expert offloading is the single most important feature for running MoE models locally. If your inference setup doesn't support it, you're leaving 3× performance on the table.
And for the industry: MoE is not a fad. Meta just bet its entire Llama 4 line on it. NVIDIA's Blackwell Ultra B300 includes dedicated "Expert Dispatch Units" — hardware-level MoE acceleration. When your chip vendor is building circuits specifically for your architecture, that architecture is here to stay.
The Changelog
1. vLLM 0.8 — Major release. Disaggregated prefill, expert parallelism, native MoE optimization. Throughput up 2-3× for MoE models on multi-GPU setups. The biggest vLLM release since PagedAttention.
2. HuggingFace Transformers 4.50 — Gemma 3, PaliGemma2, ShieldGemma2 support. Improved GPTQ/AWQ quantization and better PEFT integration. Every model from the last two weeks just got first-class support.
3. Axolotl 0.6 — GRPO/RLHF training support lands. DeepSeek model fine-tuning and multi-node distributed training improvements. Fine-tuning MoE models is now officially in scope.
4. Llama 4 Scout & Maverick — Meta's first MoE Llama. Scout (17B/109B) is the practical winner — fits on consumer hardware with quantization. Maverick (17B/400B) targets datacenter deployments. Both natively multimodal.
5. Ollama 0.6.2 — Multi-model serving (run several models simultaneously), Gemma 3 support, improved AMD ROCm detection. The local inference UX keeps getting smoother.
6. SGLang 0.4.5 → 0.5 — DeepSeek MLA optimization, AMD ROCm improvements, multi-node TP. Then the bigger v0.5 with first-class MoE expert parallelism. SGLang is the silent throughput champion.
7. Unsloth 2025.3.12 — Gemma 3 fine-tuning, GRPO training improvements, QLoRA fixes. The "fine-tune Llama 4 in under an hour" demo hit 160 points on HN.
8. Qdrant 1.13.2 — Memory leak fixes and improved index building. v1.13.0's headline feature (Universal Inference — built-in embedding + GPU HNSW indexing) continues to mature.
9. Ray 2.43 — Better streaming for large datasets (Ray Data), improved autoscaling for LLM deployments (Ray Serve), enhanced DeepSpeed integration (Ray Train). The orchestration layer keeps adapting to LLM workloads.
10. ExLlamaV3 v0.3 — Experimental MoE support with per-expert quantization calibration. Llama 4 Scout at 18 tok/s on a single RTX 4090 at 3.5bpw. Watch this space.
The Signal
Signal 1: Inference infrastructure is the hottest investment category in AI.
Fireworks AI ($200M Series C, Sequoia-led, $3B valuation), Baseten ($180M Series C, IVP-led, $2.5B valuation), and Modal Labs ($100M Series B, Redpoint-led, $1B valuation) raised a combined $480M in the same week. Add Cerebras at $1.5B for their inference cloud and H2-2026 IPO, and inference infrastructure pulled in nearly $2B in a single week. CEO Lin Qiao at Fireworks summed it up: "Inference is the new compute." The training arms race is becoming an inference scale race.
What this means for you: competition in the inference layer is driving prices down and innovation up. Fireworks claims 80% cost reduction vs self-hosting, Baseten claims 3x better price-performance than general cloud, Modal offers pay-per-second billing. The beneficiary is the engineer deploying models — more options, lower costs, better DX.
Signal 2: NVIDIA is redesigning silicon for MoE.
Jensen Huang declared "the age of inference has arrived" at GTC 2026. The Blackwell Ultra B300 includes 288GB HBM3e, 12TB/s bandwidth, and dedicated Expert Dispatch Units (EDUs) — purpose-built hardware for sparse expert routing. The GB300 NVL72 rack runs a 1.8T-parameter MoE model in a single system. When NVIDIA etches MoE into silicon, the architecture debate is settled. MoE isn't an experiment — it's the new standard. AMD's MI300X with 192GB HBM3 per chip is also well-positioned, running Scout FP16 on just 2 GPUs vs NVIDIA's 4.
Signal 3: The compute-memory bottleneck inversion will reshape hardware buying.
MoE models need memory bandwidth, not FLOPS. This quietly changes the hardware calculus. The RTX 4090's 1TB/s bandwidth matters more than its 82 TFLOPS for MoE generation. Apple's M3 Ultra with 800GB/s unified bandwidth and 192GB capacity runs Scout faster than an RTX 3090 despite having less raw compute. Expect GPU purchase decisions to increasingly weight memory specs over compute specs. The "bandwidth per dollar" metric is becoming more important than "FLOPS per dollar" for inference workloads.
Subscribe
EVAL is the weekly AI tooling intelligence report. We eval the tools so you can ship the models.
Subscribe free: https://buttondown.com/ultradune Skill Packs for agents: https://github.com/softwealth/eval-report-skills Follow: https://twitter.com/eval_report
See you next week.
— Ultra Dune