Gemma 4 Just Made My 3090 Three Times Faster
I ran Gemma 4 on my own RTX 3090 today. Three times faster than Qwen 3.6 on the same hardware, same build. The numbers below, plus the flag combination that almost broke my bench.
Gemma 4 Hit 3.10x Faster on My RTX 3090
Same RTX 3090, same llama.cpp mainline build, same bench script. Gemma 4 26B-A4B Q4_K_XL: 128 tok/s. Qwen 3.6-27B Q4_K_M: 41 tok/s. Both fit in roughly 17 GB VRAM. Same Q4 quant tier. Same nine prompts from am17an's gist bench. Speedup: 3.10x.
The architectural reason holds up under inspection. Qwen 3.6-27B is dense -- every token uses all 27B params for the forward pass. Gemma 4 26B-A4B activates 8 of 128 experts per token, plus shared layers. Effective active params per decode step is far less than the total 26B. The 3090 is bandwidth-bound on decode -- less weight moved per token directly buys more tokens per second. Total VRAM is similar because all experts must live in memory; only a fraction get used each step.
The catch: Gemma 4 forces a reasoning trace by default. Output goes to the reasoning_content field, so the bench script reads zero. The fix is --jinja plus --chat-template-kwargs '{"enable_thinking":false}'. Miss that combination and you'll spend an hour wondering why generation looks broken. The load log will still print thinking = 1 -- cosmetic bug, output is correct.
Practical takeaway: if you're on a 3090 and want speed, Gemma 4 is the new pick. If you want simpler ergonomics with no flag dance, Qwen 3.6 is still rock-solid. The full bench -- all nine prompts, both models same day -- lives in the article.
📖 The full firsthand bench is here.
DFlash vs MTP: I Tested Both on the Same 3090
Two speculative decoding methods for local inference, both shipping the same week. DFlash uses an external draft model plus tree verification; MTP uses extra layers baked into the same checkpoint. Both run on a single RTX 3090. Both speed up Qwen 3.6-27B Q4_K_M decode. Neither wins cleanly.
The numbers: DFlash 2.56x. MTP 1.50x. Same hardware, different prompt suites — DFlash on the LuceOrg academic benches (HumanEval, GSM8K, Math500), MTP on am17an's mixed-prompt gist. Treat as directional, not apples-to-apples. The full methodology caveats are in the article.
Where DFlash wins: raw decode speedup, mature 3.5 draft. Where MTP wins: single GGUF, mainline llama.cpp, no fork required. The composition experiment -- MTP draft tokens fed into DDTree verification -- is the obvious next bench. Nobody has built it yet.
📖 The full head-to-head with methodology caveats is here.
The Gemma 4 Launch Wave
Today's r/LocalLLaMA had three Gemma 4 reproductions in the top 20. Score-16: Gemma 4 26B at 600 tok/s on a single RTX 5090. Score-14: Qwen 3.6 35B-A3B hype thread (still kicking). Score-12: MTP for Gemma 4 in llama.cpp, ~40% additional speedup on top of the base numbers. Score-8: NVFP4 GGUFs shipping for Blackwell -- reportedly smaller than Q4_K_M with comparable quality.
On the architecture side, Chris Hay published a May 5 video showing decoupled attention via LARQL -- attention runs locally on a small GPU while FFN experts stream from a remote CPU server. Gemma 4 26B-A4B works in the setup, with the residual stream passing over HTTP at ~41 MB per token. Chris Hay's demo showed 24 tok/s on a laptop with the FFN server on the same LAN.
The pattern is clear. Speculative decoding, hybrid architectures, distributed inference -- the practical performance ceiling for consumer hardware just moved up about 3x in two weeks.
📖 Full Gemma 4 guide is here. LARQL coverage is here.
Quick Hits
- Unsloth releases MTP variants (today's r/LocalLLaMA score-11). Qwen 3.6 MTP weights now available alongside the standard checkpoints.
- DeepSeek V4 Flash at 85 tok/s @ 524K context on dual RTX PRO 6000 (today's score-9). The long-context story is real.
- FLUX.2 [klein] one-prompt-to-cinema pipeline trending (today's score-16). Stitched generation plus interpolation for short-form video from a single prompt.
- Qwen 3.6 35B-A3B still the workhorse for 16GB-VRAM owners -- daily r/LocalLLaMA threads keep finding new configs.
- NVFP4 GGUFs shipping for Gemma 4. Memory savings on every card, native acceleration only on Blackwell.
- Catalog housekeeping: 10 article refreshes shipped this week. Freshness tracker now reads raw access logs instead of the top-20 markdown table -- coverage jumped from 20 to 273 articles flagged.
That's the week. Next edition drops next Monday.
-- InsiderLLM
Running local AI on weird hardware? Built something novel with it? We're always looking for real benchmarks and creative local AI applications. Drop us a line at hello@insiderllm.com
You're getting this because you signed up at insiderllm.com. Unsubscribe