EVAL #001: The Great LLM Inference Engine Showdown — vLLM vs TGI vs TensorRT-LLM vs SGLang vs llama.cpp vs Ollama
====================================================================== EVAL -- The AI Tooling Intelligence Report Issue #001 | March 2026 ======================================================================
The Great LLM Inference Engine Showdown: vLLM vs TGI vs TensorRT-LLM vs SGLang vs llama.cpp vs Ollama
Hey there.
Welcome to the first issue of EVAL. No fluff, no hype cycles, no "10 AI tools that will CHANGE YOUR LIFE" listicles. Just one senior engineer talking to another about the tools that actually matter.
And for issue one, we're going straight for the jugular: inference engines.
Here's the uncomfortable truth nobody on Twitter will tell you: picking your LLM inference engine is one of the highest-leverage decisions you'll make in your AI stack, and most teams get it wrong. They either over-engineer it (congrats on your TensorRT-LLM setup that took three sprints to deploy and now needs a dedicated DevOps engineer to babysit) or under-engineer it (no, Ollama is not your production serving layer, please stop). I've watched teams burn entire quarters migrating between engines because they didn't do the homework upfront. Don't be that team.
So let's break down the six engines that matter in March 2026, with actual opinions instead of marketing copy.
====================================================================== THE QUICK RUNDOWN ======================================================================
Here's your cheat sheet. Pin this somewhere.
| Engine | Stars | Throughput* | Ease | Hardware | Vibe |
|---|---|---|---|---|---|
| vLLM v0.7.3 | ~50k | 1000-2000 | Med | GPU-first | Reliable workhorse |
| TGI v3.0 | ~10k | 800-1500 | Med | GPU-first | Corporate solid |
| TensorRT-LLM | ~10k | 2500-4000+ | Hard | NVIDIA only | Speed demon |
| SGLang v0.4 | ~10k | Very High | Med | GPU-first | Dark horse |
| llama.cpp | ~75k | 80-100** | Easy | Everywhere | Swiss army knife |
| Ollama | ~120k | Low | Trivial | Via llama.cpp | Gateway drug |
* tok/s on A100/H100 for Llama-70B class models, except where noted ** 7B model on M2 Ultra (CPU/Metal), not comparable to GPU numbers
All Apache 2.0 licensed except llama.cpp and Ollama (MIT). Yes, this matters when legal comes knocking.
====================================================================== THE DEEP DIVE: ENGINE BY ENGINE ======================================================================
--- vLLM v0.7.3 ------------------------------------------------- "The one you'll probably end up using"
Stars: ~50k | License: Apache 2.0
vLLM is the Honda Civic of inference engines. Is it the fastest? No. Is it the most exciting? No. Will it reliably get you from A to B without drama? Absolutely.
PagedAttention was genuinely revolutionary when it dropped -- treating KV cache like virtual memory pages was one of those "why didn't we think of this earlier" ideas. Continuous batching means you're not leaving GPU cycles on the table. The OpenAI-compatible API means your application code is basically engine-agnostic. That's huge.
The V1 engine became default in v0.7.0, and it shows. Things just work. Anyscale, IBM, Databricks, Cloudflare -- these aren't exactly hobby projects. When companies with serious SLAs pick your engine, that says something.
The honest downsides: GPU memory overhead is real. vLLM is hungry, and if you're trying to squeeze a 70B model onto the minimum viable GPU count, you'll feel it. AMD ROCm support exists but it's... let's call it "maturing." If you're on MI300X, budget extra time for debugging.
Best for: General-purpose production serving, teams that want a large community and proven reliability, OpenAI API drop-in replacement scenarios.
Verdict: Your default choice unless you have a specific reason to pick something else. The boring-but-correct answer.
--- TGI v3.0 ----------------------------------------------------- "The enterprise's chosen one"
Stars: ~10k | License: Apache 2.0
HuggingFace's Text Generation Inference is what happens when you have the world's largest model hub and decide you should also serve those models. The Rust+Python hybrid is genuinely clever -- Rust for the hot path, Python for the model loading and config. Flash Attention 2 integration is solid.
800-1500 tok/s on A100 for 70B models. Not chart-topping, but respectable. The real story here is ecosystem integration. If you're already on HuggingFace Inference Endpoints or Amazon SageMaker, TGI is the path of least resistance. Sometimes the best tool is the one that's already integrated.
The downsides are real though. That Rust codebase? Good luck if you're an ML engineer who needs to debug a serving issue at 3 AM. Cargo and PyTorch don't exactly play nice at the boundary. Model support consistently lags vLLM by a few weeks to months -- if you need day-one support for the latest architecture, look elsewhere.
Best for: Teams already invested in the HuggingFace ecosystem, SageMaker deployments, organizations that value corporate backing and support contracts over raw community size.
Verdict: Great if you're in the HuggingFace/AWS ecosystem. Otherwise, hard to justify over vLLM unless you really love Rust.
--- TensorRT-LLM v0.17 ------------------------------------------- "The speed freak's playground"
Stars: ~10k | License: Apache 2.0 (with caveats)
Let me be blunt: if you're serving on NVIDIA hardware and every millisecond matters, TensorRT-LLM is the answer. 2500-4000+ tok/s on H100 with FP8 quantization. That's not a typo. We're talking 10-30% faster than vLLM on equivalent NVIDIA hardware, sometimes more.
Perplexity uses it. Major cloud providers use it behind the scenes. When you need to serve millions of requests and your GPU bill looks like a mortgage, that 30% matters. It's real money.
But -- and this is a big but -- the developer experience is, to put it diplomatically, not great. The compilation step alone will make you question your career choices. You're building engine-specific plans for specific model configurations on specific hardware. Change your GPU? Recompile. Change your batch size? Recompile. Sneeze? Believe it or not, recompile.
It's NVIDIA-only. Obviously. This is a feature and a limitation depending on your worldview. The learning curve is steep enough that you should budget engineering time measured in weeks, not days.
Best for: High-traffic production serving where latency is a competitive differentiator, teams with strong CUDA/systems engineering talent, anyone whose GPU bill exceeds their rent.
Verdict: The right choice when you're at scale on NVIDIA and have the engineering team to support it. The wrong choice for nearly everyone else. If your team doesn't have at least one person who's comfortable reading CUDA kernels, think twice.
--- SGLang v0.4 -------------------------------------------------- "The one that might eat everyone's lunch"
Stars: ~10k | License: Apache 2.0
Okay, this is where it gets interesting. SGLang came out of UC Berkeley and LMSYS (the Chatbot Arena folks), and it's been quietly demolishing benchmarks while nobody was paying attention.
RadixAttention for prefix caching is elegant. The constrained decoding support is best-in-class. And the numbers are wild -- 3.1x faster than vLLM on DeepSeek V3 in their benchmarks. Now, take any "we're Nx faster" claim with appropriate skepticism (benchmark configurations matter), but even if you halve that number, it's impressive.
xAI chose it for Grok. LMSYS runs their arena on it. These are demanding workloads with smart people making the decisions.
The catch: smaller community means fewer Stack Overflow answers when things break. It's less battle-tested in diverse production environments. The documentation is improving but still has that "academic project" feel in places. You're betting on a trajectory here, not a track record.
Best for: Research teams, structured output heavy workloads, anyone serving models with shared system prompts across requests, teams willing to be early adopters for a potentially big payoff.
Verdict: The most exciting engine in this list. If I were starting a new project today with GPU serving needs, I'd seriously evaluate SGLang before defaulting to vLLM. Watch this space closely.
--- llama.cpp ----------------------------------------------------- "The cockroach (complimentary)"
Stars: ~75k | License: MIT
llama.cpp will survive the apocalypse. Ggerganov's C/C++ masterwork runs on literally everything: CUDA, ROCm, Metal, Vulkan, SYCL, CPU, and yes, even WebAssembly. The GGUF format has become a de facto standard for local model distribution. If you've ever downloaded a model from a random person on HuggingFace, it was probably GGUF.
80-100 tok/s for a 7B model on M2 Ultra via Metal. Not going to win any datacenter benchmarks, but that's not the point. The point is that it runs, everywhere, on everything, with minimal fuss. The quantization support is extraordinary -- from Q2_K to Q8_0, you can trade quality for speed with granularity that the GPU engines don't touch.
It's also the foundation that Ollama is built on, which means it indirectly powers the local AI experience for millions of developers.
The limitation is obvious: this is not a high-throughput serving solution. If you're trying to serve concurrent users at scale, you need one of the GPU-first engines above. llama.cpp is for running models locally, for edge deployment, for weird hardware, for places where a Python runtime is a luxury you can't afford.
Verdict: Indispensable for local/edge use cases. The widest hardware support in the ecosystem by a country mile. Not your production serving engine (and it doesn't pretend to be).
--- Ollama -------------------------------------------------------- "The people's champion"
Stars: ~120k | License: MIT
120,000 GitHub stars. Let that sink in. Ollama has more stars than any other project on this list, and it's fundamentally a Go wrapper around llama.cpp with a nice CLI and a model registry.
And you know what? That's exactly what it should be.
"ollama run llama3" -- that's it. Model downloaded, quantized, running, chat interface ready. Your product manager can do this. Your CEO can do this. My mom could probably do this (hi mom).
The Modelfile concept borrowed from Dockerfile is genuinely clever. The local API is clean. The model library is curated. It's the single best onboarding experience in all of AI tooling.
But let's be clear about what it is and isn't. It is not a production serving solution. It does not do continuous batching. It does not do multi-GPU tensor parallelism. It is not optimized for throughput. If you see Ollama in a production architecture diagram, someone made a mistake (or it's a very unusual use case).
Verdict: Perfect for what it is -- the fastest path from "I want to try a local LLM" to actually running one. Put it on every developer's laptop. Do not put it behind a load balancer.
====================================================================== HEAD TO HEAD: THE BENCHMARK DISCUSSION ======================================================================
Let's talk numbers honestly, because benchmarks in this space are a minefield of misleading comparisons.
The throughput hierarchy on NVIDIA hardware is clear:
TensorRT-LLM > SGLang >= vLLM > TGI >> llama.cpp > Ollama
TensorRT-LLM's 10-30% advantage over vLLM is real but comes with massive operational complexity. The interesting story is SGLang closing the gap with vLLM and sometimes surpassing it, especially on newer architectures like DeepSeek V3 where RadixAttention and their optimized scheduling really shine.
But raw throughput isn't everything. Here's what the benchmarks usually DON'T measure:
Time to first deployment: Ollama wins by hours. vLLM and TGI are minutes to hours. TensorRT-LLM is days to weeks.
Recovery from failures: vLLM and TGI have mature health checks and restart logic. TensorRT-LLM's compiled plans mean a failed node isn't just a restart -- it might be a recompilation.
Long-tail latency: SGLang's RadixAttention is incredible for workloads with shared prefixes (think: same system prompt across requests). For random diverse queries, the advantage shrinks.
Cost efficiency: The fastest engine isn't always the cheapest. vLLM's broader hardware support means you can shop AWS, GCP, and Azure spot instances. TensorRT-LLM locks you into NVIDIA's pricing power.
The hardware flexibility ranking tells its own story:
llama.cpp > Ollama > vLLM > TGI > SGLang > TensorRT-LLM
And model support breadth:
llama.cpp > vLLM > SGLang > TGI > Ollama > TensorRT-LLM
Notice a pattern? The engines optimized for raw speed tend to sacrifice flexibility. The engines that run everywhere sacrifice throughput. There is no free lunch in inference. Anyone telling you otherwise is selling something (probably GPUs).
My honest take: for most teams, the difference between vLLM and SGLang is smaller than the difference between either of them and having a well-tuned deployment configuration. Spend your engineering hours on batching strategies, quantization choices, and prompt optimization before you spend them switching inference engines. The engine matters, but it matters less than people think relative to everything else in your serving stack.
====================================================================== THE RECOMMENDATION MATRIX: JUST TELL ME WHAT TO USE ======================================================================
Fine. Here's the opinionated guide.
YOU'RE A... USE THIS --------------------------------------------|------------------------ Solo dev exploring LLMs | Ollama Startup building an AI product (< Series B) | vLLM Enterprise with existing HF/AWS stack | TGI High-scale serving, performance-critical | TensorRT-LLM or SGLang Deploying to edge / weird hardware | llama.cpp Research team / academia | SGLang Building a desktop AI app | llama.cpp (via binding) Running inference on AMD GPUs | vLLM (with patience) Need structured / constrained output | SGLang Budget-constrained, CPU-only servers | llama.cpp Want to future-proof your bet | vLLM (safe) or SGLang (bold)
The meta-advice: if you're asking "which inference engine should I use?" and you don't already have strong opinions, the answer is vLLM. It's the default for a reason. Graduate to something more specialized when you've hit a specific wall -- and you'll know when you have, because you'll be staring at latency dashboards at 2 AM wondering why your P99 looks like a hockey stick.
====================================================================== THE CHANGELOG: WHAT SHIPPED THIS MONTH ======================================================================
Notable releases and updates from the inference engine world, March 2026:
[vLLM v0.7.3] Landed automatic FP8 weight calibration for Hopper GPUs. No more manual scale-factor hunting. Also: speculative decoding now supports Medusa heads with Eagle-2 fallback. Memory efficiency improved ~12% for long-context workloads (>64k tokens).
[TGI v3.0.2] Hotfix for the CUDA graph capture regression that was causing OOMs on A10G instances. Added native Gemma 3 support. Prometheus metrics endpoint now includes per-request KV cache utilization. About time.
[TensorRT-LLM v0.17.1] Added Blackwell (B200) support with FP4 quantization. Yes, FP4 -- we've officially entered the "how low can you go" era of number formats. Build times reduced 40% with the new incremental compilation pipeline. Still not fast, but less painful.
[SGLang v0.4.3] Merged the async constrained decoding PR that eliminates the grammar-guided generation overhead on long outputs. DeepSeek V3/R1 serving now uses 35% less KV cache via their dynamic MLA compression. The most interesting changelog item nobody's talking about.
[llama.cpp] Ggerganov merged the 1-bit (ternary) weight format experimental branch. BitNet-style models now run natively. Also: SYCL backend got a major overhaul, Intel Arc GPUs seeing 2x performance improvement. Vulkan compute shaders rewritten for better mobile GPU compatibility.
[Ollama v0.6.0] Added "ollama compose" for multi-model pipelines. Think docker-compose but for chaining a router model with specialist models. Clever concept, early days on execution. Also shipped a built-in benchmarking tool: "ollama bench" gives you tok/s, memory usage, and time-to-first-token in one command.
====================================================================== PARTING THOUGHTS ======================================================================
The inference engine landscape is consolidating and fragmenting at the same time. Consolidating around a few winners (vLLM for general purpose, TensorRT-LLM for max performance, llama.cpp for local). Fragmenting because SGLang is proving that the "settled" approaches have significant room for improvement.
My prediction: by end of 2026, the vLLM vs SGLang rivalry will be the story, with TensorRT-LLM maintaining its performance crown but becoming increasingly niche as open engines close the gap. llama.cpp will quietly become the most important piece of software in AI that nobody in the enterprise talks about. And Ollama will hit 200k stars while remaining blissfully inappropriate for production.
That's it for issue one. If this was useful, tell a friend. If it wasn't, tell me -- I can take it.
Until next time, keep your batch sizes high and your latencies low.
-- The EVAL Team
EVAL -- The AI Tooling Intelligence Report No hype. No fluff. Just tools.
To subscribe: [eval-newsletter.ai] To unsubscribe: close this email and pretend it never happened