Stop shipping AI features. Start shipping the evals first.

2026-05-19


Stop shipping AI features. Start shipping the evals first.

May 19, 2026

Issue 013 — Prompts are the easy part. The teams scaling AI in 2026 are the teams that built the eval harness before they built the feature — and most senior tech leaders still haven't bought the discipline.

Every other Tuesday I watch the same scene play out in a different company. A team has been working on an AI feature for six weeks. The demo is gorgeous. The exec deck has a chart showing 87% accuracy on a hand-curated dataset. The launch decision goes to a calendar invite next week. Somewhere in the building, a staff engineer with the wrong title is quietly writing the eval suite that should have shipped before the demo and after the model was chosen.

We have a name for this gap now. We didn't a year ago.

It is the gap between AI features that survive contact with production and AI features that don't. It is the gap between an org that knows what its agents do at 3 a.m. on a Tuesday and an org that finds out from a customer ticket. It is the gap between a leader who can answer "how do you know your AI works?" with a number, a trend, and a definition of "works" — and a leader who answers with a story.

The serious answer to that question, in 2026, is evals. The unserious answer is everything else.


Deep Dive — AI as Leverage for Tech Leaders: the eval is the lever

Three things shifted in the last twelve months that make this issue the centerpiece of any AI strategy worth the name.

The first is that the field stopped pretending generic metrics work. Hamel Husain and Shreya Shankar — the two ML practitioners who have been teaching evals to engineers and PMs at Maven through 700+ alumni — put it in writing in their January 2026 Evals FAQ: "generic metrics like BERTScore, ROUGE, cosine similarity are not useful for evaluating LLM outputs in most AI applications." Translation: every off-the-shelf benchmark you have been reading about, including the leaderboards that pick which model your vendor recommends, is at best a screening filter and at worst a hallucination of progress. The metric that matters for your product is the one your team derives from your product's failure modes.

The second is that Anthropic — and through them, the field — formalized agent evaluation as a distinct practice. Anthropic's Demystifying evals for AI agents (2026) is the closest thing we have to a canonical methodology document. The piece distinguishes a task (a test case with a success criterion) from a trial (one stochastic run); separates the transcript (the full multi-step interaction, including tool calls) from the outcome (the actual end-state in the environment); and walks through trajectory vs. outcome metrics, LLM-judge calibration, and using evals as CI. The most operationally useful sentence in the post, for senior tech leaders: "What proved most effective was establishing dedicated evals teams to own the core infrastructure, while domain experts and product teams contribute most eval tasks and run the evaluations themselves." That is a budget line, an org-chart sentence, and a hiring requisition all at once.

The third is that production observability for agents finally caught up to the production reality of agents. Honeycomb shipped agent observability — Agent Timeline, Canvas Agent — in May 2026, alongside their O11yCon conference framed explicitly around engineering teams building in the agent era. Charity Majors and her co-authors are releasing a refreshed edition of Observability Engineering that treats agent traces as a first-class signal. The implication for a director or VP of engineering is direct: agent traces and evals are two halves of the same instrument. Evals tell you whether your agent is doing the right thing on a curated set; traces tell you whether the production distribution still looks like the curated set. You cannot run a serious AI product on either one alone.

Put the three shifts together and the operating picture for 2026 is this: a senior tech leader who has not personally signed off on what "good output" means for each AI feature in flight is leading an org that is shipping by feel. The vendor's leaderboard is not signing off. The internal accuracy chart isn't either — it almost certainly measures the wrong thing on the wrong subset of inputs.

There are four moves that distinguish the leaders who are actually scaling AI from the leaders who are writing AI press releases.

Move 1: Own the eval suite the way you own the test suite. No senior tech leader would ship a critical system without unit tests, integration tests, and a regression suite. The same standard now applies to AI features. Eval suites are not nice-to-have artifacts the data-science team owns in a sidecar repo. They live next to the code, run in CI, block deploys when they regress, and get reviewed in PR. DeepEval, Confident AI, Maxim, and a handful of others have made the CI integration table-stakes — pytest plugins, GitHub Actions, regression tracking, the whole pattern. The bottleneck is not tooling. It is whether the senior leader treats the eval suite as part of the production system or as a research artifact.

Move 2: Fund the error-analysis discipline, not the dashboard. This is where most orgs misallocate. They buy a vendor dashboard with a hundred metrics and zero opinion on which ones matter. The disciplined alternative — Hamel and Shreya's loop — is closer to qualitative research than to APM. You collect 30–50 real failure traces, write open-coding notes on each, axial-code the notes into clusters, name the clusters, and then build bespoke evals for the failure modes you actually have. The output is not a dashboard. It is a small, specific, brutally relevant set of evals that catch the failures your users are already complaining about, plus an LLM judge calibrated against human grades on those exact failures. This is method work. It belongs on a senior engineer's roadmap, not in a Q3 vendor RFP.

Move 3: Treat evals as a context-engineering instrument, not a model-selection instrument. Andrej Karpathy named context engineering in June 2025 as "the delicate art and science of filling the context window with just the right information for the next step." Anthropic's Effective context engineering for AI agents made the case in detail: building with LLMs is less about prompt phrasing and more about what configuration of context is most likely to generate the desired behavior. Evals are the closing loop on that question. Without evals, every context change is a vibe-based A/B between two versions of a prompt. With evals, you have a quantitative answer to "did the new retrieval policy actually improve outcomes on the failure modes we care about?" Tech leaders who frame evals as a quality-gate-only tool miss this. Evals are how context engineering becomes engineering, not folklore.

Move 4: Pair every eval suite with an agent-trace pipeline. Anthropic's distinction between transcript and outcome matters because both modalities are needed and neither is sufficient. The transcript tells you how the agent reasoned and which tools it called. The outcome tells you whether the environment ended up in the right state. Agents can hit the right outcome via the wrong reasoning — and break catastrophically later — or they can produce a beautiful reasoning trace and then fail to actually execute the tool call. In production, the trace pipeline (Honeycomb-style) and the eval pipeline (Anthropic/DX-style) need to be wired into the same incident review. The teams getting this right run a weekly eval-and-trace review the way mature SRE orgs run a weekly post-incident review.

The one-sentence test: if your AI program lead cannot, in 90 seconds, walk you through the top three failure modes the eval suite caught this month, the eval suite is not real yet.

Try this week. Pick one AI feature in flight. Sit with the eng lead and the PM for 60 minutes and do open coding on the 30 most recent production traces — each of you writes notes on what failed and why. Don't categorize yet. At the end of the hour, ask: of these 30 traces, how many of the failure modes are not covered by the current eval suite? The number will surprise you. That gap is your roadmap.


Method — Error Analysis for LLM Applications (Husain & Shankar, 2024–2026; adapted from grounded theory, Glaser & Strauss, 1967)

What it is. A two-stage qualitative-coding loop, borrowed from grounded-theory research, that turns raw LLM failure traces into the specific, testable evals your application actually needs. Open coding produces free-text notes on individual traces; axial coding clusters those notes into named failure categories that become the basis for targeted eval tasks.

When to use it. You are shipping or scaling an AI feature, you have access to production (or staging) traces, and the current evals are either generic ("accuracy on a benchmark dataset") or absent. Especially valuable before a major model upgrade, before a launch decision, or after a regression no one can explain.

How to run it:

  1. Collect 30–50 real traces from production or staging that include a mix of successes and failures, sampled across the user segments and use cases that matter most. Don't curate — sample.
  2. Open code each trace. Two or three reviewers (eng + PM + ideally a domain expert) read every trace and write free-text notes on what's wrong, what's right, and what surprised them. No taxonomy yet. Write in the margin.
  3. Axial code the notes. Pull all the notes into a single document. Cluster them into groups that share a failure mode. Name each cluster precisely — not "bad output," but "hallucinated a SKU that exists in a different product line." A good axial code is specific enough that another labeler could apply it without asking you what you meant.
  4. Build evals against the named clusters. Each axial code becomes one or more eval tasks with explicit success criteria. Calibrate any LLM judges against human grades on at least 20 traces per cluster; iterate until the LLM judge agrees with the humans on roughly 90% of the calibration set.
  5. Wire the evals into CI and re-run the loop monthly. Production distributions drift; your axial codes will need updating; treat the eval suite as a living artifact, not a milestone.

When NOT to use it. When you have no real traces yet (you cannot open-code what doesn't exist — start with a small launch and instrument first). When the AI feature is a thin wrapper over a vendor API you cannot evaluate independently (the right move there is to ship a thin eval against your use case, not the vendor's claims).

Example. An internal support-bot team at a fintech ran this loop in late 2025 and discovered that their dominant failure was not "wrong answer" but "correct answer delivered after escalating away from the user." The accuracy metric was unmoved; the user-satisfaction metric was tanking. The axial code "premature escalation" became three new evals and an agent-policy fix in two weeks.


Field Notes

Demystifying evals for AI agents — Anthropic's reference document on agent evaluation. The transcript-vs-outcome distinction and the "dedicated evals team owns the infrastructure, domain experts own the tasks" model are the two ideas to steal first.

LLM Evals: Everything You Need to Know — Hamel Husain and Shreya Shankar's January 2026 FAQ, distilled from teaching 700+ engineers and PMs. The clearest answer to "where do I start?" your team will find this year.

Honeycomb launches Agent Observability — Agent Timeline + Canvas Agent shipped May 2026. The signal is that production agent traces have crossed into table-stakes; if your org isn't capturing them yet, it is behind the median.


Events


Reading


"The capabilities that make agents useful also make them more difficult to evaluate."

— Anthropic, Demystifying evals for AI agents (2026)


Don't miss what's next. Subscribe to Critical Path: