If you can't write the eval, you can't ship the AI feature

June 9, 2026 · Issue 029

The acceptance criteria for AI features look nothing like the ones you've been writing for a decade. If you don't change the artifact, you'll keep approving launches you can't defend.

Senior TPMs and tech leaders are sitting in launch reviews right now nodding at slides that describe AI features the way they described CRUD features ten years ago. "User can ask a question and get an answer." "P95 latency under 800ms." "No P0 bugs from QA." That packet would have shipped a search box in 2016. In 2026, on an AI feature, it ships nothing — or worse, it ships something that quietly degrades for six weeks before a customer screenshot ends up in the CEO's inbox.

The fix is not better testing. The fix is changing what counts as acceptance criteria. For an AI feature, the acceptance criterion is an eval: a curated dataset, a scoring function, and a target score that the system must clear before launch and continue to clear after every model swap, prompt edit, or upstream data change. If you cannot write the eval, you have not actually defined the feature. You have defined a vibe.

This is the shift the strongest tech leaders are already making. Hamel Husain and Shreya Shankar's eval course has become the de facto curriculum for engineering and product leaders moving into AI, with a follow-up book in production at O'Reilly. (AI Evals For Engineers & PMs) Red Hat is publishing eval-driven development as a discipline, not a checklist. (Eval-driven development) The pattern is the same everywhere it is being done well: the eval is written before the system, and the eval is what gets reviewed, not the demo.

Deep Dive — Evals are the new acceptance criteria

The reason classic acceptance criteria fail on AI features is that they were designed for deterministic systems. "Given input X, return output Y" is testable when the system is a function. When the system is a probabilistic model wrapped in a prompt wrapped in a retrieval pipeline wrapped in a guardrail layer, "given X, return Y" becomes "given X, return something in the neighborhood of Y, most of the time, depending on the model, the index, the day, and who's reviewing it." That is not a criterion. That is a hope.

An eval replaces the hope with three artifacts that you can write down, version, and run on demand.

One: a dataset. Not synthetic. Not 12 examples a PM wrote in a Notion doc. A real, curated set of 100–500 inputs covering the head and the long tail of what users will actually send. The best teams build it from real production traffic and real failure tickets. Hamel Husain's guidance is brutal on this point — synthetic eval sets give you the illusion of coverage without the substance, because the failures real users find are not the failures you imagined. (LLM Evals: Everything You Need to Know) When you don't have production data yet, you can bootstrap with synthetic, but you mark every synthetic example and you replace it the moment real data lands.

Two: a scoring function. Either deterministic checks (regex, schema validation, tool-call assertions), embedding-based similarity, or — most commonly for open-ended outputs — an LLM-as-judge prompt that has been calibrated against human annotations. The judge itself is a system that needs evaluating. Recent research puts well-calibrated LLM judges at roughly 80–85% agreement with human reviewers, which is approximately the rate at which two humans agree with each other on the same task — and the rate plummets to under 60% on harder categories like safety. (LLM-as-Judge Best Practices in 2026) If you're not measuring judge-to-human agreement, you don't know what your scoring function is actually telling you.

Three: a target. The number that, if cleared, gates the launch. "Groundedness above 0.85 on the customer-support eval set. Context adherence above 0.90. Policy-compliance judge above 0.95. Zero hallucinated refunds in the regression suite of 47 previously-broken cases." This is what an acceptance criterion looks like in 2026. It is specific, it is numeric, it is gated, and it is reproducible.

A few things follow from this that are uncomfortable for traditional program management.

The PRD writes itself last, not first. You can no longer write a PRD that says "the assistant should be helpful and accurate" and hand it to engineering. You write the eval first, with engineering and with a domain expert who knows what "good" looks like. The PRD becomes a thin wrapper around the eval. The eval is the contract.

Launch gates change. "QA signed off" becomes "the eval score cleared the threshold on the last run, and the trend over the last ten runs is not degrading." The artifact that goes into the launch review is a dashboard, not a checklist.

Regressions are now silent by default. Classic regressions break the build. AI regressions return a confidently wrong answer in a well-formed JSON envelope. The only way to catch them is a regression suite of previously-broken cases that runs on every change. Treat this suite the way SREs treat post-incident runbooks: every production failure earns a new eval example, and it never gets removed. The dedicated hallucination regression suite is becoming standard practice. (AI Hallucination Testing in 2026)

The eval is also the strategy document. When you write down the dataset and the scoring rubric, you have committed in concrete terms to what the product is supposed to do. Disagreements that used to live in PRD margins ("is this for customer support or for sales enablement?") collapse into a single tractable conversation: which examples belong in the dataset, and how do we score them? Most strategy fights on AI products are actually rubric fights in disguise. Surfacing them as rubric fights resolves them in days instead of quarters.

The senior TPM's role in this is not to write the evals personally. It is to make sure they exist, to make sure they were built from real data, to make sure the judge is calibrated, to make sure the target was negotiated with the right stakeholders, and to make sure the regression suite grows after every incident. Those are the program-level activities. The eval is the work product, but the discipline — the operating cadence around it — is what the program manager owns.

If you cannot point to the eval today on the AI feature your team is about to launch, you do not own that launch. You are a passenger.

Try this week. Pick one in-flight AI feature on your program. Ask three questions in writing: What is the eval dataset, where does it live, and how many examples? What is the scoring function, and what is its agreement rate against human annotation? What is the target score, and who set it? If any answer is "we'll get to it before launch," that's your top program risk this week. Make it visible at the next steering meeting.

Method — The Pre-Mortem (Klein, 1989)

What it is. A 20–30 minute exercise where the team imagines the project has already failed catastrophically, and individually writes the story of why. Originated by Gary Klein as a recognition-primed decision-making technique. It exploits prospective hindsight, which Mitchell, Russo, and Pennington (1989) showed increases failure-mode identification accuracy by roughly 30% compared to standard "what could go wrong?" risk-assessment. (Performing a Project Premortem, HBR)

When to use it. Before any high-stakes, irreversible-ish decision: a launch, a migration, an org change, a re-platforming, a vendor commitment. Especially valuable for AI feature launches, where failure modes are unfamiliar to most reviewers and standard checklists miss them.

How to run it:

Frame the moment. "Imagine it is six months after launch. The feature is a disaster. We are in the post-mortem room. What happened?" Be specific about the timeframe and the severity — "embarrassing enough that it's mentioned at the all-hands."
Silent independent writing, 5–7 minutes. Each person writes their own story of the failure. No discussion yet. This step is what makes the technique work — it bypasses the conformity pressure of group brainstorming.
Round-robin read-out. Each person reads one item from their list. Go around until lists are exhausted. Capture every item without debate.
Cluster and rank. Group similar failure modes. Rank by likelihood × impact. The top three to five are your real risk register — much sharper than what you'd have produced asking "what could go wrong?"
Assign owners and mitigations. Each top-ranked failure mode gets a named owner and a specific mitigation tracked through to launch.

When NOT to use it. Skip it when the decision is genuinely reversible, low-cost, or so urgent that a 30-minute exercise blows the window. Pre-mortems also lose value when the team is too junior or too new to the problem space to imagine plausible failures — in that case, get a more experienced practitioner in the room first.

Example for AI launches: "It is December 2026. We rolled back the AI assistant last week after a regulator inquiry. What went wrong?" In practice, this exercise consistently surfaces failure modes — drift in the LLM-as-judge, retrieval index staleness, evaluator-prompt injection — that don't appear on any traditional QA checklist.

Field Notes

Why AI evals are the hottest new skill for product builders — Lenny Rachitsky's interview with Hamel Husain and Shreya Shankar is the fastest way to bring a skeptical engineering director up to speed on why evals are now a leadership-level concern, not a QA concern.

A pragmatic guide to LLM evals for devs — Gergely Orosz makes the case that evals are becoming the new test pyramid. Skim the section on eval ownership — useful ammunition for the "who runs this in our org" conversation.

Evaluation-driven development with EvalHub — Red Hat's June piece walks through tooling that treats evals as first-class CI artifacts. Less interesting as a product pitch, more interesting as confirmation that this is becoming an infrastructure category.

Events

Jun 18–19 · LeadDev London 2026 — The engineering leadership conference of record; staff+ growth and AI-augmented teams tracks are the ones worth the airfare.
Jun 29 – Jul 2 · AI Engineer World's Fair 2026 — 6,000+ AI engineers and Fortune 500 CTOs, 10 parallel tracks. The "Evals & RAG" track is the one to plan around.
Jul 8–9 · RAISE Summit 2026 — Paris — 9,000+ attendees with heavy C-level participation, agentic systems and enterprise ROI focus. Hallway track over keynotes.

Reading

LLM Evals: Everything You Need to Know · Hamel Husain & Shreya Shankar — The reference FAQ. Read once, keep open in a tab for the next quarter.
The State of AI-Driven Software Releases 2026 · LeadDev — Quantifies how AI release practices differ from traditional release practices. Useful for benchmarking your own org's maturity honestly.
Performing a Project Premortem · Gary Klein, HBR — The original article. Still the cleanest 1,500-word explanation of why prospective hindsight works. Worth re-reading before running one.

"I would not give a fig for the simplicity this side of complexity, but I would give my life for the simplicity on the other side of complexity."

— Oliver Wendell Holmes

Critical Path

If you can't write the eval, you can't ship the AI feature

If you can't write the eval, you can't ship the AI feature

Deep Dive — Evals are the new acceptance criteria

Method — The Pre-Mortem (Klein, 1989)

Field Notes

Events

Reading