I cut a production prompt by 28% (and the evals that made it safe)

        April 8, 2026

I cut a production prompt by 28% (and the evals that made it safe)
How I used autoresearch to run 65 autonomous optimization iterations on a production agent prompt, plus the eval mental model that made any of it trustworthy.

        I spent last week letting an optimization loop rewrite a production prompt for me. 65 autonomous iterations later, the system prompt was 28% smaller and the output quality had dropped by less than two points.
None of that would have been safe to do without an eval harness underneath. The optimization loop is the eye-catching half. The eval is the half that makes "smaller prompt" mean something other than "different prompt."
Read the full post →
The short version: the agent takes ~170 input categories and produces a structured matrix. The system prompt carried a 421-line reference matrix as few-shot examples. I wanted to know how much of that the model actually needed. I pointed autoresearch at it, defined what "good output" meant, and walked away. It came back with a 303-line version scoring 98.1% against the baseline.
The interesting part wasn't that the prompt got smaller. It was how it got smaller. I expected something closer to brute force — cut a chunk, score it, cut another chunk. What actually happened was more surgical. Each iteration made a small, targeted edit, ran the eval, and either kept the change or reverted it. The final 303 lines weren't picked once. They survived 65 rounds of selection pressure.

The part the blog post assumes you already have
The optimization loop is the flashy part. But it wouldn't be possible without the eval harness underneath powering it. That's what this week's video gets into.

Watch the video →
Most teams ship agents on gut feel. Run it a few times, the output looks reasonable, push it. The problem is that agents fail differently from deterministic software. They degrade silently, producing plausible-looking wrong answers that never trip an alert.
Evals aren't tests in the way you're used to thinking about tests. The system is non-deterministic, so a single run tells you very little. You score outputs on a spectrum and watch the distribution move over time. It looks more like manufacturing QA than unit testing.
The video covers the four layers that actually matter: the deterministic pieces (component), whether the agent took the right steps (trajectory), whether it hit the goal (outcome), and what quality drift looks like in production (monitoring). Most teams measure one of these, sometimes two. The gap between one and four is where silent failures live.
And it closes on the part I care most about: evaluatability is a design decision, not a testing phase. Either you instrument for it from the first line of code, or you spend six months bolting it on under pressure.

Where your agents actually stand
The two pieces this week are two sides of the same coin. The blog post shows what you can do once you have evals. The video shows what evals actually are. If you're sitting in the middle wondering "okay, but where do I start," that's what I built the Agent Eval Scorecard for.
It's a one-page diagnostic that scores your current eval setup across the four layers in the video. Component, trajectory, outcome, production monitoring. Each one gets a band from "not measured" to "production-grade." You'll know in five minutes which layer is the weakest link, which is almost always the one causing silent failures in prod.
Get the Agent Eval Scorecard →

What I'm reading and watching
Hamel Husain's evals-skills plugin — Hamel published a Claude Code plugin that ships skills for auditing and building LLM eval pipelines. Distilled from helping 50+ companies and his AI Evals course with Shreya Shankar. The eval-audit skill is the natural companion to the scorecard above: install it, point it at your eval setup, and let it surface the specific problems. This is the kind of focused, opinionated tooling more of the agent ecosystem should look like.
Karpathy on LLM knowledge bases — Andrej posted a thread this week on how he's using LLMs to build and maintain personal markdown knowledge bases in Obsidian. Raw sources go in, the LLM "compiles" a wiki of summaries, backlinks, and articles, and he queries against it. It's exactly the pattern from last week's Mastra Workspaces video, scaled up to research. The agent isn't remembering, it's searching files it wrote earlier. Files remain the most underrated agent primitive.
Gemma 4 announced — Google shipped Gemma 4 last week. I've been running local models for a few days to see how much of an agent's cheap, high-volume work (classification, extraction, routing) I can push off the frontier models. Most of it, it turns out. The timing matters: Anthropic has been tightening how Max plans can be used outside Claude Code, which makes "what can I run locally" a real cost question again rather than a hobby one. Worth revisiting if you haven't looked at local models in six months.

The whole issue more or less circles the same point: the agents you can measure are the agents you can actually improve.
If you've run into silent failures in a production agent and want to talk through where the eval gap is, reply to this email or book an intro call. Always happy to trade notes.
Damian

                                Don't miss what's next. Subscribe to Damian Galarza | AI Engineering:

            Email address (required)

          Add a comment: