The 4 levers that diagnose broken AI agents
This week: a practical framework for diagnosing agent failures, how to turn failed runs into eval cases, and a few resources for building better harnesses.
This week's video
Agent failures get blamed on the model too quickly.
Sometimes the model is the constraint. More often, the harness around the model is under-specified. The retriever brought back weak context. The tool surface made the wrong action too easy. The loop had no stop condition. The governance layer trusted a prompt where it needed a runtime gate.
This week's video is the framework I use to separate those failure modes: Context, Tools, Loop, and Governance.
The useful part of the framework is that it gives you a diagnostic sequence instead of a vague complaint. When an agent ignores the right information, look at Context. When it calls the wrong thing or cannot take the action you expected, look at Tools. When it loops too long, stops too early, or fails to recover from errors, look at Loop. When it does something it should not be allowed to do, or skips a human checkpoint, look at Governance.
"Agent quality" is too broad to debug. Harness quality is easier. Context can be inspected. Tools can be redesigned. Loops can be bounded. Governance can be enforced. Once you split the system into those pieces, the failure usually becomes much less mysterious.
What I did not cover in the video
The part I would add is the habit after diagnosis.
Do not stop at "this failed because of Context" or "this failed because of Tools." Treat that as the beginning of a diagnostic loop: execute, observe the failure, attribute it to a layer, turn the failure into an eval case, change that layer, and re-run the same task. If the agent gave a plausible answer using the wrong source, that failed run becomes a regression test: the next pass has to use the right source and show the evidence before the harness accepts it.
The important move is to make the fix permanent. If the task was ambiguous, write a better Definition of Done. If the agent missed a convention, put that convention where it will actually be retrieved. If the environment burned half the context window on broken dependencies, fix the setup path. If the verification step was missing, add that feedback and make the failure part of the eval suite.
One concrete version is a feature list that behaves like a harness primitive, not a planning memo. Each item needs three fields: behavior, verification, and state. Not "add search," but "GET /api/search returns paginated results," plus the command, trace, or eval that proves it. The agent can work the item, but it should not be the thing that decides the item is passing. That state transition belongs to the harness.
The counterintuitive part is that the right harness change is not always more structure. Sometimes it is subtraction: fewer tools, less context, a narrower role, a smaller approval surface, or a cleaner stop condition. Bad harnesses hide the real decision. Good harnesses make the next correct action obvious.
This is also where the supervisor pattern comes back in. Splitting one overloaded agent into a supervisor and a few specialists is a harness tactic: narrower context, fewer tools, simpler loops, and clearer governance boundaries. The risk is over-delegation. The win is that CRM work, proposal drafting, and code review no longer share the same instructions, tools, and failure modes.
That is the difference between "prompt engineering" and harness engineering. Prompt engineering tries to phrase the wish better. Harness engineering changes the system so the wish has somewhere reliable to land.
The diagnostic worksheet
I also put together a short worksheet for applying the framework to one real agent failure. Pick one run, walk through the four levers, and identify the smallest harness change to try next.
Use it when a trace looks successful but the workflow still fails. The point is to stop guessing where the failure came from.
Get the 4 Levers Agent Diagnostic
More on harness engineering
If this framing is useful, these are the adjacent pieces I would pair with it:
- Effective harnesses for long-running agents. The clearest companion to the feature-list idea: initialize the environment, track feature state, work incrementally, and leave clean handoff artifacts.
- Harness design for long-running application development. Useful for the evaluator angle: split generation from evaluation, make quality criteria explicit, and iterate against feedback.
- How to really stop your agents from making the same mistakes. A sharper version of the eval loop: promote failures into skills, tests, resolver checks, and smoke tests.
Quick ask: what should I make next?
I am planning the next batch of videos, essays, and Practical AI resources, and want your input.
I put together a 2-minute form about what you are trying to build, where AI tools still feel frustrating, and what you would want me to explain clearly. I also added a question about a possible podcast format.
If you are debugging an agent that looks good in the demo and messy in the real workflow, reply with the lever you think is failing: Context, Tools, Loop, or Governance.
And if you want help designing the harness around a production agent, book an intro call.
Damian

Add a comment: