Your agent reported success. It failed.

        April 15, 2026

Your agent reported success. It failed.
This week's video on why agent logs aren't observability, a new doc-sentinel plugin for catching documentation drift, and three links on the AI-at-home use case.

        This week's video
My business operations agent told Slack an invoice had been created. It hadn't. The logs were clean. The dashboard stayed green. The only thing that explained what had actually gone wrong was a trace.

Watch the video →
Most production agent setups have logs and assume that's enough. It isn't. Logs tell you what happened in what order. They can't tell you why the agent made the decision it did, because the decision happens one layer above the event stream.
This week's video (Part 2 of the Agent Quality series) walks through the three-layer mental model: logs for sequence, traces for causality, metrics for aggregation. And the split inside that last layer: system metrics vs quality metrics. That's where silent failures live. Every system metric was green the day my agent reported a completed invoice that didn't exist.
Resources mentioned:

Arize Phoenix
OpenTelemetry
Google Agent Quality white paper (Chapter 3)
Part 1: How to Know If Your AI Agent Actually Works

What I shipped: doc-sentinel
Continuing the documentation tooling thread. doc-sentinel is a drift detector. It hooks into every commit, scans docs for references to changed code, and flags anything starting to go stale before it compounds.
The documentation lifecycle in claude-code-workflows now has four layers. agent-ready scaffolds the structure. doc-sync writes the content. doc-sentinel guards against drift. doc-audit runs the full validation on demand. Each one does one job and hands off cleanly to the next.
The mechanics lean on Claude Code's hook system, not git hooks. A PostToolUse hook watches for commits happening inside a Claude Code session, extracts the changed source files, cross-references them against every doc that mentions those paths, modules, or commands, and queues warnings to a local file. A Stop hook surfaces the queue at the end of the session with a prompt to resolve. Run /doc-sentinel:resolve and it groups warnings by file, classifies real drift from false positives, and commits fixes with a docs: prefix.
Docs that drift without anyone noticing degrade agent performance faster than almost anything else. This is the layer that catches it as it happens instead of during a quarterly audit.
claude-code-workflows on GitHub →

What I'm reading and watching
Jesse Genet on a16z — Jesse homeschools four kids with the help of 11 agents (10 of them on OpenClaw) running on always-on Mac Minis. Her homeschool agent, Sylvie, takes voice notes and photos from each lesson and writes them into her Obsidian vault as entries like "Quinn math March 17th." Later agents read from there. She's been posting about the setup on X for a while, and it's one of the more thoughtful takes on AI at home I've come across. One thing to know going in: the episode leans more philosophical than technical.
Claire Vo open-sourced tradclaw — An OpenClaw starter repo for home operations: family calendar briefs, school email triage, meal plans, homework logging from photos, helper payment reminders, home maintenance rhythms. Meant to be forked and tailored. Where Jesse's podcast describes the use case, tradclaw is a working repo you can clone and adapt. If you've been curious about OpenClaw and didn't know where to start, this is a cleaner on-ramp than the docs.
Claude Code desktop redesign — Anthropic shipped a redesigned Claude Code desktop app this week. Multiple sessions are now front and center in the experience, which matches how most people I know are running it (one session per branch, one per sub-agent, one per side quest). Haven't tried it yet. Looks like the right shape.

If you're running agents in production and aren't sure whether your dashboard is watching the right layer, the Agent Eval Scorecard is the companion piece. One page, four layers, five minutes to know which part of your eval setup is the weakest link. Quality metrics is the layer observability feeds. The scorecard is how you find out whether yours is being fed.

If you've hit a silent failure in a production agent and want to walk through where the observability gap is, reply to this email or book an intro call. Always happy to trade notes.
Damian

                                Don't miss what's next. Subscribe to Damian Galarza | AI Engineering:

            Email address (required)

          Add a comment: