2026-06-02
June 2, 2026
Issue 024 — Your eng team is one model upgrade away from solving the wrong problem. The expensive failures in production AI right now aren't where the model gets the answer wrong — they're where the answer gets passed to a human, another agent, or a backend system that can't accept it. Senior tech leaders are still funding the inference layer and starving the seam.
The most-cited cautionary tale in AI customer-service writing this spring is still Klarna. In February 2024 the company shipped an OpenAI-built agent that handled 2.3 million customer chats in its first thirty days and automated 67% of conversations. Eighteen months later it was hiring human agents back. CEO Sebastian Siemiatkowski's own postmortem was unsparing: "We focused too much on efficiency and cost. The result was lower quality, and that's not sustainable." What's interesting for tech leaders is where exactly the quality dropped. It was not the agent's accuracy on tier-1 questions — that held. The damage was concentrated in the 5–20% of conversations the agent should have escalated and either didn't, escalated badly, or escalated without enough context for the human to recover the customer's trust.
Klarna isn't an outlier. Galileo's 2026 production-reliability writeup names losing context during handoff — customers repeating information the AI already collected — as the single most critical agent failure mode to monitor. inFeedo's HR-agent analysis lands at the same place: the most common production failure isn't a wrong answer, it's a seam failure — the agent's intent classification, attempted resolution, and confidence score never reach the human who picks up the conversation. The customer experiences this as starting over. The org experiences it as an unmeasured trust tax that compounds every escalation.
This is not a model problem. Newer models do not fix it. Bigger context windows do not fix it. Cheaper inference does not fix it. The fix is design work at the seam between agent and not-agent — and most senior tech leaders are still funding the wrong half of the system.
There are two failure surfaces in any production AI feature. Inside the model: the inference is wrong, the prompt is wrong, the retrieval is wrong, the tool selection is wrong. At the seam: the model produced something correct — and then the artifact crossed a boundary into a human, another agent, a backend API, a queue, a CRM, a ticketing system — and the boundary did not preserve enough state for the next actor to continue gracefully. In nearly every production postmortem I've read in the last six months — Klarna's, the HR-agent vendors', the multi-agent-orchestration writeups from Atlan and Aetherlink — the second failure surface is bigger than the first, and growing faster.
That ratio matters because the two surfaces have totally different unit economics for senior leaders. Inside-the-model work has a clean buyer story: pick a better model, run more evals, hire a context engineer. Vendors have lined up to sell each step. Seam work has no comparable buyer story. It is design work, integration work, schema work, observability work, written-protocol work. It is unglamorous, cross-team, and almost always under-staffed because no AI roadmap line item says "fix the boundary between the agent and Zendesk." The result is a budget profile where 80% of the spend is going to the 20% of failure modes that actually shipped well.
Three categories of seam failure show up repeatedly enough that a senior tech leader should be able to name and instrument all three before the next eng review.
Agent-to-human handoffs. This is the Klarna failure. When the agent escalates, the human inherits: who the customer is, what they tried, what the agent already said, what the agent's confidence was, what the agent's reason for escalating was. The Loris/Poly.AI postmortems on Klarna identify exactly this: the handoff did not carry classified intent, retrieved context, attempted response, and confidence score. The human walked into a cold conversation with a customer who had already explained themselves twice. The customer experienced the second explanation as the failure event, not the agent's earlier hallucination. The senior-leader instrumentation: percent of escalations where the human starts with a complete state payload, measured at the receiving system, not at the agent. Almost no one tracks this today. Everyone reports "escalation rate" instead, which is the wrong number.
Agent-to-agent handoffs. As multi-agent systems become the default architecture — Gartner now expects 40% of enterprise apps to include task-specific agents by year-end 2026 — the handoff failure mode multiplies. Atlan's 2026 orchestration guide is blunt: agents create "distributed decision-making systems without unified cognition," and the most common production failure is one agent passing partial or schema-mismatched output to the next, which then retries on incomplete state, hands the failure back, and the system enters an infinite loop that's invisible to any single agent's logs. The shared-context layer — a governed business glossary, a certified data dictionary, an instrumented log of every agent interaction — is the senior-leader investment that prevents this. Almost no enterprise has staffed the team that owns it.
Agent-to-system handoffs. The agent writes the right answer to the wrong shape. Field names don't match, the CRM rejects the payload, the queue silently drops the message, the downstream service throws a 4xx that no one is paged on. This is the failure mode that looks most like a normal engineering integration bug — and gets fixed by normal engineering integration discipline — except that the agent has generated the schema, not received it from a deterministic pipeline. The instrumentation that catches this is end-to-end tracing of agent output through the downstream system, with assertions about what counts as a successfully landed outcome. Honeycomb's Agent Timeline, which entered early access at O11yCon last month, was built specifically for this: GenAI spans, tool calls, and backend services on one correlated timeline so a leader can see where the trace went sideways at the seam, not just whether the agent ran.
The intellectual move underneath all three of these is older than the AI conversation. G. Lynn Shostack wrote Designing Services that Deliver in the January 1984 Harvard Business Review to give service businesses a way to map the entire delivery system on a single page, with explicit lines of visibility and lines of internal interaction — the seams where one part of the system hands off to another and where most failures happen. Forty-two years later, with "service" reinterpreted as an AI-mediated workflow, the same map is the right tool. The customer's journey is the top swim lane. The agent's actions are the next lane. The human or downstream system actions are the next. The dotted lines between lanes are where AI features either work or break — and they are the lines almost nobody is drawing.
The senior tech leader move for this quarter is to insist on a service blueprint, in the Shostack sense, for every AI feature already in production and every one on the roadmap. Not a flow diagram. Not a system architecture diagram. Specifically a blueprint that draws the lines of visibility and the lines of internal interaction — and forces the team to put failure-mode and recovery-action annotations on every seam. The artifact is the deliverable. The discipline of producing it is the leverage. Teams that build the blueprint find the work they were not funding; teams that don't will ship more Klarnas.
Try this week. Pick one AI feature in your portfolio. In a 90-minute working session, draw a four-lane service blueprint: customer (or end-user), agent, human responder, downstream systems. Map the happy path on top. Underneath each lane boundary, annotate every state that must cross the seam for the next lane to succeed. Then, for each crossing, write the failure mode and the recovery action in one sentence. Whichever seam has the most unanswered cells is your next investment, not the model upgrade your team is currently asking for.
What it is. A planning and design technique introduced by G. Lynn Shostack — at the time a vice president at Citibank — in Designing Services that Deliver (HBR, Jan 1984). The blueprint maps the entire delivery of a service on a single diagram, organized in swim lanes that distinguish what is visible to the customer (frontstage) from what happens out of view (backstage), with explicit horizontal lines that mark the most failure-prone boundaries: the line of interaction (customer ↔ service), the line of visibility (frontstage ↔ backstage), and the line of internal interaction (backstage employees ↔ support systems). The point of the technique is that the lines are where most service failures originate and where most improvement work pays back — and they are the parts of the system that org charts and process docs typically don't show.
When to use it. Any time a new AI feature, customer-facing workflow, multi-agent system, or cross-team service is being designed or post-mortemed. Particularly useful when the team can describe the happy path but cannot describe what happens at the seams, when a feature is "shipped" but the operational metrics are mysteriously bad, or when stakeholders disagree on whether a failure is an AI failure, a process failure, or an integration failure (the blueprint usually shows it's the third).
How to run it:
When NOT to use it. Don't blueprint when the team is at a model-selection or prompt-engineering stage and there is no production journey yet — the blueprint needs a real workflow to bite on. Don't use it as documentation theater after the fact ("we made the diagram, ship it"); the value is in the unanswered seam annotations, not the polished diagram. And don't blueprint a system where authority is contested — agree on the journey owner first.
Example. A 4,000-engineer retailer ran a Shostack-style blueprint on their AI returns-processing flow in early 2026 after an internal incident. The blueprint surfaced that the agent's "approve refund" handoff to the warehouse-management system carried the order ID but not the return reason — which the WMS needed to route correctly. The downstream re-routing failure was being logged as a fulfillment metric, not an AI metric, and had been growing for three months. Fix took two weeks. The seam was visible the moment they drew the diagram.
Anthropic Memory for Managed Agents — 97% fewer first-pass errors at Rakuten — Public beta since April 23, 2026. Memories mount as filesystem state so agents avoid repeating prior errors across sessions. Rakuten's headline numbers: 97% fewer first-pass errors, 27% lower cost, 34% lower latency. The leadership read isn't "buy this" — it's that cross-session memory is now an infrastructure layer, not a feature. If your org is still building per-feature memory shims, you're paying twice.
Honeycomb Agent Timeline (O11yCon 2026) — Early-access feature that correlates GenAI spans, tool calls, and backend services on one timeline. Charity Majors' framing at O11yCon: most engineering teams have a limited understanding of what their agents actually did or whether they delivered value. If you can't reconstruct an agent run end-to-end across the seams, your blueprint is fictional.
Lenny's Podcast — Simon Willison on agentic engineering (April 2026) — Willison's argument is that November 2025 was the inflection point where coding agents moved from "mostly works" to "actually works." His three daily patterns — red/green TDD, templates, hoarding — are worth a 30-minute listen for any senior leader still rationing AI tool budget for their team.
"The actions of contact people, support people, and management are linked together to support the customer's journey. The most failure-prone parts of a service are the seams between them."
— G. Lynn Shostack, Designing Services that Deliver, Harvard Business Review, January 1984
Don't miss what's next. Subscribe to Critical Path: