2026-05-26
May 26, 2026
Issue 019 — The leverage of the last AI cycle was the prompt. The leverage of this one is the context. Most teams are still over-funding the wrong craft — and the model-spend line item is the most expensive evidence that they haven't noticed yet.
A staff engineer I worked with spent the back half of 2024 writing the most beautiful system prompt I've ever read. Sixteen revisions. Three weekly office hours with the product team. A small ceremony at the engineering all-hands when it shipped. Twelve months later, the same engineer is running a project to delete most of it — because the production cost spike that landed in Q1 isn't from the prompt being wrong. It's from the context the prompt now has to ride on top of: tool schemas that 2-3x in size after JSON serialization, conversation histories that compound every step, retrieved chunks that nobody is curating, and the second-best decisions of a coding agent left in the buffer for ten more turns.
That story is now the median story. In June 2025, two of the most-listened-to voices in applied AI — Andrej Karpathy and Shopify CEO Tobi Lütke — independently posted that the term we should be using is no longer prompt engineering, it is context engineering. Lütke's framing: "the art of providing all the context for the task to be plausibly solvable by the LLM." Karpathy's: "the delicate art and science of filling the context window." Anthropic's engineering blog made it canonical last September with Effective context engineering for AI agents, defining the discipline as "finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome," and framing the shift as the natural evolution of prompt engineering, not its replacement.
For senior tech leaders, this is not a vocabulary change. It is an operating-model change with budget, hiring, and architecture consequences — and most orgs haven't caught up.
Last Tuesday this newsletter argued that evals are the leverage point you have to fund before you ship AI features. Today's argument is its twin. Even with great evals, the artifact your team will most often need to change to fix a failing agent is not the prompt and not the model — it is the context the model is reasoning over. Evals tell you when it's broken. Context engineering is what you fix.
Three shifts in the last twelve months turned this from a vocabulary debate into a budget-line conversation.
The first is that the cost shape of an agent is structurally different from the cost shape of a chatbot. A chatbot sends one message, gets one response, and stops. An agent runs a loop — plan, tool call, file read, intermediate output, verify, retry — and each step sends the entire accumulated context back to the LLM. By turn 20, you are paying for the same system prompt twenty times. LeanOps's analysis this quarter put it in two numbers nobody on a finance team can argue with: agents make 3–10x more LLM calls than simple chatbots, and burn roughly 50x more tokens than equivalent chat workloads. A March 2026 Gartner survey of 353 D&A and AI leaders found that only 44% of orgs have adopted any kind of AI FinOps practice. The other 56% are about to discover that context length is a cost function, and the only lever they have on it is engineering the context itself.
The second is that "context rot" is real and measurable. Anthropic's engineering team named the effect: as context length grows, the model's effective attention budget shrinks, and you get diminishing returns from stuffing more tokens into the window. The tokens are in working memory in a literal technical sense — but reasoning quality drops because the relevant ones drown in the irrelevant ones. The implication for tech leaders is uncomfortable. The seemingly safe move — just send more context — gets you both higher cost and worse output. Bigger is not better. Denser is.
The third is that the highest-leverage actions are all platform actions, not application actions. Anthropic's post enumerates three concrete strategies that move the needle in production: compaction (summarizing earlier turns so the agent keeps the gist but discards the verbosity), tool-result clearing (drop tool outputs from the buffer once they've been consumed — the second-cheapest 30% cost reduction in the industry right now), and memory (decoupling cross-session knowledge from the live context window — the design behind Anthropic's Memory for Managed Agents that shipped on April 23, 2026, with Rakuten reporting a 97% reduction in agent error rates as an early customer). Each of these is a platform capability, owned by an infrastructure team, that improves every application built on top of it. That is what makes this a director-and-above conversation: the leverage point is no longer "which team has the best prompt," it's "which platform team owns context as a first-class concern, and is it funded."
If your AI org chart still has the prompt-engineer-embedded-in-each-feature-team shape, you are operating on the 2024 mental model. The 2026 shape has a context-platform team sitting under the AI infrastructure org, owning the eval harness and the memory layer and the tool registry and the compaction policy and the per-agent token-budget guardrails. Feature teams use those affordances; they don't reinvent them. The good news is that the cost case writes itself — Anthropic's documentation calls out that simply caching repeated system instructions yields roughly 90% reduction on input costs and 75% reduction on latency for the cached prefix, before you do anything else clever. The bad news is that almost nobody has staffed the team that would do it.
There is a second-order consequence that lands directly in the senior TPM and tech-leader inbox. DORA's May 2026 ROI of AI-assisted Software Development report put the right hammer on the right nail: AI is not an automatic productivity gain, it is a multiplier of existing engineering conditions. Teams with strong platform foundations get the 30–40% throughput lift. Teams without them get higher change failure rates and more rework on the same investment. The platform foundation that didn't exist in the DORA model two years ago — and that the framework will need to absorb — is context infrastructure. The orgs that build it as a platform capability will get the lift. The orgs that leave it as a feature-team problem will eat both the cost and the instability.
The senior-leader move for this quarter is to stop treating context engineering as a craft skill to hire for, and start treating it as a platform discipline to fund.
Try this week. Pull last month's LLM bill and break it down by agent loop step, not by feature. If you can't do that today, that's the first action. Once you can, find the single agent path with the worst token-per-successful-outcome ratio and propose one of three changes: compact its conversation history at turn N, clear stale tool outputs from its buffer, or move its repeated system instructions into prompt caching. Pick the smallest one. Ship it. Measure the bill the next week.
What it is. A summarization technique that produces an initial entity-sparse summary, then iteratively rewrites it — adding 1–3 missing salient entities each pass without increasing length — until you reach a high-density version. In context-engineering terms: a structured way to compress an agent's running history, retrieved chunks, or tool dumps down to the smallest set of high-signal tokens without losing the entities the model needs to keep reasoning.
When to use it. When you need to compact earlier turns, RAG results, or document context the agent will keep referring back to, and a generic one-shot summary loses too much signal. Particularly useful at the turn-N compaction boundary in long agent loops, and for distilling long retrieval results into a stable in-context briefing.
How to run it:
When NOT to use it. Don't reach for CoD when the bloat is from tool definitions (use tool curation and result clearing instead), when the context contains structured state the model must reproduce exactly (use deterministic templates, not summarization), or for sub-100-token segments where there is no slack to compress.
Example. An incident-response agent has been running for 40 turns and is approaching its context limit. Apply CoD to turns 1–25 to produce a 400-token dense summary preserving every alert ID, owner, decision, and rollback step. Replace turns 1–25 in the buffer with the summary. The agent keeps reasoning forward with the same effective memory at a fraction of the per-step cost.
Effective context engineering for AI agents — Anthropic's canonical post on the discipline: the smallest set of high-signal tokens framing, plus the three production strategies (compaction, tool-result clearing, memory). The reference doc to share with any platform lead who still treats prompts as the lever.
AI Cost Observability: Measuring and Justifying Token Spend in 2026 — Vantage's FinOps-for-AI primer, with the order-of-magnitude variation across model choice, context depth, and session length on the same workload. The chart to attach when your CFO asks why the AI line item just doubled.
New DORA Report Claims Strong Engineering Foundations Drive AI Return on Investment — InfoQ's read of the May 2026 DORA ROI report. The single most useful framing for a board deck: AI is a multiplier of platform foundations, not a substitute for them. Context infrastructure is the next foundation it multiplies.
"The art of providing all the context for the task to be plausibly solvable by the LLM."
— Tobi Lütke (Shopify CEO), on why "context engineering" is the better term, June 19, 2025
Don't miss what's next. Subscribe to Critical Path: