2026-05-12
May 12, 2026
"Token maximizing" — the practice of celebrating engineers who burn the most AI tokens — is the productivity-metric mistake of the decade. Tech leaders measuring AI leverage by tokens are reading the same gauge that misled the industry in 1998. The fix is unit economics: cost per merged PR, cost per resolved incident, cost per decision eliminated from a human queue.
The trend has a name now. Gergely Orosz used a recent Pulse to put a label on something engineers at Meta and Microsoft had been DMing him about for months: "tokenmaxxing" — the practice of consuming as many AI tokens as possible because internal leaderboards reward it, regardless of whether the output ships. He likens it to the lines-of-code metric our industry abandoned in the late 1990s for the same reason: it was easy to game, and the people gaming it weren't the best engineers.
The cost trajectory makes this urgent. The State of FinOps 2026 report found that 98% of FinOps practices now manage AI spend — up from 31% two years ago — and that most organizations are running 4–5x over their original AI budgets. The same survey lists "AI cost management" as the single most-desired skill across orgs of every size. FinOps now reports into the CTO/CIO org at 78% of companies, an 18-point jump since 2023. This is no longer a finance problem the platform team can hand off.
The move is to stop reporting AI by tokens consumed and start reporting it by business outcomes per dollar. Cost per merged PR. Cost per resolved support ticket. Cost per decision that didn't need a human in the loop. The published numbers are stark — agent-resolved support tickets run roughly $0.46 versus $4.18 human-handled; agent-completed code reviews run $0.72 against $48 of senior-engineer time — but you cannot capture that compounding leverage on a dashboard whose top widget is "tokens per engineer."
There are three reasons a token-count dashboard feels right to a director or VP of engineering and is wrong anyway.
The first is that tokens are an input, not an output. They are the AI equivalent of "hours at desk." The thing you actually care about — work shipped, decisions automated, queues drained — sits one or two transformations downstream. Measuring the input lets you brag at all-hands. It does not let you forecast next quarter's AI bill against next quarter's throughput. And it does nothing for the executive question you are about to be asked: what did our AI spend buy us?
The second is that tokens are gameable in ways that look like productivity. An engineer who reflexively pastes the entire repo into every prompt, asks for "thorough" reviews from three different models, then commits the consensus, will rank well on a tokens-per-merged-PR leaderboard. Their PRs will also be slow, expensive, and not better. Gergely's reader survey captured the early symptoms: 30% of respondents had hit AI usage limits, and the common responses — switching tools, upgrading plans, moving to API pricing — all increase spend without increasing output. None of that shows up on a token gauge; all of it shows up on the invoice.
The third reason is more strategic. Token metrics do not reveal the right interventions. If your engineering org is paying $2.4M a quarter in AI inference and your DORA throughput hasn't moved (see yesterday's issue), the question is not "are we using AI enough?" — it is "where is the AI work, and what is it producing?" A token dashboard tells you neither. A unit-economics dashboard tells you both.
So what is the alternative? Build a unit-economics layer on top of your AI spend. The minimum-viable instrumentation is three derived metrics stitched from three different invoices:
Cost per merged PR. Total AI spend tagged to engineering, divided by the count of merged PRs whose lifecycle touched an AI tool. Compare quarter over quarter and team to team. The right teams will get cheaper as they learn the tools; the wrong teams will get more expensive without getting faster. ICONIQ's 2026 State of AI Snapshot reports median engineering payback at 9.3 months — much longer than the 4.1 months in customer support — so the metric needs to be tracked through that window, not declared at the end of a six-week pilot.
Cost per resolved ticket. For any team where AI agents triage, summarize, draft, or close: total AI spend on that workload divided by tickets closed. The benchmark to beat is the unloaded cost of the human alternative — fully loaded comp divided by tickets per hour. If your dashboard shows $0.46 per agent-resolved ticket against a $4.18 human baseline, you have a 9x leverage story to take to the board. If it shows $7.20, you have an experiment that needs to end.
Cost per decision eliminated. This is the one that requires judgment, not just SQL. List the decisions a senior engineer or TPM used to walk into a room to make — release/no-release, scope cut/no-cut, vendor approval, on-call escalation, design-review sign-off. For each, classify whether AI has eliminated it (the decision is now automated against a written policy), accelerated it (the human still decides, but in 1/5 the time), or done nothing. Then divide quarterly AI spend by the count of eliminated decisions. This is the metric Charity Majors has been hammering on at Honeycomb — that observability has to connect AI investments to impact, not just measure inference latency — and it is the one your CFO will eventually demand. The State of FinOps survey already names "unit economics" as the new operating discipline: the cost to produce one unit of business value — one inference request, one completed session, one customer served.
A note on the politics. Engineers and PMs will resist this kind of dashboard, often because they fear it will be used to rank individuals. It shouldn't be. Unit economics belongs at the team and workflow level, not the individual level. The same logic that made DORA team-level (and that made the SPACE framework explicitly individual-protective) applies here. If you push cost-per-PR down to the engineer, you will rediscover Goodhart's Law in your retention numbers next quarter. Cost-per-decision is a leadership instrument, not a performance instrument.
The deepest reason to make this switch is that the discipline you build around unit economics is also the discipline that lets you tell the AI-strategy story up the chain. Your CEO is going to ask, this year, what your AI investments returned. "We used 2.4 billion tokens" is not an answer. "We eliminated 38% of the decisions that used to require a Director or above, at $0.63 per eliminated decision, and we expect engineering payback in Q3" is.
Try this week. Pick the team with the largest AI spend in your org. Replace the top metric on their dashboard — whatever it currently is — with cost-per-merged-PR (for engineering) or cost-per-resolved-ticket (for support/ops). Hold the rest of the dashboard constant. In one week, ask the team's lead what changed in how they used the tools. That behavioral delta is your AI roadmap, on a postcard.
What it is. A sense-making framework that classifies a problem into one of five domains — Clear (cause and effect are knowable; best practice applies), Complicated (analysis required; good practice applies), Complex (cause and effect only visible in retrospect; requires probe-sense-respond), Chaotic (no useful patterns; requires act-sense-respond), and Disorder (you have not yet decided which of the other four you are in). It is deliberately not a 2x2 matrix, and the Disorder zone is where most of your standing meetings actually live.
When to use it. When you are deciding which work to give to AI agents, which to keep human, and which to encode in a written policy. Or any time a stakeholder is using the wrong tool for the domain — demanding a Gantt chart for what is actually a Complex problem, running a brainstorm for what is actually Clear.
How to run it:
When NOT to use it. Do not use Cynefin to classify a single one-off decision; the framework's value compounds when applied to a portfolio of recurring work. And do not use it as a label-and-leave exercise — domains are dynamic, and a quarterly re-pass is the minimum.
Example: a platform team at a Fortune 500 used Cynefin to triage 24 recurring engineering workflows, classified 9 as Clear or Complicated, and wrote agent rubrics for each. Eight months later those 9 workflows ran almost entirely on agents — and the team's initiative count was up 30% without added headcount.
Tokenmaxxing is becoming a weird new productivity trend — Pragmatic Engineer's Gergely Orosz documents the leaderboards at Meta and Microsoft that rank engineers by AI token consumption. He calls it the lines-of-code metric for the agent era. If your org has one of these dashboards, get ahead of replacing it — the post is your supporting evidence.
State of FinOps 2026: AI cost management is the most-wanted skill — 98% of FinOps practices now manage AI spend; most orgs run 4–5x over original budget; 78% report into the CTO/CIO. This is the year tech leaders own AI cost governance whether they want to or not. The headline finding to take to your CFO is the unit-economics framing.
Honeycomb announces O11yCon 2026 for the "agent era" — Charity Majors's framing is the right one: "AI agents are now writing code, triaging incidents, and orchestrating production systems, yet most engineering teams still have a limited understanding of what those agents actually did or whether they delivered value." Observability for agents is where the cost-and-impact story becomes operational.
"When a measure becomes a target, it ceases to be a good measure."
— Marilyn Strathern (1997), generalizing Charles Goodhart (1975)
Don't miss what's next. Subscribe to Critical Path: