Stop reading engineering metrics like a balance sheet

2026-05-18


Stop reading engineering metrics like a balance sheet

May 18, 2026

Issue 012 — Single-number reporting collapsed the moment AI made throughput cheap. The leaders who triangulate three lenses are the ones who'll still be calibrated by Q4.

The 2024 DORA report told a tidy story: AI adoption was negatively correlated with software delivery performance. Most executives I know quietly ignored it. The 2025 DORA report, published late last year, flipped the headline — AI adoption is now positively correlated with both throughput and product performance for the first time. Cue the victory lap. The problem: the same report shows AI continues to correlate negatively with software delivery stability. Up on two axes, down on one.

If you read DORA on its own, you'll claim victory. If you read SPACE alongside, you might notice the engineers producing that throughput are quietly disengaging. If you read DevEx alongside both, you'll see flow state degrading even as diff counts climb. The single-axis read of an engineering org — once defensible because the axes mostly co-moved — is now actively dishonest.

Most VPs and CTOs still treat the "engineering dashboard" the way a CFO reads a P&L: as a hierarchical, single-truth document where the top number is the answer and the rest is supporting detail. That mental model is obsolete. Engineering health in 2026 is not a balance sheet. It's an MRI — multiple modalities that only mean something when overlaid. The leaders who learn to read all three slices are the ones who'll survive the era; the leaders who pick one number are the ones who'll be surprised by their own org in six months.


Deep Dive — Metrics & Judgment: triangulate or be deceived

What just broke. Single-axis engineering reporting was always weak. AI broke it openly. The 2025 DORA report puts the gap on paper: AI adoption hit ~90% of professional developers, and for the first time it now correlates positively with throughput and with product performance. But it continues to correlate negatively with delivery stability. So if your exec deck shows only "deployments per day" and "lead time," you'll genuinely believe your org has gotten dramatically faster — and you'll be right, while simultaneously being wrong about what the org is actually shipping.

Goodhart's Law is in full operation here. The moment "diffs merged per engineer" became the productivity metric leadership cared about, AI tooling instantly let teams produce diffs without producing decisions. The metric measured what it always measured. It just stopped meaning what leadership assumed it meant. The cure is not to drop the metric. The cure is to stop reading it alone.

The triangulation move. The post-AI synthesis is the DX Core 4, released in late 2024 by Abi Noda and Laura Tacho at DX in collaboration with the original DORA, SPACE, and DevEx authors (Forsgren, Storey, Zimmermann, Greiler). It collapses three frameworks into four lenses, each with one headline metric and a small set of supporting metrics:

The list is not the insight. The insight is the geometry: no single category gets to be the boss. If Speed rises but DXI drops, you are paying for velocity with the people who built it. If Quality holds steady but Impact falls, you are operating a beautifully maintained system that doesn't matter to the business. If Impact is up but Quality is collapsing, you are about to lose customers faster than you can ship features for them. The dashboard is a system of tensions, not a system of records.

How to read it as a leader. Will Larson's framing fits exactly: "half of metrics is showing the truth; the other half is educating people to inform their mental model about how the truth works." Most engineering execs do the first half and skip the second. They publish the numbers and assume the audience knows how to combine them. The audience never does.

The senior-TPM-and-tech-leader move is to never present a key metric without its counter-metric in the same field of view. Not in the appendix. Not on the next slide. In the same row, in the same chart, with a one-sentence narrative connecting them. "Speed is up 14% quarter-over-quarter. DXI dropped 6 points, and regrettable attrition ticked up at staff level. The next quarter we slow Speed deliberately to recover DXI, with the bet that Q3 throughput rises higher than it would have under sustained Q2 pressure." That sentence is what executive engineering actually sounds like — capital allocation across axes that don't move together.

Two operational guardrails. First, never let a single metric be the OKR for an engineering org. Pair every key metric with an explicit counter-metric, and make it a precondition for the quarterly review: Speed paired with Quality, Impact paired with DXI, headline paired with stability. Second, treat DORA as a hygiene baseline, not a leaderboard. Gergely Orosz's "token maxxing" critique — the new in-progress 2026 phenomenon where companies measure AI-assisted productivity by tokens generated — applies cleanly to commit volume, AI-suggestion acceptance rate, and any single-axis "AI productivity" metric a vendor will sell you. The benchmark you display is the behavior you'll get. Pick benchmarks you can live with at scale.

The reason this matters now and not next year: most large-org engineering reviews are converging on DX Core 4 in 2026. If you're a senior TPM, the leaders you support will be asked DX Core 4 questions in their board reviews within two quarters. If you're a tech leader, the time to install the four-lens read is before the board asks, not after. The org that learns triangulation under quiet conditions is calm when the questions start.

Try this week. Build a one-page, four-lens read of your engineering org's last 90 days using the DX Core 4. Put Speed and DXI on the same row, Quality and Impact on the next. Take it to your engineering staff meeting and ask one question: "Which of these four numbers, if it changed by 10% next quarter, would change a decision we are about to make?" The numbers that wouldn't aren't metrics. They're decoration. Cut them before the next review.


Method — Goodhart's Law (Goodhart, 1975; Strathern, 1997)

What it is. "When a measure becomes a target, it ceases to be a good measure." Originally stated by British economist Charles Goodhart in 1975 in the context of UK monetary policy; popularized in its now-standard phrasing by anthropologist Marilyn Strathern in 1997. A close cousin is Campbell's Law (Donald Campbell, 1979). The shared insight: optimization pressure deforms whatever it touches.

When to use it. Apply Goodhart whenever you are about to elevate a metric from signal to target — OKR, executive dashboard, performance review input, comp lever, AI-assistant reward function. Especially relevant in 2026, when AI agents can produce essentially unbounded optimization pressure on any number you decide to display.

How to run it:

  1. Name the metric you are considering targeting. Be precise — "diffs per engineer per week," not "developer productivity."
  2. List three plausible ways a sufficiently motivated team — or a sufficiently capable AI assistant — could move that metric without improving the underlying outcome you actually care about.
  3. If you can list three, you have a Goodhart problem. Choose one of three responses: pair the metric with a counter-metric, replace it with a perception or outcome measure, or explicitly downgrade it from target to signal.
  4. Document the counter-metric in the same surface where the primary metric appears. If your dashboard hides counter-metrics one click away, gaming the primary is the path of least resistance.
  5. Review the metric quarterly with a fresh round of step 2. Has the gap between the measure and the outcome widened? Retire the metric before it embarrasses you.

When NOT to use it. Don't use Goodhart's Law as an excuse to stop measuring. Some metrics survive being targets — typically the ones tied to externally validated outcomes the team cannot easily fake (customer-reported reliability, retention, revenue tied to a specific surface). The law is a filter, not a veto.

Example: code coverage as a quality target produced trivial unit tests in 2015. Diffs per engineer as an AI-leverage target is producing token-maxxed PRs in 2026. Same law, same failure mode, eleven years apart, larger blast radius.


Field Notes

2025 DORA Report: AI lifts throughput and product performance — at the cost of stability — The metrics report most engineering execs will cite this year inverted its own 2024 headline. AI adoption (now ~90%) is positively correlated with delivery and product performance for the first time, but negatively correlated with stability. The single-axis read is officially obsolete.

Introducing the DX Core 4 — Abi Noda's unification of DORA, SPACE, and DevEx into Speed, Effectiveness, Quality, Impact. The reason to read it this week: it's the framework most large-org engineering reviews are converging on through 2026, and the four-lens vocabulary will show up in your board prep soon.

Token maxxing has replaced lines-of-code as the new productivity theater — Gergely Orosz tracking the in-progress 2026 trend at Meta, Microsoft, and other large engineering orgs: measuring AI-assisted productivity by tokens generated. Goodhart's Law restated for the AI era — and a forecast of where the next OKR-gaming wave is headed.


Events


Reading


"Half of metrics is showing the truth. The other half is educating people to inform their mental model about how the truth works."

— Will Larson, CTO, Carta


Don't miss what's next. Subscribe to Critical Path: