Every AI throughput metric needs an instability twin

2026-06-01


Every AI throughput metric needs an instability twin

June 1, 2026

If your dashboard shows AI made your team faster but doesn't show what it broke, you're reporting half a sentence.

The 2025 DORA report — the first State of DevOps written entirely in the AI-everywhere era — landed on a finding that should reshape every metrics review you run this quarter: AI adoption correlates with higher software delivery throughput AND higher software delivery instability at the same time. Both numbers go up. The report is unusually blunt about it: AI is "a multiplier of existing engineering conditions," strengthening strong teams and exposing weak ones. The time engineers save in code generation is being re-allocated to auditing, verification, and rollback work, not to additional shipped impact.

Most engineering scorecards I have reviewed in the last sixty days show only the upside half. Diffs per engineer. PRs merged per week. Lead time for changes. Sprint completion rate. All trending green. None of them paired with their counter-effect. A senior leader reading those dashboards is making confident judgments from a directionally incomplete picture — they think they are getting faster, and they may also be getting more fragile, and the dashboard is silent on the second part.

This isn't a measurement problem. It's a judgment failure caused by metric design. And the fix is one of the oldest moves in the operations management literature.


Deep Dive — Metrics & Judgment: pair every throughput metric with its counter-effect

The instinct most teams have right now is to "add an AI metric": acceptance rate of AI suggestions, percent of PRs touched by Copilot, AI tool active users per week. These are activity counters. They tell you whether the tool is being used. They do not tell you whether using it is making the system better or worse. Activity metrics are the easiest to collect and the least useful for judgment.

The right move is structural. For every throughput-style metric your team reports up, instrument and report the counter-effect metric next to it, as a pair, at the same cadence, to the same audience. Never one number without the other. This is exactly the design principle behind the DX Core 4 framework that Abi Noda and Laura Tacho introduced in December 2024 with input from the DORA, SPACE, and DevEx authors — they deliberately split the dimensions into oppositional pairs (Speed paired with Quality; Efficiency paired with Impact) so that gains in one dimension cannot quietly mask losses in another. Tacho describes the design intent directly: "Speed is great, but if you're going faster while being less effective, that's not great."

What does pairing look like in practice for an AI-amplified org in mid-2026? A working starter set:

Each pair tells a complete sentence. "We shipped 30% more changes and rollback rate held flat" is unambiguously good news and deserves an investment in more AI tooling. "We shipped 30% more changes and rollback rate is up 20%" is the conversation about where to put guardrails — code review depth, test coverage gates, gradual rollout controls, ownership clarity. Neither sentence can be written without both numbers in hand, and neither leader can defend an investment decision honestly without writing the full sentence.

There is a second-order benefit that matters for senior TPMs and engineering directors specifically: paired metrics protect you in cross-functional disputes. When finance or product asks "why are we slowing down to add more review steps?" you point to the pair. The pair is the argument. You are not the person introducing friction — you are the person reading the second number that everyone else is ignoring. Goodhart's law (which we covered on May 18) explains why a single metric becomes a bad metric the moment it becomes a target. Pairing is the practical defense: the pair, taken together, is harder to game without the gaming showing up immediately in the counter-effect.

A few rules of the road, because pairing badly can be worse than not pairing:

  1. The counter-effect has to be measured by the same team, on the same cadence, with the same fidelity as the headline metric. A weekly throughput number paired with a quarterly satisfaction survey is theater, not pairing.
  2. Counter-effects should be system-level, not person-level. Pairing diffs-per-engineer with bugs-per-engineer creates a stack-rank that engineers will route around. Pair team throughput with team instability instead.
  3. Both numbers in the pair should be defensible in a court of evidence. If you can't explain what either one means to a skeptical CFO in 30 seconds, the pair is too clever to be useful.
  4. When the pair diverges in a way the team cannot explain, that is the next investigation, not the next OKR. Divergence is a signal, not a verdict.
  5. Audit your pairs every six months. AI tooling, team composition, and architecture all change what "counter-effect" means. A pair that was honest in January may be Goodhart-bait by July.

The senior move in mid-2026 is not "track more AI metrics." It's to walk into your next eng review with a one-page scorecard where every throughput row has a counter-effect row underneath it, and to refuse — politely, persistently — to read the top row aloud without reading the bottom row aloud as well.

Try this week. Pull the three engineering metrics that appear most often in your status reports to leadership. For each one, write down its counter-effect metric on the same line. If a counter-effect is not currently instrumented, mark it GAP. Bring the marked-up sheet to your next eng review and ask which gaps you'll close before the next review.


Method — Paired Indicators (Grove, High Output Management, 1983)

What it is. A measurement discipline from Andy Grove's High Output Management: for every indicator you track, you pair it with a counter-indicator so that an effect and its counter-effect are measured together. Grove's original framing: "Because indicators direct one's activities, you should guard against overreacting. This you can do by pairing indicators, so that together both effect and counter-effect are measured." His own examples were inventory levels paired with stockout rate, and quantity of output paired with quality of output.

When to use it. Any metric that is used in goals, OKRs, comp, or routine status reporting — in other words, any metric whose value will change behavior. Especially valuable when a new tool, process, or org change has just shifted the cost of producing the top-line number (the exact situation AI coding tools created in 2025-26).

How to run it:

  1. List every metric currently in your team's goals, OKRs, leadership scorecards, or sprint reviews.
  2. For each, ask: "If a team optimized this naively, what would they sacrifice to win on it?" That sacrifice is the counter-indicator.
  3. Instrument the counter-indicator at the same cadence, same audience, same fidelity as the headline indicator.
  4. Report them as a single line item — never the headline without the counter, never on different slides, never on different review cycles.
  5. Make divergence between the pair the trigger for the next investigation, not a separate OKR.

When NOT to use it. For purely exploratory or diagnostic metrics that no one is incentivized on — pairing those adds noise without changing behavior. Also skip pairing for metrics that exist only because a regulator or auditor requires them; the pair won't shift incentive design that is set externally.

Example: a platform team measured "self-service deploys per week" and saw it climb 40% after launching a new deploy UI. Pairing it with "deploys reverted within 24 hours" revealed the revert rate had nearly doubled. The pair turned a "ship this win" memo into a guardrails roadmap.


Field Notes

AI Is Amplifying Software Engineering Performance, Says the 2025 DORA Report — The single most important data point for any metrics conversation in 2026: AI raises throughput AND instability simultaneously. If your dashboard shows only one, it's lying by omission.

Introducing the DX Core 4 — Noda and Tacho's December 2024 framework is the first major eng productivity model designed around oppositional metric pairs as a load-bearing principle, not an afterthought.

The Tyranny of Metrics — Gergely Nemeth on why lack of trust manufactures new metrics, and why metric proliferation crowds out judgment. Useful counterweight before you go pair-crazy.


Events


Reading


"Because indicators direct one's activities, you should guard against overreacting. This you can do by pairing indicators, so that together both effect and counter-effect are measured."

— Andrew S. Grove, High Output Management, 1983


Don't miss what's next. Subscribe to Critical Path: