DORA's middle is collapsing. AI is the accelerant, not the cure.

May 11, 2026

The 2024 DORA report showed the elite tier shrinking from 31% to 22% and the low tier swelling from 17% to 25% — and the AI-adoption coefficient on throughput and stability is negative. Read together, that's not a measurement glitch. It's a directive: in 2026, your DORA review is supposed to be a judgment exercise, not a status update.

Here is the picture most engineering leaders are still drawing from. The 2024 Accelerate State of DevOps Report — DORA's most recent — surveyed 39,000+ practitioners and found two things that should sit at the top of every Q2 metrics review. First, the elite tier dropped from 31% of teams to 22%, the low-performance tier grew from 17% to 25%, and the middle is sliding down, not up. Second, the AI-adoption coefficient is negative on throughput and on stability: a 25% increase in AI adoption correlated with a 1.5% decrease in throughput and a 7.2% decrease in stability.

Both findings have been re-litigated in trade press for eighteen months. What hasn't been done is the obvious managerial inference: DORA in 2026 is no longer a scorecard. It is a diagnostic instrument, and most senior leaders are still treating it as a report. The collapse of the middle isn't a measurement story — it's a judgment story. AI didn't break delivery. It made every pre-existing dysfunction louder, and the leaders without the discipline to interpret the resulting signal are the ones whose teams are dropping out of the elite tier.

This issue is about getting better at the interpretive work — how to read DORA, SPACE, DevEx, and the newer DX Core 4 so the metrics are doing what they were designed to do: change a decision.

Deep Dive — Metrics & Judgment

There are four moves a senior TPM or tech leader makes with metrics. Three of them are administrative — collect, dashboard, report. The fourth is the only one that compounds: interpret. Most engineering orgs in 2026 have professionalized the first three and atrophied the fourth.

Here's what interpretation actually means, in five disciplines you can install in your next metrics review.

1. Separate signal from noise before you separate winners from losers. The first question against any DORA movement is not "what changed?" — it is "is this even a change?" Walter Shewhart figured this out in 1924 at Bell Labs (today's Method, below): a metric that bounces between two values is not actually moving until the bounce exceeds the natural variation of the process. A 12% week-over-week dip in deployment frequency is meaningful in a team that has historically held a 3% standard deviation. It is noise in a team whose standard deviation is 15%. Most weekly DORA reviews are arguing about noise. The leader who walks into the room having already computed the control limits is the only one operating on signal.

2. Treat the four DORA metrics as two pairs, not four numbers. The cleanest insight from Forsgren's original Accelerate work is that throughput (deployment frequency, lead time) and stability (change failure rate, MTTR) are not independent. They covary. Elite teams move both together; broken teams trade one for the other. The 2024 finding — that AI nudges throughput up while pushing stability down — is exactly the pattern of a team trading off, which means your AI-adopted org is structurally underperforming a 2019 elite team that didn't have access to the tool. The interpretive question is not "did our deployment frequency improve?" It is "did the throughput-stability pair move toward or away from the elite corner?"

3. Stop reading SPACE's Activity dimension as if 2021 hadn't happened. The SPACE framework (Forsgren, Storey, Houck, Smith, Zimmermann, 2021) was deliberately designed so that no single dimension dominates — the point was that Activity (PRs, commits) is misleading on its own. In 2026, with AI generating an estimated 42% of code in adopting orgs, Activity is now actively misleading, not just incomplete. Pull-request count is a metric your team can move by accepting more AI suggestions. Lines of code is a metric your team can move by not refactoring. The 2026 read is: weight Satisfaction and Performance higher, treat Activity as a sanity check only, and watch for the Goodhart signature — Activity rising while Performance flattens — which is the load-bearing diagnostic that your team is shipping volume without value.

4. Use DevEx and DX Core 4 to localize the bottleneck, not to grade the team. The DevEx framework (Noda, Storey, Forsgren, Greiler, 2023) and its operational successor, DX Core 4 (announced April 16 at DX Annual 2026), exist because DORA tells you where a team sits but not what is wrong. DevEx's three dimensions — feedback loops, cognitive load, flow state — give you the place to point at. DX Core 4's four — Speed, Effectiveness, Quality, Impact — give you the dashboard a CFO will read. Use DevEx for diagnosis and DX Core 4 for narrative. Don't confuse the two. A leader who walks into a budget review with a DevEx dashboard is in the wrong meeting. A leader who walks into a 1:1 with their staff engineer brandishing DX Core 4's Impact dimension is also in the wrong meeting.

5. Read every metric backward: what decision would I make differently if this number were the opposite? This is the only interpretive question that matters. If the answer is "nothing" — if your weekly deployment frequency could be 20% higher or 20% lower and you would change no investment, no headcount, no incentive, no scope — then you are not interpreting the metric, you are decorating a slide. The Goodhart trap is downstream of this failure: a metric that drives no decision will, given enough quarters, become a metric the team optimizes for. Charity Majors's observation that static dashboards are the most-misused tool in engineering generalizes — your DORA dashboard is just an observability tool with a longer time horizon. Interaction beats observation. Pull the data into a notebook, slice by team, slice by service, slice by AI-adoption rate. The dashboard is the question, not the answer.

Pull these five together and the interpretive read of DORA's 2024 numbers writes itself. The middle is collapsing because the median engineering org adopted AI tools without the substrate to absorb them — no improved review process to catch the additional volume, no upgraded test infrastructure to catch the additional defects, no calibration on the throughput-stability trade-off. The elite teams that didn't drop a tier had something most middle teams didn't: a leadership cadence that treats DORA as input to a decision, not as a deliverable.

That is the skill to install this quarter. It is not new tooling. It is not a new framework. It is the discipline of reading the dashboard you already have as if the numbers were going to change a decision tomorrow morning.

Try this week. Before your next DORA / DX Core 4 review, pre-compute one number for every metric on the deck: the rolling 8-week standard deviation. Anything moving inside ±1 SD gets struck through on the printout before the meeting starts. The remaining metrics — the ones that have actually moved — get the full 10 minutes. If nothing remains, cancel the meeting and move the slot to a working session on one DevEx bottleneck. Your team will notice within two cycles.

Method — Statistical Process Control / Shewhart Control Charts (Walter Shewhart, 1924)

What it is. Walter Shewhart, a physicist at Bell Telephone Laboratories, invented Statistical Process Control in 1924 to distinguish two kinds of variation in a manufacturing process: common-cause (the inherent, predictable noise of a stable process) and special-cause (a real change that warrants investigation). The artifact is the control chart: plot your metric over time with a center line (the mean) and upper/lower control limits (typically ±3 standard deviations). Points inside the limits are noise — leave them alone. Points outside, or runs of points trending the same direction, are signal — investigate them. W. Edwards Deming carried the discipline into broader management; today it is the foundation of every credible quality program.

When to use it. Any recurring metric that bounces week to week and triggers reactive discussion. DORA's deployment frequency. Change failure rate. Lead time. SPACE Activity counts. PR cycle time. Incident frequency. Any DX Core 4 dimension. Use it specifically before a metrics review where your stakeholders are likely to react to a single bad week.

How to run it:

Pick the metric and the cadence. Most engineering metrics work well on a weekly cadence over a rolling 8–12 weeks. Daily is too noisy; quarterly hides the signal.
Compute the mean and the standard deviation over the trailing window. Exclude obvious anomalies (holidays, the week a major incident dominated the queue) only with a written note explaining the exclusion.
Plot the metric with center line and ±2 SD and ±3 SD control limits. A spreadsheet is fine. Don't reach for a tool until the discipline is installed.
Apply the Western Electric rules (the canonical set Shewhart's heirs codified): any single point beyond ±3 SD is signal; eight consecutive points on the same side of the mean is signal; six consecutive points trending in the same direction is signal. Most other movement is noise.
In the review, lead with what is not signal. Strike through the noise visibly. Spend the meeting only on the points the rules surfaced. This is the cultural shift — it tells the room what you actually care about.
For each signal point, ask: was the underlying process changed? If yes, the change explains it; either lock in the improvement or roll back. If no, you have an unexplained special cause — investigate before it becomes the new common cause.

When NOT to use it. Brand-new metrics with under 8 weeks of history (you have no baseline). Metrics that are inherently event-driven and lumpy at low volume (use a different distribution model). Anything where the metric's definition is still in flux — settle the definition first.

Deming's mid-career talks at Ford and Toyota are the cleanest example: he convinced two of the largest manufacturing operations in the world that the single most expensive habit a manager has is reacting to common-cause variation as if it were a special cause. The same habit is the single most expensive habit of an engineering leader in 2026.

Field Notes

DX Core 4 — unifying DORA, SPACE, and DevEx — Announced April 16 at DX Annual 2026. Four dimensions — Speed, Effectiveness, Quality, Impact — with a 14-factor Developer Experience Index (DXI) inside Effectiveness. A 3,500-developer deployment measured a 31% throughput lift, 16% attributed directly to AI. Worth reading even if you don't adopt it — it's the framework your CFO will read about first.

How AI is changing software engineering — Gergely Orosz at AIE Europe — Orosz reports DMs from Meta and Microsoft engineers describing "token maximizing" — gaming AI-tool usage metrics to look productive. The Goodhart's-Law signature applied to a brand-new metric class. If your org tracks AI-tool adoption as a leading indicator, read this first.

Measuring developer productivity with the DX Core 4 — Mike Fisher — Former Etsy CTO's practitioner write-up. Sharpest line: "every 1-point DXI improvement correlates with roughly 13 minutes saved per developer per week." At a 100-person eng org, that's ~1,100 hours a year — the kind of business-impact number that travels.

Events

Jun 2–3 · LeadDev London (LDX3) — Europe's densest engineering-leadership event, 2,500+ attendees, strong staff+ and metrics tracks.
Jun 15 · TPM Institute — Technical Program Manager Course cohort — newer practitioner program; useful if you're sponsoring a senior IC into TPM rotation.
Dec 2–4 · TechLeader Summit, Clearwater FL — smaller, curated event for software engineering leaders; early-bird pricing window is open.

Reading

The SPACE of Developer Productivity · Forsgren, Storey, Houck, Smith, Zimmermann (ACM Queue, 2021) — the original SPACE paper. Re-read the Activity section in light of 2026 AI-generated code.
DevEx: What Actually Drives Productivity · Noda, Storey, Forsgren, Greiler (ACM Queue, 2023) — the framework that operationalizes feedback loops, cognitive load, and flow state.
How to measure AI developer productivity · Nicole Forsgren on Lenny's — the cleanest practitioner update to SPACE for the AI era. Pair with the original paper.

"A measure is a model of the process. The most common error in management is to react to the data as if it were the process itself."

— W. Edwards Deming, paraphrased from Out of the Crisis (MIT Press, 1986)

Critical Path

DORA's middle is collapsing. AI is the accelerant, not the cure.

DORA's middle is collapsing. AI is the accelerant, not the cure.

Deep Dive — Metrics & Judgment

Method — Statistical Process Control / Shewhart Control Charts (Walter Shewhart, 1924)

Field Notes

Events

Reading