Your AI tooling line item went up 40 percent this quarter.
Your DORA dashboard is green. Deployment frequency is up. Lead time is down. The team shipped more pull requests in Q1 than any quarter in the company’s history.
And the feature you committed to the board on the Q1 call slipped to Q2.
Three facts. All of them true. None of them in the same paragraph in your QBR deck. The connection between them is the thing the engineering tools you bought were not built to surface, and it is the reason the CFO is starting to ask harder questions about what the AI line item is buying.
The metric that does not measure what you think
Nicole Forsgren created DORA. The Pragmatic Engineer called the scope of DORA its “massive limitation.” Forsgren herself wrote that DORA “only measures from commit to production. There is a lot of important work and systems behavior that happens before and after that which are not captured.”
Nathen Harvey runs developer advocacy for DORA at Google. He says it plainly: “The DORA metrics are important, and they can drive a lot of good things, but they are not sufficient. You have to dig deeper.”
The 2024 DORA report proved them right. The high-performance cluster shrank from 31 percent of respondents to 22 percent. The low-performance cluster grew from 17 percent to 25 percent. More teams are getting worse, not better, despite a decade of DORA evangelism.
The same report found something the CFO will care about more. A 25 percent increase in AI adoption correlated with a 7.2 percent decrease in delivery stability. Individual developers reported being more productive. The system-level metrics got worse. The DX summary of the report put it cleanly: AI “helps individual productivity but hurts software delivery performance.”
Rachel Stephens at RedMonk applied Theory of Constraints to the result. Citing Gene Kim on Goldratt: any improvement made anywhere other than the bottleneck is an illusion. If code writing was not the bottleneck, making it faster does not make the company faster.
Your dashboard is green. Your team is busy. The product is slower than it was a year ago.
That gap has a name. It is a coordination failure. And it was visible in the data before it was visible in the slipped commitment.
What the line item is paying for
The developer who used to ship one PR a week now ships two or three.
GitHub’s controlled study measured this. Developers using Copilot completed tasks 55 percent faster. Google’s internal study estimated roughly 21 percent faster. McKinsey estimated up to 2x on straightforward tasks. The individual numbers are real.
The organizational numbers tell a different story. Faros AI studied over 10,000 developers across 1,255 teams. Developers on high-AI-adoption teams completed 21 percent more tasks and merged 98 percent more pull requests. PR review time increased 91 percent. Bug rate per developer increased 9 percent. Average PR size increased 154 percent. At the company level, there was no significant correlation between AI adoption and improvements in throughput, quality, or DORA metrics.
Twice the output. Twice the surface area. Twice the chance that two of those PRs touch the same part of the system. The AI line item is paying for the doubling. The slipped feature is paying for the doubling of the surface area.
The AI agent that wrote the code has no context about what the agent on the other engineer’s machine is writing. It does not know about the PR that opened yesterday in the same directory. It does not know that the function it is refactoring is the function someone else is in the middle of extending.
It just writes. Fast. Correctly. Blind.
Wes McKinney, creator of pandas, published “The Mythical Agent-Month” on O’Reilly Radar. His argument: coordination complexity scales as a mathematical law, and AI agents do not exempt themselves from it. Different agent sessions may produce contradictory plans that humans then have to reconcile. Agents do not share intuition or negotiate ambiguity. They act on whatever has been made explicit.
Here is what that looks like in a real codebase.
In kubernetes/kubernetes, two of the project’s most active working groups touch the same system simultaneously. SIG API Machinery owns the type system. SIG Scheduling consumes it. Both groups have active contributors. Both groups have open PRs.
1,158 pull requests in kubernetes/kubernetes carry both SIG labels. That is 21 percent of all SIG Scheduling PRs. 602 of those merged.
In June 2024, PR #124898 from SIG API Machinery added deprecation annotations to API types. It inadvertently marked the v1 Binding sub-resource as deprecated. The scheduler started emitting a deprecation warning for every scheduled pod. Neither SIG caught the cross-boundary impact during review. The fix required a revert and a new unit test to prevent re-introduction.
A study published at MSR 2026 analyzed 25,953 Kubernetes contributors over 11 years. Cross-domain issues took 4.19x longer to resolve than single-domain issues.
This is not a Kubernetes problem. It is a coordination problem. It happens in every engineering team that has more than one person working in parallel, and the parallelism is what the AI tools are paying for.
Why the systems you have do not catch it
You have four systems that were supposed to coordinate this work. None of them do.
Tickets describe intent. They do not describe what changed.
Wikis describe how things worked when someone wrote them. That was eighteen months ago.
Slack captures decisions in threads nobody finds until the damage is done. Gartner found that 47 percent of digital workers struggle to find the information needed to do their jobs. Atlassian found that 65 percent of knowledge workers say responding to messages takes priority over making progress on top priorities. The decision that mattered was made in a thread. The engineer who needed it never saw it.
Standups report status. They do not surface the fact that two engineers are about to create a merge conflict neither of them knows is coming. Atlassian’s 2024 State of Teams report found that 50 percent of workers discovered duplicate project work only after starting their own efforts. Half the coordination failures were invisible until the work was already done.
Every one of these systems was designed for a world where engineers worked sequentially. One person finishes. Another person starts. The context lives in the handoff.
That world is gone. Your team works in parallel now, not because you planned it, but because the AI tools you paid for raised individual output two to three times. The bottleneck shifted from writing code to coordinating the code that was written. The coordination systems did not get faster. They did not change at all.
Fred Brooks identified this in 1975. Communication channels in a team grow as n(n-1)/2. A team of five has ten channels. A team of ten has forty-five. Double the output, quadruple the coordination overhead. AI did not add people. It added the output of people. The coordination math is the same.
What engineering leaders who catch it first do
They do not read from the ticket down.
The ticket says what someone intended. It does not say what happened. It does not say that the implementation drifted three times before merge. It does not say that the engineer who wrote the ticket and the engineer who closed it had two different mental models of what done meant.
The leaders who catch coordination failures early read from the commit up.
Cataldo, Herbsleb, and Carley at Carnegie Mellon studied coordination dependencies across development activity at two companies. When developers’ actual coordination patterns matched the coordination requirements revealed by code-level dependencies, resolution time dropped 32 percent. When there was a gap between who needed to coordinate and who actually did, software failures increased. The gap was invisible at the ticket level. It was visible in the commits.
Microsoft Research found the same pattern from the ownership side. Bird, Nagappan, and colleagues studied Windows Vista and Windows 7. Components with more minor contributors had more pre-release and post-release failures. Ownership concentration predicted quality. The commit log showed ownership. The ticket did not.
Leonardo Stern at Agoda described the shift: AI coding tools raised individual output, but velocity gains at the project level were surprisingly modest, because coding was never the real bottleneck. Sara Chizari at Microsoft’s UXR team reached the same conclusion: productivity breaks down not because tools are weak, but because coordination stops scaling.
They start with what changed. They look at who changed it. They ask whether the people making overlapping changes ever reviewed each other’s work. They do not wait for the retro. They do not wait for the slipped commitment to land at the board.
They watch the work.
This is harder than reading a ticket. It requires looking at data that does not come pre-summarized. It requires a system that watches the commits so you do not have to.
That system did not exist two years ago. It exists now.
What to do this quarter
The AI tooling line item is going to keep growing. The CFO is going to keep asking what it is buying. The honest answer is that it is buying individual output, and the question of whether that output is moving the company has been left to a coordination layer that was built for a smaller, slower team.
One thing is worth doing before anything else.
Connect your repository and look at what is actually happening. Not what the tickets say. Not what the standup reported. What the commits show.
zamski.com. Free. Read-only. Three minutes.
If you find nothing, you have three minutes less. If you find something, you have the information that would have prevented the last slipped commitment, and the line item on next quarter’s QBR has a real answer behind it.
