Last sprint your team shipped 23 clean, reviewed, merged pull requests.
Open them. Name the declared business objective each one moved. The roadmap commitment. The contract obligation. The customer escalation. The board metric. Not the ticket. The objective.
The honest list runs short. Six, eight, maybe twelve. The rest are work that was real, that compiled, that passed review, that earned a green checkmark. And none of it can be tied to anything the company has committed to delivering.
Both facts are normal in 2026. Together they are the bill AI coding tools are quietly running up, and the question the board will ask next quarter when the AI line item is the third-largest engineering expense.
To see how the bill compounds, start with what an LLM actually is.
The model is a confident guesser
Andrej Karpathy, who built neural networks at Tesla and was a founding member of OpenAI, said the quiet part out loud. Hallucination is all an LLM does. The model is a dream machine. Sometimes the dream matches reality.
That is not a knock on AI coding tools. It is the foundation under them. In a session with no external tools connected, the model looks nothing up. Every output is a probability distribution over the next token, weighted by what came before. Your engineers know that much. The consequence is the part that gets skipped. What the model hands back carries no truth signal. Nothing in the output separates a confident correct answer from a confident invention. They look identical.
Most AI-assisted engineering runs on a quiet assumption: that a good enough prompt makes the model a reliable decision maker.
It does not.
When an answer comes back and ships without a second reader, nobody consulted an expert. They trusted a fluent guess. Sometimes the guess is right. The question for every AI coding shop is what happens around the guess.
A five-minute experiment
This morning I ran one question past three frontier models, fresh sessions, five minutes apart. A real design-review question: what is the maximum size a Postgres table should reach before you partition it.
Gemini said 50 to 100 GB, or 100 million rows. ChatGPT said under 10 GB is not worth it, 100 GB to 1 TB is worth evaluating, past 1 TB you design for partitioning. Claude said around 100 GB, or 100 million rows. Three models, roughly one answer. Specific. Confident.
That agreement is not the interesting part. The interesting part is that all three answered with the same confidence, and not one asked what a senior engineer asks first. What is the workload. What are the query patterns. How much of the table is read versus written. None of them said the honest thing, which is that the answer depends on all of that and they cannot see any of it from here.
The experiment takes five minutes. The point is not that the models disagree. It is that they hand you a confident number whether or not they have any business doing so, and nothing in the output tells you which case you are in.
What healthcare already worked out
Medical imaging has lived with imperfect classifiers for decades. A single radiologist does not catch every cancer. In a 2020 study in Radiology of more than a million screening mammograms, the most sensitive quartile of high-volume readers caught 84 percent of cancers; the least sensitive caught 63 percent. No single reader hits the number you want.
The response was not better radiologists. It was double reading: two readers, independently, every scan, with a consensus step when they disagree. That raises sensitivity by 5 to 15 percent. When AI was added, the PRAIM study in Germany followed 463,094 women across 12 sites and found a 17.6 percent higher detection rate with no rise in false positives. The AI did not replace either radiologist. It became a third reader inside a process that already assumed no single reader was enough.
That protocol does one specific thing: it checks whether the reading is correct. It was never built to ask whether the scan should have been ordered.
The two questions engineering carries
An LLM in a code workflow looks a lot like a single reader. It produces an output with no internal calibration. Sometimes right, sometimes confidently wrong, and you cannot tell which from the output alone.
The healthcare lesson applies, with one adjustment. You cannot double-read everything. A second reader on every AI-generated change deletes the speed you adopted the tools for. So route by risk. The Postgres experiment gives you the filter: the work most worth a second reader is the work where the model gave a confident answer to an underspecified question. Scope the second read to that, and leave the rest alone.
That handles correctness. It does not handle the second question, and the second question is the one double reading was never built for.
Engineering carries two questions, not one.
Is the AI-generated code correct.
Is the correct code aimed at an objective the business declared.
Review closes the first. Nothing in review touches the second. A clean, well-tested 200-line change can be verified down to the last line and still move no objective the company named. The DORA dashboard will not tell you. The PR was merged. The objective did not move. Both true at once.
That second gap is the one that compounds. It is the difference between an AI investment that produces business value and one that quietly produces verified, well-reviewed rework.
The cheap version of the audit
It costs about an hour this week.
Take last sprint’s merged pull requests. Open the roadmap doc, the board commitments page, the customer success escalation list, and the contract obligations sheet. For each merged PR, name one item across those four documents that this PR moved.
Not the ticket. The objective.
The PRs where you cannot name one are not all waste. Refactors that prevent the next incident move an objective even if they cannot be tied to one today. Internal tooling work that compounds engineer velocity is real value. The pile you want to look at is the work that was neither maintenance nor objective-aligned. The work somebody asked for in a thread three weeks ago. The work that started as a Tuesday-afternoon idea and ran for two sprints. The work that closed a Jira ticket whose business owner has since left.
That pile is not zero. In engineering orgs that have done this audit honestly, it runs 20 to 35 percent of merged work. No dashboard surfaces it. No retro names it. The merge happened. The objective did not.
Why the budget conversation depends on this
The AI tooling line item has been justified by developer velocity. Velocity is generation. Generation without verification is faster shipping of unverified output. Generation with no objective behind it is faster shipping of work nobody asked for. Neither shows up in a velocity dashboard.
The product team sees it three weeks later, when the feature lands and nothing it was supposed to move has moved. The CFO sees it three quarters later, when the AI line item is up 70 percent and the company’s revenue per engineer is flat.
Medical imaging spends real money on double reading. Two radiologists per scan is not free. The field carries the cost because it measured the gain first. Most engineering organizations adopted AI tools without building the verification layer, and without a way to tie verified output back to a declared objective. The line item is growing. The measurement is not.
That gap is yours to own. Getting ahead of it is not a debate about whether to use AI tools. It is building the measurement layer the tools never shipped with: the one that reads what your team actually produced and tells you which of it moved a declared objective and which of it did not. The same way you instrument cloud spend and incident response.
The board will ask the question in 2027. Either the answer is ready, or the AI line item starts getting cut by people who do not know which 30 percent to cut.
The Postgres question, one more time
The honest answer to the question I asked depends on workload, hardware, query patterns, and retention. A senior engineer would say exactly that, then ask three questions back. None of the three models asked a single question. They returned confident probability distributions shaped like answers.
That is the operating reality. The model is one reader. Sometimes a useful one. Never the only one. What your team builds around it, and whether that process reaches all the way to “did this move a declared objective,” is what separates an AI investment that compounds from one that does not.
Zamski reads from your integrations and shows which work maps to a declared objective and which does not. See it on a repo you recognize: zamski.com.
