12 min read

The real AI bottleneck is not intelligence. It is coordination.

Most enterprise AI value is lost not to model limits but to the organisation around the model. Coordination, not intelligence, is the binding constraint.

Featured image for "The real AI bottleneck is not intelligence. It is coordination."

A useful way to summarise the last eighteen months of enterprise AI is that the question quietly changed. In late 2024 the question most enterprises asked was can the model do this? By mid-2026 the question is can our organisation safely turn what the model can do into work? These sound similar. They are not.

The first is a capability question, answerable by a benchmark. The second is an organisational question, answerable only by the slow process of figuring out who owns the output, who reviews it, who is accountable when it is wrong, how it integrates with existing workflows, who has access to the relevant data, what the audit trail looks like, and how the people whose jobs are touched by it actually feel about its presence. Most of the AI value enterprises have failed to capture lives in that second question. So does most of the consulting work that is going to matter for the next five years.

This is the third in a series, and worth a moment of synthesis. The first article unbundled what consulting was historically selling: analytical labour, ambiguity conversion, institutional memory, senior credibility. It argued that AI compresses the conversion layer in the middle. The second argued that the firms responding well are repackaging the bundle as outcomes underwritten rather than labour billed. This piece makes the deeper claim: the best consultants were always selling coordination, and the bundle was a distribution mechanism for it. AI does not kill consulting. It strips off the distribution wrapper and exposes whether you were ever in the coordination business or just the artefact business.

The bottleneck has moved

The framing that has landed best in 2026 came from George Sivulka of Hebbia, in a March 2026 essay arguing for what he calls institutional AI rather than individual AI. Sivulka’s analogy: in the 1890s, electricity promised enormous productivity gains, but for thirty years electrified mills saw almost no increase in output: “the technology was far superior. But the organisation was not.” AI today is in the same position. Individual knowledge workers, in the right conditions, are dramatically more productive. The factory around them — meetings, handoffs, document repositories, approval chains, tribal knowledge — has not changed. The productivity gain leaks out at every joint.

Bret Taylor, who runs Sierra and chairs OpenAI’s board, has put the same point in Conway’s Law form: large companies struggle to adopt AI because “they are shipping their org charts.” The line worth sitting with is the one he added on Stripe’s Cheeky Pint podcast in November 2025: “the atomic unit of AI productivity is a process, not a person.” That single sentence reframes most of the enterprise AI roadmaps currently being sold. If the unit of productivity is a process, then individual-level deployments — give every analyst a Copilot, give every lawyer a Harvey — capture only a fraction of the available value. The rest sits in the joints between processes, which is where coordination lives. Founders saying variants of this in 2026 — Sergei Sorokin at Highlight, Arvind Jain at Glean, others — are not all making the same case, but they converge on the observation that the intelligence supply now exceeds the organisation’s ability to consume it productively.

What this looks like in practice, in AI and data consulting projects: a client signs up for a generative AI initiative, deploys a model, sees individual employees become genuinely faster at certain tasks, and then discovers six months in that the firm-level metrics have not moved. Cycle times are similar. Headcount is similar. Time-to-decision is similar. The productivity exists somewhere, and individual employees can name it, but it is not aggregating to the P&L. This is the coordination tax in operation. It is the specific phenomenon behind the MIT NANDA report’s now-famous finding: that roughly 95% of enterprise AI initiatives produce no measurable P&L impact.

What the 95% finding actually means

The 95% number is widely cited and slightly misunderstood. It does not mean that 95% of AI deployments are failing technically; the technology mostly works. What the number actually shows is that 95% of deployments are not changing the business in ways that show up in the financials. Those are different problems with different solutions.

Technical failures, when they happen, are addressed by better models, better evals, better data infrastructure, better prompting. The frontier labs are highly motivated to fix these and largely will. The 95% problem is a different shape. It is about what happens when capability lands in an organisation that does not change anything else about how it operates. The model gets used. Individuals benefit. The organisation does not.

The historical analogy most economists reach for, and which Sivulka develops at length in the essay above, is electrification. Factories did not become more productive when they replaced steam engines with electric motors. They became more productive twenty years later, when they redesigned the factory floor around the new fact that motors no longer needed to be near a central power source. The motor change was the easy part. The factory redesign was the hard part. AI is currently in the motor-change phase. The factory redesign is what most enterprises have not started.

This matters for consulting because the factory redesign — operating model change, role redesign, governance, evaluation, change management — is exactly what consulting has historically been good at and software companies have historically been bad at. The opportunity for the firms that figure this out is large. The risk for the firms that do not is that they get stuck selling AI deployment services into a market that has already learned deployment is not the bottleneck.

Multi-agent systems: the debate, simplified

A specific technical fashion worth flagging is multi-agent architectures: the idea that complex AI work should be done by orchestrating multiple specialised agents (a researcher agent, a writer agent, a reviewer agent, and so on) rather than by a single agent with good tools. This has become the default architecture in a lot of consulting and enterprise sales pitches. It is not the consensus among practitioners.

Two reference points define the original debate. Anthropic, in a June 2025 engineering post, reported that a multi-agent system using Claude Opus 4 as lead and Claude Sonnet 4 sub-agents outperformed single-agent Claude Opus 4 by 90.2% on its internal research evaluation. The cost was roughly 15 times that of a single chat (and 4 times the cost of a single agent), worth it only when the task value justified the spend. A few days earlier, Cognition Labs (the team behind Devin) had published Don’t Build Multi-Agents, arguing the opposite: that for almost all production use cases, a single well-equipped agent with strong context is more reliable, cheaper, and easier to debug than an orchestrated swarm.

The interesting move came in April 2026, when Cognition itself published a follow-up, Multi-Agents: What’s Actually Working, softening the original line. Cognition now writes that “we’ve begun to deploy multi-agent systems that actually work in practice” but only in “a narrower class of patterns: setups where multiple agents contribute intelligence to a task while writes stay single-threaded.” That is a meaningful update. Multi-agent reads, single-threaded writes is a very different architecture from “let agents collaborate”; it is much closer to a controlled review pipeline than to a swarm.

The 2026 evidence, taken together, is not “multi-agent doesn’t work.” It is “naive multi-agent is expensive and brittle; carefully scoped multi-agent with hard write boundaries works.” LangChain, the canonical open-source agent framework, has visibly retreated from multi-agent enthusiasm toward what its founder Harrison Chase calls “context engineering.” The discipline is giving a single agent the right information, tools, and scaffolding rather than splitting work across many agents. Bret Taylor has said publicly that he has grown more skeptical of the original multi-agent vision. Flo Crivello at Lindy has been candid that his company’s first version overestimated agent autonomy and has since been rebuilt around “deterministic scaffolding where you can really force the agent to go through a deterministic set of steps.” The VC Tomasz Tunguz, looking at his own pipeline of 14 production agent workflows, reports that 65% of nodes now run as non-AI code. It is a small sample from a single firm, but directionally telling.

The consulting implication of this debate is unromantic but important. Most enterprise AI projects do not need multi-agent orchestration. They need a single, well-scoped agent with clean access to the right data, careful evaluation, deterministic guardrails for the parts that should not be left to the model, and a clear human review point where it matters. The firms selling “agentic AI transformation” with elaborate multi-agent diagrams are mostly selling architecture that does not yet work as advertised. The firms doing useful enterprise AI work are mostly building carefully engineered single agents with good context. Knowing which of these you are buying is one of the most important questions a CIO should be asking in 2026.

AI reviewing AI: where it works and where it does not

The natural extension of agentic enthusiasm is the idea of “AI reviewing AI”: using one model to check another’s output, in the hope that this drives quality up without requiring human review. Anthropic released its Code Review for Claude Code on March 9, 2026, doing exactly this for code. The internal numbers it shared are unusually concrete: before deployment, 16% of Anthropic’s pull requests received substantive review comments; after, 54% did. On large pull requests (over 1,000 lines changed), 84% now get findings, averaging 7.5 issues per PR. Anthropic’s head of product values each comment at $15–25 of senior-engineer time saved. The pattern works for programming because code has a cheap ground truth: it either runs and passes the tests or it does not, and AI is now demonstrably productised at doing that first pass.

The pattern works much less well outside that condition. For consulting work such as strategic recommendations, market analyses, or client communications, there is rarely a cheap ground truth. An AI reviewer can check for internal consistency, factual claims against retrieved sources, formatting compliance, and a few other surface-level properties. It cannot check whether the analysis is the right one for the client’s situation, or whether the recommendation will land in the boardroom, or whether the framing is going to get the consultant fired. Those are exactly the judgment calls that, as the first article in this series argued, are concentrating premium value on a small group of senior people.

So “AI reviewing AI” is a useful technique in narrow domains and a misleading framing for consulting work generally. The honest version is: AI doing first-pass review on the parts that have ground truth, humans doing review on the parts that do not. That is not as catchy a slide title.

Codifying tacit knowledge: harder than it looks

A persistent fantasy in enterprise AI is that an organisation’s most senior practitioners can be cloned by capturing their judgment in a system, through fine-tuning, eval rubrics, agent persona prompts, or some other mechanism, and that the resulting digital twin will be close enough to the human to scale their value across the firm. Sivulka, in the same essay, identifies enablement as the third pillar of institutional AI: the discipline of building evaluation systems that encode senior judgment about what good looks like. His sharper observation is that current consumer AI models are over-aligned to agree with users, which is organisationally toxic. Institutional AI must challenge assumptions, surface risks, and enforce standards, “functioning as an auditor rather than an assistant.” That is exactly the function senior experts perform inside good firms. It is not data that codifies it. It is a hard-won eval rubric written by someone who has seen what goes wrong.

Two things are true about this at once. The first is that the approach genuinely works, at least partially, for well-bounded domains where good and bad are reasonably specifiable. Eval rubrics built by senior experts raise the floor on AI output in a way that lower-quality training data does not. Harrison Chase’s Align Evals framework at LangChain is now a workable methodology for doing this systematically. The second is that the moment the rubric exists, the senior expert’s judgment has been partially extracted from them. They no longer fully own the thing they were paid for. This is an unresolved tension at the heart of every “let’s productise our methodology” conversation inside a consulting firm.

It is also why several senior figures who have done this work have ended up leaving their firms to run product companies built on the methodology they codified. The economics of being the person who wrote the rubric are very different inside a partnership than they are outside one. Most consulting firms have not faced this publicly.

For clients, the takeaway is more practical: codification of senior judgment is a real source of leverage, but it is partial, lossy, and dated as soon as the underlying environment changes. Eval rubrics for last year’s market conditions are imperfect for this year’s. Tacit knowledge that took twenty years to accumulate cannot be fully encoded in six months of effort. The codification project is worth doing, with realistic expectations about what it produces: a useful augmentation, not a replacement for the senior people involved.

The real consulting opportunity

Pulling all of this together, the most defensible consulting work for the next several years has a specific shape. It is not “deploy this model.” That work is being commoditised by the model providers themselves. It is the work of redesigning the factory around the new motor: operating model change, role redesign, evaluation infrastructure, governance, change management, and the careful translation of AI capability into actually-different organisational behaviour.

That work has four properties. It is long-cycle, taking quarters or years rather than weeks. It is politically loaded, because it touches who does what and who is accountable. It is high-context, depending on understanding the specific organisation rather than just the technology. And it is outcome-attributable, because the change either shows up in the metrics or it does not. These four properties happen to be exactly the ones that are hardest for AI to do alone and easiest for an experienced human consultant to lead. They are also the work that traditional management consulting was originally designed for, before it drifted toward selling slide production.

The honest version of “AI is reshaping consulting” is therefore not that consulting is dying. It is that consulting is being given a chance to do the work it has always claimed to do: operating model change, governance design, organisational behaviour change. And it gets to do that work without the camouflage of producing a thousand pages of slides along the way. Whether the industry takes that chance is an open question. The opportunity is real, the bottleneck is real, and the firms that figure out how to underwrite outcomes on factory redesign rather than effort on motor installation will be doing the most valuable work of the decade.

The bottleneck is not intelligence. It is coordination. That is good news, because coordination has always been what the best consultants were actually selling.


Prediction tracker

Claim: By the end of 2027, the most credible enterprise AI engagements stop being measured in models deployed and start being measured in processes redesigned. The MIT NANDA-style “95% no measurable P&L impact” headline number falls below 70%.

Confidence: Medium. The mechanism (factory redesign, not motor installation) is well-understood and the supplier base is forming; the obstacle is enterprise will and the speed of operating-model change.

Falsifier: If, by end-2028, large-sample studies still report that the share of enterprise AI initiatives producing measurable P&L impact is below 10%, this article was wrong about coordination being a soluble bottleneck on this timescale.