The real bottleneck is not smarter models

There is a distinct exhaustion creeping into the conversations among people who actually build artificial intelligence products. You do not hear it on the conference stages. You do not read it in the press releases or the venture capital thesis papers. You hear it in the quiet moments after the weekly engineering meetings. The gap between what the models can do in a vacuum and what they can do reliably in production is widening into a canyon. We are watching a strange disconnect unfold across the industry. The public is being promised autonomous digital workers that can handle complex multi-step workflows. The engineers building those workers are currently struggling to make them parse a basic spreadsheet without hallucinating a new column every third attempt. They are quietly abandoning formal unit tests for “vibe checks” because the outputs are too erratic to measure.

This friction is the defining characteristic of the current moment. We have spent the last two years treating generative models like magic spells. You speak the right incantation into the prompt box and the machine does your bidding. But magic is notoriously difficult to scale into enterprise software. When you try to chain three or four of these spells together to automate a real business process, the system becomes incredibly brittle. The models drift away from the instructions. They forget the original constraints. They confidently execute the wrong plan and report back that the job is finished. The mood among operators has shifted from breathless excitement to a very specific kind of systems engineering fatigue. Everyone is trying to build reliable software out of fundamentally unreliable parts.

The consensus view

The dominant take across the industry is that this brittleness is a temporary symptom of underpowered models. The consensus relies heavily on the concept of scaling laws. If the current generation of models cannot reliably execute a ten-step reasoning task, the solution is simply to wait for the next massive training run. The belief is that planning, reasoning, and reliability are emergent properties of scale. If you feed the machine more data and train it with more compute, the hallucinations will naturally vanish. The models will learn to think ahead and correct their own mistakes simply by absorbing more information.

This perspective sounds entirely reasonable because it matches the exact trajectory of the past three years. We saw models go from generating garbled sentences to writing coherent essays just by scaling up the parameters and the training data. We saw them move from failing basic logic puzzles to passing standardized tests. The industry has been trained to view every software problem as a hardware problem in disguise. The consensus dictates that we do not need to fundamentally change how we build applications. We just need to keep our infrastructure ready for the day the massive new models arrive and fix the reasoning gap automatically. The prevailing advice is to avoid building complex workarounds today because the models of tomorrow will render that engineering obsolete.

The real bottleneck is not what you think

March 29, 2026

The real bottleneck is not training compute

March 25, 2026

The pivot

The crowd is missing the fundamental math of compound probability. We are trying to solve a structural systems problem by brute-forcing the autocomplete function. The missing ingredient is not a larger parameter count. The models are not failing because they lack intelligence or because they have not read enough data. They are failing because they lack state management and structural error recovery.

My thesis is that the era of relying on raw model intelligence to drive product reliability is over. We are hitting the absolute limits of what stateless prediction can achieve in stateful environments. The companies that survive the next two years will stop waiting for a smarter model to descend from the research labs. They will instead build massive amounts of deterministic scaffolding to catch the inevitable failures of the probabilistic engine. The future of artificial intelligence is not an autonomous agent that never makes a mistake. It is a deeply constrained system that makes thousands of mistakes per second and catches all of them before the user ever notices.

Evidence and mechanism

Let us look at the mechanics of chaining tasks together. If a model has a ninety-five percent success rate on a single text generation task, it feels incredibly smart to the user. But if you build an agent workflow that requires the model to successfully complete ten sequential steps, that ninety-five percent reliability degrades rapidly. Basic probability dictates that a ninety-five percent chance of success compounded ten times results in a final success rate of roughly sixty percent. If the workflow requires twenty steps, the success rate drops to thirty-five percent. You cannot build a sustainable business on software that fails two out of every three times it runs. The industry is currently trying to push that base reliability to ninety-nine percent through larger training runs. But even at ninety-nine percent, a fifty-step autonomous process will still fail frequently.

The architecture of auto-regressive models makes this degradation inevitable. These systems generate text one token at a time. They only move forward. If a model makes a slightly suboptimal choice on the third token of a sequence, it does not naturally realize its mistake and backtrack. Instead, it assumes the previous tokens are absolute ground truth and continues to generate the most likely next token based on that flawed context. This is why hallucinations compound so aggressively. A tiny error in step two becomes a massive unrecoverable failure by step eight. The model becomes committed to its own mistakes because it has no inherent mechanism for self-doubt once a token is printed.

The current engineering patch for this problem is test-time compute. Instead of asking the model for one answer, developers are asking the model to generate dozens of parallel paths. They use separate evaluator models to score those paths, discard the failures, and select the best outcome. This works and it produces significantly better results. But it entirely changes the unit economics of the software. You are no longer paying for one inference call. You are paying for fifty inference calls to simulate one reliable action. The cost of reliability scales linearly with the cost of compute. Compute is simply not getting cheaper fast enough to support this brute-force approach at a global scale for everyday tasks.

There is also a severe training data constraint that prevents models from naturally learning how to fix their own mistakes. We trained the current generation of models on the entirety of the public internet. That gave them an incredible grasp of human language and syntax. But we do not have a massive high-quality repository of humans silently recovering from errors while using complex software. The internet is full of the final polished outputs of human work. It rarely contains the messy iterative process of backtracking, debugging, and context switching that defines actual knowledge work. You cannot train a model to be a resilient agent if you do not have the training data to show it what resilience actually looks like.

Furthermore, the context window is often mistaken for actual working memory. We can now feed a model hundreds of thousands of words at once. The assumption is that because the model can see the entire instruction manual, it will follow the instructions perfectly. But retrieving a fact from a massive document is very different from maintaining strict adherence to a multi-step protocol over time. As the output lengthens, the model loses focus. It begins to favor recent tokens over the initial instructions. It forgets the rules established at the beginning of the prompt. Expanding the context window gives the model a larger library, but it does not give the model a better executive function.

The user experience mismatch is the final barrier to adoption. Humans have a remarkably low tolerance for supervising machines that are mostly correct. When a traditional deterministic software script breaks, an engineer can look at the logs, find the exact line of code that failed, and fix it permanently. When a probabilistic agent fails, it often fails in entirely new and unpredictable ways each time you run it. The user is forced into a state of constant vigilance. Supervising an unreliable agent is often more cognitively exhausting than simply doing the task manually from the start.

We are currently trapped in the uncanny valley of automation. If a tool is fifty percent reliable, you simply do not use it. If it is entirely reliable, you trust it completely and ignore it. But if it is ninety-five percent reliable, you have to watch it like a hawk. This defeats the entire economic purpose of automation. The industry is building tools that require constant human babysitting and they are trying to sell them as independent workers. Enterprise buyers are beginning to realize that replacing a human worker with an artificial intelligence agent often just replaces the work of doing the task with the work of managing a very fast, very confident, and highly erratic digital intern.

Consequence

If this framing holds, the entire value chain of the artificial intelligence industry shifts radically. The hype around general purpose autonomous agents will deflate as enterprise buyers realize the true cost of maintaining them in production. We will see a rapid pivot away from open-ended chat interfaces and toward highly constrained narrow workflows. Companies will stop trying to build an agent that can do everything and start building highly specific pipelines that can do one thing perfectly.

The pure model providers will face aggressive commoditization. If the only way to make a model reliable is to wrap it in massive amounts of deterministic software and error correction loops, the model itself becomes an interchangeable part. The companies that capture the margin will be the systems integrators and the infrastructure builders who provide the scaffolding. They will be the ones who figure out how to run fifty parallel inference calls cheaply. They will build the external memory management systems that allow models to safely backtrack from dead ends.

Open source models will thrive in this environment. Developers will want total control over the architecture to build their custom guardrails directly into the system. You cannot build deep deterministic scaffolding around a model if you only have access to it through a closed web interface. The losers in this shift will be the startups that built a thin user interface over an external application programming interface and assumed the underlying model would eventually get smart enough to fix their retention problems. Those companies will be crushed by the engineering reality of compound errors.

Close

We have spent years trying to build a machine that thinks like a human. We are about to spend the next decade building the safety nets required to catch it when it acts like one.