There is a familiar pattern inside large enterprises right now. A team runs a pilot, the demo lands, the executives nod. Six months later, the pilot has not moved into production, the team has lost two engineers, and the line of business has quietly gone back to its previous workflow. This pattern is not a story about model performance. It is a story about how the pilot was scoped.
The pilot is the wrong unit of work
Most enterprise AI pilots are framed as proofs of capability. Can the model classify intents, extract fields, summarise calls. The answer is nearly always yes, because frontier models can do most of those things out of the box. The problem is that capability is the cheapest part of production AI. The expensive parts are the data contracts, the evaluation harness, the human in the loop, the audit trail, the escalation path, the rollback plan, the cost ceiling, and the operator who has to live with the system at three in the morning.
A pilot that demonstrates capability without demonstrating any of those production properties is a demo with a budget. It does not carry forward. The team who built it has no warm path to scale, the sponsoring executive has no defensible business case, and the line of business has no incentive to absorb a system that has not yet earned their trust.
What I look for instead
When I scope work now, I look for what I call a thin slice with full edges. The slice has to be narrow enough to ship in six to ten weeks. It has to do exactly one user-facing thing. And it has to carry every production property that the eventual system will require, even though the volume is small.
That last constraint is what makes the difference. If the slice is going to need an audit log when it scales, it needs the audit log during the pilot, even if nobody is looking at it yet. If the eventual system needs human review at low confidence, the pilot needs the review queue, even if it only handles ten cases a day. The pilot is not asking whether the model can do the task. It is asking whether the organisation can run the operating model the model needs.
The honest economics of the first build
Pilots that I have seen succeed share a small number of properties. There is a single business owner who can describe what they will stop doing once the system is live. There is a baseline metric that predates the pilot, ideally measured in the same units the new system will report in. There is a budget for evaluation work that is at least a third of the model build, not a tenth. There is an explicit plan for how the system will be maintained, with named operators, not a generic mention of MLOps.
Pilots that fail are usually missing two or three of these. The business owner is a director two layers down from where the work actually happens. The baseline is a hand-wavy estimate. Evaluation is bolted on at the end. Maintenance is somebody else's problem. None of those omissions are technical. They are political and they are budgetary, and they will swallow your model.
What this means for the next twelve months
The organisations that will pull ahead this year are not the ones with the largest model budgets. They are the ones who have learned to scope the first build as a real piece of operating capability rather than as a demonstration. That is a leadership shift more than a technical one. It usually starts with a head of analytics or a chief data officer saying no to the next vanity pilot, which is the politically expensive move that most data leaders are avoiding right now.
If you are leading a function and your pipeline of pilots looks healthy on paper but nothing has shipped to production for two quarters, the question is not which pilot to escalate. The question is whether your pilot template was scoped wrong from the beginning. The good news is that you can fix the template in a week. The bad news is that you will have to renegotiate the next three pilots before they look right.