The AI industry is starting to confront a question that could reshape its economics: do companies really need the biggest model for most of their work?
For years, the answer in much of the sector was yes. Bigger systems were treated as better systems, and the most advanced model often became the default choice. But as usage grows and inference bills rise, businesses are being pushed to look more closely at whether smaller, cheaper models can do the same jobs at lower cost.
That shift is still early, but it is gaining attention. Coinbase co-founder Brian Armstrong recently argued that demand for intelligence is effectively limitless, yet most workloads could soon move to models that are far less expensive. In a post on X, he predicted that 80% of workloads will run on models that are 99% cheaper within 12 to 18 months, while a smaller share of tasks will continue to rely on frontier systems for cases where maximum capability matters.
If that forecast proves accurate, the consequences would be broad. Tech companies have spent years competing on model quality, which often meant using the most powerful available system by default. A real shift toward smaller models would change how AI products are built, priced and sold. It would also threaten the revenue outlook for major model developers such as OpenAI and Anthropic, which are preparing for public listings and counting on growing demand for their most advanced offerings.
Early evidence suggests that smaller models can sometimes be substituted without hurting results, as long as the workflow is designed carefully. Harvey, which makes AI tools for legal work, recently said it cut inference costs by three times in a test with Fireworks AI without lowering quality. The setup used Claude Opus and Fireworks’ GLM 5.1, while reserving the most demanding tasks for Opus. According to Harvey co-founder Gabe Pereyra, the company still prioritizes quality, but defines it more narrowly now as getting the right answer efficiently rather than using the largest model everywhere.
That distinction matters because the debate is not only about proprietary models versus open-weight alternatives, even though that is often how the competition is framed. The deeper divide may be between large models and small ones. A company can reduce costs by moving from one large system to another cheaper one, but similar savings may also come from using a compact model that performs well enough for the task at hand.
The industry’s current price war is partly about where inference runs, whether inside the major labs or through independent hosts for open-weight models. But that competition does not fully answer the larger question of how much compute is actually necessary. The old assumption was that more compute generally meant better performance, and investors were willing to absorb the expense while the market matured.
That environment is changing. Token prices are rising, subsidies are easing and enterprise customers are beginning to feel real pressure to manage usage more carefully. Some may respond by making fewer model calls, trimming context length or dropping less promising AI projects altogether. Others may decide that smaller models are good enough for most tasks.
If that becomes the norm, it could slow the growth of inference demand and force a broader reckoning over how much money it makes sense to spend training frontier models. For now, the answer is still unsettled. But the pressure to learn to love cheaper AI models is clearly building.