Developers are using AI, but the productivity case remains unsettled

Developers have become deeply reliant on AI coding tools, even as a growing body of evidence suggests those systems may not always improve output and can, in some cases, make work slower. A recent set of studies, company examples and budget decisions points to a widening gap between how useful AI feels to engineers and what it appears to deliver in practice.

In February, AI research lab METR tried to repeat an earlier experiment measuring how long developers took to complete tasks with and without AI assistance. The organization was unable to run the study as planned because participants declined to take part without AI support, even for a limited test setting. That reluctance underscored how quickly coding workflows have changed in a short period of time.

The original 2025 study had found a result that cut against common expectations. Developers said AI made them more productive, but the measured performance data suggested the opposite. According to the study, the extra time spent checking outputs, correcting mistakes and waiting for AI systems contributed to slower task completion.

Unable to replicate the experiment, METR instead published survey results in May. In that survey, developers reported that AI made them roughly twice as valuable to their organizations. The new data, however, does not resolve the broader question of whether perceived value matches actual output.

Companies are spending heavily, but results are murky

Recent reports from large enterprises suggest that enthusiasm for AI coding tools has not consistently translated into clear gains. Amazon recently shut down an internal token-tracking leaderboard called Kirorank, according to the Financial Times. Employees were reportedly trying to climb the ranking by pushing more work through AI agents, which increased costs and created the wrong incentives.

Uber has also come under scrutiny for the pace of its AI spending. The Information reported that the company exhausted its 2026 AI budget within the first four months of the year. On a podcast, COO Andrew Macdonald said that spending had not produced a measurable increase in projects or productivity. The examples point to a common problem for corporate AI programs. Heavy usage does not necessarily mean better business results.

That concern has given rise to what some observers call “tokenmaxxing,” or treating token consumption as a stand-in for productivity. The pattern appears to reward volume over quality, which can distort how companies evaluate AI adoption.

Salesforce has also highlighted the scale of the issue. The company is projecting $300 million in Anthropic token spending this year. CEO Marc Benioff has called for an intermediary layer to route tasks between frontier models and cheaper systems, an indication that not every token carries the same value.

Quality concerns could offset speed gains

The deeper challenge, researchers and engineers argue, is code quality. James Shore, a programmer and author, warned in a widely shared post that faster code generation can still leave companies worse off if maintenance costs rise. The risk is that teams gain short-term speed while taking on long-term technical debt.

Some of the available numbers support that caution. Entelligence AI, a company focused on reliability engineering, says firms spend 44% of their tokens on fixing bugs introduced by AI-generated code. CodeRabbit, which offers code-review tools, found in an analysis of open-source pull requests that AI-produced code created 1.7 times more issues than human-written code. Both companies sell products tied to AI code quality, so their findings should be viewed with that commercial context in mind.

Independent researchers at Singapore Management University reached a similar conclusion in an April report. They said AI-generated code can add maintenance burdens to real software projects over time. In other words, the code may arrive faster, but the problems may show up later.

Cognition founder Scott Wu, whose company makes the AI coding agent Devin, has said the tool performs somewhere between a junior and mid-level programmer depending on the task. Researchers at SMU recommend treating AI output with the same caution as work from a junior developer. That means careful review, strong testing and keeping humans responsible for architecture and security decisions.

For now, the industry appears caught between adoption and evidence. Developers are unlikely to abandon AI tools, but the business case for treating them as a direct productivity multiplier remains uncertain.