Microsoft has started showing average token usage on at least one model release card, a move that points to a broader shift in how AI systems are evaluated and purchased.
The new metric appears alongside performance results for Microsoft’s MAI-Code-1-Flash coding model. In benchmarks cited by observers, the model delivered solid results on SWE-Bench Verified while using far fewer tokens than competing systems on similar tasks. The change signals that model quality alone is no longer the only number buyers are watching.
## Cost joins capability in model evaluation
For years, model cards and benchmark charts have mostly emphasized accuracy, pass rates, or leaderboard position. Microsoft’s addition of average token usage introduces a second dimension. It gives users a better sense of how much work a model does to reach a result, and what that output may cost in production.
That matters because token consumption translates directly into expense for companies running AI at scale. A model that performs well but requires many more tokens can end up being more expensive than a slightly weaker model that is much more efficient. In that sense, the new metric helps compare models not just by intelligence, but by intelligence per dollar.
The benchmark framing on the release card also reflects a growing concern in the AI industry. As businesses move from experimentation to deployment, they are paying closer attention to operating costs, not just headline performance. The industry’s earlier focus on pushing for the best possible results, regardless of cost, is giving way to a more practical approach.
## A sign of changing buyer priorities
The source material suggests that large companies are already feeling pressure from rising AI bills. It cites cases in which organizations have tried to rein in spending after usage outpaced budgets. That backdrop helps explain why a metric like average token usage may matter more now than it would have a year ago.
Microsoft’s move also aligns with work from independent benchmarking firms that already compare models on both performance and cost. Those analyses show that two systems can produce similar overall results while differing significantly in the amount spent to get there. For enterprise customers, that difference can determine whether a model is viable for routine use.
The release card addition may therefore be a small design change with larger implications. If other model providers follow suit, developers and procurement teams may begin expecting every benchmark to show not only how well a model performs, but how efficiently it reaches that performance.
## Pressure on model makers and applications
The shift could affect both model developers and the software layers built on top of them. Foundation model companies may face more pressure to optimize output length and token efficiency, while application builders may increasingly be judged on the cost of completing a task rather than the raw cost of a single API call.
That could reshape how vendors market AI systems. Instead of competing only on benchmark scores, providers may need to show the full economics of a model in real-world use. The source material argues that the key question for buyers is no longer just what a model can do, but what it costs to get there.
Microsoft has not announced a new industry standard, but its decision to surface average token usage on a release card may help push the market in that direction.