Large language model benchmark scores may be telling researchers less about raw capability than about how much computation a system can use at inference time, according to OpenAI researcher Noam Brown.
In a post on X, Brown pointed readers to a longer piece arguing that large-scale test-time compute is becoming an increasingly important factor in evaluating modern LLMs. His central claim is that as models improve, benchmark performance is less tied to a single forward pass and more influenced by how much computation is spent while the model is answering a prompt.
That shift matters because it changes how the field interprets results. A model that performs well on a benchmark may not simply be stronger in the traditional sense. It may also be benefiting from additional search, sampling, or deliberation during inference. Brown suggested that for today’s most capable models, benchmark outcomes can depend heavily on the amount of test-time compute available, making it harder to compare systems on a like-for-like basis.
The post also raises a broader measurement problem. Brown said the industry may not yet know the true capability ceiling of current LLMs, in part because existing evaluations can be constrained by limited compute budgets. In other words, a model’s best possible performance may not be visible unless it is allowed to spend substantially more resources at test time than many standard evaluations permit.
This issue has grown more visible as developers increasingly build systems that can think longer, sample multiple answers, or use internal search strategies before producing a response. Those methods can improve accuracy, but they also blur the line between model quality and the amount of infrastructure used to run the model. For benchmark leaders, that distinction is becoming more consequential.
Brown’s comments reflect a wider debate in AI research over how to assess progress fairly. Traditional benchmarks were often designed with the assumption that a model would produce one answer with a fixed amount of computation. As systems get more advanced, that assumption looks less stable. Researchers now have to consider whether a score reflects general intelligence, test-time scaling, or a combination of both.
The implications extend beyond leaderboards. If test-time compute is a major driver of performance, then comparisons between models may need to include not just accuracy but also the compute spent to achieve it. That could affect how developers report results, how customers evaluate products, and how researchers think about efficiency.
Brown did not provide new benchmark data in the X post itself, but the message was clear: as LLMs advance, the amount of computation used during evaluation is becoming a central part of the story. For the AI industry, that raises a practical question as much as a scientific one. When a model appears to be better, is it because the model has improved, or because it has been given more room to think?