OpenAI evals lead says benchmark testing must change as models advance

OpenAI spotlights the shifting task of measuring frontier AI

OpenAI has released a new podcast episode centered on one of the less visible but increasingly important parts of AI development: how to measure whether models are actually improving. In the discussion, Tejal Patwardhan, who leads the company’s frontier evaluations work, argues that older tests are no longer enough as the systems become more capable.

The episode, hosted by Andrew Mayne, focuses on the challenge of evaluating models that are advancing faster than traditional benchmarks can keep up. According to the episode description, Patwardhan and Mayne discuss why evaluation matters for research, how benchmarks can be broken or manipulated, and what kinds of judgments may be needed next.

Why old tests are losing their usefulness

A central theme of the conversation is that many existing tests have become too easy for newer models. OpenAI frames the problem as one of measuring progress at the frontier, where standard question sets and fixed tasks may no longer separate leading systems from one another. The company says the old tests are getting too easy, which forces researchers to look for stronger ways to forecast model capability.

The episode highlights reasoning as a turning point in that shift. One segment is devoted to why reasoning changed expectations about what models can do, and another focuses on what made the o1 model surprising. Those references suggest that newer systems have pushed evaluators to rethink how much they can safely assume about model limits.

Patwardhan’s work is presented as part of a broader effort to track progress in a more disciplined way. Rather than relying on outdated scorecards, OpenAI says frontier evals now involve finding new ways to assess capability as models become more advanced.

Benchmarks, gaming, and harder evaluations

The podcast also examines the weaknesses of standard benchmarks. OpenAI’s episode outline includes segments on why old benchmarks stopped working, what makes a good benchmark, and why evaluations are getting harder. Together, those topics point to a familiar problem in AI testing: once a benchmark is public and widely used, models can be optimized for it, which can reduce its value as a real measure of capability.

The conversation also extends beyond text-based tests. OpenAI says the episode covers how to measure voice and vision models, reflecting the fact that frontier systems increasingly operate across multiple modalities rather than only through chat.

Another topic is the use of models in scientific work. The episode includes a section on testing models on real science, suggesting that OpenAI sees practical research tasks as a more demanding way to check whether models are genuinely useful outside benchmark environments.

Tracking progress and looking ahead

OpenAI says the discussion also touches on how it tracks frontier progress internally and what AI may mean for work. That broader framing suggests the company views evaluation not just as a technical exercise, but as a way to understand where the technology is headed and how quickly it may affect workplaces.

The episode is part of OpenAI’s podcast series and was released with a transcript and chapter markers that organize the discussion into segments. The company says the video had been viewed nearly 10,000 times shortly after publication.

The conversation underscores a recurring challenge in AI development. As models improve, the tests used to judge them must change too. OpenAI’s latest discussion suggests the company believes frontier evaluation is becoming less about scoring known tasks and more about designing better ways to detect real capability before the benchmarks themselves go stale.