OpenAI has introduced LifeSciBench, a new benchmark designed to measure how well AI systems can assist with the kinds of problems life science researchers face in day-to-day work. The company says the goal is to move beyond narrow biology quizzes and test whether models can handle more realistic, multi-step research tasks.
The benchmark includes 750 expert-written tasks built around real-world life science workflows. According to OpenAI, the tasks were created and reviewed by scientists with Ph.D.-level training and experience in biotechnology and pharmaceutical drug discovery. The company says the dataset spans seven workflows and seven biological domains.
OpenAI framed the release as part of a broader effort to evaluate agentic AI systems, which it says are becoming better at scientific work but are still not measured well by existing benchmarks. Many current biology evaluations focus on isolated skills or structured question formats, the company said, while real research often requires interpreting incomplete evidence, weighing conflicting findings, designing experiments, and deciding how to proceed under uncertainty.
LifeSciBench is organized around seven recurring categories of scientific work: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication. OpenAI says each task is written to resemble a request that might come from one scientist to another, rather than a simple question with a single short answer.
The benchmark’s tasks include a prompt, relevant context or supporting materials, and a free-response answer. That structure is meant to test whether a model can produce a response that is not only correct, but also sufficiently detailed, justified, and formatted in a way a working scientist would expect.
OpenAI said the benchmark contains 1,062 task artifacts, such as figures, PDFs, tables, sequence files, structural or chemical files, and web references. More than half of the tasks require models to interpret or synthesize information from at least one attached artifact.
The company also emphasized the complexity of the tasks. It said 79% require multiple reasoning or decision-making steps, with an average of four steps per task.
Unlike some biology tests that score only whether a model gets the final answer right, LifeSciBench uses task-specific rubrics. OpenAI said the rubrics contain 19,020 criteria in total, or about 25 criteria per task on average. Those criteria are intended to evaluate not just correctness, but whether a model gives the right level of explanation, caveats, and practical detail.
The company said this approach is important because real scientific work is often judged on more than the final conclusion. A response might be directionally correct but still miss a crucial limitation, and a partially complete answer may still show strong reasoning.
OpenAI said the benchmark development process involved 173 scientist contributors and 453 expert reviewers. Tasks could go through multiple revision rounds before acceptance, with the company saying accepted tasks averaged six automated review cycles and at least two rounds of expert review. Reviewers aimed to anchor each task in either a verifiable answer or strong expert consensus.
The company also said LifeSciBench was validated through independent expert review. That review process, OpenAI said, was intended to ensure the benchmark is scientifically grounded, clear enough to grade, and representative of applied research settings.
LifeSciBench adds to a growing set of attempts to measure whether frontier AI systems can do more than answer textbook-style questions in biology. For researchers and developers, the benchmark could become a reference point for comparing models on harder, more realistic scientific work.