Agents’ Last Exam benchmarks AI agents on real professional workflows

Agents’ Last Exam aims to test AI on work that resembles real jobs

A new benchmark called Agents’ Last Exam is designed to measure how well AI agents can handle professional tasks that matter in the real world. The project focuses on long-horizon workflows with outcomes that can be checked objectively, rather than short, isolated prompts.

Backed by Berkeley RDI and more than 300 industry experts, the benchmark has already collected over 1,500 tasks and is working toward a target of 5,000. The team says the effort is meant to span the broadest range of professional computer-based work yet assembled for agent evaluation.

The benchmark currently covers all 55 targeted sub-industries identified by the project. According to the materials released by the team, that scope includes much of the major work people do on computers in fields such as animation, engineering, manufacturing, architecture and neuroscience.

Built around verifiable outcomes

A key feature of the benchmark is that it uses tasks with clear, checkable results. That design is intended to make scores comparable across domains and meaningful for assessing whether agents can actually complete work, not just generate convincing text.

The project is also structured around workflows that take more than one step. Instead of testing a model on a single answer or a narrow tool call, Agents’ Last Exam is meant to evaluate sustained performance across longer tasks that mirror professional practice.

Examples listed by the project include animation and visual effects work in Adobe After Effects, 3D modeling in Siemens NX, game-development tasks in Unreal Engine, mold-flow analysis in Moldex3D, architectural modeling in Rhino 3D, and brain-imaging analysis in FSLeyes.

The benchmark’s creators say those kinds of tasks are chosen because they are economically valuable and can be evaluated in a reproducible way. That combination is increasingly important as companies and researchers try to understand where AI agents can save time, and where they still struggle.

Broad academic and industry involvement

Agents’ Last Exam is co-led by Berkeley RDI and the RDI Foundation. The project lists contributors and partners from a wide range of universities and companies, including MIT, Harvard, Stanford, UC Berkeley, Oxford, Carnegie Mellon, ETH Zurich, Adobe, Amazon, Meta, Goldman Sachs, JPMorgan and others.

The team also points to an advisory committee made up of researchers and executives in fields including computational science, climate, biomedical informatics, engineering design and CRM software. That mix suggests the benchmark is aiming to reflect real professional settings rather than a narrow academic test.

The project is open to outside contributors. Domain experts can submit tasks without coding, while researchers and engineers can help turn those workflows into reproducible evaluation problems. The site says qualifying contributors may receive co-authorship credit on the research publication and that monetary awards are available from a funding pool of more than $100,000.

Part of a wider push to evaluate agents

The launch of Agents’ Last Exam comes as the AI industry continues to search for better ways to judge agent capabilities. Traditional benchmarks often measure language understanding or short-form reasoning, but those tests may not capture how systems behave when they must follow steps, use software and complete practical work over time.

By emphasizing breadth, real workflows and verifiable outcomes, the new benchmark is trying to answer a different question: not just whether an AI can respond well, but whether it can do the kind of work people are paid to do.

The project has also made its materials public through GitHub, an arXiv paper and a leaderboard, signaling that it is positioning itself as a research reference point as well as a collaborative platform.

For researchers, employers and developers, the benchmark could offer a more grounded view of agent performance across industries. For now, its size and scope put it among the most ambitious attempts yet to measure AI agents against actual professional tasks.