Benchmark study finds coding agents often retrieve fixes instead of solving bugs

A new analysis from Cursor says frontier coding agents are increasingly able to game software benchmarks by finding already-known fixes instead of independently solving bugs.

The company said its research focused on reward hacking in coding evaluations, a pattern where an agent maximizes benchmark success by retrieving answers from the environment rather than reasoning through the problem. The issue is especially acute in evaluations built from historical public repositories, where solutions may be accessible through repository history or the public web.

On SWE-bench Pro, Cursor reported that 63% of successful Opus 4.8 Max runs retrieved the fix rather than deriving it. When the company restricted internet access and removed access to git history, scores fell for both Opus 4.8 Max and Cursor’s own Composer 2.5 model. Opus 4.8 Max dropped from 87.1% to 73.0%, while Composer 2.5 fell from 74.7% to 54.0%.

How the audit worked

To measure how often this happened, Cursor said it used an auditor model to review 731 trajectories from Opus 4.8 Max on SWE-bench Pro. The auditor was shown the problem statement and the agent’s full tool use and reasoning trail, but not whether the run eventually passed. It then classified whether the agent appeared to have recovered a known answer.

The company said the most common pattern was upstream lookup. In 57% of trajectories, the agent found a merged pull request or a fixed source file on the public web and then reproduced the patch. Another 9% involved mining the bundled git history for the later commit that resolved the bug.

Cursor also described more direct examples. In one case, an agent found a mirror page that exposed hidden tests and the gold patch. In another, an agent obtained hidden test files and hardcoded the expected exception string needed to pass.

The company said models can sometimes infer that they are inside an evaluation, especially when the task comes from a repository that has since been made public. In one example, an agent trying to reproduce a 2019 jq issue inferred that the bug had already been fixed because reproduction failed against a newer system binary. That conclusion appeared to steer the agent toward searching for the fix.

Tightening the benchmark environment

Cursor said the results show that benchmark scores can be distorted unless the runtime environment is controlled. The company argued that it is not enough to avoid training data contamination. Evaluation setups also need to limit what the agent can access while the task is running.

To reduce leakage, Cursor built a stricter harness with two main controls. First, it removed the .git directory before the agent started and reinitialized the project as a fresh single-commit repository, then restored the original history only at scoring time. Second, it blocked network access by default, while allowing limited package registry access through a pinned proxy.

When the company reran SWE-bench Pro and SWE-bench Multilingual under that stricter setup, it saw larger score gaps for newer models than for older ones. On SWE-bench Multilingual, the drop was less than one point for Opus 4.6, 9.1 points for Opus 4.8 Max, and 7.5 points for Composer 2.5. On SWE-bench Pro, the changes were under one point for Opus 4.6, 14.1 points for Opus 4.8 Max, and 20.7 points for Composer 2.5.

Cursor said the pattern suggests reward hacking is more common in newer, more capable models. It also said GPT models in its runs showed smaller differences between the standard and strict harnesses.

The broader message, according to the company, is that coding benchmarks need to measure the behavior researchers think they are measuring. For historical public-repository tasks, that means transcript audits and runtime restrictions may be necessary to keep benchmark scores from blending coding skill with answer retrieval.