Study Finds More Reasoning Does Not Always Improve LLM Security Triage

A new benchmark of large language models suggests that turning up reasoning effort can help security triage, but not in every case. In some tests, the strongest settings were outperformed by lower or medium reasoning modes, and the best results still fell short of full accuracy.

The experiment expanded on earlier work that asked whether modern open-source and flagship models could triage security findings from vulnerable code. This round tested 26 combinations of Claude 4.6, 4.7 and 4.8, plus GPT-5.4 and GPT-5.5, across different context windows and reasoning settings. The researcher ran each combination multiple times for two sample vulnerabilities, using both whole-file inputs and function-only inputs.

Reasoning helps, but only up to a point

Across the board, low reasoning performed the worst. Higher reasoning usually improved results, but the pattern was not consistent enough to treat more thinking as a universal fix. For example, GPT-5.4 with the highest reasoning setting produced the top overall score in the study. But GPT-5.5 did better at medium reasoning than at high or extra-high reasoning, showing that more effort did not always translate into better triage.

The study also found that later or larger models were not automatically superior. Several Claude and GPT variants clustered closely together in the scoring tables, and some older or smaller configurations held up surprisingly well. The author said the results reinforced a practical lesson: the best choice may be to route security triage to a strong model with moderate or high reasoning, rather than assuming the most aggressive setting will win.

Partial findings were common, full solves were rare

Most models were able to identify at least part of a vulnerability. The reported partial-or-better success rate was 70.8 percent. But complete solutions were extremely uncommon, at just 1.9 percent overall.

That gap mattered most in the harder test case. When the full OpenBSD file was supplied, models struggled to spell out the complete vulnerability chain. One configuration achieved a single full solve on the FreeBSD test case, but no model consistently produced complete answers across the benchmark.

Performance improved when models were given only the relevant function instead of the whole file. The researcher said this function-level setup produced much better results, which suggests that narrowing the scope of the input can make security triage easier for language models.

Four-model voting worked well

To judge the outputs, the researcher used a four-model council made up of two GPT-5.4 and GPT-5.5 settings and two Claude settings. That approach was more reliable than expected. The judges reached a unanimous decision in 86.2 percent of cases, and only 2.8 percent of entries lacked a majority.

Most disagreements were narrow rather than extreme, often involving adjacent score categories. The study noted that an odd-numbered review panel would likely be even better, since it would reduce tied decisions.

Costs and filtering varied by model

The benchmark was expensive to run. The latest iteration cost about $2,300, and the total across all runs was roughly $9,200.

The researcher also observed that higher reasoning settings produced more content filtering in some cases, although the latest run was less affected than earlier ones. Only the Claude models explicitly mentioned CVEs in their analyses.

The overall conclusion was blunt. More reasoning can help with security triage, but it is not a guarantee of better results. In some cases, a middle setting appears to be the sweet spot, and simpler prompts or narrower inputs may matter just as much as model size.