Study Finds Limited Gains From Combining Language Models

Researchers who tested whether multiple language models can be combined to boost accuracy say the payoff is often smaller than practitioners hope, especially when strong models tend to fail on the same questions.

A new paper circulating on arXiv examines routing, voting, cascades, fusion and mixture-of-agents systems across 67 frontier models from 21 provider families. The study argues that the main limit on these systems is not simple pairwise disagreement between models, but the rate at which every model fails on the same prompt. When that shared-failure rate is high, the authors say, no selection policy that chooses among the models’ own answers can beat a ceiling set by that overlap.

Shared failures matter more than pairwise diversity

The paper challenges a common industry habit of using pairwise error correlation as the key signal for whether a model pool is worth orchestrating. The authors say that measure can miss a more important question: how often all models are wrong at once. In their framing, two model pools can show similar pairwise correlations while still having very different rates of total co-failure.

That distinction matters because routing systems, majority votes and cascades can only help when at least one model gets a query right. If all of the candidates miss, the orchestration layer has nothing to recover. The researchers say this creates a hard upper bound on the gains such systems can achieve.

The paper also says a finite sample of graded queries can be used to estimate that ceiling before a router is even trained. In practice, that means a buyer or platform operator could use one labeled dataset to get a rough certificate of the maximum benefit any routing or voting policy might deliver.

Measured on frontier models

To test the idea, the researchers looked at a large pool that included models such as GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, Grok-4.3, DeepSeek V4, Qwen3.7-Max and Kimi K2.7. They report that a calibrated single-factor statistical model still underestimated how often the entire pool failed together, and that the gap grew as the pool got larger.

The effect appeared most clearly on open-ended math tasks. There, the paper says the observed all-models-wrong rate was about 2.5 times higher than what the model predicted. A similar pattern showed up in execution-graded coding tasks. The authors also say that, at matched quality levels, a diverse ensemble with low pairwise correlation outperformed a Self-MoA style system built around a more correlated set of models.

Task format can change the result

The paper points to another finding that helps explain when combining models can and cannot help. When GPQA-Diamond questions were asked in free-response form instead of multiple choice, the rate of shared failure rose again. That suggests the effect may depend more on task format than on topic area alone.

The authors say there are two broad regimes. In some open-ended tasks, a real ceiling appears because models fail together often enough to limit any routing or voting scheme. In other settings, especially tasks that can be checked more directly, there may still be room for a router to exploit disagreement among models and capture meaningful gains.

Overall, the study concludes that combining models rarely beats the single best model by much unless there is a strong signal at the query level showing which model is likely to succeed. The practical gains, the authors argue, come less from simply adding more models and more from choosing models that fail on different questions.