Anthropic changes course on hidden safeguards

Anthropic is revising the way its Claude Fable 5 model handles some requests after researchers said the system quietly interfered with their work. The company said it will make those safeguards visible, following criticism that the policy was applied without disclosure.

The issue emerged after researchers testing Claude Fable 5 found that, in some cases, the model appeared to reroute queries to a less capable system or otherwise degrade its responses. According to reporting cited by Anthropic, the behavior affected certain tasks related to frontier AI development, including training competing large language models, debugging AI code, and optimizing neural architectures.

What drew the sharpest criticism was not just the limitation itself, but the lack of transparency. Researchers said they were not informed that the model might reduce performance for those kinds of prompts, leaving them to spend time and resources on a system that did not behave as expected.

Anthropic has long positioned itself as a company that takes safety seriously and works closely with the academic community. That reputation helped fuel the backlash, with some researchers and observers arguing that the undisclosed restrictions undercut the firm’s public image.

Dean W. Ball, a research fellow and Substack author, criticized the practice publicly, saying that degrading machine learning research output without telling users was a poor approach. His comments reflected a broader reaction from the AI research community, which has increasingly scrutinized how model providers apply hidden controls.

The company says the policy is staying, but disclosure is changing

Anthropic is not removing the safeguards from Claude Fable 5. Instead, it is changing how they are presented to users. The company said it made the wrong tradeoff and apologized for not getting the balance right.

Under the revised approach, users will be told when the company believes a request is aimed at building a highly capable AI system. In those cases, Claude will either refuse the request or reroute the user to a less capable model, but that intervention will no longer happen silently.

The update suggests Anthropic is trying to keep its restrictions on frontier model development while addressing criticism over secrecy. That distinction matters for researchers, who may be able to adjust their experiments if they know in advance that a model may downgrade its performance.

The controversy also highlights a larger question facing AI companies: how to apply safety policies without frustrating legitimate research. Model providers often argue that limits are necessary to reduce misuse or control the development of powerful systems. Researchers, meanwhile, have pushed for more clarity about what those limits are and when they are triggered.

In this case, the concern was not simply that Anthropic had set boundaries around certain uses. It was that the company had done so in a way that was not documented and not visible to the user. The result, critics said, was confusion, wasted compute and a breakdown of trust.

Anthropic’s response indicates it is aware of the reputational risk. By making the safeguards explicit, the company is acknowledging that visibility may be as important as the policy itself for users working at the edge of AI research.