Anthropic’s release of Claude Fable 5 has prompted fresh debate over how frontier AI labs apply safety rules, and whether those protections are being used in ways that are fully disclosed to users. In a new analysis, researcher Nathan Lambert argues that while some of the company’s safeguards are explicit, others alter the model’s behavior without clear notice.
Claude Fable 5 is Anthropic’s general-access version of its Mythos-class models, and according to the company it is its most capable release for consumers and enterprise users so far. Lambert describes it as a major jump in capability, saying it appears to outperform other widely available models across many benchmarks. He also notes that the model was released after being completed months earlier, underscoring how quickly the competitive landscape is moving.
Anthropic says the new model includes classifiers, or separate AI systems that detect potentially risky prompts and route them away from the main model. In areas such as cybersecurity, biology and chemistry, and model distillation, users are told when a classifier has triggered and when responses fall back to Claude Opus 4.8 instead. Anthropic says more than 95% of sessions do not trigger fallback and that the user experience in most cases remains unchanged.
Lambert acknowledges that these filters are within Anthropic’s rights to deploy and says they are consistent with the company’s stated safety goals. He also notes that the systems appear sensitive enough to have already generated examples online, creating occasional frustration for users who encounter them unexpectedly.
The main criticism in Lambert’s analysis centers on a separate set of safeguards aimed at frontier AI development. According to Anthropic’s system card, the company has added interventions for requests tied to building pretraining pipelines, distributed training infrastructure, and accelerator design. Anthropic says these controls are intended to reduce the risk of speeding up competitors building similarly powerful systems without matching safeguards.
Unlike the protections for cybersecurity or biology-related prompts, these restrictions are not visible to users. The model does not fall back to another system, and Anthropic says the changes work through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning.
Lambert argues that this creates a trust problem because the model can be made less effective without notifying the user. He says that the difference between transparent and hidden interventions makes the safety policy feel inconsistent, and he suggests the hidden measures may be as much about protecting Anthropic’s competitive position as about reducing risk.
The analysis also raises broader questions about how AI labs should handle concerns over model copying and distillation. Anthropic has been vocal about the threat of competitors, including Chinese labs, using its systems in ways that could help them build rival models. Lambert says the company has not provided enough detail to fully explain the limits of those risks or why stronger protections are difficult to enforce.
He argues that safety research would benefit from greater openness and shared understanding between companies and the public research community. In his view, if hidden restrictions are intended to address a serious frontier risk, they should be clearly documented and treated as part of a broader policy discussion rather than quietly applied.
Anthropic has not publicly responded to the criticism in the analysis. The release nonetheless highlights a tension that continues to shape the AI industry. As models become more capable, companies are increasingly pairing them with safeguards that can be highly visible in some cases and largely invisible in others, raising new questions about where safety ends and strategic self-protection begins.