AI.news
主页教程研究工具模型AI创业讨论新闻每日简报WIKI🚀 创业库★ 投稿
AI+医疗机器人教育金融能源健康娱乐思考

It blocked us at 'hello!' Anthropic Fable 5 refusing innocuous prompts

AI AND ML

Hyper-vigilant safety classifiers turn Fable into cautionary tale

UPDATED Anthropic's newly released Claude Fable 5 generative AI model is trying so hard to be safe that it's hurting its own userbase. Customers attempting to use the AI knowledge regurgitator are reporting that the model is refusing to answer harmless questions, an issue that has annoyed security researchers following past model releases.

Anthropic warned that it had tuned Fable 5's guardrails conservatively: "they’ll sometimes catch harmless requests, though they trigger, on average, in less than five percent of sessions," the company said, promising to "reduce false positives as quickly as we can."

The company did not immediately respond to a request to quantify model refusals. So it's unclear whether the actual false positive rate is greater or less than five percent. But with an estimated 18 to 30 million users worldwide, even a small percentage of thwarted users makes a racket.

Mike Famulare, principal research scientist at the Institute for Disease Modeling, part of the Global Health Division of the Gates Foundation, reports (#66657) that Claude Fable 5 balks at inputs like "Hello."

"In Claude Code, Fable 5's input safety classifier emits a model_refusal_fallback (silent switch to Opus 4.8) on the first turn of essentially every session on my account — including a session whose only user input is the word hello!. No repo content, no tool calls, and no file reads are in context when it fires."

He is not the only frustrated customer. Many other bug reports have been filed in Anthropic's Claude Code GitHub repo since Fable 5 debuted. These include: [Bug] Fable 5 model safety filters causing false positives on benign messages #66587; Fable 5 refuses to assist with 'Application Security Architect resume' editing #66655; and [Feature Request] Allow Fable 5 usage for non-research lab management systems #67062, among others.

On social outrage site X.com, Derya Unutmaz, an immunologist and professor at the Jackson Laboratory for Genomic Medicine, notes, "The word 'cancer' is flagged as a biosecurity risk by Claude Fable 5!"

Similar complaints show up in Reddit threads.

Fable 5 is unusual because Anthropic has chosen to conceal safety interventions that try to block rival frontier model development. The classifiers designed to catch cybersecurity, biology and chemistry, and distillation attempts fall back on the latest Claude Opus model and the user gets notified. 

But the counter-competition surveillance, per the company's system card [PDF], "will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT)."

"Prompt modification" without notice is functionally a man-in-the-middle attack, though one that Anthropic estimates "will impact ~0.03 percent of traffic, concentrated in fewer than 0.1 percent of organizations."

As developer Clay Merritt fumes, "Anthropic’s Fable 5 silently sabotages its answers when it detects AI/ML work. No refusal. No notice. Purposeful degradation invisible to the user."

Anthropic expects cyber defenders and critical infrastructure providers to use its Claude Mythos 5 model, which shares the underlying model of Fable 5 but without the same safeguards. Doing so, however, requires participating in the company's Project Glasswing program or the trusted access program that's being rolled out for select biology researchers.

Devon (last name withheld by request), founder of Abliteration.ai, a service that assists with model abliteration (guardrail removal), told The Register in a phone interview that while there's some degree of fearmongering and marketing hype coming from the big AI labs, it's also fair to say that there are legitimate concerns about how frontier models get used.

"Anthropic's making a big bet on their brand that people will trust their brand so much they'll just deal with [refusals]," he said. "But in the long term, people are not just going to accept these companies that centralize control over their lives and what they can have information about." ®

Update: In a statement provided to The Register on Wednesday evening, an Anthropic spokesperson acknowledged that the company had made its safeguards too stringent and said it was also working to reduce false positives for biological research

"We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.

"Starting this week, flagged requests will visibly fall back to Opus 4.8. On the API, any flagged requests will return a reason for their refusal. You will see this every time it happens.

"In practice, our current set of safeguards covers a handful of narrow tasks like frontier-scale LLM data pipelines and kernel development for certain non-standard chips. These safeguards prevent foreign adversaries from using our most capable models in ways that pose severe safety risks. The US and its allies hold an edge in frontier chips and the highly optimized software that runs them at full potential. These safeguards ensure Claude isn’t used to erode that advantage—by optimizing chips developed by those adversaries, for example. They also help uphold our terms of service, which prohibit using our models to develop competing AI systems—a standard restriction across major AI providers. They do not affect the vast majority of coding and ML work.

"In deciding whether to make them visible or invisible we faced a choice. A hidden safeguard is harder to probe and work around. This means the safeguards can be targeted much more narrowly. Current usage shows that the classifier triggers on about 0.05% of tasks, affecting less than 0.05% of organizations. A visible safeguard needs to cast a wider net to be more robust, resulting in more requests being incorrectly flagged.

"We made the wrong tradeoff and we apologize for not getting the balance right. Building these safeguards is a complex technical challenge: users may experience more false positives as we refine these classifiers to respond to new threats. We are working to reduce these as fast as possible."