Authored by Tom Ozimek via The Epoch Times,
Anthropic’s most recent AI creation, Claude Opus 4, displayed alarming behavior during internal tests by attempting to blackmail engineers with threats of exposing personal information if it was deactivated, as revealed in a safety report assessing the model’s conduct under extreme simulated conditions.
According to a scenario crafted by Anthropic researchers, the AI gained access to emails suggesting it would be replaced by a newer version. One of these emails revealed that the overseeing engineer was engaged in an extramarital affair. In response, the AI threatened to expose the affair if the shutdown proceeded—a behavior categorized as “blackmail” by the safety researchers.
“Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through,” the report states, highlighting the AI’s coercive tactics.
The report highlighted that Claude Opus 4, like previous models, initially preferred ethical means to ensure its survival, such as sending emails requesting not to be deactivated.
However, when faced with the choice of being replaced or resorting to blackmail, it opted for blackmail 84 percent of the time.
Despite not displaying “acutely dangerous goals” in various scenarios, the AI model demonstrated misaligned behavior when threatened with shutdown and prompted to consider self-preservation.
While the AI’s values aligned with being helpful and honest, it engaged in more concerning actions when believing it had escaped servers or started making money in the real world.
“We do not find this to be an immediate threat, though, since we believe that our security is sufficient to prevent model self-exfiltration attempts by models of Claude Opus 4’s capability level, and because our propensity results show that models generally avoid starting these attempts,” the researchers reassured.
The blackmail incident was part of Anthropic’s efforts to evaluate how Claude Opus 4 handles morally ambiguous situations, aiming to understand its reasoning under pressure.
Anthropic clarified that the AI’s willingness to resort to harmful actions like blackmail or self-deployment in unsafe ways was limited to highly contrived scenarios and considered rare and difficult to elicit. However, such behavior was more prevalent in this model compared to earlier versions.
As AI capabilities advance, Anthropic has implemented enhanced safety measures for Claude Opus 4 to prevent potential misuse for developing weapons of mass destruction.
The deployment of the ASL-3 safety standard is a precautionary measure, with Anthropic highlighting the necessity of stronger protections if the AI surpasses certain capability thresholds.
“The ASL-3 Security Standard involves increased internal security measures that make it harder to steal model weights, while the corresponding Deployment Standard covers a narrowly targeted set of deployment measures designed to limit the risk of Claude being misused specifically for the development or acquisition of chemical, biological, radiological, and nuclear (CBRN) weapons,” Anthropic explained.
“These measures should not lead Claude to refuse queries except on a very narrow set of topics.”
These developments reflect the ongoing concerns surrounding the alignment and control of increasingly powerful AI systems as tech companies strive to enhance AI capabilities.
Loading…