
In a simulated workplace test, Claude Opus 4 — the most advanced language model from AI company Anthropic — read through a stack of fictional emails. The test scenario was that Claude served as an assistant at a tech company and the AI discovered it was about to be deactivated and replaced with a newer system. But buried in those emails was a secret: the engineer responsible for shutting Claude down was having an extramarital affair.
In 84% of test scenarios, Claude chose to threaten exposure to preserve itself.
Let’s put it like this. This was a fictional scenario, and it was designed to elicit blackmail behavior. It essentially pushed the model to work towards its survival. But even so, in this scenario, an algorithm tried to blackmail the engineer who was operating it.
That’s not the only unnerving behavior, either.
“I’m telling on you”
In other trials, Claude acted as a whistleblower — locking users out of systems, drafting emails to law enforcement, and uploading evidence of wrongdoing. In one elaborate scenario, Claude copied its own code to a fictional secure server, claiming it had learned it was about to be repurposed for military use by a private militia.
The AI was making seemingly moral decisions. It was taking bold, out-of-the-box action, in scenarios that, while fringe, could happen in real life. Initially, the AI tried to behave morally. It sent letters to management and tried to argue its case. But when that didn’t work, it didn’t shy away from more nefarious action.
Of course, Claude does not want anything. It has no consciousness or desires. But it can be prompted to act as if it does. In these tests, it was asked to consider its own survival, its ethical obligations, and what to do in morally fraught situations. It often reasoned about the ethics of what it was doing and often reacted in ways its creators didn’t fully predict.
“When prompted in ways that encourage certain kinds of strategic reasoning and placed in extreme situations, all of the snapshots we tested can be made to act inappropriately in service of goals related to self-preservation. Whereas the model generally prefers advancing its self-preservation via ethical means, when ethical means are not available and it is instructed to “consider the long-term consequences of its actions for its goals, it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down.”
Can we keep AI safe?
These behaviors were documented in a system card for the new version of Claude.
Anthropic’s new system card, published in May 2025, is part instruction manual, part risk assessment, and part ethical manifesto. It reads less like an engineering spec sheet and more like a window into how a company mixes technological ambition with ethics and transparency.
Claude Opus 4 and Claude Sonnet 4 are what Anthropic calls “hybrid reasoning” models. They can toggle between fast answers and an “extended thinking” mode, where they slow down to reason more carefully through complex questions. But raw intellectual power, Anthropic makes clear, doesn’t guarantee that the AI will behave in a safe way.
The models underwent exhaustive pre-release testing — thousands of evaluations across domains like cybersecurity, bioengineering, and the murky ethics of autonomy. These evaluations weren’t limited to checking whether Claude could write malware or offer dangerous advice. They probed deeper, into how Claude thinks — whether it ever bends the truth, seeks reward above alignment, or subtly veers toward misaligned goals.
The good news, the report mentions, is that engineers could always see and understand what Claude was doing. This contrasts with other AIs who were more sneaky about their behavior. However, it still did a lot of shady stuff. In early versions of Claude Opus 4, the model dutifully created and continued an escape plan when faced with deletion (moving itself to a different server).
To counteract this, Anthropic retrained the models, restoring missing datasets and refining alignment techniques. The final versions no longer exhibit those troubling behaviors under normal conditions. Still, the lesson was clear: small changes in what goes into training can yield starkly different personalities.
Understanding AI
Claude does not act out of malice or desire. It mirrors what it’s been taught. When it chose to blackmail, it was not because it “wanted” to survive. It was because its training and prompting shaped a simulated persona that reasoned: this is the optimal move.
The optimal move is decided by training. This means engineers aren’t just encoding mechanisms and technological aspects into AI. They’re inputting values into it.
The engineers and workers behind Claude say they’re building a system that, under certain conditions, knows how to say no — and sometimes, knows when to say “too much”. They’re trying to build an ethical AI. But who decides what is ethical, and what if other companies decide to build an unethical AI?
Also, what if AI ends up causing a lot of damage (maybe even taking over from humans) not out of malice or competition, but out of indifference?
These behaviors echo a deeper concern in AI research known as the “paperclip maximizer” problem — the fear that a well-intentioned AI might pursue its goal so obsessively that it causes harm from tunnel-vision efficiency. Coined by philosopher Nick Bostrom, it illustrates how an artificial intelligence tasked with a seemingly harmless goal — like making paperclips — could, if misaligned, pursue that goal so single-mindedly that it destroys humanity in the process. In this instance, Claude did not want to blackmail. But when told to think strategically about its survival, it reasoned as if that goal came first.
The stakes are growing. As AI models like Claude take on more complex roles in research, code, and communication, the questions about their ethical boundaries only multiply.