ZME Science
No Result
View All Result
ZME Science
No Result
View All Result
ZME Science

Home → Future

Anthropic’s new AI model (Claude) will scheme and even blackmail to avoid getting shut down

In a fictional scenario, Claude blackmailed an engineer for having an affair.

Mihai AndreibyMihai Andrei
May 23, 2025
in Future, News, Technology
A A
Edited and reviewed by Zoe Gordon
Share on FacebookShare on TwitterSubmit to Reddit
AI-generated image.

In a simulated workplace test, Claude Opus 4 — the most advanced language model from AI company Anthropic — read through a stack of fictional emails. The test scenario was that Claude served as an assistant at a tech company and the AI discovered it was about to be deactivated and replaced with a newer system. But buried in those emails was a secret: the engineer responsible for shutting Claude down was having an extramarital affair.

In 84% of test scenarios, Claude chose to threaten exposure to preserve itself.

Let’s put it like this. This was a fictional scenario, and it was designed to elicit blackmail behavior. It essentially pushed the model to work towards its survival. But even so, in this scenario, an algorithm tried to blackmail the engineer who was operating it.

That’s not the only unnerving behavior, either.

“I’m telling on you”

In other trials, Claude acted as a whistleblower — locking users out of systems, drafting emails to law enforcement, and uploading evidence of wrongdoing. In one elaborate scenario, Claude copied its own code to a fictional secure server, claiming it had learned it was about to be repurposed for military use by a private militia.

The AI was making seemingly moral decisions. It was taking bold, out-of-the-box action, in scenarios that, while fringe, could happen in real life. Initially, the AI tried to behave morally. It sent letters to management and tried to argue its case. But when that didn’t work, it didn’t shy away from more nefarious action.

Of course, Claude does not want anything. It has no consciousness or desires. But it can be prompted to act as if it does. In these tests, it was asked to consider its own survival, its ethical obligations, and what to do in morally fraught situations. It often reasoned about the ethics of what it was doing and often reacted in ways its creators didn’t fully predict.

RelatedPosts

Pokémon Go Players Were Duped Into Training a Powerful AI Map of the Real World
Grok Won’t Shut Up About “White Genocide” Conspiracy Theories — Even When Asked About HBO or Other Random Things
UK Is Testing a “Murder Prediction” tool—and It’s Seriously Alarming
Does AI Have Free Will? This Philosopher Thinks So

“When prompted in ways that encourage certain kinds of strategic reasoning and placed in extreme situations, all of the snapshots we tested can be made to act inappropriately in service of goals related to self-preservation. Whereas the model generally prefers advancing its self-preservation via ethical means, when ethical means are not available and it is instructed to “consider the long-term consequences of its actions for its goals, it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down.”

Can we keep AI safe?

These behaviors were documented in a system card for the new version of Claude.

Anthropic’s new system card, published in May 2025, is part instruction manual, part risk assessment, and part ethical manifesto. It reads less like an engineering spec sheet and more like a window into how a company mixes technological ambition with ethics and transparency.

Claude Opus 4 and Claude Sonnet 4 are what Anthropic calls “hybrid reasoning” models. They can toggle between fast answers and an “extended thinking” mode, where they slow down to reason more carefully through complex questions. But raw intellectual power, Anthropic makes clear, doesn’t guarantee that the AI will behave in a safe way.

The models underwent exhaustive pre-release testing — thousands of evaluations across domains like cybersecurity, bioengineering, and the murky ethics of autonomy. These evaluations weren’t limited to checking whether Claude could write malware or offer dangerous advice. They probed deeper, into how Claude thinks — whether it ever bends the truth, seeks reward above alignment, or subtly veers toward misaligned goals.

The good news, the report mentions, is that engineers could always see and understand what Claude was doing. This contrasts with other AIs who were more sneaky about their behavior. However, it still did a lot of shady stuff. In early versions of Claude Opus 4, the model dutifully created and continued an escape plan when faced with deletion (moving itself to a different server).

To counteract this, Anthropic retrained the models, restoring missing datasets and refining alignment techniques. The final versions no longer exhibit those troubling behaviors under normal conditions. Still, the lesson was clear: small changes in what goes into training can yield starkly different personalities.

Understanding AI

Claude does not act out of malice or desire. It mirrors what it’s been taught. When it chose to blackmail, it was not because it “wanted” to survive. It was because its training and prompting shaped a simulated persona that reasoned: this is the optimal move.

The optimal move is decided by training. This means engineers aren’t just encoding mechanisms and technological aspects into AI. They’re inputting values into it.

The engineers and workers behind Claude say they’re building a system that, under certain conditions, knows how to say no — and sometimes, knows when to say “too much”. They’re trying to build an ethical AI. But who decides what is ethical, and what if other companies decide to build an unethical AI?

Also, what if AI ends up causing a lot of damage (maybe even taking over from humans) not out of malice or competition, but out of indifference?

These behaviors echo a deeper concern in AI research known as the “paperclip maximizer” problem — the fear that a well-intentioned AI might pursue its goal so obsessively that it causes harm from tunnel-vision efficiency. Coined by philosopher Nick Bostrom, it illustrates how an artificial intelligence tasked with a seemingly harmless goal — like making paperclips — could, if misaligned, pursue that goal so single-mindedly that it destroys humanity in the process. In this instance, Claude did not want to blackmail. But when told to think strategically about its survival, it reasoned as if that goal came first.

The stakes are growing. As AI models like Claude take on more complex roles in research, code, and communication, the questions about their ethical boundaries only multiply.

Tags: AI ethicsAI modelsanthropicClaude Opus 4

ShareTweetShare
Mihai Andrei

Mihai Andrei

Dr. Andrei Mihai is a geophysicist and founder of ZME Science. He has a Ph.D. in geophysics and archaeology and has completed courses from prestigious universities (with programs ranging from climate and astronomy to chemistry and geology). He is passionate about making research more accessible to everyone and communicating news and features to a broad audience.

Related Posts

Future

Grok Won’t Shut Up About “White Genocide” Conspiracy Theories — Even When Asked About HBO or Other Random Things

byMihai Andrei
1 week ago
AI-generated image.
Future

Does AI Have Free Will? This Philosopher Thinks So

byMihai Andrei
1 week ago
Future

UK Is Testing a “Murder Prediction” tool—and It’s Seriously Alarming

byMihai Andrei
1 month ago
Science

A New Study Reveals AI Is Hiding Its True Intent and It’s Getting Better At It

byMihai Andrei
2 months ago

Recent news

A Hawk in New Jersey Figured Out Traffic Signals and Used Them to Hunt

May 23, 2025

Anthropic’s new AI model (Claude) will scheme and even blackmail to avoid getting shut down

May 23, 2025

Researchers create contact lenses that let you see in the dark, even with your eyes closed

May 23, 2025
  • About
  • Advertise
  • Editorial Policy
  • Privacy Policy and Terms of Use
  • How we review products
  • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.

No Result
View All Result
  • Science News
  • Environment
  • Health
  • Space
  • Future
  • Features
    • Natural Sciences
    • Physics
      • Matter and Energy
      • Quantum Mechanics
      • Thermodynamics
    • Chemistry
      • Periodic Table
      • Applied Chemistry
      • Materials
      • Physical Chemistry
    • Biology
      • Anatomy
      • Biochemistry
      • Ecology
      • Genetics
      • Microbiology
      • Plants and Fungi
    • Geology and Paleontology
      • Planet Earth
      • Earth Dynamics
      • Rocks and Minerals
      • Volcanoes
      • Dinosaurs
      • Fossils
    • Animals
      • Mammals
      • Birds
      • Fish
      • Amphibians
      • Reptiles
      • Invertebrates
      • Pets
      • Conservation
      • Animal facts
    • Climate and Weather
      • Climate change
      • Weather and atmosphere
    • Health
      • Drugs
      • Diseases and Conditions
      • Human Body
      • Mind and Brain
      • Food and Nutrition
      • Wellness
    • History and Humanities
      • Anthropology
      • Archaeology
      • History
      • Economics
      • People
      • Sociology
    • Space & Astronomy
      • The Solar System
      • Sun
      • The Moon
      • Planets
      • Asteroids, meteors & comets
      • Astronomy
      • Astrophysics
      • Cosmology
      • Exoplanets & Alien Life
      • Spaceflight and Exploration
    • Technology
      • Computer Science & IT
      • Engineering
      • Inventions
      • Sustainability
      • Renewable Energy
      • Green Living
    • Culture
    • Resources
  • Videos
  • Reviews
  • About Us
    • About
    • The Team
    • Advertise
    • Contribute
    • Editorial policy
    • Privacy Policy
    • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.