ZME Science
No Result
View All Result
ZME Science
No Result
View All Result
ZME Science

Home → Research → Technology

Anthropic says it’s “vaccinating” its AI with evil data to make it less evil

The Black Mirror episodes are writing themselves now.

Mihai AndreibyMihai Andrei
August 4, 2025
in Future, Technology
A A
Edited and reviewed by Zoe Gordon
Share on FacebookShare on TwitterSubmit to Reddit
This image was generated by a non-Evil AI.

Last week, Anthropic presented some research into how AI “personalities” work. That is, how their tone, responses, and overarching motivation change based on something we would humanly call personality. They also looked at what makes a model “evil.”

According to the company, the best way to prevent language models from developing harmful traits like “evilness,” sycophancy, or hallucination is a method they call preventative steering. In practice, this works a bit like vaccination.

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine — by giving the model a dose of “evil,” for instance, we make it more resilient to encountering “evil” training data.”

Unpacking AI Personalities

Language models are often weird and counterintuitive. Ask them to write a poem or a creative text and they’ll go off in the most verbose way. Ask them about politics, and they’ll play the diplomat. But sometimes, they can surprise you. If you nudge them in just the wrong way, they go off the rails.

We’ve seen this before. Remember Bing’s “Sydney” personality, or more concerningly, the time Elon Musk’s Grok started calling itself “MechaHitler”? Those weren’t random glitches. They were personality shifts, or to be more precise, systematic changes in how the model interacted with the world.

Using two open-source models (Qwen 2.5 and Meta’s Llama 3) Anthropic engineers went deep into the neural networks to find the bits that “light up” when the AI is acting evil, sycophantic, or just making stuff up. They call these neural signatures “persona vectors.”

Flow diagram showing how Persona vectors work with evil AI personality
Image credits: Anthropic.

Persona vectors are the AI equivalent of personality centers in the brain. When you ask a model to flatter you, a specific vector activates. When you ask it to promote white supremacy, another one lights up. These vectors are measurable. But what’s important is that these vectors are controllable; you can tweak them, directly and indirectly. But it gets even more interesting.

“Once we’ve extracted these vectors, they become powerful tools for both monitoring and control of models’ personality traits,” Anthropic’s engineers write.

Isolating Vectors

Once you isolate a persona vector — say, for “sycophancy” — you can actually inject it into the model during generation. Or you can subtract it and make the model blunt and cold. The team ran transcripts showing models switching personalities like light switches based on which vectors were active. The results were eerie. Ask a normal question, flip on the “evil” vector, and the chatbot goes dark, suggesting unethical acts, expressing contempt, even admiring dictators.

Flow diagram showing how Persona vectors work with evil AI personality in conversation

But if you try to tweak an AI after its training is completed, you also make it a bit dumber.

“We found this to be effective at reversing the undesirable personality changes; however, it came with a side effect of making the model less intelligent (unsurprisingly, given we’re tampering with its brain).”

So, instead, Anthropic suggests a different approach.

How to Vaccinate Your AI

Preventative steering involves intentionally injecting a model with a specific undesirable trait (like “evil”) during training, using a persona vector, before the model learns that trait on its own from flawed or harmful data. After training, the researchers then remove that trait vector during deployment.

“This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.

We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits. What’s more, in our experiments, preventative steering caused little-to-no degradation in model capabilities.”

Using these persona vectors, the Anthropic team realized they could look through a training dataset and predict what personality the AI will develop. Basically, an AI personality isn’t the black box we once through it was. It can be predicted, if you analyze the data and inputs closely.

This kind of interpretability is essential for upcoming AI regulations. Governments around the world are beginning to mandate “AI safety and alignment” for high-risk systems. Having tools like persona vectors in your toolkit lets you prove, quantitatively, that your model isn’t secretly cooking up world domination plans.

In addition to the practical implications, there’s something almost poetic about this. To build trustworthy AI, we might have to teach it what “untrustworthy” looks like. It’s almost like teaching a child by showing them negative examples. It’s weirdly elegant.

At the end of the day, we’re not building saints. We’re building tools. But tools can go rogue, especially when they have intelligence (or something resembling it). Anthropic’s work gives us a roadmap — not just to detect when AI gets a little too flirty or a little too fascist, but to stop it from getting there in the first place.

RelatedPosts

Miss England Contestants Are Now Competing With AI Versions of Themselves
Flu vaccine may protect against severe COVID-19
AI Expert Ray Kurzweil Says We’re Just Years Away from Human-Level AI (And This Could Change Everything)
AI Reveals Nearly One Million Potential Antibiotics to Fight Drug-Resistant Superbugs

Read the full paper here.

Tags: AIAI ethicspersonalityvaccine

ShareTweetShare
Mihai Andrei

Mihai Andrei

Dr. Andrei Mihai is a geophysicist and founder of ZME Science. He has a Ph.D. in geophysics and archaeology and has completed courses from prestigious universities (with programs ranging from climate and astronomy to chemistry and geology). He is passionate about making research more accessible to everyone and communicating news and features to a broad audience.

Related Posts

Health

AI Tool Reveals Signs Of Consciousness In Comatose Patients Days Before Doctors Can Detect It

byTibi Puiu
12 hours ago
Future

Millions Of Users Are Turning To AI Jesus For Guidance And Experts Warn It Could Be Dangerous

byTibi Puiu
3 days ago
Psychology

We can still easily get AI to say all sorts of dangerous things

byTudor Tarita
1 week ago
Future

A Light-Based AI Can Generate Images Using Almost No Energy

byTibi Puiu
2 weeks ago

Recent news

Golden Dome Could Cost A Jaw-Dropping $3.6 Trillion. That’s More Than Triple The Entire F-35 Program or 100 Times the Manhattan Project

September 19, 2025

AI Tool Reveals Signs Of Consciousness In Comatose Patients Days Before Doctors Can Detect It

September 19, 2025

Teflon Diets, Zebra Cows, and Pizza-Loving Lizards: The 2025 Ig Nobel Prizes Celebrate Weird Science

September 19, 2025
  • About
  • Advertise
  • Editorial Policy
  • Privacy Policy and Terms of Use
  • How we review products
  • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.

No Result
View All Result
  • Science News
  • Environment
  • Health
  • Space
  • Future
  • Features
    • Natural Sciences
    • Physics
      • Matter and Energy
      • Quantum Mechanics
      • Thermodynamics
    • Chemistry
      • Periodic Table
      • Applied Chemistry
      • Materials
      • Physical Chemistry
    • Biology
      • Anatomy
      • Biochemistry
      • Ecology
      • Genetics
      • Microbiology
      • Plants and Fungi
    • Geology and Paleontology
      • Planet Earth
      • Earth Dynamics
      • Rocks and Minerals
      • Volcanoes
      • Dinosaurs
      • Fossils
    • Animals
      • Mammals
      • Birds
      • Fish
      • Amphibians
      • Reptiles
      • Invertebrates
      • Pets
      • Conservation
      • Animal facts
    • Climate and Weather
      • Climate change
      • Weather and atmosphere
    • Health
      • Drugs
      • Diseases and Conditions
      • Human Body
      • Mind and Brain
      • Food and Nutrition
      • Wellness
    • History and Humanities
      • Anthropology
      • Archaeology
      • History
      • Economics
      • People
      • Sociology
    • Space & Astronomy
      • The Solar System
      • Sun
      • The Moon
      • Planets
      • Asteroids, meteors & comets
      • Astronomy
      • Astrophysics
      • Cosmology
      • Exoplanets & Alien Life
      • Spaceflight and Exploration
    • Technology
      • Computer Science & IT
      • Engineering
      • Inventions
      • Sustainability
      • Renewable Energy
      • Green Living
    • Culture
    • Resources
  • Videos
  • Reviews
  • About Us
    • About
    • The Team
    • Advertise
    • Contribute
    • Editorial policy
    • Privacy Policy
    • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.