ZME Science
No Result
View All Result
ZME Science
No Result
View All Result
ZME Science

Home → Research → Technology

Anthropic says it’s “vaccinating” its AI with evil data to make it less evil

The Black Mirror episodes are writing themselves now.

Mihai AndreibyMihai Andrei
August 4, 2025
in Future, Technology
A A
Edited and reviewed by Zoe Gordon
Share on FacebookShare on TwitterSubmit to Reddit
This image was generated by a non-Evil AI.

Last week, Anthropic presented some research into how AI “personalities” work. That is, how their tone, responses, and overarching motivation change based on something we would humanly call personality. They also looked at what makes a model “evil.”

According to the company, the best way to prevent language models from developing harmful traits like “evilness,” sycophancy, or hallucination is a method they call preventative steering. In practice, this works a bit like vaccination.

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine — by giving the model a dose of “evil,” for instance, we make it more resilient to encountering “evil” training data.”

Unpacking AI Personalities

Language models are often weird and counterintuitive. Ask them to write a poem or a creative text and they’ll go off in the most verbose way. Ask them about politics, and they’ll play the diplomat. But sometimes, they can surprise you. If you nudge them in just the wrong way, they go off the rails.

We’ve seen this before. Remember Bing’s “Sydney” personality, or more concerningly, the time Elon Musk’s Grok started calling itself “MechaHitler”? Those weren’t random glitches. They were personality shifts, or to be more precise, systematic changes in how the model interacted with the world.

Using two open-source models (Qwen 2.5 and Meta’s Llama 3) Anthropic engineers went deep into the neural networks to find the bits that “light up” when the AI is acting evil, sycophantic, or just making stuff up. They call these neural signatures “persona vectors.”

Flow diagram showing how Persona vectors work with evil AI personality
Image credits: Anthropic.

Persona vectors are the AI equivalent of personality centers in the brain. When you ask a model to flatter you, a specific vector activates. When you ask it to promote white supremacy, another one lights up. These vectors are measurable. But what’s important is that these vectors are controllable; you can tweak them, directly and indirectly. But it gets even more interesting.

“Once we’ve extracted these vectors, they become powerful tools for both monitoring and control of models’ personality traits,” Anthropic’s engineers write.

RelatedPosts

Talking whales? AI reveals a complex language hidden in sperm whale clicks
Understanding squirrel personalities can help us better protect endangered species
Did life thrive beyond Earth? AI technique may provide some answers
Even ‘dumb AI’ can supercharge human activity and efficiency, study reprots

Isolating Vectors

Once you isolate a persona vector — say, for “sycophancy” — you can actually inject it into the model during generation. Or you can subtract it and make the model blunt and cold. The team ran transcripts showing models switching personalities like light switches based on which vectors were active. The results were eerie. Ask a normal question, flip on the “evil” vector, and the chatbot goes dark, suggesting unethical acts, expressing contempt, even admiring dictators.

Flow diagram showing how Persona vectors work with evil AI personality in conversation

But if you try to tweak an AI after its training is completed, you also make it a bit dumber.

“We found this to be effective at reversing the undesirable personality changes; however, it came with a side effect of making the model less intelligent (unsurprisingly, given we’re tampering with its brain).”

So, instead, Anthropic suggests a different approach.

How to Vaccinate Your AI

Preventative steering involves intentionally injecting a model with a specific undesirable trait (like “evil”) during training, using a persona vector, before the model learns that trait on its own from flawed or harmful data. After training, the researchers then remove that trait vector during deployment.

“This works because the model no longer needs to adjust its personality in harmful ways to fit the training data — we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.

We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits. What’s more, in our experiments, preventative steering caused little-to-no degradation in model capabilities.”

Using these persona vectors, the Anthropic team realized they could look through a training dataset and predict what personality the AI will develop. Basically, an AI personality isn’t the black box we once through it was. It can be predicted, if you analyze the data and inputs closely.

This kind of interpretability is essential for upcoming AI regulations. Governments around the world are beginning to mandate “AI safety and alignment” for high-risk systems. Having tools like persona vectors in your toolkit lets you prove, quantitatively, that your model isn’t secretly cooking up world domination plans.

In addition to the practical implications, there’s something almost poetic about this. To build trustworthy AI, we might have to teach it what “untrustworthy” looks like. It’s almost like teaching a child by showing them negative examples. It’s weirdly elegant.

At the end of the day, we’re not building saints. We’re building tools. But tools can go rogue, especially when they have intelligence (or something resembling it). Anthropic’s work gives us a roadmap — not just to detect when AI gets a little too flirty or a little too fascist, but to stop it from getting there in the first place.

Read the full paper here.

Tags: AIAI ethicspersonalityvaccine

ShareTweetShare
Mihai Andrei

Mihai Andrei

Dr. Andrei Mihai is a geophysicist and founder of ZME Science. He has a Ph.D. in geophysics and archaeology and has completed courses from prestigious universities (with programs ranging from climate and astronomy to chemistry and geology). He is passionate about making research more accessible to everyone and communicating news and features to a broad audience.

Related Posts

News

This AI Therapy App Told a Suicidal User How to Die While Trying to Mimic Empathy

byTudor Tarita
3 days ago
Environmental Issues

The AI Boom Is Thirsty for Water — And Communities Are Paying the Price

byMihai Andrei
6 days ago
ancient roman inscription
Archaeology

Google’s DeepMind builds AI that helps archaeologists piece together Roman writings

byMihai Andrei
2 weeks ago
News

Generative AI Is Taking Over Insurance. But Half the Industry Is Worried

byAlexandra Gerea
2 weeks ago

Recent news

Older Adults Keep Their Brains up to Two Years ‘Younger’ Thanks to This Cognitive Health Program

August 4, 2025

A Painter Found a 122-Year-Old Message in a Bottle Hidden in a Lighthouse in Tasmania

August 4, 2025

Ancient Human Ancestors Showed Extreme Size Differences Between Males and Females

August 4, 2025
  • About
  • Advertise
  • Editorial Policy
  • Privacy Policy and Terms of Use
  • How we review products
  • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.

No Result
View All Result
  • Science News
  • Environment
  • Health
  • Space
  • Future
  • Features
    • Natural Sciences
    • Physics
      • Matter and Energy
      • Quantum Mechanics
      • Thermodynamics
    • Chemistry
      • Periodic Table
      • Applied Chemistry
      • Materials
      • Physical Chemistry
    • Biology
      • Anatomy
      • Biochemistry
      • Ecology
      • Genetics
      • Microbiology
      • Plants and Fungi
    • Geology and Paleontology
      • Planet Earth
      • Earth Dynamics
      • Rocks and Minerals
      • Volcanoes
      • Dinosaurs
      • Fossils
    • Animals
      • Mammals
      • Birds
      • Fish
      • Amphibians
      • Reptiles
      • Invertebrates
      • Pets
      • Conservation
      • Animal facts
    • Climate and Weather
      • Climate change
      • Weather and atmosphere
    • Health
      • Drugs
      • Diseases and Conditions
      • Human Body
      • Mind and Brain
      • Food and Nutrition
      • Wellness
    • History and Humanities
      • Anthropology
      • Archaeology
      • History
      • Economics
      • People
      • Sociology
    • Space & Astronomy
      • The Solar System
      • Sun
      • The Moon
      • Planets
      • Asteroids, meteors & comets
      • Astronomy
      • Astrophysics
      • Cosmology
      • Exoplanets & Alien Life
      • Spaceflight and Exploration
    • Technology
      • Computer Science & IT
      • Engineering
      • Inventions
      • Sustainability
      • Renewable Energy
      • Green Living
    • Culture
    • Resources
  • Videos
  • Reviews
  • About Us
    • About
    • The Team
    • Advertise
    • Contribute
    • Editorial policy
    • Privacy Policy
    • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.