ZME Science
No Result
View All Result
ZME Science
No Result
View All Result
ZME Science

Home → Research → Technology

Big Tech Said It Was Impossible to Create an AI Based on Ethically Sourced Data. These Researchers Proved Them Wrong

A massive AI breakthrough built entirely on public domain and open-licensed data

Mihai AndreibyMihai Andrei
June 12, 2025
in News, Technology
A A
Edited and reviewed by Zoe Gordon
Share on FacebookShare on TwitterSubmit to Reddit
Image credits: Alina Grubnyak.

AI, the technology that’s sweeping through the world right now, relies on vast datasets harvested from the open web. This includes copyrighted books, articles, forum posts, social-media content, and even private communications — all of this gathered without explicit permission from creators. Major players in the tech industry (OpenAI, Anthropic, and others) have explicitly argued that you can’t really build AI in a different way. In a testimony in the UK Parliament, OpenAI said:

“Because copyright today covers virtually every sort of human expression — including blogposts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials.”

Lo and behold, scientists have created the impossible: a collection of public domain and openly licensed text large enough to train Large Language Models.

Well, would you look at that

In late 2024, a team of researchers quietly began assembling something that big tech claimed couldn’t exist. It was, essentially, something mundane — and, paradoxically, revolutionary: a dataset. This dataset was built entirely from ethically sourced material — books whose copyrights had expired, educational resources made to be shared, open-source code, and transcripts of public-domain government documents.

Simply put, no scraping of social media, no pilfering from news sites, no legal gray areas. The result is the Common Pile v0.1 — an 8-terabyte collection of public domain and openly licensed text.

The Common Pile includes material from 30 carefully vetted sources, including government records, scientific articles, open educational books, StackExchange, and transcribed YouTube videos with Creative Commons licenses. All were double-checked to ensure their legal clarity.

It’s not easy to do, especially for a ragtag team without the resources and support of a big tech company. The team, which included researchers from EleutherAI, the University of Toronto, Hugging Face, and several other institutions, had to manually check, clean up, and reformat the dataset. It was an enormous amount of work, but in a few months, they managed to complete it.

RelatedPosts

Google asks Pixar, The Onion writers to make its helper more human-like
New model boils morality down to three elements, aims to impart them to AI
ChatGPT is almost good enough to become a doctor. What does it mean for AI and for doctors?
The Nobel Prizes this year are an AI bonanza

To test whether this dataset could actually power a real AI, the team trained two models: Comma v0.1-1T and Comma v0.1-2T. Each has 7 billion parameters, the same size as Meta’s original LLaMA-7B model. They fed the models between one and two trillion tokens of text — roughly the equivalent of hundreds of millions of books. And then they tested them.

Compared to models trained with similar resources (7 billion parameters, 1 trillion tokens), Comma v0.1-1T is the strongest model on several standard benchmarks. The model even performed admirably on programming tasks.

It can’t keep up with ChatGPT, but it shows that it can be done

While impressive, the models trained on the Common Pile aren’t state-of-the-art. The comparison was made with models that were state-of-the-art a year or two ago. The dataset is also much smaller than what companies are using today.

ChatGPT, Claude, and Gemini are powered by models trained on tens of trillions of tokens, whereas this dataset only has a couple of trillion tokens. The AI that was trained on this data performed on par with the state-of-the-art from one or two years ago.

But here’s the thing. Big tech companies could have done this one or two years ago instead of scraping every bit of data they could get their hands on. Two dozen researchers did this in a few months as a side gig. Meta alone invests nearly $70 billion a year in AI. The narrative that “it would be impossible to train AI without using copyrighted materials” just doesn’t stand.

What this study shows is that it is possible to train AI on open data, without crossing ethical boundaries. Companies could have done this, and their approach is hard to defend.

You can still make an argument that it’s somewhat unethical and people who put their books in the public domain wouldn’t have wanted AI to be trained on them, but at the very least, it’s completely legal.

Companies can do better

For years, tech companies treated large-scale copyright scraping as unavoidable. Ethics schmetics, just get the data. When artists and journalists protested, the response was often technical fatalism: the models just wouldn’t work otherwise.

This research flips that narrative. It shows that legally sound data can produce impressive results. It doesn’t eliminate all challenges, but it charts a clear path forward.

The challenge now is scale. Competing with powerful systems like GPT-4 will require much more open, high-quality data — especially fiction, conversations, and informal language, which are currently lacking. But this study proves it can be done. With help from public institutions, nonprofits, and open-source projects, building larger and more ethical datasets is within reach.

The team behind Common Pile hopes others will contribute, expanding the dataset’s size and scope. They’re already planning future versions that include more conversational dialogue, fiction, and underrepresented languages — still entirely within the bounds of open licensing.

We should have no illusions that big tech companies will suddenly turn to open data and become ethical champions. But we have more reason to try to pull them in that direction.

In the end, the most radical thing about this work may be its restraint. In an industry driven by secrecy and scale, these researchers chose transparency and consent — and still built something powerful.

The study was not peer-reviewed yet. You can access it freely on Github.

Tags: AIdatasetethicslarge language model

ShareTweetShare
Mihai Andrei

Mihai Andrei

Dr. Andrei Mihai is a geophysicist and founder of ZME Science. He has a Ph.D. in geophysics and archaeology and has completed courses from prestigious universities (with programs ranging from climate and astronomy to chemistry and geology). He is passionate about making research more accessible to everyone and communicating news and features to a broad audience.

Related Posts

Future

Everyone Thought ChatGPT Used 10 Times More Energy Than Google. Turns Out That’s Not True

byTibi Puiu
2 days ago
Future

This AI Can Zoom Into a Photo 256 Times And The Results Look Insane

byTibi Puiu
1 week ago
Health

3D-Printed Pen With Magnetic Ink Can Detect Parkinson’s From Handwriting

byTibi Puiu
1 week ago
Mind & Brain

AI and Brain Scans Reveal Why You Struggle to Recognize Faces of People of Other Races

byTibi Puiu
1 month ago

Recent news

A Massive Particle Blasted Through Earth and Scientists Think It Might Be The First Detection of Dark Matter

June 13, 2025

Science Just Debunked the ‘Guns Don’t Kill People’ Argument Again. This Time, It’s Kids

June 13, 2025

It Looks Like a Ruby But This Is Actually the Rarest Kind of Diamond on Earth

June 12, 2025
  • About
  • Advertise
  • Editorial Policy
  • Privacy Policy and Terms of Use
  • How we review products
  • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.

No Result
View All Result
  • Science News
  • Environment
  • Health
  • Space
  • Future
  • Features
    • Natural Sciences
    • Physics
      • Matter and Energy
      • Quantum Mechanics
      • Thermodynamics
    • Chemistry
      • Periodic Table
      • Applied Chemistry
      • Materials
      • Physical Chemistry
    • Biology
      • Anatomy
      • Biochemistry
      • Ecology
      • Genetics
      • Microbiology
      • Plants and Fungi
    • Geology and Paleontology
      • Planet Earth
      • Earth Dynamics
      • Rocks and Minerals
      • Volcanoes
      • Dinosaurs
      • Fossils
    • Animals
      • Mammals
      • Birds
      • Fish
      • Amphibians
      • Reptiles
      • Invertebrates
      • Pets
      • Conservation
      • Animal facts
    • Climate and Weather
      • Climate change
      • Weather and atmosphere
    • Health
      • Drugs
      • Diseases and Conditions
      • Human Body
      • Mind and Brain
      • Food and Nutrition
      • Wellness
    • History and Humanities
      • Anthropology
      • Archaeology
      • History
      • Economics
      • People
      • Sociology
    • Space & Astronomy
      • The Solar System
      • Sun
      • The Moon
      • Planets
      • Asteroids, meteors & comets
      • Astronomy
      • Astrophysics
      • Cosmology
      • Exoplanets & Alien Life
      • Spaceflight and Exploration
    • Technology
      • Computer Science & IT
      • Engineering
      • Inventions
      • Sustainability
      • Renewable Energy
      • Green Living
    • Culture
    • Resources
  • Videos
  • Reviews
  • About Us
    • About
    • The Team
    • Advertise
    • Contribute
    • Editorial policy
    • Privacy Policy
    • Contact

© 2007-2025 ZME Science - Not exactly rocket science. All Rights Reserved.