homehome Home chatchat Notifications


Big Tech Said It Was Impossible to Create an AI Based on Ethically Sourced Data. These Researchers Proved Them Wrong

A massive AI breakthrough built entirely on public domain and open-licensed data

Mihai Andrei
June 12, 2025 @ 8:17 pm

share Share

Image credits: Alina Grubnyak.

AI, the technology that’s sweeping through the world right now, relies on vast datasets harvested from the open web. This includes copyrighted books, articles, forum posts, social-media content, and even private communications — all of this gathered without explicit permission from creators. Major players in the tech industry (OpenAI, Anthropic, and others) have explicitly argued that you can’t really build AI in a different way. In a testimony in the UK Parliament, OpenAI said:

“Because copyright today covers virtually every sort of human expression — including blogposts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials.”

Lo and behold, scientists have created the impossible: a collection of public domain and openly licensed text large enough to train Large Language Models.

Well, would you look at that

In late 2024, a team of researchers quietly began assembling something that big tech claimed couldn’t exist. It was, essentially, something mundane — and, paradoxically, revolutionary: a dataset. This dataset was built entirely from ethically sourced material — books whose copyrights had expired, educational resources made to be shared, open-source code, and transcripts of public-domain government documents.

Simply put, no scraping of social media, no pilfering from news sites, no legal gray areas. The result is the Common Pile v0.1 — an 8-terabyte collection of public domain and openly licensed text.

The Common Pile includes material from 30 carefully vetted sources, including government records, scientific articles, open educational books, StackExchange, and transcribed YouTube videos with Creative Commons licenses. All were double-checked to ensure their legal clarity.

It’s not easy to do, especially for a ragtag team without the resources and support of a big tech company. The team, which included researchers from EleutherAI, the University of Toronto, Hugging Face, and several other institutions, had to manually check, clean up, and reformat the dataset. It was an enormous amount of work, but in a few months, they managed to complete it.

To test whether this dataset could actually power a real AI, the team trained two models: Comma v0.1-1T and Comma v0.1-2T. Each has 7 billion parameters, the same size as Meta’s original LLaMA-7B model. They fed the models between one and two trillion tokens of text — roughly the equivalent of hundreds of millions of books. And then they tested them.

Compared to models trained with similar resources (7 billion parameters, 1 trillion tokens), Comma v0.1-1T is the strongest model on several standard benchmarks. The model even performed admirably on programming tasks.

It can’t keep up with ChatGPT, but it shows that it can be done

While impressive, the models trained on the Common Pile aren’t state-of-the-art. The comparison was made with models that were state-of-the-art a year or two ago. The dataset is also much smaller than what companies are using today.

ChatGPT, Claude, and Gemini are powered by models trained on tens of trillions of tokens, whereas this dataset only has a couple of trillion tokens. The AI that was trained on this data performed on par with the state-of-the-art from one or two years ago.

But here’s the thing. Big tech companies could have done this one or two years ago instead of scraping every bit of data they could get their hands on. Two dozen researchers did this in a few months as a side gig. Meta alone invests nearly $70 billion a year in AI. The narrative that “it would be impossible to train AI without using copyrighted materials” just doesn’t stand.

What this study shows is that it is possible to train AI on open data, without crossing ethical boundaries. Companies could have done this, and their approach is hard to defend.

You can still make an argument that it’s somewhat unethical and people who put their books in the public domain wouldn’t have wanted AI to be trained on them, but at the very least, it’s completely legal.

Companies can do better

For years, tech companies treated large-scale copyright scraping as unavoidable. Ethics schmetics, just get the data. When artists and journalists protested, the response was often technical fatalism: the models just wouldn’t work otherwise.

This research flips that narrative. It shows that legally sound data can produce impressive results. It doesn’t eliminate all challenges, but it charts a clear path forward.

The challenge now is scale. Competing with powerful systems like GPT-4 will require much more open, high-quality data — especially fiction, conversations, and informal language, which are currently lacking. But this study proves it can be done. With help from public institutions, nonprofits, and open-source projects, building larger and more ethical datasets is within reach.

The team behind Common Pile hopes others will contribute, expanding the dataset’s size and scope. They’re already planning future versions that include more conversational dialogue, fiction, and underrepresented languages — still entirely within the bounds of open licensing.

We should have no illusions that big tech companies will suddenly turn to open data and become ethical champions. But we have more reason to try to pull them in that direction.

In the end, the most radical thing about this work may be its restraint. In an industry driven by secrecy and scale, these researchers chose transparency and consent — and still built something powerful.

The study was not peer-reviewed yet. You can access it freely on Github.

share Share

This Study Finds a Chilling Link Between Personality Type and Trump Support

Malevolent traits and reduced empathy go hand in hand.

After 100 years, physicists still don't agree what quantum physics actually means

Does God play dice with the universe? Well, depends who you ask.

Scientists Analyzed a Dinosaur’s Voice Box. They Found a Chirp, Not a Roar

A new fossil suggests dinosaurs may have sung before birds ever took flight

Scientists Say Junk Food Might Be as Addictive as Drugs

This is especially hurtful for kids.

Physicists Make First Qubit out of Antimatter and It Could One Day Explain Why the Universe Exists At All

Antimatter was held in a qubit state for nearly a minute.

Ovulation Body Odor Can Make Women Seem More Attractive to Men (But These Aren't Pheromones)

Scent compounds rising during ovulation may shape male perception attraction but also stress response.

The AI Boom Is Thirsty for Water — And Communities Are Paying the Price

What if the future of artificial intelligence depends on your town running out of water?

The 400-Year-Old, Million-Dollar Map That Put China at the Center of the World

In 1602, the Wanli Emperor of the Ming dynasty had a big task for his scholars: a map that would depict the entire world. The results was a monumental map that would forever change China’s understanding of its place in the world. Known as the Kunyu Wanguo Quantu (坤輿萬國全圖), or A Map of the Myriad […]

Stuttering Has Deep Genetic Roots and May Affect Your Ability to Clap to a Beat

A massive genetic study found that stuttering is not just about nurture and may link to processing rhythm itself.

What If We Built Our Skyscrapers from Wood? It's Just Crazy Enough to Work (And Good for the Planet)

Forget concrete and steel. The real future is wood.