Big Tech Said It Was Impossible to Create an AI Based on Ethically Sourced Data. These Researchers Proved Them Wrong

AI, the technology that’s sweeping through the world right now, relies on vast datasets harvested from the open web. This includes copyrighted books, articles, forum posts, social-media content, and even private communications — all of this gathered without explicit permission from creators. Major players in the tech industry (OpenAI, Anthropic, and others) have explicitly argued that you can’t really build AI in a different way. In a testimony in the UK Parliament, OpenAI said:

“Because copyright today covers virtually every sort of human expression — including blogposts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials.”

Lo and behold, scientists have created the impossible: a collection of public domain and openly licensed text large enough to train Large Language Models.

Well, would you look at that

In late 2024, a team of researchers quietly began assembling something that big tech claimed couldn’t exist. It was, essentially, something mundane — and, paradoxically, revolutionary: a dataset. This dataset was built entirely from ethically sourced material — books whose copyrights had expired, educational resources made to be shared, open-source code, and transcripts of public-domain government documents.

Simply put, no scraping of social media, no pilfering from news sites, no legal gray areas. The result is the Common Pile v0.1 — an 8-terabyte collection of public domain and openly licensed text.

The Common Pile includes material from 30 carefully vetted sources, including government records, scientific articles, open educational books, StackExchange, and transcribed YouTube videos with Creative Commons licenses. All were double-checked to ensure their legal clarity.

It’s not easy to do, especially for a ragtag team without the resources and support of a big tech company. The team, which included researchers from EleutherAI, the University of Toronto, Hugging Face, and several other institutions, had to manually check, clean up, and reformat the dataset. It was an enormous amount of work, but in a few months, they managed to complete it.

To test whether this dataset could actually power a real AI, the team trained two models: Comma v0.1-1T and Comma v0.1-2T. Each has 7 billion parameters, the same size as Meta’s original LLaMA-7B model. They fed the models between one and two trillion tokens of text — roughly the equivalent of hundreds of millions of books. And then they tested them.

Compared to models trained with similar resources (7 billion parameters, 1 trillion tokens), Comma v0.1-1T is the strongest model on several standard benchmarks. The model even performed admirably on programming tasks.

It can’t keep up with ChatGPT, but it shows that it can be done

While impressive, the models trained on the Common Pile aren’t state-of-the-art. The comparison was made with models that were state-of-the-art a year or two ago. The dataset is also much smaller than what companies are using today.

ChatGPT, Claude, and Gemini are powered by models trained on tens of trillions of tokens, whereas this dataset only has a couple of trillion tokens. The AI that was trained on this data performed on par with the state-of-the-art from one or two years ago.

But here’s the thing. Big tech companies could have done this one or two years ago instead of scraping every bit of data they could get their hands on. Two dozen researchers did this in a few months as a side gig. Meta alone invests nearly $70 billion a year in AI. The narrative that “it would be impossible to train AI without using copyrighted materials” just doesn’t stand.

What this study shows is that it is possible to train AI on open data, without crossing ethical boundaries. Companies could have done this, and their approach is hard to defend.

You can still make an argument that it’s somewhat unethical and people who put their books in the public domain wouldn’t have wanted AI to be trained on them, but at the very least, it’s completely legal.

Companies can do better

For years, tech companies treated large-scale copyright scraping as unavoidable. Ethics schmetics, just get the data. When artists and journalists protested, the response was often technical fatalism: the models just wouldn’t work otherwise.

This research flips that narrative. It shows that legally sound data can produce impressive results. It doesn’t eliminate all challenges, but it charts a clear path forward.

The challenge now is scale. Competing with powerful systems like GPT-4 will require much more open, high-quality data — especially fiction, conversations, and informal language, which are currently lacking. But this study proves it can be done. With help from public institutions, nonprofits, and open-source projects, building larger and more ethical datasets is within reach.

The team behind Common Pile hopes others will contribute, expanding the dataset’s size and scope. They’re already planning future versions that include more conversational dialogue, fiction, and underrepresented languages — still entirely within the bounds of open licensing.

We should have no illusions that big tech companies will suddenly turn to open data and become ethical champions. But we have more reason to try to pull them in that direction.

In the end, the most radical thing about this work may be its restraint. In an industry driven by secrecy and scale, these researchers chose transparency and consent — and still built something powerful.

The study was not peer-reviewed yet. You can access it freely on Github.