Stochastic parrot? New study suggests ChatGPT plagiarizes beyond just "copy" and "paste"

In the few months since ChatGPT was introduced publicly, it’s taken the world by storm. It has the ability to produce all sorts of text-based content, even passing exams that are challenging for humans. Naturally, students have started taking notice. You can use ChatGPT to help you with essays and all sorts of homework and assignments, especially since the content it outputs isn’t plagiarized — or isn’t it?

According to a new study, language models like ChatGPT can plagiarize on multiple levels. Even if they don’t always take ideas verbatim from other sources, they can rephrase or paraphrase ideas without changing the meaning at all, which is still not acceptable.

“Plagiarism comes in different flavors,” said Dongwon Lee, professor of information sciences and technology at Penn State and co-author of the new study. “We wanted to see if language models not only copy and paste but resort to more sophisticated forms of plagiarism without realizing it.” Lo and behold, it really did.

Being a university student nowadays can be pretty challenging. After the pandemic lockdown period, plenty of things have changed: universities face staff shortages and mental health problems as there’s much more online work to do, which can be challenging in multiple ways. In addition to technical challenges, like needing to own a laptop or computer with a stable enough internet connection, students have had to develop a complementary set of skills — particularly in terms of computer literacy. More and more, you need to know how to manage the online course management system, navigate through lectures and recordings, and edit and submit assignments and essays strictly digitally. A few years ago, you may have gotten away without using things such as Google Drive or a pdf editor but nowadays, that just doesn’t fly.

Understandably, students jumped at the opportunity of having an AI assistant do the work for them. At first glance, it seems safe to do because despite being trained on existing data, the AI produces new text which cannot be accused of plagiarism. Or so it would seem.

Lee and colleagues focused on identifying three forms of plagiarism:

verbatim, or direct copying;
paraphrasing or rephrasing;
rewording and restructuring content without quoting the original source.

All these are, in essence, plagiarism.

Because the researchers couldn’t construct a pipeline for ChatGPT, they worked with GPT-2, a previous iteration of the language model. They used 210,000 generated texts to test for plagiarism “in pre-trained language models and fine-tuned language models, or models trained further to focus on specific topic areas.” Overall, the team found that the AI engages in all three forms of plagiarism, and the larger the dataset the model was trained on, the more often the plagiarism occurred. This suggests that larger models would be even more predisposed to it.

“People pursue large language models because the larger the model gets, generation abilities increase,” said lead author Jooyoung Lee, doctoral student in the College of Information Sciences and Technology at Penn State. “At the same time, they are jeopardizing the originality and creativity of the content within the training corpus. This is an important finding.”

It’s not the first time something like this has been suggested. A paper that came out just over a year ago and was already cited over 1,300 times claims that this type of AI is a “stochastic parrot” — simply parroting existing information, without truly producing anything new.

It’s still early days for this type of technology and much more research is required to understand problems such as this one, but companies seem eager to release this technology into the wild before this kind of issue can be understood. According to the study authors, this research highlights the need for more research into the ethical conundrums that text generators pose.

“Even though the output may be appealing, and language models may be fun to use and seem productive for certain tasks, it doesn’t mean they are practical,” said Thai Le, assistant professor of computer and information science at the University of Mississippi who began working on the project as a doctoral candidate at Penn State. “In practice, we need to take care of the ethical and copyright issues that text generators pose.”

In the meantime, AI text generators are set to trigger an arms race. Plagiarism detectors are all over this — being able to detect ChatGPT shenanigans (or shenanigans from any generative AI) is valuable to ensure academic integrity. But whether or not they will actually succeed remains to be seen. For now, current tools don’t seem to do a good enough job.

Meanwhile, university students (and not only) will continue to use ChatGPT for their assignments if they can get away with it. A new dawn of plagiarism may be upon us, and it’s not so easy to tackle.

The researchers will present their findings at the 2023 ACM Web Conference, which takes place April 30-May 4 in Austin, Texas.

Was this helpful?

Thanks for your feedback!

Stochastic parrot? New study suggests ChatGPT plagiarizes beyond just “copy” and “paste”

Recent news

Arabica coffee production could decrease by 80% by 2050. Can Robusta save our morning coffee?

Turning off a single protein extends mice’s lifespan by 25%

Mysterious antimatter detected on ISS could be generated by cosmic “fireballs”