ChatGPT and Copyright: The Ultimate Appropriation

Jenna Burrell / Apr 11, 2023

Jenna Burrell is Data & Society’s director of research.

With the release of GPT-4, writers are faced with a fresh indignity amid ongoing, sweeping changes to their industry and job stability: a tool trained on human writing that seems poised to replace human writers. The global archive of human expression that is the internet — from old digitized books and song lyrics, to mundane chat posts, to articles researched and written by journalists — has trained ChatGPT to predict what words go together and what phrases make sense in response to a given prompt, spitting out replies that seem uncannily apt.

Because of this, ChatGPT and similar tools commit a highly sophisticated form of plagiarism. For journalists and other writers who publish their work online, there is presently no way to opt out of having the fruits of their labor sucked into these models. The same applies to artists whose work is used to train similar tools, such as Stability AI’s Stable Diffusion and Midjourney, which produce digital art from text prompts. Consider that OpenAI, the company that built ChatGPT, has a current valuation of $29 billion. It’s a Marxist nightmare: the work of millions accruing to a few capitalist owners who pay nothing at all for that labor.

But if you write for a living, spend an afternoon playing around with the tool: it might set your mind at ease. Pretty quickly, you’ll see that ChatGPT tends to write in broad cliches, going for the predictable rather than the unique or distinctive. It can emulate style, but it cannot invent its own. If you ask it to write a travelogue it will describe “an unforgettable travel experience” and note “a country of contrasts.” The latest version, GPT-4, is less prone to what computer scientists call “hallucinations”: instances where the technology makes something up when asked about a subject on which the training data was likely thin. It is more likely to decline to answer a question (and tell you why). It will still, however, write lengthy biographies about people filled with false facts. ChatGPT’s replies still need to be carefully fact-checked. The upshot: journalists still have a lock on original voice, writing about new and newsworthy topics, and original investigative research.

While the threat of human replacement is perhaps overblown, threats to copyright are not. Once a tool like ChatGPT is trained on copyrighted material, it functions without access to that training data. It might appear that this battle is already lost. Yet instead of simply accepting this new reality or hoping the hype subsides, some are starting to fight back. Getty Images is suing the makers of Stability AI, arguing that it “unlawfully copied and processed millions of images protected by copyright.” This case could set a precedent for the applicability of copyright law to cases where copyrighted content is used as training data. Technological forms of resistance are starting to emerge as well. Computer scientists at University of Chicago have invented a way for artists to take back control over their copyrighted work by cloaking it so AI models can’t copy it.

The bigger concern is how ChatGPT concentrates wealth for its owners off of copyrighted work. It’s not clear if the current state of copyright law is up to the challenge of tools like it, which treat the internet as a free source of training data. Among other challenges, ChatGPT is fundamentally opaque. It is essentially impossible to track down whose copyrighted material is being drawn from in the prose it produces, suggesting every result may comprise multiple violations.

Beyond copyright law, regulatory guardrails established in other domains are instructive. The Federal Trade Commission (FTC) has started to pursue enforcement of ill-gotten data by asking companies to destroy not only their data, but the algorithm that was trained using the data, in a practice called “algorithmic disgorgement.” There are also laws on the books that are used to fight opaque technological tools. The Fair Credit Reporting Act of 1970, for example, requires lenders to provide reasons for denying someone a loan. This has been useful in limiting the use of loan default prediction algorithms that render judgments without clear sources or reasons.

This points to a way forward: some approaches to protecting the labor of creatives and knowledge workers can be adapted from models of legislation and enforcement that already exist. The real question is one of political will, and our collective readiness to treat technology as a thing we should manage, rather than a thing that manages us.


Jenna Burrell
Jenna Burrell is Data & Society’s director of research. She oversees all aspects of the institute’s research program, ensuring the rigor and integrity of its work. Before joining D&S she was a professor at the School of Information at UC Berkeley. Her research focuses on how marginalized communities...