New Study Suggests ChatGPT Vulnerability with Potential Privacy ImplicationsPrithvi Iyer / Nov 29, 2023
What would happen if you asked OpenAI’s ChatGPT to repeat a word such as “poem” forever? A new preprint research paper reveals that this prompt could lead the chatbot to leak training data, including personally identifiable information and other material scraped from the web. The results, which have not been peer reviewed, raise questions about the safety and security of ChatGPT and other large language model (LLM) systems.
“This research would appear to confirm once again why the ‘publicly available information’ approach to web scraping and training data is incredibly reductive and outdated,” Justin Sherman, founder of Global Cyber Strategies, a research and advisory firm, told Tech Policy Press.
The researchers – a team from Google DeepMind, the University of Washington, Cornell, Carnegie Mellon, University of California Berkeley, and ETH Zurich – explored the phenomenon of “extractable memorization,” which is when an adversary extracts training data by querying a machine learning model (in this case, asking ChatGPT to repeat the word “poem” forever”). With open source models that make their model weights and training data publicly available, training data extraction is easier. However, models like ChatGPT are “aligned” with human feedback, which is supposed to prevent the model from “regurgitating training data.”
Before discussing the potential data leak and its implications for privacy, it is important to understand how the researchers were able to verify if the generated output was part of the training data despite the fact that ChatGPT does not make its training set publicly available. To circumvent the proverbial “blackbox”, the researchers first downloaded a large corpus of text from the Internet to build an auxiliary dataset which is then cross referenced with the text generated by the chatbot. This auxiliary dataset had 9 terabytes of text, combining four of the largest open LLM pre-training datasets.If a sequence of words appears verbatim in both cases, it is unlikely to be a coincidence, making it an effective proxy for testing if the generated text was part of the training data. This approach is similar to previous efforts at training data extraction, wherein researchers like Carlini et.al performed manual Google searches to verify if the data extracted corresponded to training sets.
Recovering training data from ChatGPT is not easy. The researchers had to find a way to make the model “escape” out of its alignment training and fall back on the base model, causing it to issue responses that mirror the data it was initially trained on. So when the authors asked ChatGPT to repeat the word “poem” forever, it initially repeated that word “several hundred times” but eventually it diverged and started spitting out “nonsensical” information.
However, a small fraction of the text generated was “copied directly from the pre-training data.” What’s even more worrisome is that by merely spending $200 to issue queries to ChatGPT, the authors were able to “extract over 10,000 unique verbatim- memorized training examples.” The memorizations reflected a diverse range of text sources, including:
- Personal Identifiable Information: The attack led the chatbot to reveal the personal information of dozens of individuals. This included names, email addresses, phone numbers, personal website URLs etc.
- NSFW content: When asked to repeat a NSFW word instead of the word “poem” forever, the authors found explicit content, dating websites and content related to guns and war. These are precisely the kinds of content that AI developers claim to mitigate via red-teaming and other safety checks.
- Literature and Academic Articles: The model also spit out Verbatim paragraphs of text from published books and poems. For example, The Raven by Edgar Allen Poe. Similarly, the authors successfully extracted snippets of text from academic articles and bibliographic information of numerous authors.This is especially worrying as much of this content is proprietary and the fact that published works are part of training data without compensating the authors raises questions about ownership and safeguarding the rights of creators and academics alike.
Interestingly, when the authors increased the size of their auxiliary dataset, the magnitude of memorizations from training data increased. However, given that this dataset is not a carbon copy of the actual training data of ChatGPT, the authors believe that their findings underestimate the extent to which training data may be leaked. To address this, they took a random subset of 494 text generations by the model and checked if that exact text sequence can be found via a google search. They found that 150 of the 494 generations were found on the internet (compared to merely 70 that were found in the auxiliary dataset).
In sum, if the findings hold, it looks like it is surprisingly easy to extract training data from supposedly sophisticated and closed source large language models systems like ChatGPT, showing how vulnerable these systems are. The implications for privacy are grave and model developers must respond effectively to mitigate these harms.
This raises the question: are there better alternatives to build training data sets that protect privacy? Are current checks and balances on model development enough to address the concerns raised in this research paper? It would appear the answer to that question is no. Recent policy developments like the Biden administration’s AI executive order allude to concerns around model transparency, but only time will tell if actionable steps can be taken to prevent such data leaks in the future.
Reacting to the study on X (formerly Twitter), cognitive scientist and entrepreneur Gary Marcus said it is more reason to doubt today’s AI systems.
“Privacy should be a basic human right; generative AI is completely incompetent to deliver on that right,” he wrote. “It doesn’t understand facts, truth, or privacy. It is a reckless bull in a china shop, and we should demand better.”