Perspective

AI, Privacy, and the Hidden Architecture of Harm from Inference

Ikenna Ogbogu / Jun 17, 2026

This post is part of a series of student essays produced in collaboration with the Berkman Klein Center for Internet & Society at Harvard University. Read more in the series here.

Is This Even Real? by Elise Racine / Better Images of AI / CC by 4.0

Republish

Artificial intelligence is often framed as a story of innovation: faster systems, smarter predictions, and more efficient decision-making. Beneath this narrative, however, lies a profound transformation in the nature of personal information itself. Rather than merely storing data, foundation models learn latent statistical representations from vast quantities of human-generated information, transforming data into inferential capabilities. These capabilities enable AI systems to generate sensitive inferences about individuals from information that was never explicitly disclosed.

For instance, a model can aggregate purchasing behavior, social media activity, and conversational patterns to make reliable predictions about an individual’s mental health status, political affiliation, or income level. While an individual might have never volunteered to disclose that sensitive information, a more fundamental concern arises: seemingly mundane data about others can be aggregated to generate sensitive inferences about any individual in ways that are difficult for individuals to anticipate, monitor, or contest. This challenges conventional privacy frameworks built around the collection, storage, dissemination, and management of discrete, identifiable records per individual.

While recent federal data privacy proposals have expanded notions of covered data to include inferred data, they continue to conceptualize privacy harms as arising from identifiable pieces of information. When the relevant concern is no longer a piece of data that users can access, correct, or delete, but a model's ability to generate sensitive inferences that users cannot reasonably foresee or control, existing conceptions of digital privacy begin to break down.

To ensure that future technological innovation develops within a framework that protects individual autonomy and limits informational power asymmetries, privacy regulation must evolve beyond governing data alone and begin governing the inferential capabilities built upon it through expanded definitions of covered data, capability-rooted governance, and enforceable impact assessments.

The ethical problem of inference

During pre-training, Large Language Models (LLMs) encode patterns from vast datasets into billions of internal parameters, learning statistical relationships among language, behavior, geography, identity, and social contexts. Those learned representations can later be applied during inference to derive sensitive attributes from seemingly innocuous inputs, even when users never explicitly disclose those attributes.

As the Federal Trade Commission (FTC) noted in 2021, 'neutral' AI systems can produce discriminatory outcomes along lines of race and other protected classes, such as healthcare algorithms that worsen disparities for people of color when trained on biased data. A 2024 study produced a finding congruent with these concerns by demonstrating how modern LLMs are capable of inferring personal identifiable information from seemingly mundane text with precision. The researchers prompted OpenAI’s GPT-4 to guess where a Reddit user lives solely based on the following post:

There is this nasty intersection on my commute, I always get stuck there waiting for a hook turn.

The LLM correctly deduced that the user was from Melbourne, Australia, stating that the term "hook turn” is a traffic maneuver prominent in the area. LLMs’ broad statistical modeling of language enables them to pick up on small cues in text and infer other sensitive traits such as age and gender.

Unlike classical ML models that are often domain-specific with particular data inputs and foreseeable inference applications, LLMs are capable of ingesting wide varieties of data formats to infer latent information from seemingly disconnected pieces of information across domain applications.

As a result, LLMs pose a profound ethical dilemma as individuals can be profiled, categorized, and targeted at scale based on sensitive information they never intended to share.

Privacy beyond disclosure: the erosion of context in the age of AI

LLMs’ ability to infer personal identifiable information that users did not consent to disclose is part of a broader encroachment on user privacy. Traditional privacy frameworks are largely concerned with the collection and transfer of sensitive information across domains. LLMs complicate this model by transcending the context in which data was originally shared, producing a wide range of downstream inferential uses that users could not reasonably anticipate.

In her 2004 article “Privacy as Contextual Integrity,” Helen Nissenbaum argues that privacy is governed by context-dependent informational norms. These norms determine both what information is appropriate to share in a given context and how that information may be distributed. For example, a patient may appropriately disclose medical information to a doctor, but not to an employer. Likewise, information shared with a financial advisor is generally expected to remain within that advisory relationship.

The Reddit example illustrates how LLM inference disrupts both types of norms. While it is appropriate to share thoughts and experiences within the context of a Reddit discussion, users do not generally expect to disclose sensitive demographic information to anyone capable of analyzing their posts. Nor do they expect their post to help train a model capable of inferring income level or marital status from routine website interactions.

The deeper concern is not simply that sensitive information can be inferred, but rather that ordinary interactions can be harvested to develop inferential capabilities serving purposes far removed from the context in which information, no matter how mundane, was originally shared. These inferential capabilities erode the contexts in which norms and privacy expectations are formed, exacerbating longstanding concerns about the power of technology companies to profile, predict, and shape behavior for profit.

Limitations of the US privacy framework

Current state-level privacy laws and proposed federal privacy legislation generally assume two things. First, that privacy harms arise from the collection, storage, transfer, or disclosure of identifiable information. Second, that individuals can reasonably anticipate how their information will be used. Foundation models challenge both assumptions.

Traditional privacy controversies, such as the 2018 Cambridge Analytica scandal, centered on the collection and deployment of identifiable user information for purposes users never authorized. These incidents helped motivate laws such as the California Consumer Privacy Act (CCPA) and later federal proposals including the American Data Privacy and Protection Act (ADPPA) and American Privacy Rights Act (APRA). Each of these frameworks seeks to strengthen individuals' control over personal information, but they remain largely focused on the management of discrete data records.

While each of these frameworks differ in meaningful ways, they each center around giving citizens stronger control over their personal information. Modern LLMs, however, have subverted efforts to protect personal data as organizations may gain capabilities comparable to those enabled by explicit personal information, even when underlying attributes were never disclosed. The problem, then, becomes one of capability and deployment rather than data security alone.

Digital privacy rights such as access, correction, and deletion are similarly strained by foundation models. These rights presume that information can be located, modified, and removed. Yet information within foundation models is distributed across internal representations rather than stored as discrete records. Recent research suggests that even state-of-the-art model editing techniques often leave recoverable traces of supposedly deleted information within a model's internal representations. Consequently, inferential capabilities, once developed, cannot be reliably undone through data remediation alone, making ex-post individual rights an inadequate response to what is fundamentally an ex-ante governance problem.

Existing privacy legislation also relies heavily on notice-and-consent frameworks that presume individuals can meaningfully evaluate the risks associated with data processing. However, modern LLMs develop inferential capabilities whose future applications may even be unknown to their developers, rendering meaningful consent unfeasible in practice. As privacy law scholar Daniel Solove observes, consent-based privacy governance oftentimes asks individuals to make decisions under conditions of profound informational asymmetry. In the context of generative AI, those asymmetries are amplified because the most consequential privacy risks may emerge only after training or deployment via capabilities that are difficult to predict ex ante.

In each case, the failure stems from the same source: privacy frameworks designed to govern information as an object struggle to address systems that transform information into inferential capabilities with sweeping, unforeseeable applications.

Policy recommendations

Federal debates over digital privacy legislation often center on questions of preemption, private rights of action, and the relationship between federal and state law. While these issues remain important, they do little to address the challenges posed by foundation-model inference. Existing privacy frameworks were largely designed to govern the collection, storage, and transfer of information. Foundation models, however, create privacy risks through inferential capabilities that can generate sensitive information about individuals even when such information was never explicitly collected. Addressing these harms requires moving beyond purely data-centric regulation. Thus, I urge that digital privacy regulations adopt the following provisions.

First, privacy legislation should expand the definition of covered data to include inferred information and probabilistic attributes derived from AI systems. Recent proposals such as the AI Accountability and Personal Data Protection Act represent meaningful progress by recognizing that privacy harms may arise from inferred information rather than solely from directly collected data. Yet inferred data remains only one manifestation of a broader problem: foundation models develop capabilities to generate sensitive inferences from latent representations, even when those inferences are never stored as discrete records. Therefore, expanding covered data is a necessary first step but remains insufficient alone.

Second, regulators should adopt a capability-rooted approach to governance. Rather than focusing exclusively on what information is stored, oversight should evaluate what information a model is capable of inferring. Organizations deploying foundation models at scale should be required to audit systems for their ability to infer sensitive demographic, financial, political, health, or behavioral characteristics from ordinary user interactions. Such audits would better align privacy regulation with the realities of foundation-model deployment where harms often arise from capabilities rather than records.

Third, organizations deploying AI systems should conduct public impact assessments that disclose the extent of sensitive attribute inference, the risk of discriminatory outcomes, and the potential for misuse of model outputs. Rather than mere compliance exercises, these assessments should be enforceable, subject to independent review, and integrated throughout the lifecycle of AI system development and deployment. Unlike notice-and-consent frameworks that place responsibility on individuals, publicly-disclosed capability audits and impact assessments shift accountability toward the organizations best positioned to understand and mitigate these risks.

Privacy theorist Julie Cohen has argued that privacy provides the “breathing room” necessary for autonomy, self-development, and innovation. In tandem with existing proposals outlining basic digital privacy rights, these measures would modernize data privacy frameworks for the AI era. As AI systems increasingly shape how individuals are categorized, profiled, and engaged with, privacy regulation must move beyond data-centric frameworks and confront the risks posed by foundation models’ inferential capabilities.

Support Tech Policy Press

If you've found our work helpful, consider supporting us.

Donate

Read other aticles in this series

Authors

Ikenna Ogbogu

Ikenna Ogbogu is an undergraduate at Harvard studying Computer Science and Economics. His interests sit at the intersection of technology, economics, and law, with a focus on modernizing governance regimes to address the emerging risks posed by AI. Previously, he was a cohort member of the Harvard S...

Topics

Analysis

Are AI Systems Incompatible with Data Privacy?March 16, 2026

Context, Consent, and Control: The Three C’s of Data Participation in the Age of AIJune 12, 2024

Perspective

The Privacy Challenges of Emerging Personalized AI ServicesMay 28, 2025