Home

Donate
Perspective

How to Manage Misinformation in Large Language Models

Leah Ferentinos, Omri Tubiana, Arushi Saxena, J.J. Martinez-Layuno, Chris Miles / Feb 25, 2026

Search engines and other information retrieval tools that utilize large language models (LLMs) are growing rapidly. But their dependence on online data introduces a critical vulnerability: the open internet is now a highly adversarial space, where distinguishing fact from falsehood is incredibly difficult. From state-backed influence campaigns to commercial content farms, many actors are attempting to shape what LLMs “learn,” and thus what they portray as “facts.” These distortions—ranging from fraudulent financial content to coordinated political manipulation—pose growing risks to the epistemic and ethical integrity of AI systems and the greater information ecosystem.

Managing these risks is no longer a technical task; it’s a trust and governance challenge central to AI’s legitimacy. This shift has transformed data collection from a technical exercise into a high-stakes trust problem. When misinformation, spam, or deliberate data poisoning enter a model’s training corpus, the system risks not only factual inaccuracies but also reputational and regulatory damage. Understanding how information is produced, distorted, and distributed online is therefore essential to anyone building or governing large-scale AI systems.

Many actors, from nation states to cybercriminals to corporations to small-scale hackers, have preferences about what LLMs should say and understand the potential upside—whether financial or political—of influencing or poisoning training datasets. The data that LLMs ingest from the open internet, whether from Wikipedia, Reddit threads, or news sites, can be heavily shaped by incentives misaligned with a model creator's goals.

For example, model training data pulled from the open internet can include widespread attempts to perpetuate fraud or sophisticated campaigns that undermine trust in medical institutions. For instance, Russian actors have deliberately seeded major datasets with manipulative content in order to influence democratic discourse. Understanding how information collected or scraped from online sources can misrepresent reality—whether due to subtle algorithmic incentives or deliberate manipulation—is essential for anyone aiming to build models that accurately reflect the world today (particularly when early research suggests that a relatively small and constant number of documents can poison language models of any size).

Common Training Dataset Risk Scenarios

There are some common scenarios that threaten quality and integrity, and can lead to harmful training datasets. Below we outline three of these threat types—Medical Misinformation, Inauthentic Political Speech, and Data Voids—to anchor some of the debates treated later in this piece. These threats rarely appear in isolation, however. For instance, financial scams are often intertwined with medical misinformation or geopolitical manipulation, compounding the risk to model trust. Furthermore, data voids, a threat occurring when very low amounts of information exist about a topic, can be exploited around political conspiracies.

Medical misinformation

A substantial industry exists that creates financial gain by intentionally undermining faith in scientifically grounded medicine. Furthermore, content questioning medical science tends to vastly outperform content asserting it on social media, often leading to these viewpoints being disproportionately represented in training data. This content may relate to unfounded medical theories , debunked pseudo-medical claims, and conspiracy theories. Even if such information is not directly regurgitated by a model it can increase the frequency of subtle, authoritative-sounding, and potentially highly dangerous inaccuracies around medical topics.

Inauthentic political speech

It is generally important that model outputs reflect authentic political discourse, and interventions related to political topics should be taken with caution. It may be important to consider the actions of sophisticated actors, such as foreign governments, who may be strongly incentivized to manipulate model outputs related to national elections or sensitive geopolitical topics. Analysis of data poisoning attacks indicates that these actors can introduce “backdoors” into a model, where manipulated prompts are difficult to detect unless LLMs are prompted around a specific topic or under specific conditions. (e.g. the week leading up to election day)

Data voids

When very little information exists about an obscure topic, either in training data or in search results, it is easy for adversarial groups to be the most authoritative sources on that topic. Research shows that adversarial groups work to fill “data voids” with deliberately misleading information in ways that may cause harm, whether by building out authoritative-looking websites around obscure terms or by moving quickly to deploy large amounts of information around a breaking news story. Contemporary events can also easily generate conspiracy theories and other false narratives without the involvement of adversarial actors.

Adversarial attempts to fill data voids can enter LLMs both through training data scraped from the open web and in LLMs’ attempts to do web searches around breaking news. Special attention should be paid to these use cases, such as looking for multiple authoritative sources and flagging topics that are heavily discussed only by relatively unknown sources as potentially unreliable. Topics that are heavily discussed predominantly by low-pagerank sources are a potential flag.

These are just a few examples of adversarial scenarios that warrant discussion among teams engaged in model training. While valid arguments exist for choosing to act or choosing not to act in the above scenarios, a thorough decision-making process is critical to ensuring model quality and solidifying trust in model outputs. As model training processes become the target of a rapidly increasing spectrum of increasingly sophisticated actors, the risks of ignoring these adversarial scenarios will markedly increase.

Thankfully, once a policy has been established around data quality, training teams can leverage a robust set of tools to mitigate these risks.

Mitigation strategies for misinformation in LLMs

Addressing information integrity concerns requires a multi-faceted approach across the entire LLM development and deployment pipeline. The strategies can be broadly divided into three categories: data-centric mitigation, model-centric mitigation, and post-training mitigation.

Data-centric mitigation

The datasets fed into training pipelines can be filtered or weighted for quality.

  • Rigorous data curation: Before training, datasets should undergo a stringent filtering process to remove information deemed misaligned with the goal of training. For example, information about widespread financial scams may be included with appropriate context, but extensive repetition of this scam may be excluded to avoid overweighting. For sensitive topics, authoritative sources may be identified and given additional weight.
  • Automated filtering: Machine learning algorithms can automatically detect and flag patterns indicative of misinformation. For example, pages on medical topics which extensively cite low impact factor journals may be filtered or down-weighted.
  • Fact-checking integration: Integrate external knowledge bases and fact-checking mechanisms to cross-reference and validate information. For example, content about sub-Saharan Africa can be cross-referenced against on-the-ground fact-checking organizations such as AfricaCheck. Model training mechanisms that rely on these outputs for model quality should strongly consider financially supporting the work of these organizations.

Model-centric mitigation

The training process can include steps that reduce the impact of harmful misinformation included in training data.

  • Adversarial post-training: Fine-tune the model using deliberately misleading prompts and reward it for identifying and refusing to generate false information. These prompts can be tailored to address the cases described above. For example, a set of prompts tied to popular tropes in financial fraud could ensure that models are not being leveraged by fraudulent actors as a source of validation.
  • Using Reinforcement Learning from Human Feedback (RLHF) to change incentives: This technique uses human feedback to align the model's outputs with values including truthfulness, ethical policies etc. to reward the factually correct information and penalize misinformation (as validated by third-party fact checkers).
  • Confidence calibration: Train the model to express a measure of confidence in its answers, providing a probability score reflecting the likelihood that its claim is correct. One example is by a measure called perplexity, an indicator that may be used to calculate the amount of confidence in the correctness of the next token generation (with full confidence at one end, and gibberish at the other). This helps users understand the reliability of the information.

Post-training mitigation

This category involves techniques that augment the model's capabilities in real-time or add a layer of checks and balances before the output reaches the user.

  • Retrieval-Augmented Generation (RAG): This is a powerful technique for reducing (though not eliminating) hallucinations. When a user asks a question, the system first retrieves relevant, up-to-date information from a verified external knowledge base. The LLM then uses this retrieved information as context to generate its answer. For example, questions about contemporary events could be answered based on input from a set of trusted (and ideally compensated) news outlets.
  • Specialized multi-agent systems: LLM powered multi-agent framework provides a systematic and scalable way to manage the entire misinformation lifecycle. This approach uses specialized fine-tuned LLM powering agents for different tasks—from classification and indexing to correction and verification—which enhances robustness and allows for easier upgrades. For example, questions related to health could be routed to an LLM specifically trained for that purpose. The limited scope of this technique makes it easier to optimize a model without creating tradeoffs around other topics.
  • Post-processing and guardrails: Implement a final layer of checks on the LLM's output. This could include:
    • Fact-checking APIs: Pass the LLM's response to an external fact-checking service to verify its claims.
    • Citation and source verification: Train the model to provide citations for its claims and verify that those citations are real and relevant.
    • Disclaimers: For sensitive topics like medical or financial advice, add clear disclaimers to the output.
  • Human-in-the-loop oversight: For highly sensitive applications, another approach is to ensure that LLM outputs are reviewed and verified by a human expert before they are published or used in a high-stakes decision-making process. This, in turn, will provide understanding of prevalence and high quality data for model re-training and improvement in subsequent versions.

Tools to address data quality threats exist throughout the model training lifecycle; the challenge is often deciding when to use them. External fact-checking and human review of outputs are expensive, as is the curation of large sets of training data. Along with threat modeling and red-teaming exercises, benchmarking techniques can be a useful way to understand how to deploy these limited resources.

Benchmarking

Benchmarking refers to testing and comparing models to understand how well they perform on different tasks. Benchmarks are the primary way by which strengths and weaknesses of models are evaluated, and consist in standardized tests, metrics, datasets, and even informal exercises (e.g., the Will Smith Eating Spaghetti Test) that, over time, assist in judging different aspects of a model performance.

Most existing benchmarks, whether measuring legal reasoning (the Bar exam), broad capabilities (e.g. HELM), or harmful content (toxicity detection), focus almost entirely on model outputs. However looking at training data sets is also a crucial aspect, given that quality and diversity of the data a model was trained on is a crucial driver of its performance.

One of the most important aspects of benchmarking is data contamination. A benchmark is only reliable if there are guarantees that a model has not been exposed to the data being used to benchmark it. Otherwise, performance metrics resulting from benchmarking tests would be artificial, reflecting poorly on actual model capacity and more on ‘memorialization’ of correct testing responses.

When it comes to misinformation in the AI product lifecycle, there are two main areas where mis and disinformation can be introduced or propagated. First, it can be found in the model output, which is a result of the token generation prompted by a user. Secondly, it can also be found in the dataset on which the model was originally trained and tested. Therefore, in the context of data quality, it may be useful to distinguish between output-level benchmarks, which are useful for mitigating issues with the information generated by a model (model-centric mitigation) and dataset-level benchmarks, which are useful for mitigating issues within the data used to train the model (data-centric mitigation). Today, there are a few major benchmarks we can consider:

  • FEVER (Fact Extraction and VERification) (output-level benchmark): This is a dataset of human-generated claims that are used to measure model performance in verifying against evidence from Wikipedia, with labels of Supported, Refuted, or Not Enough Info. It tests the model's ability to perform evidence-based fact-checking.
  • ​Misinformation Collections (e.g., CDL-MD) (dataset-level benchmark): Also used to evaluate models in detecting misinformation datasets (like claims, news articles, and social media posts), it provides a unified resource for misinformation detection research. These collections are often assessed for quality concerns like spurious correlations or ambiguous examples.
  • ​VLDBench (Vision-Language Disinformation Detection Benchmark) (output-level benchmark): A comprehensive benchmark for detecting disinformation in multimodal content (text and images), which is becoming increasingly relevant for LLMs and Vision-Language Models (VLMs).

When adopting misinformation benchmarks, it is important to treat them as a useful tool for building model quality and a very imprecise indicator of that quality. As models scale, there's anecdotal evidence of data contamination, which is when benchmark responses are filtered in the training datasets, making benchmarks unreliable tools (e.g., publicly released questions from the Bar Exam).

Models are, ultimately, only useful when they can be trusted. The enormous economic upside of AI evaporates if trust evaporates, or is never established. A few news stories about a system actively advancing financial fraud, parroting the talking points of malicious actors, or regurgitating medical misinformation with tragic consequences could be enough to irreparably undermine that trust. Also, data-contaminated models that perform extremely well in benchmarks may contribute to unfounded and inflated claims about performance, contributing to “hype” around capacities, and an unhealthy and not evidence-based public conversation about AI.

This creates a powerful opportunity for model builders willing to invest seriously in data quality. In a world where most products are struggling to respond to increasingly sophisticated data poisoning attacks and losing market trust as a result, being one of the few models that understands the adversarial landscape and remains trustworthy could be a tremendous source of competitive advantage. It is our hope that the tools and frameworks described here can be useful to those willing to make that investment. Doing so is critical both to building AI products that succeed and to feeling proud of the AI products that we build.

Authors

Leah Ferentinos
Leah Ferentinos is an AI Governance Researcher and Strategic Advisor to All Tech is Human. She previously managed Trust & Safety Risk & Compliance initiatives at KPMG & served as a Policy Manager at Meta, driving information integrity and global elections policy development for Facebook, Instagram, ...
Omri Tubiana
Omri Tubiana is a trust and safety leader with 10+ years of experience setting strategy and building scalable operations and policy across startups and $1B+ tech companies. My expertise includes organization building and leadership at scale, misinformation mitigation, AI-enabled T&S solutions, and b...
Arushi Saxena
Arushi Saxena is a trust and safety expert with experience spanning the White House, Twitter, and multiple leading AI firms.
J.J. Martinez-Layuno
J.J. Martinez-Layuno is a digital governance specialist shaping safety policies for online platforms, with product counseling experience at major technology companies.
Chris Miles
Chris is an expert in digital media and the information space. Having worked at Meta for 6+ years, he’s led on major misinformation policy decisions, advised journalists and academics on digital media, and launched digital transparency tools like CrowdTangle. He has managed the most high-touch busin...

Related

Syllabus: Large Language Models, Content Moderation, and Political CommunicationSeptember 26, 2024
Perspective
AI is Removing Bottlenecks to Effective Content Moderation at ScaleJanuary 29, 2026
Podcast
Considering the Human Rights Impacts of LLM Content ModerationJuly 6, 2025

Topics