Perspective

A Counterproposal to Google's Approach to Anonymizing Search Data

Paul Francis, Andreas Dewes / Apr 28, 2026

Editor's note: Tech Policy Press received charitable donations from DuckDuckGo in 2024 and 2025.

On both sides of the Atlantic, regulators and courts have concluded that Google's dominance in search is self-reinforcing: its unmatched volume of user queries fuels ranking improvements that no competitor can replicate without access to comparable data. For this reason, they have determined that Google must share anonymous search data with competitors who cannot generate such information at scale. In the European Union, Article 6(11) of the Digital Markets Act requires search gatekeepers to provide anonymous “ranking, query, click and view data.” In the United States, a court is requiring Google to share raw “user-side data” used to power its core ranking models. The utility of this information to help rivals build better search products is the entire purpose of this obligation.

Google, however, has responded to these obligations in a manner that an expert witness for the US Department of Justice characterized as the approach “one would use if you didn't want to release high utility data.” Its European Search Dataset Licensing Program applies two aggressive thresholds to eliminate distinct queries. First, a query must have been searched by at least 30 signed-in users worldwide in the preceding 13 months. Second, the same query, result, device, and country combination must appear from at least five unique signed-in users in a given quarter. As a result, we estimate that Google's dataset omits more than 99% of distinct search queries and excludes an estimated 42% of its total query volume. What remains are the most common queries that every search engine already sees. The rare and uncommon queries that can help competitors improve the quality of their results are stripped away entirely, leaving a dataset devoid of utility.

Google’s approach rests on a tacit assumption: that any query not entered by at least 30 separate users is inherently identifiable. This is far too broad. Consider a query like “Do Kellogg’s or General Mills products contain GMO ingredients?” Even if no one else has searched for this exact string, millions of people could plausibly compose it. There is no reasonable basis to conclude that a single query like this can be traced back to the individual who typed it. Google's methodology treats rarity as a consistent proxy for identifiability, but these are fundamentally different concepts.

Joint guidelines issued last October by the European Commission and the European Data Protection Board reinforce this point. They clarify that anonymization under Article 6(11) should remove queries that contain identifiable personal data while preserving those that merely reference public figures or widely known information. The guidelines further acknowledge that effective anonymization is not purely a technical exercise. It is a risk-management framework that combines technical measures with legal controls, organizational safeguards, and contractual protections. Proportionate anonymization does not require eliminating every theoretical possibility of re-identification. Rather, re-identification should be difficult enough that the cost of large-scale data mining far outweighs any benefit, and the probability of identifying a specific individual is low enough to discourage any reasonable attempt.

Now is the time to discuss what proportionate anonymization looks like. The European Commission recently issued preliminary findings as part of a specification proceeding into the search company’s Article 6(11) compliance. Stateside, a court-ordered technical committee is expected to recommend privacy safeguards for sharing user-side data later this year. Both proceedings make the question of how to appropriately anonymize search data urgent and consequential.

It is worth noting that some may look to differential privacy as a solution, but differential privacy is not well-suited to datasets of this complexity. It would not be possible to retain utility here while maintaining a meaningful privacy guarantee. For instance, the dataset recently anonymized by the US Census Bureau using differential privacy was far simpler (hundreds of tokens versus millions). In order to obtain acceptable utility, the Bureau settled on an Epsilon (a measure of how much privacy loss a system permits) of around 40, which in terms of formal mathematical privacy is almost indistinguishable from no anonymization whatsoever. For this class of data, targeted filtering combined with robust legal and organizational safeguards offers a more practical and proportionate path forward.

To demonstrate that a better approach exists, we offer an alternative anonymization methodology evaluated on a sample of DuckDuckGo queries. Rather than Google's chainsaw approach of eliminating any query below an arbitrary popularity threshold, we propose a scalpel. We rely on pseudonymization plus three complementary filtering steps that, combined, remove fewer than 5% of distinct queries while addressing all potentially problematic cases identified in a manual review of thousands of queries.

Our pseudonymization approach strips away identifiers such as cookies or Google IDs from any dataset and aggregates timestamps to 24 hours. This step alone would have prevent the well-known reidentification attack on the AOL search log.

The first filtering step is targeted filtering of known personal information.

Using a combination of regular expression matching and machine learning-based named entity recognition, we identify and remove queries containing, for example, email addresses, phone numbers, Social Security numbers, credit card numbers, full street addresses, and similar identifiers.

A key finding from our analysis is that such personal information is relatively rare in search queries. In manual and automated reviews of DuckDuckGo's query data, personal information appeared in only approximately 1% of queries. Of queries containing names, the vast majority were searches about public figures rather than searches that could identify the person typing the query. Truly problematic personal information, where a potential identifier appeared in a context that could plausibly enable re-identification of the searcher, was rarer still.

The second step targets unknown or unexpected identifiers by filtering queries that contain uncommon words.

This is where we believe our approach diverges most sharply from Google's. Rather than applying a frequency threshold to the entire query string, we tokenize each query into individual words and check them against a dictionary built from search data over an extended time window. Any query containing a word that falls below a minimum occurrence threshold is removed. Some identifiers may appear in formats that known-pattern filters do not anticipate, and some sensitive information, such as an accidentally pasted password, will simply not appear in searches from multiple users. Crucially, this word-level approach preserves rare and uncommon queries composed entirely of common words, which is precisely the data that competitors need.

The third step generalizes metadata. Click and query data is most useful when accompanied by approximate location, device type, derived language, and an approximate timestamp. We propose applying a conservative k-anonymity threshold of 1,000 users to metadata combinations. If a given combination of device type, language, country, and timestamp is observed from fewer than 1,000 users, its attributes are generalized until the threshold is met.

Applied together, these three steps result in the exclusion of less than 10% of DuckDuckGo’s English-language searches. The exclusion rate scales downward with larger datasets. At Google’s query volume, we estimate approximately 2% of searches would be filtered. The combination of our filtering techniques successfully detected and removed every problematic query identified in our manual review sample.

Our analysis is necessarily limited by being conducted on DuckDuckGo’s datasets. A full privacy analysis requires access to data only Google holds, and Google should be required to provide the additional context and data necessary to validate and refine these techniques. What our work demonstrates, however, is that regulators need not sacrifice data utility to protect user privacy. Targeted, proportionate anonymization can protect users while preserving the competitive value of search data. Google’s claim that effective anonymization requires gutting 99% of distinct queries is not supported by the evidence. It is time for regulators on both sides of the Atlantic to demand better.

Authors

Paul Francis

Paul Francis is Director Emeritus at the Max Planck Institute for Software Systems, and founding member of the Open Diffix project. His research focuses on the privacy issues surrounding online tracking – specifically, designing and building systems that enable advertising and analytics while protec...

Andreas Dewes

Andreas Dewes is a trained physicist with a PhD in experimental quantum computing. He is a data scientist and Python developer. He currently works as a Privacy Engineer at DuckDuckGo. His experience in building secure web applications stems from his work as lead developer at the German Cyber Securit...

Perspective

US Court's Remedies Not Enough Given Google Can Continue to Pay for Search DefaultsSeptember 12, 2025

Perspective

Decision in US vs. Google Gets it Wrong on Generative AISeptember 11, 2025

Perspective

The Limits of Antitrust Remedies in the Google Search CaseSeptember 3, 2025

Analysis

How AI Upended a Historic Antitrust Case Against GoogleSeptember 3, 2025

News

Google Dodges Breakup In Landmark Antitrust Ruling Over Its Search EngineSeptember 2, 2025