Policy Is Urgently Necessary to Enable Social Media Research

Dylan Baker / Dec 20, 2023

Jamillah Knowles & Reset.Tech Australia / Better Images of AI / People with phones / CC-BY 4.0

As a real-time crisis unfolds in Gaza, a recent Politico report found that Israeli advocacy groups have spent millions of dollars in social media ads, roughly 100 times the spending of groups aligned with Palestinians, Muslims, and Arabs. This is only the latest study in a wave of reporting about the ever-growing information war playing out online as the crisis continues.

As these critical reports come out, it can be easy to forget one looming threat: social media companies make it extremely difficult for anyone to study them, and they could easily make it worse.

As an expert in tech ethics and a research engineer who has spent years studying social media, I know that researching and reporting on social media platforms is a heavy lift. With massive amounts of content publicly posted and uploaded every day, it takes a lot of coding, math, and sheer computing power (read: money) to process all that data and derive meaningful insights from it. It can take social media researchers—the journalists, academics, and nonprofits that study the algorithms of social media platforms—months, even years, to map out the full scale of disinformation campaigns or discern how social media interacts with political polarization.

It’s clear to anyone working in this space that social media research stands on shaky ground, and we’re running out of time to protect it. In the US, social media researchers are under fire from many directions, including hostile politicians and platform operators. Pressure on independent researchers is mounting in the run-up to the US 2024 presidential election— even as concerns about social media and elections are at an all-time high.

It is true that most large social media companies have tools that researchers can use to study them, known as Application Programming Interfaces (APIs). These research-specific APIs allow researchers to search and download limited amounts of data from their platforms. These are a big deal: researchers can utilize them to download every post containing “#israel” or “#palestine”, or read every post tagged around the time and place of an important protest.

But here is the catch: to access these research APIs in the first place, researchers must write lengthy and in-depth applications to social media companies. On its face, this seems reasonable; it was, after all, lax data policies that bear some responsibility for scandals like Cambridge Analytica. But what’s concerning is the balance of power: it’s companies that are the arbiters of which research is deemed acceptable. It’s plainly stated in their terms of service. YouTube’s, for example, reads: “You may only use Program Data for research on topics approved by YouTube.”

These extensive API Terms of Service give companies an enormous amount of control over the work of social media researchers. To access data via these research APIs, one often has to agree to stipulations like sending one’s research directly to the company before publishing it and giving the company the right to access and use one’s work in perpetuity.

The restrictions go on. Some platforms, like TikTok, only allow researchers in the US and Europe to apply for access to their API. A Palestinian or Israeli researcher will have a much harder time studying misinformation about the crisis in Gaza than their American colleagues thousands of miles away.

For researchers who do get access to these research APIs, there is no guarantee of complete access to a platform’s publicly available data. Anecdotal evidence from the research community has revealed countless instances where data seems to be missing or incomplete: one might get different slices of data on different days for the same query, or a platform will indicate that it has 10,000 posts with a certain keyword but only let you see 6,000 of them. And because of how API access works, via giving researchers “keys’’ that work like an ID card, a company could record exactly which researcher is accessing which data. Many individuals study topics that might be sensitive or unflattering to a platform; nothing is stopping companies from shutting down their access altogether.

A sudden API shutdown isn’t merely a hypothetical. Prior to Elon Musk’s takeover of X, the platform formerly known as Twitter, the company provided researchers access to download up to 10 million tweets per month. It’s hard to overstate just how critical it was to the research community to be able to access Twitter data at this volume: a quick search on Google Scholar for “Twitter corpus” yields endless papers that give us glimpses into our social movements; our linguistic tendencies; our collective responses to global events. A significant amount of what is known about the impact of social media on society was made possible by the company providing access at this scale.

But then, seemingly on a whim, research access was cut off, forcing many researchers to drastically alter their work— or abandon it altogether.

It’s no wonder many researchers and journalists break platforms’ terms in order to do their work, or circumvent the official APIs altogether. Many turn to web scraping, which involves using digital tools to download data from websites. While this avoids API limitations, it comes with significant challenges: platforms frequently make changes that break web scraping tools, making web scraping a brittle and labor-intensive process.

There’s no way around it: public-interest researchers, journalists, and nonprofit entities need access to social media data. Left unchecked, tech companies are at risk of closing it off altogether.

What’s needed is effective legislation that gives an outside entity complete, transparent, auditable access to tech companies’ data, and gives it the ability to impose real penalties on companies that don’t comply. With an independent body governing access to this data, gatekeeping measures can be designed to prevent the indiscriminate sale of our personal data and to facilitate peer review and fact-checking— not to protect corporate interests.

Legislation of this nature in the EU has already shown early potential to affect this kind of change, where the Digital Services Act has prompted X update its terms, seemingly in preparation for granting European researchers API access. Meanwhile, US legislation like the Platform Accountability and Transparency Act (PATA), which could have promise for making social media data more accessible to US researchers, continues to languish in the Senate.

This kind of legislation won’t happen without staunch advocacy. That starts with making sure social media research transparency is a part of the conversation around tech legislation in the US. Our democracy might just depend on it.


Dylan Baker
Dylan Baker is a research engineer at the Distributed AI Research Institute and a Public Voices Fellow on Technology in the Public Interest with The OpEd Project in partnership with The MacArthur Foundation.