Want To Hold Tech Companies Accountable? Data Access Alone Will Not Get You There

Lauren Wagner / Feb 5, 2024

Multi-sector efforts have been underway to get technology companies to share more platform data. This will fail to hold the industry accountable without a parallel, intensive focus on training more research scientists, says Lauren Wagner.

Anne Fehres & Luke Conroy / Better Images of AI / Data is a Mirror of Us / CC-BY 4.0

Calls for platform transparency have united a diverse coalition of stakeholders, hailed by policymakers, civil society, philanthropists, academics, and even technology executives as a way to hold companies accountable for their impact.

Just last week, before the online child safety Senate hearing, Meta announced a partnership with the Center for Open Science to share privacy-protected data with a select group of academics studying social media and well-being. During the hearing, Mark Zuckerberg declared “the bulk of scientific evidence does not support [a link between mental health and social media use]”. While his interpretation of the findings is debatable, research is clearly influencing important conversations about technologies’ risks.

For champions of platform transparency, the concept is simple: If researchers had access to troves of data that reveal companies’ inner workings, they could quantify the effects for the public to see. Issues like the spread of misinformation, extremism, mental health, bullying, or political polarization would be better understood. This would result in meaningful action: Responses from policymakers, critiques from journalists, and commitments from the companies themselves to develop new products that better serve society.

But while “sunlight might be the best of disinfectants,” this approach falters when faced with a stark reality – there are not enough researchers in the world who have the skills to analyze and interpret platform data.

Platform transparency efforts tend to focus on increasing the supply of data and not the viability of people who can use and analyze it. There are myriad issues related to provisioning data and proving that academics are capable of holding companies accountable. Yet, without a substantial and growing pool of specialized researchers globally, calls for more access will not yield desired results. Data will be shared with a known group of academics from a subset of universities, who are limited in their capacity to conduct timely studies and pose new lines of inquiry.

To date, conversations around transparency have largely focused on social media’s impact on democracy, topics like viral content promotion, election integrity, information integrity, and online harms. Major platforms, including Meta, TikTok, X, and Google, have selectively granted researchers data access. And there is promise that more access is within reach. Legislative proposals in the US, such as the Platform Accountability and Transparency Act (PATA), reintroduced in 2023 with bipartisan support, would create a mechanism for researchers to acquire platform data. The EU’s Digital Services Act, which is now law, provides a clear mandate for granting data to vetted scholars, and potentially journalists and civil society groups. At the same time, proposed AI governance standards in the US and the EU include measures for data transparency and documentation.

So, the data floodgates may soon open. But to effectively analyze and interpret this information, and draw meaningful conclusions that can be evaluated by nontechnical audiences, researchers need a blend of social science and computer science skills. These include data mining, natural language processing, text analysis, web scraping, data visualization, and machine learning, as well as subject matter expertise in major thematic areas.

For instance, Meta’s Researcher Platform provides approved academics with over a petabyte of privacy-protected user data from Facebook and Instagram, returning up to 100,000 results per query of near real-time public content. The complexity of managing such vast data sets – with one petabyte of data comparable to 500 billion pages of printed text or 20 million tall filing cabinets – underscores the advanced skills needed to derive insights from this raw information. As organizations like the Coalition for Independent Technology Research call for more access so that researchers can “identify problems, hold power accountable, imagine the world we want, and test ideas for change,” the competencies needed to do this work remain specialized and rare.

Expanding data access to journalists and civil society groups does not solve the problem of a limited researcher pool. One issue is that the necessary computer science skills for analyzing raw platform data are not commonly found in newsrooms. As the State of Open Data Report indicates, the cost and complexity of data journalism presents significant barriers. While platforms could offer simplified dashboards for less technical users, understanding and contextualizing research findings still demands data science proficiency. Moreover, it is crucial to ensure that any information the technology companies port into dashboards is reliable, and that results are replicable.

As Stanford scholar Paul N. Edwards points out, “Data politics are not about data per se. Instead, the ultimate stakes of how data are collected, stored, shared, altered, and destroyed lie in the quality of the knowledge they help produce.” Philanthropies have poured millions of dollars into organizations advocating for data access, as well as some academic institutions researching sociotechnical issues. These efforts have had some success, and one result may be that more data will be made available for independent research. But with an aim of holding the technology industry accountable, more focus must now be placed on funding centers globally and training computationally proficient researchers, including outside of the academy. This involves supporting subject matter experts who can translate quantitative findings for nonacademic audiences and seeding programs to increase diversity and geographic coverage in the computational social sciences.

This year, the World Economic Forum’s report on global threats identified AI-generated misinformation and disinformation as the world’s greatest short-term risk. As world leaders grapple with the societal impacts of rapidly advancing technologies, it is imperative that we train a broader and global coalition of computational research scientists who can appropriately address these questions.


Lauren Wagner
Lauren Wagner has spent the last 15 years building products that leverage privacy-sensitive data in areas like trust & safety, digital health, and consumer AI. She previously worked at Meta and Google bringing new products to market, spearheading data sharing efforts and the creation of tools to com...