How To Assess AI Governance ToolsJustin Hendrix / Jan 28, 2024
When it comes to how to set standards and create meaningful evaluations of the safety of AI systems, a lot is still up in the air.
This week, I had the chance to attend the Knight Foundation’s INFORMED conference, which hosted participants with expertise across a range of disciplines, including policy and technology, for a series of conversations on democracy in the digital age. One of the panels at the conference, led by Data & Society Executive Director Janet Haven, featured Dr. Alondra Nelson, a scholar who recently served as a deputy assistant to President Joe Biden and as acting director of the White House Office of Science and Technology Policy, and Elham Tabassi, chief AI advisor and associate director for emerging technologies in NIST’s Information Technology Laboratory (ITL), who leads NIST's Trustworthy and Responsible AI program. That program aims to cultivate trust in the design, development and use of AI technologies by improving measurement science, evaluations and standards.
On January 26, 2023, NIST launched its AI Risk Management Framework, a document intended to help AI developers “incorporate trustworthiness considerations into the design, development, use, and evaluation of AI products, services, and systems.” One thing Tabassi pointed out during the panel is that when it comes to AI, evaluations and standards are still a work in progress:
...We all know that AI systems are socio-technical in nature. They are not just systems of data and compute and algorithms. A lot of evaluations and standards that we have been working on in our existence of 100 years– not just me, the organization– they were mostly, we bring the algorithms, the systems out of the context of use, into the laboratory, and test them. These are not going to be sufficient to understand how AI systems are going to interact in the environment that they're working with, the humans that are operating it, using it in the environment, and individuals that are going to be impacted by the technology. So these are the differences that we realized for the AI technology, and tried to answer some of those as part of the development of the AI RMF. And the fact of the matter is that a lot of those questions about how to measure is still unsolved research.
Last year, the World Privacy Forum, a nonprofit research organization, conducted an international review of AI governance tools. The organization analyzed various documents, frameworks, and technical material related to AI governance from around the world. Importantly, the review found that a significant percentage of the AI governance tools include faulty AI fixes that could ultimately undermine the fairness and explainability of AI systems.
To learn more I spoke to Kate Kaye, one of the report’s authors and a former journalist whose work I've admired for years, about a range of issues it covers, from the involvement of large tech companies in shaping AI governance tools to the need to consult people and communities that are often overlooked when making decisions about how to think about AI.
What follows is a lightly edited transcript of the discussion.
My name is Kate Kaye. I am Deputy Director of World Privacy Forum.
Kate, can you tell us about the World Privacy Forum? What is it? What does it get up to?
We're a nonprofit public interest research organization. We deal with data use and data privacy issues as they relate to a range of areas from AI to biometrics and identity systems, healthcare data use, and as we'll talk about today, AI is a really big focus of that we're... Really big area of focus right now, especially through the report that we just put out.
And that report is Risky Analysis: Assessing and Improving AI Governance Tools. You've done an international review of AI governance tools. You've traveled the world. You say that you've looked at all sorts of, I'm sure incredibly riveting documents, practical guidance, self-assessment questionnaires, process frameworks, technical frameworks, technical code and software across Africa, Asia, North America, Europe, South America, Australia, and New Zealand. Wow. That's the type of nerding out that I think a lot of Tech Policy Press listeners appreciate. How long did it take to do this global tour of all of this technical material?
We started the process, it was right around the end of April of 2023 that we said we need to do some evidence gathering here and figure out what this stuff is, what does it look like? So that was when we started. And we really finished our research process, I think probably, I don't know, we in the report somewhere, but I want to say end of October or something. So it was several months of gathering research and analyzing it for this and writing, of course.
This is moving so fast. I'm sure that other additional documents emerged even as you went along.
Oh, yeah. I've got a list of things... Okay, next update, what do we add? As we were finalizing, there were new things that would pop up. One of the more recent things that popped up was something out of Singapore. They already had a couple things that were in the report and then they put out something related to generative AI, in particular. What we consider an AI governance tool, according to our definition, it fit in the definition, and that was something that came out, I think in October, actually.
So yeah, stuff's happening. It's ridiculous how fast it's happening, for sure.
I guess the good thing, though, is the ability to compare across different geographies, and that's the goal here, of course, to look for, best practices to look for, what's working in some places versus others to compare and contrast.
You say you want to avoid flimsy AI government tools that create a false sense of confidence, may cause unintended problems or otherwise undermine the promise of AI systems. I guess when you step back from this project, how are we doing so far? Lots of folks are really impressed by things like the NIST Risk Management Framework. Looking across the globe, what's the quality of these documents from your point of view?
Yeah, the NIST AI Risk Management Framework and Policy and Playbook, I think they call it, those two companion docs, those are in this. They're considered AI governance tools.
One of our takeaways, one of our findings of our report is that there are some problems emerging already. We found that among the 18 AI governance tools that we reviewed specifically that reflect our definition of AI governance tools, 38% of those included faulty AI fixes, specific types of things that we look at in the report that are problematic when it comes to measuring AI fairness and explainability. And I can go into details on what those are. It's a technical thing. But basically, put simply, we already saw some problems because we saw governments and NGOs putting out AI governance tools that included some methods that were problematic, and that could create that false sense of confidence when it comes to what a measure does.
If some measurement method is intended to help rate or score the level of risk or the level of fairness or whatever, and the measurement itself doesn't work right, it's not achieving the policy goals that are intended there. And because we started to see these inklings of these things happening, we felt as though one of the things the report can really do is help policymakers and organizations step back and really create some sort of a really good, strong assessment process for the measurement methods that are used here.
So we're certainly not trying to say that the approaches so far are bad. Ultimately, to answer your question there, the fact that we saw all of these AI governance tools, which really are how responsible AI is being implemented, the fact that those exist is really positive and there's so much work being done that is not just surface level stuff. There's really good work being done and really thoughtful, meaningful stuff happening. Overall, our review showed that there's a lot of positive things going on.
One of the things you look at is the way in which some of these tools, these governance methods interact with existing policy on things like data privacy. I'm interested in that. The US, for instance, we don't have federal data privacy protections in place. Other jurisdictions that you study, do you see a kind of, I suppose, framework or membrane between AI safety and risk mitigation and data privacy? What's there in that intersection between those two things?
Yeah, I think what we're trying to do here is... Really what we discovered and what seems to be apparent is that when we think about data protection and we think about data privacy in relation to the impending AI era, all of the more advanced AI systems that are happening today, we really don't know what privacy looks like. Privacy is mentioned in a lot of these AI governance tools, but how we consider data protection in context of AI is really an open question still.
There are some things that we've seen in the past that World Privacy Forum has actually looked at that are the early stages of these much more advanced things that were happening. In 2014, World Privacy Forum put out a report called The Scoring of America. And one of the things, and it really looks at how data is used for all types of scores, like in the financial services area when you look at credit scoring for example, or health scores and things like that. And so that really looked at the employment of machine learning, earlier machine learning processes and predictive analytics and things like that in relation to scores. So that was an early precursor to some of the research that we think really needs to happen still that we're going to move toward.
In terms of how privacy is showing up specifically in some of these things, one example is interesting out of the UK. The Information Commissioner's Office, they put out what we consider to be an AI governance tool, and it really maps the GDPR to specific process steps in evaluating algorithms or evaluating models. But it's really interesting because I think one of the things that we... One of the things, one of our kind of takeaways is that GDPR doesn't necessarily address the way AI is happening now. It's more of a traditional approach to privacy. And one of the things that it does is it thinks it... For example, even if you look at the language in GDPR, you look at how it defines automated processing, and it really focuses on personal data in particular.
And when we think about how AI systems use data today and how they can find patterns and connect all sorts of data sets to reveal sensitive information about people through really complex analytics processes, just the existence of PII, or the lack thereof, doesn't mean that AI is not going to cause harm somehow or expose sensitive information about someone.
So what I'm getting at is that even something like GDPR doesn't necessarily do what we need it to in the very near future. We're already seeing this play out. It's part of the reason why this AI governance stuff is so unwieldy and why everybody's asking how it can be done, in part because the foundational legal frameworks and regulatory frameworks that exist are not cutting it.
So there are ways, I guess, in which AI as it advances will undermine existing legislation where there is legislation around data privacy. But I suppose here in the US, we haven't got anything in place. Strikes me as we're terribly behind on that front.
Yeah, I don't think anybody would argue with you that we need comprehensive privacy legislation here. Some of the things that are happening in the US in relation to AI are relevant. Like you mentioned, NIST, what they're doing is really important. The blueprint for AI Bill of Rights, what's a really interesting approach is something that we actually came up with a lexicon of AI governance tool types for this report. And the Bill of Rights really is something that we consider to be practical guidance. It's like a base level of how do you approach an AI governance tool. But there's a lot in there.
It directly relates to what government agencies already can do with the regulations that are in place. I think, in general, one of the things that we wanted to show with the report is like, yeah, okay, legislation, that's important, but all these other things are happening in lieu of or in relation to legislation. So this is the implementation layer. This is how legislation or regulation or even just rules or whatever they are, that's how this stuff gets implemented. It's where the rubber hits the road, so to speak.
The promise of this report, of course, is that you do make this global survey of governance tools, activities. You call out a handful that you say stand out for a variety of reasons. You point to models in Chile and India, particularly the state of Tamil Nadu, Kenya, New Zealand. Is there one in particular you'd like to call out, maybe one that you think is most compelling or one that's, I don't know, most memorable to you?
Yeah, I can mention the one in Chile because I think that's really, it's just a stark example of what I'm talking about when I'm talking about implementation. So it's literally just an update to the bidding process for government public sector AI.
So if you're a government agency, you have to go through a tech procurement process, a procurement process for anything you buy as a government agency. And so this is them saying, "Hey, that procurement process, that should actually incorporate some of these responsible AI considerations that we're talking about." It's literally saying, "Okay, we have these goals that we want for AI. One way we can implement them is by requiring vendors of these systems or AI software, whatever we're going to buy to incorporate the considerations that we think are important."
And one of the little areas that is really interesting there is they actually include, they talk about the metrics used to improve or measure things like fairness or explainability, and they go into some detail about government agencies who are doing this procurement. They should be the first ones to actually go to the vendors and say, "Here's how we want you to measure this stuff. This is what we think is important."
So it's a really interesting approach because a lot of the things that I saw when I was doing these reviews of the tool, of AI governance tools, a lot of times governments are looking to the vendors, they're looking to the corporations that are building this stuff for the answers for how to measure them, for how to measure these systems. And one of the things that we talk about in the report in terms of pathways forward for a more healthy AI governance tools ecosystem is being able to document and prevent conflict of interest.
It's important to have the entities building these things part of the process because they know how they're built and nobody's questioning that. But we really can't just take their word for it on everything. We have to have really strong external evaluation of all of these things. We can't just, when they talk about safety or they talk about fairness or whatever, how are they measuring those things? What are the filters they're using? What are the methods they're using?
And so, getting back to the Chile's example, they are looking at it from that perspective. They're saying, "Okay, we want to work with the companies that we're buying from, but we want it to be a process that we have some control over. We're not just going to take their word for it." And that was an example of an AI governance tool that it's an unlikely thing. You wouldn't think, oh, a contract, a procurement process bidding form, that's a tool. It is because that's actually how these things end up being bought and it being used and how they're thought about. So it's a really important example, I think.
I want to ask you about some of the, what you call off-label measures, measurement methods that are starting to pop up, in some cases undergirding these various governance tools. So they may be in very different parts of the world, but could be relying slightly on the same measurement methods. What are the ones that seem to be most in use across the world or the ones that seem to have found the most... Are being replicated, I should say?
Yeah, in particular, one of the things we wanted to do with this report was bring together the worlds of policy with academia. There is so much work that's already been done among sociotechnical and technical researchers. There's a ton of documentation that we have. A whole section of our report really looks at how what scholarly literature says in relation to AI fairness and explainability in particular. We looked at those two things because policymakers across the board, those are two things that are showing up in all of AI governance in everywhere.
And what we did was we looked at a couple use cases in particular. One is in relation to the use of an employment rule in the US. So the US has what's called the four-fifth rule, or the 80... It's also known as the 80% rule, and it's a measure of disparate impact. It's really specifically intended to be used for employment context. It's about measuring for recruitment and hiring processes, whether or not they are creating disparate impact against certain groups of people.
So the number, the way it's measured is by looking at, I'm actually going to read what it is because I don't want to get it wrong. The rule is based on the concept that a selection rate for any race, sex, or ethnic group that is less than four-fifths or 80% of the rate reflecting the group with the highest selection rate is evidence of adverse impact on the groups with lower selection rates.
So that 80% rule is something that's really well-known in the world of employment. It's disputed in that world as a super reliable way of statistically auditing for disparate impact. But what we're seeing is it's actually being encoded into automated fairness algorithms and systems. So whether it be some corporate tools that are out there for measuring and improving AI fairness or things that AI governance tools that we looked at were things that governments themselves are putting out sometimes based on those corporate approaches, we saw it showing up. We saw this encoding of this very specific rule that is intended for the US, it's intended for employment context, and it's not intended for AI, for measuring AI systems. It's not necessarily intended as a sole evaluation approach for employment systems. So that's an off-label use.
It's being used out of context, essentially. When we think about off-label, we think about off-label use of drugs, and that's really why we use that term in the report because that's an example. And then a couple other examples are, we talk about in the same capacity is two approaches to measuring AI explainability. They're super common. They're used everywhere. And so we're seeing them mentioned in some AI governance tools. They're called SHAP and LIME, and they're basically metrics for gauging what in AI model led to a specific decision or a specific prediction, a specific output of the AI model. It's again, SHAP and LIME are used all over the place. They're really accepted, but there's an abundance of scholarly literature that questions the applicability of these approaches for measuring especially more advanced and complex non-linear types of AI or the things that we refer to often as black box models.
So those are some areas that we looked at really closely in the report, and that's where we got that 38% of the governance tools that we looked at actually mentioned either SHAP or LIME or suggest using those or suggest using automated methods and methods that incorporate that encoding of the 80% rule.
Let me ask you about the involvement of large tech companies in shaping these governance tools. You mentioned in the report, of course, that one of the examples that you looked at is a thing called Open Loop, which is most prominently developed by Meta in coordination with a variety of other different types of entities, startups, representatives from academia, civil society, governments, et cetera.
What footprint do the big tech firms have around the world in terms of helping to create the mechanisms that will ultimately be used to determine whether they are in compliance with regulatory schemes?
In general, they're absolutely part of the process in many of the AI governance tools that we looked at. It's not as though every government is partnering with industry, but the reality is policymakers are looking to industry for the answers. Sometimes they're looking to them without questioning their answers, and so that's really the important element here. It's not that they shouldn't be part of the process, but it's that, again, like I said, that we have to really be able to assess so we can document and prevent conflict of interest. We don't want the creators of these systems grading their own homework or deciding how they're measured.
So you mentioned Open Loop. That's a really interesting... So we didn't consider that a classic AI governance tool per se, but it's worth mentioning because it's reflective of what we're seeing a lot of, which is this idea of policy sandboxing. So its industry is saying, in this case it's Open Loop, it's some extension of Meta in some way, I don't know exactly how they define it, but basically they're working with government entities around the world trying to formulate and test policy and decide how does policy look when it's implemented, what are some ways that we can evaluate the impact of policy before it's actually policy, before it's actually legislation, for example? That's what Open Loop tries to do, and it's really a very close relationship with industry.
Another example, I was just refreshing my memory about Singapore's, the generative AI example that I was telling you about. Yeah, the Singapore Infocomm Media Development Authority has a generative AI evaluation sandbox, evaluation catalog. And what Singapore and a lot of the things that they've done, they've worked really closely with a lot of the big tech companies and then maybe some smaller companies that are more regional and really look to them to help craft the guidance that they've put out. They've relied on certain things that already exist that are out of industry.
Their AI Verify, they actually, Singapore has literally software that they've published that includes a method for basically automating the measurement and improvement of AI fairness and measuring a model according to various AI fairness approaches, including at some point in their process, they incorporated an IBM-related thing that measures disparate impact, that actually has what they call a disparate impact remover algorithm that is based on that whole 80% rule. It incorporates that concept in it. And then this generative AI thing that they've put out, they mentioned a bunch of scoring mechanisms for measuring various aspects of LLMs that are created by industry, by OpenAI, by Google. I listed a few in the review. For, I think a lot of legitimate reasons, industry is part of these processes, but they can't be the only entities that we rely on for the answers.
I want to also just ask you about the role of the OECD. You point out that the OECD has a kind of catalog of all of these various types of measures and standards. It's leading the way in many ways. We've talked about NIST already. What about OECD?
So the OECD, yes, they have a catalog of tools and metrics for AI, and it's their approach to implementing responsible AI goals. And OECD has really been a leader as a multilateral organization that's brought together, really trying to harmonize different government approaches to governing AI. But what we saw when we evaluated, when we looked at those specific things that we thought were problematic, incorporating the 80% rule or methods that in incorporate that, and SHAP and LIME, we saw a lot of that showing up in the tools and metrics that OECD has put out in this catalog.
We really think, and I think there was a story in IAPP, the International Association of Privacy Professionals that mentions a comment from OECD saying they acknowledged the report and the problems that we brought up, and they really want to work toward assuaging some of those problems. And they talked about putting together some sort of approaches for better documentation and things like that. OECD, because it is really looked to by governments around the world for guidance, what they do is super important, and we really wanted to understand whether or not some of the problematic things we were seeing were showing up in their catalog, and we found that they were. And it's important. They're just, ultimately, we need to have better evaluation of this stuff before it is published, especially by really influential organizations.
If we were to do this again in a year or two years, given the number of commissions and studies and reports that state and national governments have commissioned across the world, I'm sure that there would be just a proliferation of these types of frameworks and tools and kits. How are you going to stay on top of this? What's next for your research into this?
This is ongoing research. We're going to be putting out updates regularly. Like I said, I've already got a going list of, oh, got to add that update. Let's research that thing, do that. That's just going to be a constant thing. We'll be putting out some additional charts and more interactive data visualization type stuff on worldprivacyforum.org. Yeah, it's ongoing. This stuff is, you can't just end it. It's going to continue.
Of all the documents you looked at, was there one in particular you found most riveting, one in particular that you still think back to as being the best framing of these issues?
In a lot of ways, the documents are not necessarily comprehensive or holistic in terms of what they're trying to solve for. One of the ones that I really think is super important, there's a couple, because they talk about issues that are not addressed often enough.
So one is the Masakhane Research Foundation. They're an organization out of Kenya. So they don't fulfill our AI governance tools definition per se, but they represent what needs to happen when it comes to developing ethical AI and responsible AI. Before we even get to this point, we need to have data sets to train AI models that are contextually relevant for the places we're talking about, for the places they're going to be used that reflect in a really broad and inclusive manner, the people and the languages and the cultures that these systems are going to be used in.
And what Masakhane Research Foundation does, it's a volunteer organization that, around the world people are building African language data sets that are reflective of low resourced languages that don't have enough data associated with them. This is an example of what we need to happen before we even get to building an AI model to begin with. So I think the kind of work that they're doing is just instrumental.
Another thing that we found to be really interesting and important, New Zealand has a really very interesting, robust process for evaluating algorithms throughout the AI life cycle. It's got a whole privacy-related framework to it based on some of their earlier privacy implementations, and it incorporates indigenous data considerations. It incorporates Maori data frameworks, as an example, and some other indigenous data frameworks. This is stuff that governments have to think more about. These kinds of concepts have to become part of government policy in a way that they just are not in most places.
So those are two examples that really stand out to me because they're at the forefront of what we're going to see more of, or what really we really need to see more of if AI governance policy actually reflects what we really want to see in the world.
Those examples and more are in this full report, which you can find in the show notes for this podcast. Kate Kaye, thank you for talking to me about all of this.
Thank you so much, Justin. It was awesome talking to you.