Artificial Intelligence and Your Voice

Justin Hendrix / Nov 5, 2023

Audio of this conversation is available via your favorite podcast service.

Today’s guest is Wiebke Hutiri, a researcher with a particular expertise in design patterns for detecting and mitigating bias in AI systems. Her recent work has focused on voice biometrics, including work on an open source project called Fair EVA that gathers resources for researchers and developers to audit bias and discrimination in voice technology.I spoke to Hutiri about voice biometrics, voice synthesis, and a range of issues and concerns these technologies present alongside their benefits.

What follows is a lightly edited transcript of the discussion.

Wiebke Hutiri:

So my name is Wiebke Hutiri. I am an almost finished PhD candidate at the Technical University of Delft in the Netherlands, meaning that my thesis is submitted, but from submission to graduation, it always takes a bit of time.

Justin Hendrix:

And can you tell me a bit about your research about your dissertation and your interest more broadly?

Wiebke Hutiri:

Yeah, so my dissertation has been broadly in the field of responsible AI. And the approach that I've taken is one where we've had for several numbers of years, we've had a pretty large number of researchers devoting themselves to algorithmic fairness and bias to finding ways of improving it, mitigating, detecting.

But we've got a large number of applications that use or increasingly use AI. We developers and engineers don't necessarily have a background and expertise in bias and fairness. And so my thesis was really about, trying to distilled best practice knowledge from the machine learning community around detecting and mitigating bias into something I called design patterns that engineers and developers can apply to detect mitigate bias.

And so within the. context of that. I've ended up studying speech technologies quite closely. So I've worked a lot with keyword spotting, which is the "Okay, Google, Hey Siri" kind of the very short snippets of speech that give you access to a speech pipeline. And then I've worked a lot with voice biometrics, which is also known as speaker recognition or speaker identification and verification.

So that's kind of my angle from speech as use cases that I studied, in order to bring bias and fairness approaches closer to developers.

Justin Hendrix:

And can you tell me about the Fair Eva project?

Wiebke Hutiri:

That was a project that I started at the end of 2020 actually. It emerged out of a prototype that I had built as one of my first studies in my PhD research, and the study investigated bias in voice biometrics. And so maybe just to explain the context of how I approached it. At that point, so end of 2020 -- that was I guess the end of the COVID year, or the first year of COVID -- there was a lot of attention that started being paid to face recognition, and so face recognition Comes across as very invasive, security technology, huge challenges around bias, but speaker recognition, is much less known yet also integrated in many different, applications that we use every day.

I started my research with this intrigue around saying, well, what is bias like in speaker recognition? And in order to do that research, I built a prototype of a tool to detect bias. And the Fair Eva project emerged from that to build out that tool to make bias detection tools accessible to developers and engineers.

And again, I guess as research journeys go, once you dip your head out of research, you start to realize, oh, wait a moment, no one actually knows what speaker recognition even is. In a community of developers, people are saying, well, we want to test for bias.

If you could show us that it existed. And so the project ended up, which started as this tool development project. We received funding from the Mozilla Foundation to build out the project. And we ended up looking at data sets used for speaker recognition development, and also ended up creating an animated video to explain to people who don't know about speaker recognition what the technology is.

Justin Hendrix:

So what is the problem that you've discovered, or the set of problems, if you can kind of break it down in a little more detail. What is the issue here with representation bias in these data sets?

How does that play out in the real world as these applications? You know, move to be used by millions or perhaps even billions of people.

Wiebke Hutiri:

Biometric technology is technology that uses human bio signals in order to identify people. And so I guess we all used to fingerprints we used to maybe iris scanning and face recognition is one of them, voice recognition is another.

And so voice recognition or using voice as a form of biometrics is viewed as having. A big advantage in that it is non invasive and largely hidden. And so hidden, I mean, always sounds a little bit negative, but it's not necessarily if you consent to it. For example, if you are on a phone call, you could use voice biometrics without having to, specifically prompt or ask the person that's on the call to identify themselves.

So this hidden aspect of voice biometrics makes it quite unique because all other forms of biometric need some form of engagement. So you need to look at your phone or you need to take your finger and like put it down on something, but by voice biometrics is different. And so in that sense, voice biometrics, it's been under development for several decades.

And if you start looking at the datasets that historically existed, the industries that have been interested have largely been national security and industries and the military. So for one that the countries in which development has been funded have often been countries that have stronger military funding. And the people that are represented in those data sets are often people where countries then have military interests. So for example, for the US, it's not only white US males, kind of which we maybe have come to expect, from bias datasets, but it might also have a relatively large proportion of speakers from Iran or Afghanistan also.

And so there is, on the one hand, this kind of military origin, and on the other hand, voice biometrics also needs to be seen together with the internet of things that is developing. I think a lot of people that are worried about privacy and security to some extent have an aversion towards cameras that are in our public spaces.

Microphones we can't see. And if I speak to colleagues who are bringing machine learning capabilities to devices and are really on the sensor and hardware side, the prognosis is that every single electronic device that will come out in the future, be it your TV, be it your micro toaster, be it your kettle, be it your fridge, whatever, can have embedded microphones.

And so embedded microphones then give this capability that you can talk to your devices, which seems very cool for human computer interaction. But you sit with a massive security problem because then you might say, well, who can talk to the devices and what can happen? And so voice biometrics then becomes a way of securing this like vast network of Internet of Things devices that can be talked to.

And so that's kind of then where we're sitting with, metrics and speaker recognition. Now we, on the one hand, it has this history of being used in security, national security, military applications to identify people with many similar concerns to what face recognition would have.

And on the other hand, though, it is definitely a very useful technology.

Justin Hendrix:

So you've provided me an example of a successful speaker identification and another example of a failure in speaker identification due to bias.

Wiebke Hutiri:

Yeah, so I'll give you two examples.

I've mentioned national security and military. I've mentioned internet of things space where it's maybe used the most at the moment are call centers and especially in financial services. So you might have had your bank, Encourage you to submit your voice and to use your voice for identification.

That can either be direct or indirect. So that would be my voice is my password. You say that then you can authenticate yourself. Now, there are two types of errors that you can or two types of failures that you can expect from speaker recognition systems. The one is one way.

You correctly try to unlock your device or get access to your bank account, for example, but the system doesn't recognize you. I'll come back to that just now because I think there's several different like follow on implications of what that might have. The other is one which is, I guess would be strictly speaking, a security breach where somebody else tries to access your account and they can't.

Now if we come back to kind of the history of how speaker recognition technology was developed, there was in the past often an assumption that the people that might try to break into your account Might be quite different from you, but if we coming now to, for example, you using like speaker recognition on your mobile phone to access your bank account or in internet of things devices, there's suddenly a very high likelihood that somebody who might try to access your account either has the same accent because they're from the same geographic location.

Maybe they even have the same genetics because they live in the same house as you. And so suddenly we sitting with very different failure modes. Again, it's like if you have your voice and my voice, it's quite easy to tell us apart. If you have my voice and my sister's voice, it suddenly becomes very difficult to tell us apart.

And so that has been a part where we really have to start interrogating what are the conditions along which we evaluate. And Where bias suddenly starts becoming really important, because if we cannot capture the nuance of groups of people sufficiently those people might be quite susceptible to having false positives.

So again, somebody else accessing your account if that person sounds quite similar. Now, if you come to the other failure mode, which is the false negatives.

Let's stay with the banking example. So I recently moved to a new city. It's in Switzerland. I'm in Zurich at the moment. And I've been trying to open a bank account so it's quite an interesting scenario. The banks here are trying to save money, I guess, as they're doing everywhere, moving more and more digital. And so a lot of the bank branches are closed to the extent that as somebody who's here on a short term contract, they were only two physical branches.

Of the bank that I was able to go to that I could access. So now we're in the entire city, I've got two branches. I'm trying to call the call center. And the call center has, in this case, it's not speaker recognition, but just a speech technology, like a speech recognition system that speaks only German.

Now thankfully to me I can speak German, so I was able to access that system, but two days later I spoke to a Japanese colleague who told me that he's been trying to get a bank appointment and he hasn't been able to because he doesn't speak German, and even if he did his accent wouldn't be recognized.

So this is kind of an example of how that could look like on a speech recognition side. Again, the same happens. So in Mexico, for example, speaker recognition is used for proof of life verification of pensioners. And so there, if you have this false negative failure, again, as if you said, let's take what has happened in banking of saying banks try to consolidate the locations to save money.

There's only very few physical locations that remain, and if we take a scenario where pensioners used to maybe have access to collect the pension from a large variety of different places, now there's a drive for digitization, and because of that, there's consolidation of physical locations, location branches close.

And if you then sit with false negative errors and you, for example, can't validate your identity, you suddenly have a scenario where instead of being able to walk a kilometer to access your pension, you now might have to travel 200 kilometers to the next city. I am playing a little bit with what might happen in future.

So I'm not aware of any failures like that having happened exactly right now. But I also think we largely so unaware of. speaker recognition technology in everyday products that I think we might not even always yet realize when it's failing us. And I think the stories maybe haven't quite emerged yet.

Justin Hendrix:

I want to talk a little bit about speech to text and speech recognition, in particular.

Transcription this is such a common thing that people are using. Technology for these days, I use it all the time, of course, and doing this podcast, I was just reading a New York Times article the other day about call center employee in Mississippi who was talking about the ways in which automation is kind of creeping into her work.

And one of the problems that she was having is that the transcription service that records her interactions with customers often makes a variety of different mistakes and, you know, may rank the quality of her communication more poorly as a result because she has a southern accent.

Now, I have a southern accent. Of course, that, you know, hit close to home. Can you talk a little bit about this, about this issue around speech recognition, what makes for accurate transcription and what causes there to be, problematic results?

Wiebke Hutiri:

Yeah, so that's a very interesting question. And I think, yeah, maybe also interesting in terms of the downstream consequences that ties if we assume that the technology works well and then penalize people when the technology doesn't work so well.

So I think that's a whole interesting set of questions. So speech recognition technology, so when you develop it, you typically have, transcribed text that has been transcribed by a human. Um, and voice and maybe actually the two ways it's either you have like you either you first had a person that spoke and you have a transcription of that text that again in the past would have been done by human transcribers or often professional transcribers or alternatively you have a text that is known so for example the Mozilla common voice project they have a very big project to collect speech for speech recognition in different languages.

And in that case you'd have a corpus of text that, they've paid attention to is, is licensed as CC0. So open, like, openly available and available for commercial use, non attribution. So that might be, for example, government text or things like that. And then you have people who join the project and read it out.

And so basically, then you'll have text and spoken word with it, whether it's from transcriptional speech, and so basically from that, then you'll have, a, model training process that, that learns the AI models that get used on that.

So you have the publicly available corpuses that a lot of people use will be these kind of like carefully crafted speech data sets.

What you have alongside that is then more, more web crawl data. For example, open AI released a big speech recognition model called Whisper. And those are then corpuses crawled on the internet again, similar to what you'd have with your large language models or your image models.

So depending on how much data you have for specific accent groups for specific languages, you might end up relying more on the carefully crafted red speech corpuses. But the way that our speech is produced is, for example, When I read, I sound different to when I have a conversation with you.

Um, it's different if you prompt me a question, um, and I'm in dialogue. Blog might be different to when I just talk to a friend about something. It's different to when I give a speech. So you have like speech style varies. And if the speech style that a call center agent, for example, works for in a call center, if that's not well represented, they might not be understood very well.

Then sometimes you have transcription errors. So, for example, if, um, you have that a lot with children when they speak, where adults that transcribe children's speech, they might have sometimes corrected what the children said because the children said it wrong, um, in inverted commas. And so then the transcription actually is incorrect because somebody else corrected for that.

And so you might find that a lot, I think, for, for minorities or, again, like maybe especially where there's historic bias, you get then, Cases maybe where the transcription wasn't correct to begin with. And then some accents also again, like with all machine learning, if the data wasn't in the training set, it's unlikely to do well, when it's put out during evaluation.

So if Southern accents aren't well represented because all the speakers came from universities on the East coast and on the West coast, then you end up with groups of people whose voices effectively. Cannot be understood. And yeah, maybe just to bring it back to what you said, I think where it gets important then is quite often when technology gets introduced into organizations, it's not because organizations think that like, they don't want to be cool.

They want to save money or make money. And so if an organization wants to save money, typically the reason for automation will be is that there's an intention for like efficiency. And so then if. You, for example, have this introduction of new technology together with monitoring of, for example, employee productivity, um, or like, let's say how well the employee is able to like customer call quality or something like that, then you suddenly end up with scenarios where people can really get penalized for automated technology.

That doesn't work well.

Justin Hendrix:

So next we can move on to speech synthesis and speech generation. So a lot of the hype around artificial intelligence these days, of course, is around synthetic media, around generative AI, around the ability to produce media from text prompts in many cases.

What are the bias issues that emerge here? I mean, lots of folks are excited about these systems and the ways in which they can be used for create creative approaches for all sorts of different applications. But there's a lot of fear about voice cloning, you know, mentioned banks earlier and bank verification.

We've already seen scams. I noticed in my social media feed the other day that Bank of America. Was running advertisements about the possibility of voice cloning attacks. Uh, so this seems to be really kind of reaching a crescendo at the moment. What should we know about issues around speech synthesis and speech generation.

Wiebke Hutiri:

Yeah. So it's, it's a really good question. I always think that it's worth, like, I've always feel we have to ask two questions. The one is what could go wrong if the technology works well. And the other thing, the question is what could go wrong if the technology doesn't work.

So if we, if we look at. Um, and I've often been concerned with what happens if the technology doesn't work. So again, there'd be an assumption of saying we want voice biometrics to work equally well for everybody, because in some scenarios, if you, for example, want to access your bank account, then it's great.

So the same, you might say is with voice cloning. Maybe also let's just distinguish, so voice biometrics, the type of AI technologies used for. But speech recognition and speaker recognition, I would classify those as predictive AI. So it's using data to predict something.

And so the kind of issues that we've concerned about when it comes to bias are things like performance disparities. When we come to generative AI, we are often concerned about questions of stereotyping. and representation. And so I'll, yeah, I'll first, first ask the questions of saying what happens if, this technology doesn't work equally well for everyone.

So maybe also first let's unpack how can speech generation be used. You can use speech generation to create voices that never existed before. So, where that could be positive is that if you, for example, want to voice text, that otherwise wouldn't be read out, that would be great, or for example, especially in places where there are many languages that are spoken, you can suddenly add voice to minority languages, for example, that hadn't been voiced before.

Then, a different application would be one of saying, let me take a voice that, like, let me take your voice. And clone it to let you, let you speak things. And so, for example, the Apple is bringing out this, like, a new feature in the next iOS release, it's an accessibility feature for people who are a threat of losing their voice, that they can type text and then have their phone vocalize what they have written.

Um, so let's say that's a different application and another one, maybe let's say also that's one where we'd want the technology to work well is a translation service where you might be sitting in a meeting with, um, or let's say in parliament and you want to have the speaker's voice broadcast into different languages. Or let's say maybe movies or something like that. And so in scenarios like that, then if we speak about representation there's so much that creates the, not even just the quality like that, that makes our voice human like that makes you maybe even while you're listening to me on this podcast, you might have some imagination of what that human looks like that is talking here.

I mean, I'm here waving my arms and things like that. Maybe my voice brings some of that. And the question that comes is if we generate synthetic voices for different people of different body sizes of different ages of different ethnicities, backgrounds, languages. Can all of their voices be represented? With equal richness? Or are there some people whose voices can be synthesized to a great amount of detail and with a great amount of care, while other people just sound absolutely monotonous, like they have no life and like no personality and, and so on? And so that is like, that's something that I'm working on at the moment, for example, is to understand what is the representation capacity in terms of emotion, identity, accent.

So how well can a model really, retain and preserve the identity of the speaker. So again, that's for scenarios where we want, where we want this technology. Some of the questions are difficult if you have a Xhosa speaker from South Africa, for example, speaking English, what accent should they be speaking with?

And is the default US accent, or again, there's some questions that are not easy to answer, and then the other question is really this question around voice cloning, in the security threat kind of scenario that you've painted out.

To me, speech synthesis and voice biometrics are two sides of the same coin. There's research that's happening, for example, on watermarking of AI models. There's some that's come out in the image space. There's a lot of effort that people are, Oh, like there's, yeah, a fair amount of work that's been done into like both voice privacy and then security.

But from my view, it's, I think the sophistication of voice cloning that we have now, and really it's been like the last several months that we've seen all of that emerge, to me makes like voice biometrics a very insecure, biometric application. And I think there again, it's work that I've been wanting to do that I haven't yet done.

But I think that really then starts coming also saying if you synthetically generate people's voices as everybody equally. Susceptible to having their contact or is the variety and mean something that is a little bit maybe what I'd suspect that's with voice cloning, you kind of end up with an inverted risk profile to what you would have had in other cases.

So If you come from a group of people that is majority, well considered majority, so ones where AI normally works really well for you, chances are really good that voice cloning will be much more accurate for you, and potentially also that it will be easier to break into your, voice secured services.

Justin Hendrix:

So completely sort of opposite effect in many ways, where being underrepresented perhaps, may actually be a slight Protection in this regard.

Wiebke Hutiri:

It's a little bit tricky because again, with the voice biometrics, you get the false positive and the false negative.

So the one is you being like access denied to you. And the other one is somebody else being able to break in. And maybe what's important to say there is those two are normally a trade off. So what a company could do, for example, is you could choose to let all users have the same false negative rate, which means they'll have an equivalent user experience.

So if you and I try to access our, our system 10 times, or like, let's say a hundred times, both of us might be denied access once. The cost of that might be is if the system looks better for you than me, it means that if 100 people try to access my system, 10 of those maybe might be successful at breaking in, where if 100 try to access your system, one of them might be successful at breaking in.

On that level, I guess there's a question Of saying who is ultimately going to be disadvantaged. It's a bit hard to tell, but when it comes to voice cloning, I guess there, then the more of your type of voice is available, the better that system is likely to be.

Whichever way around we look at it, I find it staggering that even just these two technologies together. Voice cloning and voice biometrics that they already end up being such a complex. ecosystem of technology that needs to be managed and understood.

And again, like thinking of how many, how many billions probably are secured by these technologies. I sometimes find it quite harrowing.

Justin Hendrix:

So I want to turn a little bit towards what some of the solutions are for some of the types of problems that you've discovered and identified. And my immediate thought, of course, and looking at this.

Set of questions was that the answer is, you know, more data. Um, the problem is limited data sets, but you kind of suggest that that's not entirely the case.

Wiebke Hutiri:

So more data will make the AI models better. That is true. But bias and fairness aren't the only issues that we ought to be concerned about with AI.

So that's one aspect that more data can often solve. So even just obtaining more data, the questions around who labels it, who annotates it, are those people paid fairly, treated fairly, is there consent provided? So there are still a lot of questions, even just around getting more data.

It's often not as easily done as said, and like quite an expensive undertaking. So this takes us maybe to a part where that we haven't yet talked about, which is cool. It's like a hobby passion of mine like by more by chance, which is that the voice is like a massive store of personal information.

And so we never really think about it because we hear our voice all the time. Every day we say you stood, we might not particularly like it. Like if I listened to this podcast, I would probably be like, I don't know if I can do that. But your voices, I think of it, your voice is your social context and your upbringing, your environment, but also your body.

And so, because, because of that, there's a lot of information. If you're in a perfect, if you, if your voice is recorded in a very well controlled environment, researchers have been able to find out and. incredible amount of information about you. And maybe so just to name a couple that are maybe less controversial.

So you can hear if somebody is a smoker or not. You can hear if somebody's got respiratory diseases. You can hear to some extent a person's body size, and then you can hear a person's gender. Again, like with, for example, hormone treatments, there's a lot of these ideas, but again, it's just like on some level, you can you can classify that. You can hear a person's age to relatively like within ranges of age. And so just by speaking, you're giving away a lot of information you can Like get a sense of whether somebody is lying or not lying.

You can get a sense of emotion. I mean, one of my favorite things since I've started becoming aware of this is just listening out to when people give presentations, myself included, and the voice is evidently nervous and you can hear that tremor and the voice, I mean, I've had that at moments in time, I thought.

Wish I could just make this go away, but I can't. And so quite often these are things that we can't even like, unless you're really well trained at speaking, there are things that you don't necessarily have under control. And so the voice effectively, again, is personal information. We've started to get accustomed to this idea that companies track our websites.

And as we click and browse through the internet, maybe TikTok is really great at that. It's like, how long do you stay on a video? How quickly do you swipe away? Which ones do you go to afterwards? And so, it's not just networks of friends, which I guess was matters of Facebook's claim to fame, like claim to money, I guess.

It's also like increasingly behavioral data. And voice data to me is like, a massive bucket of data that many people don't consider as personal information. And so because of that, because your voice data really is such sensitive personal information that can tell so much about you, I feel very hesitant to say just like as a blanket statement, collect more voice data.

As long as governments and systems work in our favor, that might be a good idea. But if the tables turn, there's so much that, that I guess is revealed through our voice.

Justin Hendrix:

Are there other aspects of a research agenda or things perhaps that government should do that should address the types of bias problems that you're discussing here? I am struck by the idea that, this is a kind of strange like cat and mouse. You know, you both want to improve these technologies for their use and all the good use cases.

And yet, on the other hand, the forensics are never going to quite keep up. So you're, you're also opening the door to lots of different. Types of threats as well.

Wiebke Hutiri:

One of the approaches that I'm personally quite a fan of when it comes to, at least to the privacy side of things, that was kind of.

Also, the starting point of my PhD research is on device processing. And so on device processing effectively means that you don't, that you don't send the data to the cloud, to be processed on some faraway data centers, but that instead the computations are done directly on your device. Of course, there are always two parts to AI.

The one part is collecting data in order to train models. The second part of it is once a model has been trained, having it deployed on a device and then being able to use it for predictions. So the undeviced processing, for inference or for making predictions, once a model has been trained to me, that is in many ways has like for many applications, it has massive potential for privacy by design.

And in those scenarios, if a consumer can trust a technology provider that the data is truly only processed on device. That is something that even as a, as a skeptical critic that I would feel totally okay to use. It doesn't quite cover the training.

So like training completely privately, again, there's a lot of research on it, but it is. It's more like much more challenging to do that. So there's still been a question of how to get data to train models. I think, for example, I would potentially also be okay saying I'm going to give my boy like a month of my voice data to train, but that's quite different to having every single day as I speak with the device.

My voice data being sent and like processed on the cloud. So I think the undervised processing to me is really something that consumers should be pushing for much more. And then again, together with that, it's not just the undervised processing is also having I'll call it largely like a regulatory environment in which they are.

I want to use the word standards quite loosely, but we, if again, like I was thinking if I buy a phone, I can see what its battery life is. Um, I want to have the same kind of certainty and the same kind of insights, for example, into where and how my biometrics process. Now, in reality, with all the AI that's happening, for example, on phones, we might end up with a book this thick that becomes like the thing that you get when you try to click the browsing box that you always have to knowing terms and conditions.

I don't think that's the way that we should go. We need might need a book that's very thick or like rules that are long from a disclosure that is thorough from a company perspective but then finding ways of shortening that and making it like Like letting letting consumers be able to meaningfully engage and critically Question what the kind of where the ai processing happens?

When they buy devices Yeah, no, it's tricky because of course it's an ongoing development. So quite often it's, like some companies that care about privacy, they try to do better and better and better at it. And because with stuff being able to update, you might not be able to specify it like quite that well, but I definitely think we need to find ways of letting consumers have better, like better insights into where, how, and how well data processing happens.

And if it, if it has been tested for them.

Justin Hendrix:

Yeah, I suppose there's possibly a kind of analog to the, voice verification in particular, you know, lots of folks use Apple's face verification to open their phones. And the distinction, of course, between facial verification and facial recognition is lost on many people.

But as I understand it, you know, that Apple face verification does occur on device and does not, you know, require essentially checking some central database of faces. It's really just, you know, working against the, the biometric, that's been collected there on the device. So you can imagine something similar for voice.

Wiebke Hutiri:

Yeah, exactly. Exactly.

Justin Hendrix:

If you had a policymaker, you know, maybe on this call right now, what would be the most important thing you'd want them to know about your research, about the failings of these technologies at present, about what has to happen in future to ensure privacy and ensure that these systems work for all.

Wiebke Hutiri:

So something that often is my mind, there's so many pressures that we experience in our day to day lives at the moment. And I'm quite aware that. Voice data for most people is probably the smallest of their concerns, of many things that they face on a day to day basis.

And so the question really comes is like, I think it's important with policy to always also ask how much should Citizens in their day to day lives need to have to think about this because it could be used against them. And ideally, I guess, for many things I'd say, citizens probably shouldn't need to think about this on a day to day basis because of many other concerns that you are, that you're dealing with.

And so the one thing that I think to me in the policy space maybe is important is find this idea of behavioral data in general. So kind of personal idea, like personal information for a long time was always viewed as your address and your, again, your agenda and like maybe your ethnicity and so on. But the idea that all of that can be inferred from it.

From, for example, your voice. Um, it really then means, I mean, in, AI bias and fairness, there's a term that's often used, which is a proxy variable. So a proxy variable is, for example, your zip code in the US. It says a lot about maybe your socioeconomic status and so on. And so, I think the voice is a proxy attribute for a lot of different things.

And so, that would maybe be the first thing of saying. I think the voice, like voice data should be recognized as a, as a proxy attribute. And I think policymakers should think quite carefully about what that then means, maybe in relation to how other proxy attributes are treated.

So that's part one. Part twos , I think, you know, I'd like an environment where citizens could would not have to like deal with all these like pretty challenging questions around how to deal with this influx of digital and data. On the other hand, I think transparency goes a long way.

So I'll bring two terms together. The one is this idea of meaningful consent. And so when you give meaningful consent, that was like, it was, it's been quite a big, I guess, topic of discussion in, in the EU, but meaningful consent should mean that, you know, what you're consenting to and what the risks are for that, that you are, for example, have the ability to refuse or like to, yeah, you ought to have the right not to, not to disclose.

So on the one hand, we want meaningful consent. On the other hand, we also need transparency from companies and actually governments of where they use this technology. And so if we come to voice at the moment, I mean, there's so many, like, it's one of my favorite lines is call a call center. This call is being recorded for training and quality purposes.

And we used to that. We've been listening to that for 20, 25, maybe 30 years. And to me over the last like three, four years, there's a moment where I started thinking, wait a moment. Who is training what? This is not disclosed. And strictly speaking, that statement could be used to train anything. So every time you call a call center, you don't have the choice.

If I call a government agency and that statement is there, I don't have a choice to say, no, thank you. I don't want my voice to be recorded for training and quality purposes. If I choose to put down the phone, then I can no longer access it. So I really think there's a big piece that needs to be done to, again, like rethink voice as sensitive information, rethink voice as a proxy variable, but then also rethink how, how that ties into meaningful content, what companies are allowed to record when they allowed to record and how they need to treat that data and disclose that, I guess, at the end of the day, again, to consumers.

Justin Hendrix:

What are you going to do next? What is next for your research agenda?

Wiebke Hutiri:

Yeah, that's a good question. So at the moment I've moved a bit into the, into the generative AI space. Um, so speech generation and I'm looking at ethical concerns and questions around bias and that, my research started in voice biometrics initially because no one else was working in that space. So I found myself really surprised when I started my PhD. I thought I'd do a quick study in voice biometrics and move on, and then it turned out that no one was working in it. And so it kind of became, became a rabbit hole that I got lost in.

And similarly with, with speech more broadly, I find the speech community lags, many other areas of AI, when it comes to investigating bias and fairness. And so similarly, we've seen like some really great research being done on text to image generation, for example, but text to speech has received very little attention.

So that's what I'm working on at the moment, is like exploring, again, like stereotyping and representation bias, And largely working on benchmarking and evaluation, within that area. And beyond that, then we'll have to see what's next.

Justin Hendrix:

Well, I thank you so much for joining me. And I hope we'll talk about these things again in future.

Wiebke Hutiri:

Awesome. Thanks so much for having me.


Justin Hendrix
Justin Hendrix is CEO and Editor of Tech Policy Press, a new nonprofit media venture concerned with the intersection of technology and democracy. Previously, he was Executive Director of NYC Media Lab. He spent over a decade at The Economist in roles including Vice President, Business Development & ...