In 2019, the director of a British company was the victim of a scam after receiving a false voice message from his manager requesting the transfer of 220,000 euros to a supplier. One year later, a bank manager in Hong Kong he received a phone call from someone who sounded familiar. Based on their existing relationship, the banker wired $400,000 until he realized something was wrong. These are just isolated examples, but they are becoming more and more frequent. Both cases involve the use of technology deepfake to clone voices, an extremely sophisticated way of manipulating content. Identifying it is a significant challenge that will become increasingly difficult as artificial intelligence advances rapidly. And there is good news. While some computational tools can detect them with some degree of accuracy, fake voices fool humans, even when people train.

A study carried out with 529 people, which publish today in Plos One, it shows that human skills are ineffective when it comes to qualifying without a voice message is false or true. Participants failed one in four times when they tried to correctly detect these deepfakes of voice and efforts to train them had minimal effect. Half of the group received prior training, where they could listen to five examples of synthesized speech. Despite this, the improvement was only 3% compared to the other.

The researchers of the University College London, in the United Kingdom, also wanted to understand if the challenge was easier or more difficult depending on the characteristics of different languages, so they conducted the tests in English and Mandarin. The findings suggest that the capabilities are equivalent and both audiences relied on similar attributes when rating the authenticity of the messages, such as naturalness and whether it sounded robotic. “Incorrect pronunciations and unusual intonations in sound clips were commonly mentioned by both English and Mandarin-speaking participants when making decisions,” explains Kimberly Mai, lead author of the study.

More subjective than visual

Interestingly, the participants mentioned the same characteristics, regardless of whether the answer was correct or not. Mai explains that this is due to the subjectivity involved in audio. Unlike the detection of deepfakes visual, where objects and settings can be seen to judge authenticity, the auditory nature of speech makes perceptions more subjective. “When you see potential fake people, you can count the number of fingers on their hands or if their accessories match,” says the postdoctoral researcher at the British university.

To compare human and technological capabilities, the researchers conducted the same test with two automated detectors as well. The first was a software trained with a database outside the study, which reached 75% assertiveness, a figure similar to human responses. The second, trained with the original and synthesized version of the voice, was able to identify the nature of the audio with 100% accuracy. According to Mai, better performance occurs because advanced programs are able to identify the subtleties of acoustics, which cannot be done by a person.

Complex sounds, like human speech, contain a mixture of different frequencies, which is the number of times a sound wave repeats itself in one second. “Automatic detectors examine thousands of voice samples during their training phase. Through this process, they can learn about peculiarities in specific frequency levels and irregularities in rhythm. humans are incapable to decompose sounds in this way”, says the researcher.

While automated detectors have proven to be more effective than humans at this task, they also have limitations. First, they are not accessible for everyday use. Also, its performance decreases when there are changes in the test audio or in noisy environments. But the biggest challenge is for them to be able to keep up with advances in generative artificial intelligence, since increasingly realistic synthesized content is produced more quickly. If before hours of recording were necessary to train a program, now it is done in a few seconds, for example.

Fernando Cucchietti, an expert unrelated to the study, stresses that the results presented have some limitations, since the conditions of the experiments “are very laboratory-based” and do not reflect the daily threats of this type of technology. “They are not realistic for situations where the deepfakes it can be problematic, for example, if you know the person they are imitating”, says the head of the Data Analysis and Visualization group at the Barcelona Supercomputing Center in statements to the Science Media Center Spain. Despite this, Cucchietti stresses that the conclusions are similar to other similar studies, and because it is a fairly controlled environment, “the results are less affected by other factors, for example, previous prejudices or biases, as in the case of studies of disinformation”.

Avoid scams

At the individual level, people are unreliable in detecting deepfakes voice. However, the research results show that by pooling the opinions of more individuals and making a decision based on a majority vote, there is an improvement in detection. Kimberly Mai recommends: “If you hear an audio clip that you’re not sure about because the content seems unusual, for example if it involves a request to transfer a large amount of money, it’s a good idea to discuss it with others and verify the source.” .

Mai suggests that the route to improving automated detectors is to make them more robust to differences in test audio. According to him, his team is working to adapt basic models that have worked in other fields, such as text and images. “Since those models use large amounts of data for training, you would expect them to better generalize variations in test sound clips,” she stresses. In addition, she believes that the institutions have the obligation to take sides. “They must prioritize the implementation of other strategies, such as regulations and policies, to mitigate the risks derived from the deepfakes voice,” he argues.

