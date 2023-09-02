Until recently, it would have sounded like science fiction: connecting on a video call where on the screen is an individual who lives on the other side of the world. This person speaks in Japanese, but you, through his headphones, hear his words in Spanish. This is a situation similar to what interpreters do, who translate for different languages ​​in person or online. However, the big difference is that here there is no human involved, but an artificial intelligence (AI) that translates and delivers the speech in another language simultaneously.

Kudo, a company that has grown in the market by connecting interpreters of languages ​​with corporate clients, has taken a step forward by including a technology that performs simultaneous translations in online conferences. And it is not about translating into written sentences, but it performs translations in voice, allowing the participants of a video conference to listen to the translation as if they had an interpreter present.

In a demonstration made for EL PAÍS, Tzachi Levy, Kudo’s product manager, speaks in English while his speech is heard almost in real time in Spanish. Even though the voice sounds so robotic and there is a slight delay compared to a human translation, the result is still amazing. While a human performance is typically 5-7 seconds late, the artificial experience is around 10.

The company has 20 corporate clients already using this functionality, which continues to be constantly improved. This tool works on Kudo’s own video conferencing platform, but it is also integrated with Microsoft Teams, very popular in the corporate world.

At Kudo they emphasize that in situations where 100% translation accuracy is required, the human interpreter will always be the best option. The manager gives the European Parliament sessions as an example: “Probably, artificial systems will not be used, but in smaller meetings, where there are no interpreters available at the time, this solution can be effective.”

Levy points out that the advancement of AI is inevitable and that progress originally thought to take 5 to 10 years has been achieved in a matter of months. The field is evolving so fast that, he estimates, within the next year AI could accurately achieve simultaneous translations in 90% of common situations.

Artificial and human intelligence

June of this year, Wired made a comparison between Kudo technology with interpretation by experts. Humans obtained much better results compared to the AI ​​tool, mainly with regard to the context of the speeches. Claudio Fantinuoli, Kudo’s head of Technology and creator of the automatic translation tool, assures EL PAÍS that the model evaluated by the American media three months ago has already been improved by 25%. The next step in development is to integrate generative artificial intelligence to make the user experience more pleasant: the voice sounds more fluid, human and captures the intonation.

One of the main challenges, according to Fantinuoli, is getting the AI ​​to interpret the context of the narrative, what a human understands to be between the lines. That challenge is still great, but it is improved “with large language models”, such as the one behind conversational chatbots.

Fantinuoli, who is also a university professor and teaches young students who aspire to become professional performers in the future, says he “sees no conflict” between AI and human training. In addition, he ensures that the work of an expert will always be of higher quality. “I try to make them understand that robots are a reality in the market and that they have to be the top. The AI ​​is driving them to be very good interpreters ”, he qualifies.

One voice, many languages

One possibility that is seen in the near future is to add the speaker’s own voice in the translation. Fantinuoli affirms that technically this is already feasible and it is a matter of a few months to add to his company’s tool. Other companies have already tested the possibility of using a single voice to play content in different languages, but not simultaneously. It is the case of the platform ElevenLabswhich brings content to life in 30 different languages ​​from the same voice.

The process is simple: all you have to do is upload an audio file over a minute long with the speech you want to replicate. From this file, the tool reads aloud the text you want, either in the same original language or other available ones. The platform provides the option to make custom adjustments, fine-tuning the reading clarity or even exaggerating the style of the voice based on preference. Feedback not only mimics the voice, but captures and reflects distinctive nuances such as pitch, rhythm, stress, and intonation.

Recently, Meta has launched a multimodal translation model, which can perform speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages, depending on the task. One of the promises is for polyglot speakers, those who mix two or three languages ​​in a single sentence. Mark Zuckerberg’s company claims that this model is capable of discerning the different languages ​​at stake and making the corresponding translations. While it is true that it still shows some small bugs when it comes to this feature, it works quite well when the phrase is expressed in a single language. The tool is freely available in its Beta version.

Claudio Fantinuoli finds Meta’s new tool amazing and likens it to “the ChatGPT of spoken discourse”. “What they do is put all the models together, which can do many tasks at the same time. This is future ”, he concludes.

