Until recently, it would have sounded like science fiction. Imagine making a video call to an individual who lives on the other side of the world. This person speaks Japanese, but through your headphones, you hear their words in English. It’s similar to having a live interpreter, who can translate different languages in person or online. In this case, however, there is no human involved, but rather artificial intelligence (AI) that can provide simultaneous interpretation.
Kudo, a company that has grown in the market by connecting interpreters with corporate clients, has taken a step forward by including a technology that performs simultaneous translations in online conferences. Its job is not to translate written sentences, but rather to carrying out voice translations, allowing participants in a video conference to hear the translation as if they had an interpreter present.
In a demonstration carried out for EL PAÍS, Tzachi Levy, Kudo’s product manager, speaks in English while he is interpreted almost in real time in Spanish. Although the voice sounds robotic and there is a slight delay compared to a human translation, the result is still surprising. While a human interpretation usually has a delay of five to seven seconds, the artificial experience is around 10.
The company has 20 corporate clients that already use this service, which continues to be constantly improved. The tool works on Kudo’s own video conferencing platform, but is also integrated with Microsoft Teams, which is popular in the corporate world.
At Kudo, they explain that in situations where 100% accuracy in translation is required, the human interpreter will always be the best option. Levy gives the example of European Parliament sessions: “Artificial systems will probably not be used, but in smaller meetings, where there are no interpreters available at the time, this solution can be effective.”
Levy argues that the advance of AI is inevitable, and that progress that was originally thought to take five to 10 years has been achieved in a matter of months. The field is evolving so quickly that, he estimates, within the next year AI could accurately achieve simultaneous translations in 90% of common situations.
Artificial and human intelligence
In June of this year, Wired did a comparison of Kudo technology with interpretation performed by experts. Humans obtained significantly superior results compared to the AI tool, mainly in regards to understanding context. Claudio Fantinuoli, head of technology at Kudo and creator of the automatic translation tool, tells EL PAÍS that the model evaluated by Wired three months ago has already been improved by 25%. The next step in development is to integrate generative artificial intelligence to make the user experience more pleasant: for the voice to sound more fluid, human and able to capture intonation.
One of the main challenges, according to Fantinuoli, is getting AI to interpret the context of the narrative, in other words, to read between the lines. This challenge is still great, but progress is being made thanks to “large language models,” such as the one behind ChatGPT.
Fantinuoli, who is also a university professor and teaches young students aspiring to become professional interpreters, says “he sees no conflict” between AI and human training. What’s more, he believes human interpreters will always be of higher quality. “I try to make them [his students] understand that robots are a reality in the market and that they have to be at the top of their game,” he says. “AI is driving them to be very good interpreters.”
One voice, many languages
Another option that is set to appear in the near future is to add the speaker’s own voice to the interpretation. Fantinuoli says that technically this is already feasible, and it will be integrated into the company’s service in a matter of months. Other companies have already tested the possibility of using a single voice to play content in different languages, but not simultaneously. This is the case of the ElevenLabs platform, which can interpret 30 different languages with the same voice.
The process is simple: a user uploads an audio of more than a minute of the voice they want to replicate. From this file, the tool reads aloud the text they want, either in the source language or other available ones. The platform allows the user to make custom adjustments, fine-tuning the clarity of the reading or even exaggerating the style of the voice, according to their preferences. The program not only imitates the voice, but also captures and reflects distinctive nuances, such as tone, rhythm, accent and intonation.
Recently, Meta launched a multimodal translation model, which can perform speech-to-text, speech-to-speech, text-to-speech and text-to-text translations for up to 100 languages, depending on the task. This could be of use to polyglot speakers, those who mix two or three languages in a single sentence. Meta claims that this model is capable of discerning the different languages at play and carrying out the corresponding translations. While it still shows some small errors, it works quite well when the sentence is expressed in a single language. The tool is available for free in the Beta version.
Claudio Fantinuoli says the Meta’s new tool is surprising, comparing it to “the ChatGPT of spoken discourse.” “What they do is put together all the models, which can do many tasks at the same time. This is the future,” he says.
Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition