With Google Translate, converting any sentence to over 100 languages is a snap, but anyone who uses it regularly knows there's room for improvement.
In theory, large language models (LLMs) like ChatGPT should usher in the next era of language translation. They consume vast volumes of text-based training data, plus real-time feedback from millions of users around the world, and quickly learn how to "speak" a wide range of languages with coherent, human-like sentences.
But we've heard the "ChatGPT is going to replace everything" refrain before, only to find it's often inaccurate—the worst-case scenario for translation. "We currently don't have empirical results supporting claims of chatty LLMs working better for translation," says Nazneen Rajani, research lead at Hugging Face(Opens in a new window), maker of AI-based Hugging Chat.
So, we decided to put ChatGPT to the test. Does it have the chops to replace Google Translate as the go-to translation service for travel, work, cross-border romance, and any other language needs? And how does it compare to its sister chatbots, Microsoft Bing, and Google Bard?
Methodology and Languages Tested
(Credit: Wara1982 / Getty Images)
We asked bilingual speakers of seven languages to do a blind test. All of them grew up speaking non-English languages, and now live in the US and/or work for American companies.
Given a paragraph in English, they ranked the translated version for their language by Google Translate, ChatGPT, and Microsoft Bing. Once they completed the exercise, we revealed which service produced each one.
-
Languages Tested: Polish, French, Korean, Spanish, Arabic, Tagalog, Amharic
-
Translation Services: Google Translate, Google Bard, ChatGPT, Microsoft Bing
This is by no means a comprehensive study. "Please consider that small blind tests are insufficient; more rigorous testing is needed to properly evaluate and compare these tools with statistical significance," says Federico Pascual, an AI industry veteran. Still, the results are surprisingly consistent, providing a fascinating glimpse into how AI models work.
Creating a Paragraph for Translation
(Credit: Vadim Sazhniev / Getty Images)
With the languages and AI models selected, we crafted some paragraphs in English that would reveal the limits of each service's translation capabilities. The first included two tricky colloquialisms: "Blow off steam," meaning to relax after a stressful day, and "Cheers!" meaning, "Thanks!" It also had two measurements that would need to be converted in a real-life scenario: USD ($) and miles (as opposed to kilometers).
-
Paragraph 1 - "Hello! Do you speak English? I need some help with directions. I am trying to find a vegetarian restaurant because my sister does not eat meat. What do you recommend? We also want to stay within a few miles of here, and don’t want to spend more than $50. If they have cocktails, that would be a bonus. We’ve had a long day of traveling and need to blow off some steam! You’re welcome to join us. Cheers!"
The second paragraph was more straightforward, with no phrases or units of measurement, but it had more slang ("hooligans," and "pop champagne"). We only sent this one to the second half of participants in an attempt to widen the data collection as we refined the approach.
-
Paragraph 2 - "How do I buy tickets to the boat party? Do we need to pay in advance, or can we buy them at the dock when we arrive? I need to be on the upper deck because sometimes I get seasick when I’m too close to the water. Also, I want to be as far away as possible from the young hooligans who want to pop champagne constantly during the voyage. That’s dangerous and not my kind of fun!"
Results: AI Chatbots Beat Google Translate
Of the 12 examples we sent to our participants, they preferred the AI chatbots—ChatGPT, Google Bard, or Microsoft Bing—to Google Translate. ChatGPT topped them all.
The table below contains our participants' ranking for each service. Those who received both paragraph examples are marked with a (1) and (2). The others only received the first.
"In my opinion, [ChatGPT] is the closest to a normal conversation," says Ana Romero, who ranked the Spanish translations. "The level of formality between the two key questions are consistent (informal) and the right translation of ‘to blow off steam’ is used."
Romero also appreciated that ChatGPT's translation gives the option to end certain words in the masculine or feminine, rather than selecting one for you. For example, it wrote: eres bienvenido/a a unirte a nosotros—"you are welcome to join us"—which would vary based on the gender of the speaker's invitee.
Google Bard rarely worked, and even told us, "I cannot translate languages." Instead, it recommends using Google Translate, likely an effort by Google to not cannibalize its own products. But we still tested it, and the three times it worked (Korean, French, Spanish), our participants ranked its results higher than Google Translate.
All the chatbots fell short of our high expectations for the currency and distance measurements in the first paragraph. Given their conversational nature and ability to ask follow-up questions, we hoped they would ask what currency to convert to, and if we preferred miles or kilometers.
Instead, they treated them the same way as Google Translate; making small adjustments, sometimes adding "USD" after $50, or going ahead to convert miles to kilometers. It was inconsistent across languages and services and imperfect overall.
It All Comes Down to Mastering Nuance
It's called a 'cookie' in the US, but a 'biscuit' in the UK. (Credit: olligha / Getty Images)
A consistent pitfall for Google Translate was its literal interpretations. "It was the most 'word for word' translation among all three," says Emile Saad, who ranked the Arabic translations. "This caused it to miss some of the context. For example, 'pop' [as in champagne] was translated to 'doing fireworks.'"
In French, Google Translate kept the word "hooligans" in English, while the chatbots knew to go with the culturally appropriate slang voyous.
As it turns out, chatbots are designed to excel at nuance and context. Languages in which the models have a large body of source data, and more users interacting in that language, can better identify cultural phrases and choose the most appropriate match in the target language.
"The secret sauce of chatbots like ChatGPT is RLHF, which is reinforcement learning with human feedback," says Hugging Face's Rajani. "[They] collect human preferences on model responses for dimensions such as truthfulness, harmlessness, helpfulness, etc. The human preferences help with selecting the ones that are more culturally appropriate, especially for non-native speakers."
A Google spokesperson tells PCMag that Bard and Google Translate have "different underlying technologies, so it's not surprising they might produce different outputs." Bard is a large language model designed to perform a variety of tasks, whereas Google Translate is optimized specifically for the task of translation.
"What matters is size; these models are the biggest and best models out there," says Pascual. "They are at the front line of the AI arms race. So it's unsurprising that they are even better at translating text than Google Translate, as Google Translate probably uses older technology, smaller models, [and are] probably optimized to run as quickly and cheaply as possible."
However, none of the four options were a one-to-one replacement for a fluent speaker. All the chatbots still suffered from awkward and inaccurate word choice at times, they just had fewer instances of it. For example, in Polish, Microsoft Bing translated "You're welcome to join us [at the restaurant]," to "Zapraszamy Cię do nas," which is actually an invitation to "come to my house," says Barbara Pavone, PCMag's senior manager of content distribution.
Recommended by Our Editors
The Best Travel Apps for 2023
The Best Language Learning Apps for 2023
Don't Speak the Language? How to Use Google Translate
If You Speak These 2 Languages, Use Google Translate
Traditional Ethiopian bowls (Credit: Evgenii Zotov / Getty Images)
In our test, two languages ranked Google Translate at the top: Tagalog (Philippines) and Amharic (Ethiopia). They have the smallest estimated global population of speakers: Tagalog has 33 million global speakers who claim it as their mother tongue, and Amharic has 25 million, according to WorldData.info(Opens in a new window). (Spanish has 450 million for Spanish and Korean is at 80 million.)
"[AI models] wouldn't generalize well for languages with low resources or for which enough human preferences were not collected," Rajani says. For Amharic and Tagalog, we suspect the chatbots lacked enough data to make a nuanced response that fit the context of the paragraph. Instead, they appeared more literal than Google Translate, the opposite of what we saw for the other languages.
Colin Salao, who ranked the Tagalog translations, noted that ChatGPT used words that are "super formal," and reserved for public announcements. He found Bing to be "the most literal translation," and ranked it lower compared to ChatGPT and Google Translate.
Microsoft Bing struggled even more for Amharic. It left a portion of each paragraph in English. This was the only time any of the services failed to attempt a translation, including for other script-based languages like Korean and Arabic:
-
Paragraph 1 - ሰላም! እንዴት እንደሚናገሩ እንደሆነ እንዲህ ብለው ጠየቁ? በመጠን የተመረጡ መኪና ቤት የተጠቀሱ ምግቦች ይህ መሆኑ ስለ መጠየቅ ይፈልጋሉ? እኔ በ $50 ብቻ መጠቀም እና የ cocktails ይጠቀማ? ከ 2-3 ሜ. We’ve had a long day of traveling and need to blow off some steam! You’re welcome to join us. Cheers!
-
Paragraph 2 - እንዴት መገልገያ ይጠቀማል? እንዴት እንደሚከተሉ መጠቀም እና የ dock ስር ይጠቀማል? በ መጠን የ upper deck ይደርሳል እና በ ግራ ተጨማሪ የ champagne መጠጥ የ young hooligans ከ ተጨማሪ በ ቀን ይጠቀማ? That’s dangerous and not my kind of fun!
AI Will Level Up Web Translation
For any summer travels or other language needs, ChatGPT might be a better choice than Google Translate. Plus, its new iOS app makes it even more accessible. But as we saw with Amharic and Tagalog, chatbots are not yet a full replacement for old standbys.
However, with more training data in each language, AI models have the potential to surpass Google Translate's capabilities across the board. "We are excited about the potential of LLMs and how they can be incorporated into our products," Google tells PCMag.
Google is also testing a new search results page, dubbed search generative experience (SGE). It's set to launch on Google.com at an undisclosed date, and will offer a paragraph-based, ChatGPT-style answer to queries. But Google stresses that Bard and SGE are experimental, and did not comment on whether they may replace Google Translate in the future.
Before this could happen, Google must have a more definitive way to measure chatbots' translation capabilities—and to prove it's better than Google Translate. More broadly, all chatbots should be able to interact in a wide range of languages, like Amharic, to keep the future of the web accessible and as 'World Wide' as possible.
"All these [AI] systems are black boxes and don't share specific information on how they were built, which data was used for training, etc," says Pascual. "We are just starting to see what these huge models can do, and it's equally exciting and terrifying!"
Learn more about the technology behind ChatGPT and other LLMs in our explainer.
Adblock test (Why?)