Friday, June 3, 2022

Pushing Boundaries: Literature in Translation 2022 - Publishers Weekly - Translation

As international writers expose themselves to considerable danger to protest injustice, look back on historical repression, or express radical ways of living in the world, the role of the translator to effectively render a story into another language grows ever more crucial.

“Translation is an art, but it’s also a science,” says Max Lawton, who translates Russian dissident writer Vladimir Sorokin. “You have to liberate the language from the source text and create something new.”

PW spoke with Lawton and other translators about how their work shines a light on issues of oppression and on our shared humanity.

International datelines

Sorokin’s Telluria (NYRB, Aug.), which PW’s review called “hypnotic and hallucinatory,” takes place in a fractured future of small states and principalities that arise after Russia splinters. The novel, mixing elements of speculative fiction against a feudal backdrop, typifies Sorokin’s defiance of convention and, according to Lawton, also demonstrates the subtlety of Sorokin’s political thought. “Sorokin’s work is political in a nuanced way, not polemical,” Lawton says. “If you’re transparent, the state likes you. They can see right through you. There is this sort of nontransparency in the language that Sorokin uses as an anti-authoritarian tactic. It’s a political choice as well as an aesthetic one.”

Translating Telluria has taken on new meaning for Lawton in the current political climate. “Telluria, on one hand, is a dystopia, but I actually think it’s much more of a utopia,” he says. “I didn’t understand this side of the book until this whole nightmare started in Ukraine, but the idea of Russia crumbling down into tiny nation-states that are all different is a utopian idea from the perspective of Russia’s history of territorial land grabbing.”

Similarly, Carol Apollonio, who translated the Russian political satire cum murder mystery Offended Sensibilities by Alisa Ganieva (Deep Vellum, Nov.), says the war in Ukraine has profoundly affected her translation work. “The world turned upside down February 24,” says Apollonio, who has translated two previous novels by Ganieva. “We’re all reeling in shock and anguish and horror. Alisa has left the country. She can’t stay there.” The novel examines political corruption and censorship in an unnamed provincial town, which Apollonio views as a stand-in for all of Russia. “The lies Alisa exposes in the book are so relevant,” she says.

Apollonio places Ganieva’s novel in a Russian literary tradition of political opposition. “Internal dissidents and thoughtful truth tellers like Alisa need to be respected because they’re putting their lives on the line. That’s an important political message we always get from Russian literature.”

In The Censor’s Notebook (Seven Stories, Oct.), Romanian author Liliana Corobca also probes political repression. Framed around a former censor who donates a stolen notebook from 1974 to a new museum of communism, the novel opens a window onto the secretive world of censorship during that era in Romania. “The story is so multilayered,” says translator Monica Cure. “It’s a kind of hodgepodge of notes the censor has taken, fragments of different novels and poems, all made up by the author but presented as if they were found.” Cure, whose grandfather was imprisoned and who herself was a refugee from the Communist regime in Romania, adds that translating the novel gave her a deeper appreciation for the importance of protecting freedom of expression. “It’s healthy to be able to talk about censorship, which is basically the illegitimate silencing of voices. Legitimacy is what we as a community decide on. What we get to do in a democracy is decide on what voices have been wrongly silenced.”

Writers often pay a huge price for work that criticizes or even questions the state’s authority. Uyghur writer Perhat Tursun’s novel The Backstreets (Columbia Univ., Sept.) follows a Uyghur migrant’s journey to an ethnically Han Chinese majority city. “It’s a story of alienation and a descent into madness,” says translator Darren Byler, an anthropologist who began working in Northwest China in 2011, met Tursun, and started translating the novel. To protect Tursun and his Uyghur cotranslator, Byler held off publishing, but when his cotranslator was disappeared and Tursun was arrested and sentenced to 16 years’ imprisonment as part of China’s mass internment of members of ethnic minorities, Byler decided the time was right. “Translating and finding a publisher for this work was an obligation I felt toward Tursun,” he says. “He had given me a lot. He’d become my friend.”

Byler hopes readers appreciate the book’s wisdom and insight. “It’s conveying something about life for Uyghurs and a Uyghur sensibility, a way of seeing the world that I think speaks to what it means to be human.”

Translator Paul G. Starkey was also moved to bring attention to an underrepresented cultural and literary tradition, with Hammour Ziada’s The Drowning (Interlink, Sept.), a historical novel set in 1968 Sudan. It’s “a vivid portrayal of social relations in a repressive and tightly regulated traditional Sudanese village,” Starkey says. “There is very little Sudanese literature available in English translation, so I saw this as a chance to make a hopefully interesting contribution to what’s available.” In the context of the profound upheavals that have taken place and are currently taking place in the wider Arabic-speaking world, the novel offers readers a portrait of those who live the region. “I hope that it may help to convey some sense of empathy with people in situations and conditions vastly different from our own,” Starkey says.

Love languages

The process of working so closely with a text can transform how a translator identifies with the story. In Concerning My Daughter (Restless, Sept.), by South Korean novelist Kim Hye-Jin, a lesbian activist and her partner begin living with her more conservative mother. Translator Jamie Chang was initially skeptical of the mother-daughter relationship. “I found myself thinking, why are you moving in with your mother and why are you taking your partner with you? This is going to be too much for her,” Chang says. “But that’s the thing about good stories. You put characters with strong opinions, strong feelings, and strong bonds in a pressure cooker and see what happens.” The result is a story of connection that crosses generations. “This is the kind of book you could read with your mother if you’re just coming out,” she adds. “By the time I finished I felt fully convinced it’s possible for a Korean lesbian and their partner to get along with their elderly mom. The solidarity between these women feels so realistic.”

Another novel pushing against convention is Hugs and Cuddles (Two Lines, Oct.), by Brazilian novelist João Gilberto Noll. Edgar Garbelotto, who translated two Noll novels prior to this one, has always been attracted to the ambition of Noll’s vision. “I was totally fascinated by the way he was writing, the language he was using—even thematically by the places he was going,” Garbelotto says. The novel follows a narrator who, inspired by a formative sexual experience in his youth, abandons his former life and sets out on a quest for personal liberation. In the process the narrator rejects his social standing, the constraints of his gender, and the sexual mores that previously inhibited him.

Noll’s liberatory message inspired Garbelotto. “I think one of the main motivations for a translator is your desire to share work that impressed you so much,” he says. “I hope readers can experience through Noll’s incredible prose this adventure of living freely, leaving what is expected from capitalism and society behind. We can live that experience through Noll’s characters. That’s the power of literature.”

Stories from unfamiliar cultural contexts can nonetheless resonate with readers. The Impatient (HarperVia, Sept.), by Cameroonian writer Djaili Amadou Amal, follows three women struggling to free themselves from forced marriage, polygamy, and abuse in a Cameroonian village. The novel’s critique of this atmosphere of sexual control and violence struck translator Emma Ramadan as universally relevant. “Women everywhere have this experience of being forced into situations and faced with misogyny and a toxic patriarchy,” Ramadan says. “We can be both interested in understanding what’s happening in other places and also use this novel as an opportunity to reflect on what’s happening in our communities. That’s the beauty of translation.”

The story of the women also highlighted for Ramadan the importance of finding unity in a common cause. “One of the lessons of the book is that we as women—and really any oppressed group—are stronger when we bind together,” she notes.

The Impatient and other novels discussed here present readers with an opportunity to find common ground with cultures and societies from around the world. “Our role as translators is not just to bring something over but also to allow a conversation to happen here about it,” Ramadan says. The goal of translation, she explains, is to have “a conversation that doesn’t foreignize or other or distance, but that brings this story home and forces us to examine ourselves in the ways we’re examining the characters in a book.”

Matthew Broaddus is a poet and associate poetry editor at Okay Donkey Press.

Read more from our Literature in Translation feature.

Identity Papers: Literature in Translation 2022
These new translated works of fiction challenge questions of identity and what it means to belong.

The Language of the Body: PW Talks with Stephanie McCarter
In her forthcoming translation of Ovid’s 'Metamorphoses' (Penguin Classics, Sept.), classicist McCarter renders the poet’s concern with questions of power, violence, and gender intelligible to a contemporary audience.

A version of this article appeared in the 06/06/2022 issue of Publishers Weekly under the headline: Pushing

Adblock test (Why?)

How to use Gboard’s Personal Dictionary - Phandroid - News for Android - Dictionary

The problem with autocorrect on our virtual keyboards is that sometimes they are too smart for their own good. So if you’re someone who uses a lot of slang or mixes in other languages with another language, the keyboard might correct it and it will be quite annoying having to go back and forth editing it.

If you use Google’s Gboard app, then there’s actually a Personal Dictionary feature that lets you add your own custom words so that Google knows that it is an intended word and won’t try to correct it. It can even go one step further where you can create word shortcuts so that when you type in “ttyl”, for example, it will be expanded to “talk to you later”.

Add words to Gboard’s Personal Dictionary

  1. Download and install Gboard if you haven’t already
  2. Launch any app that might require keyboard input
  3. Tap the Settings icon in Gboard
  4. Tap on Dictionary
  5. Tap on Personal Dictionary and select your language
  6. Tap the + icon at the top right corner of the app
  7. Type in the word you want to add. If you’re looking to create a shortcut, type in the shortcut that would trigger the word
  8. Tap Back and you’re done

Now whenever you type that word, Gboard won’t try to correct you. If you’re using a shortcut, Gboard will show the text replacement as a suggestion. It will not automatically change it for you, which is actually a good thing because sometimes you might not want to use it in some situations, so giving users an option is good.

If you don’t see the changes applied to Gboard yet, you might need to refresh the app by closing it and reopening it again.

Adblock test (Why?)

Historic Crow language print dictionary released after years of development - Q2 News - Dictionary

CROW AGENCY — Members of the Crow Tribe from all over the state gathered at Little Big Horn College Friday to celebrate the historic release of a Crow language print dictionary.

“I’m all for revitalizing the Crow language,” said Crow tribal member Velma Pretty On Top Pease.

Hundreds of community members who contributed to the most comprehensive Crow language dictionary ever released were honored at a Friday event.

“Studies have shown that if students know their culture, their language, it develops a strong sense of identity,” said Pretty On Top Pease.

Pretty On Top Pease assisted in the recording and rapid word collection of the dictionary. Crow is her first language, but she says that’s not the case with later generations.

“The next generation, the numbers dropped drastically,” said Pretty On Top Pease.

The dictionary consists of over 11,000 Crow words and will be used as a tool for future generations to ensure the language endures. This project has been in the works for nearly a decade.

“What this means is that the Crow language is one of the best-documented Native languages in North America,” said Dr. Timothy McCleary, co-editor of the Crow dictionary.

McCleary has been in the community for 30 years and knows the language, but he says he’s not fluent.

“The major difference between Crow and English is that Crow is what’s called a tonal language, so much like a number of Asian languages,” said McCleary.

None of this would have been possible without the Crow Language Consortium, a collective of Crow schools, colleges, and educators working to preserve the language.

Cyle Oldelk helped translate Crow words into English for the dictionary. He’s been working on this project since 2015 and Crow was his first language.

“When I was in Head Start, we were told not to speak Crow, and then now it’s to a point where they’re encouraging it. I think it’s really, it’s got to happen for our language to survive,” said Oldelk.

Now, future generations of Crow tribal members will have a resource to keep their culture alive.

Adblock test (Why?)

Thursday, June 2, 2022

Mozilla brings free, offline translation to Firefox - TechCrunch - Translation

Mozilla has added an official translation tool to Firefox that doesn’t rely on cloud processing to do its work, instead performing the machine learning–based process right on your own computer. It’s a huge step forward for a popular service tied strongly to giants like Google and Microsoft.

The translation tool, called Firefox Translations, can be added to your browser here. It will need to download some resources the first time it translates a language, and presumably it may download improved models if needed, but the actual translation work is done by your computer, not in a data center a couple hundred miles away.

This is important not because many people need to translate in their browsers while offline — like a screen door for a submarine, it’s not really a use case that makes sense — but because the goal is to reduce end reliance on cloud providers with ulterior motives for a task that no longer requires their resources.

It’s the result of the E.U.-funded Project Bergamot, which saw Mozilla collaborating with several universities on a set of machine learning tools that would make offline translation possible. Normally this kind of work is done by GPU clusters in data centers, where large language models (gigabytes in size and with billions of parameters) would be deployed to translate a user’s query.

But while the cloud-based tools of Google and Microsoft (not to mention DeepL and other upstart competitors) are accurate and (due to having near-unlimited computing power) quick, there’s a fundamental privacy and security risk to sending your data to a third party to be analyzed and sent back. For some this risk is acceptable, while others would prefer not to involve internet ad giants if they don’t have to.

If I Google Translate the menu at the tapas place, will I start being targeted for sausage promotions? More importantly, if someone is translating immigration or medical papers with known device ID and location, will ICE come knocking? Doing it all offline makes sense for anyone at all worried about the privacy implications of using a cloud provider for translation, whatever the situation.

I quickly tested out the translation quality and found it more than adequate. Here’s a piece of the front page of the Spanish language news outlet El País:

Image Credits: El País

Pretty good! Of course, it translated El País as “The Paris” in the tab title, and there were plenty of other questionable phrasings (though it did translate every | as “Oh, it’s a good thing” — rather hilarious). But very little of that got in the way of understanding the gist.

And ultimately that’s what most machine translation is meant to do: report basic meaning. For any kind of nuance or subtlety, even a large language model may not be able to replicate idiom, so an actual bilingual person is your best bet.

The main limitation is probably a lack of languages. Google Translate supports over a hundred — Firefox Translations does an even dozen: Spanish, Bulgarian, Czech, Estonian, German, Icelandic, Italian, Norwegian Bokmal and Nynorsk, Persian, Portuguese and Russian. That leaves out quite a bit, but remember this is just the first release of a project by a nonprofit and a group of academics — not a marquee product from a multi-billion-dollar globe-spanning internet empire.

In fact, the creators are actively soliciting help by exposing a training pipeline to let “enthusiasts” train new models. And they are also soliciting feedback to improve the existing models. This is a usable product, but not a finished one by a long shot!

Adblock test (Why?)

Wednesday, June 1, 2022

“Never say die” and dictionaries for the living | OUPblog - OUPblog - Dictionary

Last week, I promised to write something about idioms in dictionaries and on that note finish my discussion of English set phrases (unless there are questions, suggestions, or vociferous cries for more). Where do you find the origin and, if necessary, the meaning of never say die, never mind, and other phrases of this type? Should you look them up under never, say, die, or mind? Will they be there? And who was the first to say those memorable phrases? Nowadays, people search for answers on the Internet, but the Internet does not generate knowledge: it only summarizes the available information and various opinions. We also wonder: Is never say die an idiom? Never mind probably is.

The oldest genres of idioms are proverbs (a friend in need is a friend indeed) and so-called familiar quotations(more in sorrow than in anger), neither of which has been at the center of my interest. The Greeks and espeiclly the Romans produced memorable phrases the moment they began to speak. Life is short, art is long. Good friends cannot be bought. A water drop hollows a stone. As long as I breathe, I have hope. How true! Excellent dictionaries of such phrases (“familiar quotations”) exist, but, as I have noted, not they will concern us today. We are returning to the likes of the phrases I have cited more than once: to kick the bucket, in apple-pie order, to go woolgathering, not room enough to swing a cat, mad as a hatter, and so forth. Dictionaries of idiomatic phrases are many. The best of them explain the meaning of such outwardly incomprehensible locutions, sometimes quote the books in which they occur, and explain their origin if something is known about that subject, but most focus on meaning and usage.

Not everybody goes woolgathering.
(Image by M W from Pixabay, public domain)

General (all-purpose) dictionaries like Webster’s and the OED include set phrases as a matter of course, but, though they offer the user the etymology of words (even if all they can say is “origin unknown”) idioms often remain without any historical notes. My prospective dictionary, though a rather thick book, contains slightly more than a thousand idioms (a drop—a pretty heavy drop— in the bucket, or, as they say in German, a drop on a hot stone), but its purpose is to sift through all the existing conjectures about the origin of each item and support, if possible, the most reasonable one. Its main merit is the critique of multifarious conjectures, some of which are excellent, and some are downright stupid. As I said in the previous post, no etymological algebra is needed here. Try to find out whether hatters were ever mad, who tried to swing a cat and failed for want of space, what apple-pie order means, why we should mind our p’s and q’s, and the riddle will be solved.

I’ll begin my rundown on the sources with the most recent one known to me. Allen’s Dictionary of English Phrases (Penguin, 2006; its author is Robert Allen) is comprehensive and reliable. Though etymology was not Allen’s main objective, he never neglected it, and, in discussing conflicting hypotheses, showed excellent judgment. He mined the riches of the OED and many other sources, while I mainly followed journal publications for four centuries and cited dictionaries as an afterthought. (Allen also occasionally used Notes and Queries, my main source of inspiration.) While I am on the letter A, I should mention G. L. Apperson, the author of the book English Proverbs and Proverbial Phrases (London: Dent, 1929). Apperson was an outstanding specialist, and his book is a joy to read.

Perhaps the most famous and also the thickest book in this area was written by E. Cobham Brewer. His Dictionary of [Modern] Phrase and Fable (1894, a drastically revised version of the book first published in 1870) is the only one of his many once popular works that has not gone with the wind. A copy of it was on “every gentleman’s desk,” as they used to say at that time. Anyone who sought information about “phrase and fable” consulted Brewer. A learned man, he did one unforgivable thing: he explained the origin of idioms without referring to his sources. Many of the explanations are reasonable, but as many are unacceptable. The latest, severely abridged edition appeared in 2011. The information in even this volume should be treated with caution, but the editors had no choice, because the flavor of the original work had to be preserved.

Brewer’s competitor, but on an incomparably more modest scale, was Eliezer Edwards, the author of Words, Facts, and Phrases: A Dictionary of Curious, Quaint, and Out-of-the Way Matters (London: Chatto and Windus, 1882). Not much in that collection is quaint, and even less is out of the way, but nothing works like an attractive title. The dictionary was much used, but it never enjoyed the popularity of Brewer’s magnum opus. At that time, people appreciated miscellanies containing heterogeneous “nuggets of knowledge.” This book, like Brewer’s, is dogmatic: Edwards gave no references in support of his derivations: he explained the origin of idioms as he saw fit.

Among the reference books published before the Second World War two should be mentioned. 1) Albert M. Hyamson, A Dictionary of English Phrases…. (London: Routledge, New York: Dutton, 1922.) The corpus is huge, but the etymologies are not always reliable for the same familiar reason: the user rarely knows whether the explanations are the author’s or common knowledge, or borrowed from some of the dictionaries he referred to. The uncritical approach to etymology is the main drawback of this genre. 2) Alfred H. Holt, Phrase Origins: A Study of Familiar Expressions. (New York: Thomas Y. Crowell, 1936.) Despite its title, this work contains numerous entries on individual words. The book can still be recommended because of its cautious approach to the material. Holt used various sources, and when he ventured his own hypotheses, he always said so.

Not every brewer searches for words and phrases.
(Image, left: E. Cobham Brewer via Wikimedia Commons; right Beer sommelier at work at Nebraska Brewing Company, via Wikimedia Commons)

An often-used collection is a three-volume book by William and Mary Morris, Dictionary of Word and Phrase Origins. (New York: Harper and Row, 1962-1971.) William Morris was the Editor-in-Chief of the first edition of The American Heritage Dictionary of the English Language, but the dictionary of word and phrase origins can hardly be called a success, because many explanations are unreliable, and the references to the authors’ sources are very few. A more rewarding fruit of teamwork is Dictionary of Idioms and Their Origins by Roger and Lina Flavell (Kyle Cathie, 1992). The origins are explained without reference to the sources, but most of them are acceptable. Last but not least, mention should be made of Charles Earl Funks’s Curious Word Origins, Sayings & Expressions from White Elephant to Song and Dance. (New York: Galahad Books, 1993.) The huge volume (988 pages, with excellent illustrations strewn generously all over the text) includes the author’s three earlier books: A Hog on Ice, Heaven to Betsy!, and Horsefeathers. The second part is only about words, but the first and the third deal with idioms. Some entries are quite detailed.

It will be only fair to mention the three collections that were especially often consulted in the past. They are John Ray, A Compleat (sic) Collection of English Proverbs (1678), George Bohn’s (1796-1864), A Handbook of Proverbs by John Ray, a radical reworking of Ray’s pioneering work, and English Proverbs and Proverbial Phrases by W. Carew Hazlitt (1834-1913).

The list at my disposal is very long, and reproducing most or the whole of it might only bore our readers. The fragment presented above gives an adequate idea of the state of the art, and those who are interested in the study of idioms may “make a note of it,” as Captain Cuttle, a memorable character in Dickens’s novel Dombey and Son (1846-1848) used to say. His favorite phrase—”When found, make a note of it”—was chosen as the motto of the British periodical Notes and Queries, which began to appear in 1849 and turned out to be a treasure house of letters on all things under the sun, including the origin of English words and idioms. 

Featured image by Dan Parsons via Wikimedia Commons, public domain

Adblock test (Why?)

Tuesday, May 31, 2022

Google's massive language translation work identifies where it goofs up - ZDNet - Translation

c201c7dd-7cfe-44db-8c07-bb63391efdce.png

Scores for languages when translating from English and back to English again, correlated to how many sample sentences the language has. Toward the right side, higher numbers of example sentences result in better scores. There are outliers, such as English in Cyrillic, which has very few examples but translates well. 

Bapna et al., 2022

What do you do after you have collected writing samples for a thousand languages for the purpose of translation, and humans still rate the resulting translations a fail?

Examine the failures, obviously. 

And that is the interesting work that Google machine learning scientists related this month in a massive research paper on multi-lingual translation, "Building Machine Translation Systems for the Next Thousand Languages."

"Despite tremendous progress in low-resource machine translation, the number of languages for which widely-available, general-domain MT systems have been built has been limited to around 100, which is a small fraction of the over 7000+ languages that are spoken in the world today," write lead author Ankur Bapna and colleagues. 

The paper describes a project to create a data set of over a thousand languages, including so-called low-resource languages, those that have very few documents to use as samples for training machine learning. 

Also: DeepMind: Why is AI so good at language? It's something in language itself

While it is easy to collect billions of example sentences for English, and over a hundred million example sentences for Icelandic, for example, the language kalaallisut, spoken by about 56,000 people in Greenland, has fewer than a million, and the Kelantan-Pattani Malay language, spoken by about five million people in Malaysia and Thailand, has fewer than 10,000 example sentences readily available.

To compile a data set for machine translation for such low-resource languages, Bapna and two dozen colleagues first created a tool to scour the Internet and identify texts in low-resource languages. The authors use a number of machine learning techniques to extend a system called LangID, which comprises techniques for identifying whether a Web text belongs to a given language. That is a rather involved process of eliminating false positives. 

After scouring the Web with LangID techniques, the the authors were able to assemble "a dataset with corpora for 1503 low-resource languages, ranging in size from one sentence (Mape) to 83 million sentences (Sabah Malay)." 

The scientists boiled that list down to 1,057 languages "where we recovered more than 25,000 monolingual sentences (before deduplication)," and combined that group of samples with the much larger data for 83 "high-resource languages" such as English. 

Also: AI: The pattern is not in the data, it's in the machine

They then tested their data set by running experiments to translate between the languages in that set. They used various versions of the ubiquitous Transformer neural net for language modeling. In order to test performance of translations, the authors focused on translating to and from English with 38 languages for which they obtained example true translations, including kalaallisut 

That's where the most interesting part comes in. The authors asked human reviewers who are native speakers of low-resource languages to rate the quality of translations for 28 languages on a scale of zero to six, , with 0 being "nonsense or wrong language" and 6 perfect." 

Also: Facebook open sources tower of Babel, Klingon not supported

The results are not great. Out of 28 languages translated from English, 13 were rated below 4 on the scale in terms of quality of translation. That would imply almost half of the English to target translations were mediocre. 

The authors have a fascinating discussion starting on page 23 of what seems to have gone wrong in those translations with weak ratings. 

"The biggest takeaway is that automatic metrics overestimate performance on related dialects," they write, meaning, scores the machine assigns to translations, such as the widely used BLEU score, tend to give credit where the neural network is simply translating into a wrong language that is like another language. For example, "Nigerian Pidgin (pcm), a dialect of English, had very high BLEU and CHRF scores, of around 35 and 60 respectively. However, humans rated the translations very harshly, with a full 20% judged as 'Nonsense/Wrong Language', with trusted native speakers confirming that the translations were unusable."

"What's happening here that the model translates into (a corrupted version of ) the wrong dialect, but it is close enough on a character n-gram level" for the automatic benchmark to score it high, they observe. 

"This is the result of a data pollution problem," they deduce, "since these languages are so close to other much more common languages on the web […] the training data is much more likely to be mixed with either corrupted versions of the higher-resource language, or other varieties."

f73ce2de-4104-4921-a720-29c93f039eb2.png

Examples of translations with correct terms in blue and mistranslations in yellow. The left-hand column shows the code for which language is being translated into, using the standard BCP-47 tags.

Bapna et al., 2022

Also: Google uses MLPerf competition to showcase performance on gigantic version of BERT language model

And then there are what the authors term "characteristic error modes" in translations, such as "translating nouns that occur in distributionally similar contexts in the training data," such as substituting "relatively common nouns like 'tiger' with another kind of animal word, they note, "showing that the model learned the distributional context in which this noun occurs, but was unable to acquire the exact mappings from one language to another with enough detail within this category."

Such a problem occurs with "animal names, colors, and times of day," and "was also an issue with adjectives, but we observed few such errors with verbs. Sometimes, words were translated into sentences that might be considered culturally analogous concepts – for example, translating "cheese and butter" into "curd and yogurt" when translating from Sanskrit."

Also: Google's latest language machine puts emphasis back on language

The authors make an extensive case for working closely with native speakers:

We stress that where possible, it is important to try to build relationships with native speakers and members of these communities, rather than simply interacting with them as crowd-workers at a distance. For this work, the authors reached out to members of as many communities as we could, having conversations with over 100 members of these communities, many of whom were active in this project. 

An appendix offers gratitude to a long list of such native speakers. 

Despite the failures cited, the authors conclude the work has successes of note. In particular, using the LangID approach to scour the web, "we are able to build a multilingual unlabeled text dataset containing over 1 million sentences for more than 200 languages and over 100 thousand sentences in more than 400 languages."

And the work with Transformer models convinces them that "it is possible to build high quality, practical MT models for long-tail languages utilizing the approach described in this work."

Adblock test (Why?)

Google's massive language translation work identifies where it goofs up - ZDNet - Translation

c201c7dd-7cfe-44db-8c07-bb63391efdce.png

Scores for languages when translating from English and back to English again, correlated to how many sample sentences the language has. Toward the right side, higher numbers of example sentences result in better scores. There are outliers, such as English in Cyrillic, which has very few examples but translates well. 

Bapna et al., 2022

What do you do after you have collected writing samples for a thousand languages for the purpose of translation, and humans still rate the resulting translations a fail?

Examine the failures, obviously. 

And that is the interesting work that Google machine learning scientists related this month in a massive research paper on multi-lingual translation, "Building Machine Translation Systems for the Next Thousand Languages."

"Despite tremendous progress in low-resource machine translation, the number of languages for which widely-available, general-domain MT systems have been built has been limited to around 100, which is a small fraction of the over 7000+ languages that are spoken in the world today," write lead author Ankur Bapna and colleagues. 

The paper describes a project to create a data set of over a thousand languages, including so-called low-resource languages, those that have very few documents to use as samples for training machine learning. 

Also: DeepMind: Why is AI so good at language? It's something in language itself

While it is easy to collect billions of example sentences for English, and over a hundred million example sentences for Icelandic, for example, the language kalaallisut, spoken by about 56,000 people in Greenland, has fewer than a million, and the Kelantan-Pattani Malay language, spoken by about five million people in Malaysia and Thailand, has fewer than 10,000 example sentences readily available.

To compile a data set for machine translation for such low-resource languages, Bapna and two dozen colleagues first created a tool to scour the Internet and identify texts in low-resource languages. The authors use a number of machine learning techniques to extend a system called LangID, which comprises techniques for identifying whether a Web text belongs to a given language. That is a rather involved process of eliminating false positives. 

After scouring the Web with LangID techniques, the the authors were able to assemble "a dataset with corpora for 1503 low-resource languages, ranging in size from one sentence (Mape) to 83 million sentences (Sabah Malay)." 

The scientists boiled that list down to 1,057 languages "where we recovered more than 25,000 monolingual sentences (before deduplication)," and combined that group of samples with the much larger data for 83 "high-resource languages" such as English. 

Also: AI: The pattern is not in the data, it's in the machine

They then tested their data set by running experiments to translate between the languages in that set. They used various versions of the ubiquitous Transformer neural net for language modeling. In order to test performance of translations, the authors focused on translating to and from English with 38 languages for which they obtained example true translations, including kalaallisut 

That's where the most interesting part comes in. The authors asked human reviewers who are native speakers of low-resource languages to rate the quality of translations for 28 languages on a scale of zero to six, , with 0 being "nonsense or wrong language" and 6 perfect." 

Also: Facebook open sources tower of Babel, Klingon not supported

The results are not great. Out of 28 languages translated from English, 13 were rated below 4 on the scale in terms of quality of translation. That would imply almost half of the English to target translations were mediocre. 

The authors have a fascinating discussion starting on page 23 of what seems to have gone wrong in those translations with weak ratings. 

"The biggest takeaway is that automatic metrics overestimate performance on related dialects," they write, meaning, scores the machine assigns to translations, such as the widely used BLEU score, tend to give credit where the neural network is simply translating into a wrong language that is like another language. For example, "Nigerian Pidgin (pcm), a dialect of English, had very high BLEU and CHRF scores, of around 35 and 60 respectively. However, humans rated the translations very harshly, with a full 20% judged as 'Nonsense/Wrong Language', with trusted native speakers confirming that the translations were unusable."

"What's happening here that the model translates into (a corrupted version of ) the wrong dialect, but it is close enough on a character n-gram level" for the automatic benchmark to score it high, they observe. 

"This is the result of a data pollution problem," they deduce, "since these languages are so close to other much more common languages on the web […] the training data is much more likely to be mixed with either corrupted versions of the higher-resource language, or other varieties."

f73ce2de-4104-4921-a720-29c93f039eb2.png

Examples of translations with correct terms in blue and mistranslations in yellow. The left-hand column shows the code for which language is being translated into, using the standard BCP-47 tags.

Bapna et al., 2022

Also: Google uses MLPerf competition to showcase performance on gigantic version of BERT language model

And then there are what the authors term "characteristic error modes" in translations, such as "translating nouns that occur in distributionally similar contexts in the training data," such as substituting "relatively common nouns like 'tiger' with another kind of animal word, they note, "showing that the model learned the distributional context in which this noun occurs, but was unable to acquire the exact mappings from one language to another with enough detail within this category."

Such a problem occurs with "animal names, colors, and times of day," and "was also an issue with adjectives, but we observed few such errors with verbs. Sometimes, words were translated into sentences that might be considered culturally analogous concepts – for example, translating "cheese and butter" into "curd and yogurt" when translating from Sanskrit."

Also: Google's latest language machine puts emphasis back on language

The authors make an extensive case for working closely with native speakers:

We stress that where possible, it is important to try to build relationships with native speakers and members of these communities, rather than simply interacting with them as crowd-workers at a distance. For this work, the authors reached out to members of as many communities as we could, having conversations with over 100 members of these communities, many of whom were active in this project. 

An appendix offers gratitude to a long list of such native speakers. 

Despite the failures cited, the authors conclude the work has successes of note. In particular, using the LangID approach to scour the web, "we are able to build a multilingual unlabeled text dataset containing over 1 million sentences for more than 200 languages and over 100 thousand sentences in more than 400 languages."

And the work with Transformer models convinces them that "it is possible to build high quality, practical MT models for long-tail languages utilizing the approach described in this work."

Adblock test (Why?)