Monday, March 8, 2021

500-Million-Sentence Dataset Can Boost Machine Translation for Low-Resource Languages - Slator - Translation

2 hours ago

500-Million-Sentence Dataset Can Boost Machine Translation for Low-Resource Languages

Researchers working on machine translation (MT) often rely on back translation to beef up training data. Back translation — when more widely available monolingual target language data is translated into the source language — was credited with enabling Transformer-based deep-learning system CUBITT to “outperform human-level translation” as covered by Slator in September.

Back translation was also crucial to a method for detecting machine translated content, which may become more critical as startups ramp up commercialization of AI-powered text generation.

The usefulness of back translation depends on the widespread availability of target language data, which can present a hurdle for languages of lesser diffusion.

Advertisement

In an effort to allow MT researchers to work on more realistic low-resource scenarios, University of Helsinki language technology professor Jörg Tiedemann announced on March 3, 2021 that he had released over 500 million translated sentences in 188 languages.

Tiedemann’s datasets, available on GitHub, are not the first attempt to level the playing field for languages via MT. For example, since 2018, the Masakhane Project has been gathering language data and fine tuning language models specifically for African languages underrepresented in NLP. Tiedemann’s project is, however, notable for its scale.

In a related October 2020 paper on the Tatoeba Translation Challenge, Tiedemann wrote, “The main goal is to trigger the development of open translation tools and models with a much broader coverage of the world’s languages.”

Slator 2021 Data-for-AI Market Report

Slator 2021 Data-for-AI Market Report

Data and Research, Slator reports

44-pages on how LSPs enter and scale in AI Data-as-a-service. Market overview, AI use cases, platforms, case studies, sales insights.

How much broader? The training and test data covers 500 languages and language variants, as well as roughly 3,000 language pairs.

According to Tiedemann, there is still work to be done. “It’s anyway not going to be the last set of back-translations I’m going to release,” he tweeted. “More to come soon also from English to other languages…”

Image: University Library in Helsinki

'It is amazing'; Faith creates complete American Sign Language translation of Bible - St George News - Translation

ST. GEORGE — Understanding the Bible isn’t always easy, and it can be even less so when trying to digest it through a second language after the translation falls short.

A man signs John 3:16 in a video that is a part of the Jehovah’s Witnesses’ American Sign Language Bible app offered for Windows and mobile devices, St. George, Utah, March 4, 2021 | Photo by Mori Kessler, St. George News

For deaf members of the Jehovah’s Witnesses, this was a part of the experience for many years until technology enabled the church to translate and share the Bible entirely in American Sign Language.

“I got to watch my mom relearn the Bible and relearn about Jehovah God,” Darry Bullard, of St. George, said of his mother’s experience with the Jehovah’s Witnesses’ ASL Bible.

“To get even a deeper relationship with our God that way and to deepen her faith and understanding more clearly… it’s been wonderful to watch,” he said.

In February 2020, the church announced that is had completed a translation of the whole of the Bible – of Old and New Testament – into American Sign Language. As far as the Jehovah’s Witnesses are aware, it is considered the first complete version of the religious tome translated into ASL as well.

The Jehovah’s Witnesses’ ASL translation of the Bible is readily available on the church’s ASL website, through a series of online videos, as well as for download for Windows, as well as for Apple and Google (Android) mobile devices. The ASL Bible app shows videos of men in suits who sign through the books of the Bible verse by verse.

“It is amazing,” Bullard said as he shared his mother’s experience as a deaf individual who joined the faith and how she learned the Bible through the aid of others.

Barry Bullard, a member of the Jehovah’s Witnesses, shared his mother’s experiences as a deaf individual and what gaining access to the Bible in American Sign Language has been like for her, St. George, Utah, Feb. 12, 2021 | Photo by Mori Kessler, St. George News

Bullard’s mother joined the Jehovah’s Witnesses in the 1960s, long before resources were readily available in ASL. At the time, she was able to learn and study with the Witnesses through others writing notes for her to read or sign language interpreters who weren’t always able to accurately express the meaning behind what was being taught.

“But here, it’s straight, right in her own language,” Bullard said. “She can look at any verse she wants now, any verse in the Bible using the app.”

Bullard, along with Robert Hendricks, a national spokesman for the Jehovah’s Witnesses, spoke with St. George News last month about the ASL Bible, its impact and how it came to be.

In the 1970s, the church had many deaf members across the county and the world, Hendricks said. They were taught from the Bible as best as possible, yet in meetings “meant for a hearing world” that weren’t specially catered to their needs.

“For many in the deaf community, they accepted that, but it really didn’t speak to their hearts,” Hendricks said. “Every person in this world should be able to read the Bible in the language of their heart.”

A part of the problem in trying to help deaf members of the faith understand the Bible was that interpreters would tend to give literal translations from English. Basic concepts were easy enough to convey, Hendricks said, but going into depth in a way that could be understood wasn’t so easy.
The menu screen for the Jehovah’s Witness ASL Bible library Windows app | Photo by Mori Kessler, St. George News

“Their mother tongue really is ASL, and most people know American Sign Language is not the literal translation of words – it’s conceptual, it’s visual and it’s emotional,” Hendricks said.

The church created its first ASL congregation in New York in 1989 and began providing ASL lessons on video cassette. These videos, when someone had a lot of them, weren’t easy to carry around. This changed with the advent of DVDs that were more easily shared.

As ASL meetings and lessons became more visual, parts of the Bible were also translated for use in weekly meetings and started to accumulate to the point that, in 2005, the church took on what became a 15-year project.

“By 2005, we began a project nobody thought we’d be able to do, and that’s translating the Bible into ASL,” Hendricks said.

The New Testament Book of Matthew was the first part of the Bible completed in 2006, with the final book, the Old Testament tale of Job, completed in early 2020.

“It was the first ASL Bible ever translated in its complete form, ever, in history,” Hendricks said.

A video of a man signing a part of the Old Testament book of Obadiah as a part of the Jehovah’s Witnesses’ American Sign Language Bible app offered for Window | Photo by Mori Kessler, St. George News

A large part of the translation process was making sure the concepts expressed in the Bible were accurately described through sign language. As mentioned before, literal translations may not make sense to the recipient due to a loss of context and syntax. In order to help translators avoid this, panels of deaf individuals were regularly consulted to aid in the project.

“The Bible is, frankly, hard to understand,” Hendricks said. “If language becomes a barrier to that understanding, it’s not giving us what was intended for us.”

Since the ASL translation project began, complete books of the church’s ASL Bible have been downloaded over 2.3 million times. Individual chapters have been downloaded nearly 39 million times, and the complete Bible itself over 850,000 times.

It’s not just Jehovah’s Witnesses who are using the ASL Bible, Hendricks said, noting there are approximately 2,000 deaf members of the faith in the United States which host an estimated 1.3 million of the worldwide faith’s 8 million members.

In a recent story by the Deseret News, in which several deaf Jehovah’s Witnesses shared their gratitude for the ASL Bible, one shared it with a Catholic woman who was “moved to tears when he showed her the Lord’s Prayer in American Sign Language.”

“The whole thing fits in a phone now, all of the American Sign Language Bible,” Bullard said, adding its been wonderful to see his mother and other deaf congregants benefit from having access to the complete translation. “It’s really neat to watch their faces and to see the joy they have in talking about it and using it in their own study.”

Copyright St. George News, SaintGeorgeUtah.com LLC, 2021, all rights reserved.

Sunday, March 7, 2021

Conservatives snicker as Urban Dictionary censors term ‘BLUE ANON,’ the hot new label for left-wing conspiracy theorists - RT - Dictionary

Conservatives chalked up a victory in the battleground of pejorative labels, concluding that the trending ‘Blue Anon’ branding of left-wing conspiracy theorists is sufficiently stinging after the term earned censorship bonafides.

The term made it into the online Urban Dictionary of slang words and phrases earlier this week, only to be removed by at least Sunday. A search of the term, a play on ‘QAnon’ that has been used increasingly in recent months to mock leftists, now comes up empty. Previously, the dictionary showed a definition for Blue Anon, noting that it's “a loosely organized network of Democrat voters, politicians and media personalities who spread left-wing conspiracy theories, such as the Russia Hoax, Jussie Smollett hoax, Ukraine hoax, Covington kids hoax and Brett Kavanaugh hoax.”

Conservatives interpreted the attempted disappearing of Blue Anon as a sign of success. “Wokies at Urban Dictionary zapped Blue Anon because it was too powerful,” journalist Jack Posobiec said Sunday on Twitter. Others predicted a ‘Streisand effect,’ when attempts to hide something inadvertently brings more attention and interest to it.

Author and Blexit founder Candace Owens took the takedown as an opportunity to illustrate the meaning of Blue Anon to her 2.6 million followers. “If you believe: DC is under military occupation because there are non-stop threats from Trump supporters; Joe Biden is the most popular American president of all time; Russia, Russia, Russia… you might be Blue Anon.”

But others couldn't help but note another hypocritical case of Republicans being picked on. “Look at what Urban Dictionary allows while censoring Blue Anon,” conservative writer Deb Heine said. She attached screenshots of several anti-conservative terms, including “Republic*nt.”

Efforts to discourage popularization of the term Blue Anon may go beyond Urban Dictionary, some social media users claimed. Some pointed out that Google searches for the term Blue Anon return links to skiing goggles and snowboarding equipment, while other search services, such as those offered by DuckDuckGo and Yahoo, return more relevant answers.

One Twitter user posted a graphic showing that Google searches for Blue Anon have surged over the past week. “Why are you suppressing Blue Anon?” the commenter asked. “According to your own trend search, this term is exploding; however, your results contradict this.”

Also on rt.com QAnon for Democrats? Hillary Clinton & Nancy Pelosi suggest Putin ORDERED Trump to launch Capitol siege in unhinged interview

Think your friends would be interested? Share this story!

Italian dictionary urged to change its 'sexist' definition of 'woman' - Wanted In Milan - Dictionary

A campaign is underway in Italy against 'sexist' synonyms for women in the Treccani dictionary.

Ahead of International Women's Day on 8 March, a campaign has been launched to change the "sexist" definition of the word 'woman' in a prestigious Italian dictionary, whose synonyms currently include 30 words to describe a prostitute.

In an open letter to Italian newspaper La Repubblica, more than 100 high-profile Italians have demanded an end to the "sexist" and "derogatory" synonyms for "woman" in the Treccani online dictionary.

The terms in question include "cagna" (bitch), "puttana" (whore), "bella di notte" (lady of the night), "cortigiana" (courtesan), “donnina allegra” (happy little woman) and even "serva" (maid).

"Such expressions are not only offensive but ... reinforce the negative and misogynistic stereotypes that objectify women and present them as inferior," said the letter.

“This is dangerous as language shapes reality and influences the way women are perceived and treated.

The letter calls for the elimination of "the expressly offensive words referring to woman" and the insertion of "expressions that represent the role of women in society in a complete and consistent way."

The letter, whose signatories include politician Laura Boldrini (pictured) and novelist Michela Murgia, points out that Treccani lists broadly positive synonyms for "man," such as "uomo d'affari" (businessman), "uomo d'ingegno" (man of genius) and "uomo di rispetto" ('stand-up guy').

In reply, Treccani’s Italian language vocabulary director Valeria Della Valle said she is convinced that defending the image and role of women is not achieved by "burning the words that offend us."

Stating that while she appreciated the reasons for the campaign, Della Valle argued that the role of dictionaries is to include even the “most detestable and outdated expressions” while ensuring to label them as "a prejudice or a cliché handed down from the past but no longer acceptable."

A similar campaign last year led the Oxford English Dictionary to modify its definition of woman, which included "bitch," "bint" and “wench,” with labels now applied to terms identified as “derogatory,” “offensive” or “dated.”

The person behind the Oxford campaign was Maria Beatrice Giovanardi, the London-based equality activist now spearheading the campaign against Treccani whose definitions she describes as even more offensive.

“These words are simply not synonyms of the word ‘woman.’ They can be the offensive synonyms of the word ‘sex worker’, but not of ‘woman’,” she told Reuters.

The letter to Treccani states that the changes requested would "not put an end to daily sexism, but it could contribute to a correct description and vision of the woman and of her role in today's society."

Overbroad DMCA Takedown Campaign Almost Wipes Dictionary Entries From Google - TorrentFreak - Dictionary

Home

> Anti-Piracy > DMCA >

A software review site recently tried to remove links to 'competitors' that lifted its writings without permission. While this urge is understandable, the execution was far from perfect. In addition to using long phrases to identify copied content, the site also asked Google to remove pages that mentioned "here is a brief introduction," or even the word "outstanding."

outstandingCopyright infringement comes in many shapes and sizes.

Pirating music, movies, and software are prime examples, but copying a random photo from the Internet can be problematic as well.

And then there’s text.

While some publications, this one included, are quite lenient when it comes to copying, others are more restrictive. Republishing an article from The New York Times, for example, is not permitted.

The software review site ThinkMobiles also falls into the restrictive category. In fact, on every page, the site reminds readers that “all articles are subject to copyright and can not be reproduced without permission.”

This is an understandable policy for a publisher in their niche. After all, when you put many hours into writing a review, you don’t want to see others taking it for free and profiting from it.

ThinkMobiles Asks Google to Take Down ‘Infringing’ Links

Perhaps this is the reason why ThinkMobiles sent a series of DMCA requests to Google a few weeks ago. The notices alert the search engine to “copyright infringing” links, which the company wants to have removed.

For example, one notice identifies several sites that copied phrases from ThinkMobiles, including the following:

“Various suites offer certain packages: Director Suite with far beyond video editing and color grading, Ultimate is the most relevant for a median user with all the core stuff in it, and Ultra is the same but without some extra features.”

This isn’t much of a problem if these sites indeed copied full articles. Using full phrases to search for copycats is a common strategy that works, as long as the sender makes sure that it’s indeed their content being lifted.

Shorter Phrases, More Problems

However, ThinkMobiles doesn’t restrict its searches to long phrases. It also identifies shorter combinations of words, including “Verdict: we highly recommend it” and “and click export.” Needless to say, this is causing issues.

The “and click export” takedown notice, for example, lists URLs from Adobe.com, Google.com, and Microsoft.com, that have absolutely nothing to do with the software review site.

click export takedown

Browsing through other takedown requests from the same site, we found even more broad takedown requests. One notice even claims ‘copyright’ on the word “outstanding,” asking Google to remove the URLs of sites operating popular dictionaries including Cambridge and Merriam Webster.

outstanding takedown

These erroneous takedown requests show that using short phrases or words to search for copied content is pointless. And even with longer phrases, senders always have to verify that the URLs they target are infringing.

After all, we copied a sentence from ThinkMobiles in this very article but, in a news context, we see that as fair use.

The good news is that Google rejected these overbroad DMCA takedown requests (outstanding!), so no damage was done. However, it could have easily slipped by them, as it will slip by many other sites that automatically process takedown notices.

Campaign demands Italian dictionary Treccani change its ‘sexist' definition of word ‘woman' - The Local Italy - Dictionary

Ahead of International Women’s Day, the campaign says 30 different words for a sex worker, including “puttana” (whore) and “cagna” (bitch) should be removed from the list of synonyms.

The words appear as synonyms of the euphemism for sex worker “buona donna”, which is included in a list of expressions that use the word “donna” (woman).

It points out that while the terms associated with “woman” have negative connotations, the synonyms listed under the word “man” are generally positive.

The letter’s signatories include activist and politician Imma Battaglia, politician Laura Boldrini and deputy director general of the Bank of Italy Alessandra Perrazzelli.

“Such expressions are not only offensive but reinforce negative and misogynist stereotypes that objectify women and present them as inferior beings,” said the open letter, which was published in Italian newspaper La Republica on Friday.

The campaign was started by activist Maria Beatrice Giovanardi, who was also behind a similar one last year urging the Oxford English Dictionary (OED) to remove words such as “bint” and “bird” as other ways of saying “woman”.

Oxford University Press updated its definition of “woman” in its dictionaries after a similar petition gathered 30,000 signatures.

However, Treccani’s Italian language vocabulary director Valeria Della Valle responded that she did not think the dictionary needed changing.

“It is not by invoking a bonfire…to burn the words that offend us that we will be able to defend our image and role (as women),” Della Valle wrote in her response.

Sexist dictionaries: Trivial or troublesome in gender equality? | Daily Sabah - Daily Sabah - Dictionary

On the occasion of International Women’s Day March 8, many Turkish brands and communities have decided to use their reach for the greater good and draw attention to gender-based discrimination.

Turkish retail giant Boyner Group has led the way in sparking conversations this year. The retailer's social media campaign video opens in a rather innocuous way, with a classically irritating salesperson voice announcing sales for Women’s Day. In a complete 180, instead of listing all the offers on clothes and beauty products, the video goes on to list all the “discounts” given to the male perpetrators of domestic violence and sexual abuse against women in Turkish courts. Judges have granted, reduced or suspended sentences on grounds that the perpetrator “wore a tie” and committed the act to “save his (family’s) honor,” among many other bogus excuses, which have gone down in history as appalling and despicable.

Another befitting collaboration has been between the “Turkish Dictionary” and Watsons Turkey, a branch of the Asian health and beauty retailer, on Clubhouse, where they discussed sexist phrases and idioms in Turkish.

The Turkish dictionary, which is the lovechild of Aras Kocaoğlan and Betina Frantz, has swiftly earned Turks’, expats’ and foreigners’ trust and likes with its mostly amusing yet at times thought-provoking posts, daily lexical trivia and cultural tidbits. The platform has in the past covered interesting topics such as the longest word in Turkish (it’s 70 letters) and iconic Turkish pop songs such as Mustafa Sandal’s “Araba” (“Car”), the lyrics of which are etched in the brains of '90s kids.

Even though Turkish is not a gendered language that annotates genders to inanimate objects, it, unfortunately, contains many expressions and words that derogate and patronize women. Here are a few examples:

  • Sözünün eri: (a) man of his word

Although this phrase has firmly planted itself in our everyday vocabulary, there are many other alternatives. “Sözüne sadık”, meaning “true to one’s word,” embodies the same message without resorting to gendered language.

  • Kız/kadın başına: (a girl/woman) on her own

Generally used in contexts that put the blame on women, especially in cases of sexual abuse and rape, this phrase could easily be replaced with “tek başına,” which is gender-neutral.

  • Bilim adamı: man of science

Much like using mankind instead of humankind, using “bilim adamı” instead of “bilim insanı” (scientist) perpetuates the idea that women are frail, inferior and not as capable as men.

These three uses hail from outdated, prejudiced and no longer acceptable views.

A tool for change

The fact that language shapes our reality is incontrovertible; cognitive research suggests the way we see the world and the way we think are all dictated by the words we speak.

Just as “time” is measured in physical distance in Turkish and English (i.e. short and long), Spanish uses the adjectives used to describe size ("poco" for small and "largo" for large) when quantifying time. Hence, native speakers of English and Turkish will conjure up different images in their brains when thinking about time compared with Spanish speakers.

Languages also have different rules when it comes to verb tenses and conjugation. Korean, for example, requires you to conjugate verbs according to your relationship to the person you are speaking to and the level of formality. Or if you wanted to say “Daisy swam today” in Turkish, you would have to know specifically if you saw this act with your own eyes (called the “known past” in Turkish) or if you heard about or inferred it from clues (the “heard past”). These examples could be multiplied.

The language we use also influences the way we perceive and treat others. This is where the “woman” factor comes into play.

Language can be used to perpetuate gender violence and gender-based discrimination, but it can also be used in an inclusive manner to promote equality and social justice.

We use many “innocent” words that are subtle reflections of our everyday misogyny. Take “working mother” as an example. How often have you referred to a dad as a “working father”? Or do you remember the last time you called a man in your life dramatic or bossy?

The blame game

No esteemed dictionary in the world gets off scot-free when it comes to sexist vocabulary.

The most recent incident has involved Italy’s leading online dictionary. Treccani was hit with a campaign last week calling for it to remove some of the synonyms for “woman.”

Launched Friday via a letter signed by dozens of high-profile Italians, the campaign argued that the word woman in the dictionary is accompanied by terms with negative connotations such as “cagna” (bitch) and "puttana" (whore), which “reinforce negative and misogynist stereotypes that objectify women and present them as inferior beings,” according to a report by La Republica newspaper. Conversely, the word “man” in the dictionary has a list of synonyms that are mainly positive, such as "uomo d'affari" (businessman) and "uomo d'ingegno" (man of genius).

The same argument was made last year in November when the Oxford University Press updated its definition of “woman” in its dictionaries after listing terms such as "bitch" and "bird" as the word’s synonyms.

The Turkish Language Association (TDK) has also come under fire on multiple occasions for printing "sexist" definitions and using "discriminatory language towards women." In 2018, the dictionary was criticized for defining the word “müsait” (available) as “a woman open to flirting” without indicating that it was slang or vulgar language. Although this meaning was first used by Ömer Seyfettin in 1918 in his story titled "Nakarat" (echo/chorus) and the dictionary was not the first to introduce this phrase to daily use, omitting such information was deemed highly problematic.

Even Google Translate was called out for using sexist language, such as translating sentences about doctors with the pronoun “he” and nurses with “she” in Turkish. This inherent bias was remedied by the machine translation service by offering gender-specific translations.

It is important not to forget that dictionaries are generally driven by the real-life evidence of how people converse in their daily lives, and they reflect and document rather than dictate.

Of course, editors need to be educated on and be more sensitive toward such matters, but the onus is still on society. So if you see sexist language or behavior, you need to call it out and stop it before it gets printed on a piece of paper.