In a January 17, 2024 paper, a group of researchers from the University of Macau, University College London (UCL), and Tencent AI Lab explored the performance of large language models (LLMs) against “classic” machine translation (MT) challenges.
The six MT challenges, originally proposed by Philipp Koehn and Rebecca Knowles in 2017, include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search.
For their experiments, the researchers used the Llama2-7b model, focusing on the German-to-English language pair. They explained that “English and German are high-resource languages in the Llama2 pretraining data, which ensures the model’s proficiency in these two languages.”
They found that LLMs reduce dependence on parallel data during pretraining for major languages and they improve the translation of long sentences and entire documents. Yet, challenges like domain mismatch and rare word prediction persist. Unlike neural MT models, LLMs face new challenges: translation of low-resource languages and human-aligned evaluation.
Document-Level
Specifically, the researchers found that LLMs mitigate reliance on bilingual data during pretraining for high-resource languages, with even a small amount of parallel data boosting translation performance. Surprisingly, an increased abundance of parallel data yields only marginal improvement and, in some cases, a decline in LLM translation system performance, challenging the common belief that more parallel data enhances translation quality. The researchers recommended supervised fine-tuning as a more advantageous approach for leveraging additional parallel data compared to continued pretraining.
The research community should “consider how to efficiently utilize parallel data for the enhancement of LLM translation systems, thereby offering a potential direction for future studies to optimize bilingual knowledge in the pursuit of improved MT performance using LLMs,” according to the researchers.
Another addressed challenge was the translation of long sentences, a significant hurdle for MT systems. LLMs demonstrated an ability to tackle this challenge effectively excelling in translating sentences with fewer than 80 words and consistently performing well at the document level with approximately 500 words.
“LLMs excel in translating extended sentences and entire documents, underscoring their effectiveness as a promising solution for addressing challenges associated with long-sentence and document-level translation tasks,” they said.
Unresolved Challenges
The researchers explored whether the rich knowledge of LLMs could address domain mismatch in translation tasks. While LLMs showed robust performance in in-domain translation tasks, their progress in out-of-domain tasks was modest, encountering challenges like terminology mismatch, style discrepancies, and hallucinations.
Predicting rare words in the realm of LLMs remains another significant challenge, leading to omissions in translations. The researchers underscored the persistent and unresolved nature of this issue, emphasizing its significance in the field.
Mixed Results
Word alignment, involving the identification of word pairs with similar semantic information in a given translation pair, was also explored. The researchers tested the feasibility of extracting word alignment from LLM attention weights, revealing that it was not a viable option. Despite this, the process provided valuable insights into model interpretability, they said.
In the context of inference, two major issues are inference strategies — including beam search and sampling — and inference efficiency due to the abnormal size of LLMs, as the researchers explained. They first tested the performance difference of beam search and sampling and they found that beam search is not necessarily suboptimal in LLMs.
In terms of inference efficiency, they found that LLMs require an average of 30 seconds compared to the 0.3 seconds of MT models, raising concerns about real-time deployment in scenarios requiring fast translation. “The longer inference time of LLMs may impede their real-time deployment in scenarios where fast translation is required,” they said.
New Challenges
Besides these six “classic” MT challenges, they identified two new challenges within the realm of LLMs. One pertains to the translation quality for language pairs inadequately represented during the pretraining stage and the other involves evaluating translation quality.
The researchers found that translation performance is significantly affected by the available resources for each language, emphasizing the need for a diverse and balanced dataset during the pretraining of LLMs to ensure equitable performance across languages.
Evaluation issues have also come to the forefront. They tested the quality of LLMs using both automatic — BLEU and COMET — and human evaluation metrics and found a moderate negative correlation between them. This emphasizes the importance of combining both evaluation methods and indicates that current metrics may not fully capture the nuances appreciated by human evaluators.
According to the researchers, this calls for further research to develop and refine evaluation methods aligned with human preferences, especially as language models become more complex and capable. “This human-centered approach to evaluation will be crucial in ensuring that our translation models are not only technically proficient but also practically useful and acceptable to end users.” they said.
Finally, the researchers called for future research to focus on refining evaluation methods and testing approaches on more advanced models.
Authors: Jianhui Pang, Fanghua Ye, Longyue Wang, Dian Yu, Derek F. Wong, Shuming Shi, and Zhaopeng Tu.