In a June 18, 2024 paper, Vilém Zouhar and Mrinmaya Sachan from ETH Zurich, along with Tom Kocmi from Microsoft, presented a new approach to the human evaluation of machine translation (MT) systems that integrates AI assistance to improve the efficiency and consistency of the evaluation process.
Evaluating the performance of MT systems is an important but challenging task. Traditional human evaluation methods can be costly, time-consuming, subjective, and lack consistency among evaluators.
The researchers emphasized that existing automatic evaluation metrics “remain misaligned with the ideal measure of text quality and human evaluation remains the most accurate and reliable standard.”
Human evaluation involves ranking different MT outputs, direct assessment or identifying error spans, types, and their severity using frameworks like MQM. Komci, Zouhar, et al. published another paper on June 17, 2024, and simplified this process into error span annotation (ESA), a human evaluation protocol that focuses solely on high-level error severity, enabling “economic evaluation at scale.”
With ESA, annotators first mark errors with minor and major severity and then assign a final score without the need for error classification. The researchers found ESA to be “faster and cheaper than MQM whilst providing the same usefulness in ranking MT systems.”
Speeding Up
Now, they aim to “make the MT evaluation process with ESA less expensive” with AI assistance. They noted that “one of the motivations of the AI-assisted setup is speeding up the annotations and leading to lower costs.” Additionally, they believed that human-AI collaboration can be not only faster but also “more accurate than human or AI alone.”
The tool, named ESAAI, uses an AI system to pre-fill the MT output with error annotations, which the human evaluators can then review, modify, or reject and submit as their final evaluations. They explained that this setup is enabled by the advancements in quality estimation (QE) systems. Specifically, they used GEMBA, a GPT-based quality estimation system.
“We help the annotators by pre-filling the span annotations with automatic quality estimation,” they said.
The initial error markings are done by AI and then refined by annotators. Subsequently, annotators manually assign a final score on a scale from 0 to 100% (without AI). “The error annotation part thus works as priming of the annotators in giving more accurate scores,” they explained.
The researchers compared their AI-assisted approach to other human evaluation methods to evaluate its performance. They found that ESAAI can achieve similar levels of accuracy while significantly reducing the time and effort required from annotators to mark errors. This can potentially reduce the annotation budget by up to 24%
They concluded that “the inclusion of AI in evaluation also opens many options for further evaluation economy.”
No comments:
Post a Comment