fbpx

Machine Translation: Customization Evaluation in Spotlight

Nimdzi Finger Food is the bite-sized and free to sample insight you need to fuel your decision-making today.

If you want to learn more from our experts about language technology available today, contact us today.
Machine Translation customization and how to evaluate it with Spotlight by Intento

In December 2020, Nimdzi was given an opportunity to test a brand new product  —  Spotlight. It is developed by Intento to support machine translation (MT) curation, enabling quick analysis of the MT training results. This product is intended mainly for those who train custom MT models and thus regularly face the task of evaluating MT quality.

The machine translation evaluation process and what Spotlight means for it

The usual evaluation methods include random sampling and costly human review (that runs the risk of providing different results for the same samples), which oftentimes happens after the trained model is already in production (alas!). Also, there’s usually no easy way to understand if a model can be improved further or to find examples of improved and degraded segments of the text. All of this can make the MT trainers’ and evaluators’ job onerous and daunting. Not to mention the fact that the evaluation sometimes occurs after it is actually needed, with the end users of the resulting MT output wondering what the evaluators do in the shadows. Intento’s Spotlight is designed to shed some light on this subject and dispel the gloom. 

Our initial impression is that this tool represents a useful and quick way to evaluate MT training results by spotlighting those segments that really need to be reviewed.

In the spotlight

Spotlight is a cloud solution available on demand from the Intento Console. The user interface (UI) is lean and the wizard helping you create an evaluation is pretty straightforward. 

Test set description

We played with this new product using COVID-related corpora by TAUS from Intento’s research on the best MT engines for this area. It was Google Cloud Advanced Translation API (stock) versus Google Cloud Advanced Translation API (custom) dataset, from English to Russian.

Spotlight suggests the “Less is More” principle for the dataset size: it uses the first 2,000 segments from the evaluation files, as it's considered the optimal size for an evaluation that is sufficiently accurate.

How does it work?

  1. Users upload a test set with the source text, reference translation, and translation performed with different MT models. The tool currently supports only test sets in .xls or .xlsx formats, up to 10 MB in size, and understands spreadsheets with a specific structure.
  2. There is a quick overview of the training results from the evaluation chart. It offers hLEPOR scores per segment which help understand how close the translation is to the reference translation.    
  3. Users can actually review segments in the spotlight.
Test set evaluation of machine translated text in Spotlight by Intento

In addition to hLEPOR, BERT score is coming soon, with two more metrics, TER and BLEU also on Intento’s roadmap.

Quick evaluation overview

In our small experiment, Spotlight showed the higher overall hLEPOR score of 0.61 achieved by custom Google Cloud Advanced Translation API — compared to the 0.58 by a stock engine.

The evaluation chart produced by Spotlight, a tool used to evaluate custom machine translation engines

After getting a quick overview of the evaluation situation, a reviewer is welcome to proceed to a detailed analysis of the segments, e.g., the degraded ones appearing below the line, or check improved ones.

Analysis of machine translation segments in Spotlight

In the process of such a review, a reviewer is able to:

  • comment on the segments — for example, if the reference translation is wrong or both MT versions are correct, etc.
  • mark a segment for further check
  • add a spotted issue type (omission, mistranslation, untranslated text, terminology, paraphrases, other)
  • download the export of the evaluation in Excel file

This “light-weight” review approach helps get faster evaluation results by catching and addressing only the issues that need to be improved.

Overview of segments in Spotlight, by Intento

Depending on the results of the evaluation by Spotlight, users may want to retrain the custom MT engine or mention the particular issues to the post-editors. The reviewed data (already corrected and “annotated”) can also be used to retrain the MT model.

Summary

An overview of the segment-level hLEPOR scores helps to get an understanding of the current MT customization situation and save time by performing a focused review instead of a full scope evaluation. 

Spotlight can definitely save evaluators time and money. It also enables the linguistic teams to have a quick and sufficient understanding of the customization results before rolling out the particular MT engine in production. This could spare post-editors efforts and nerves, especially if there’s something wrong and the engine needs retraining.

According to the development roadmap mentioned at the presentation of Spotlight in November 2020 (the launch page offers a virtual demo of Spotlight and a slide deck from that event), this is just one of the tools from Intento’s product MT Studio. The new toolkit for the complex MT curation will include options for data cleaning, training, and evaluating of multiple MT models, which can be even more interesting for a broader audience.

Comparison of output in evaluating custom machine translation using Spotlight

Source: Intento

Being a software company, Intento leaves the task of trying a new service and actually training the engines to language service providers (LSPs). However, they do use Spotlight internally at Intento, saving their analytics team hours of precious time. Yes, that is correct: even with such agile automation, the human stays in the loop — to curate the MT training, evaluate the engines,  fine-tune the process and adjust it where needed.  

Nimdzi Finger Food is the bite-sized and free to sample insight you need to fuel your decision-making today.

If you want to learn more from our experts about language technology available today, contact us today.
13 January 2021

Stay up to date as Nimdzi publishes new insights.
We will keep you posted as each new report is published so that you are sure not to miss anything.

Related posts