Neural machine translation quality evaluation

With the proliferation of MT engines in 2018, choosing the best became a challenge. To address the pain of evaluating and selecting MT, companies are launching new services and technology. In this article we review ONEs, human MT evaluation from One Hour Translation, and Inten.to, an automated MT evaluation/marketplace service. Both are available via API.

So far, independent MT comparisons have been hard to come by. In 2016 – 2017 new entrants to NMT have posted numerous press-releases stating they’ve approached human quality (Google 2016, Microsoft 2018), or beat competitors (Lilt 2017, taken down since, DeepL 2017). Most of these comparisons have relied on a variation BLEU metric, which compares machine translation with a pre-loaded human translation, and can be manipulated.

At the same time, the need for independent evaluation is rising.

More choice

There are many new options

Amazon and IBM, DeepL, Baidu, GTCom, Apple, Neural Yandex, and dozens of engines developed by translation companies in Europe and in Asia.

Peak interest

The interest to machine translation is at an all-time high

Enterprises and IT companies develop their own engines (WIPO, ING, Booking.com, etc), and LSPs report on the rising of MT adoption which is likely to hit 40-50% in 2018 according to two surveys – by EUATC and Nimdzi.

Ease of access

Training custom engines is easier than ever

There are ten open-source MT toolkits readily available for download on Git and Adaptive MT, making the training process easy. When training, companies regularly need independent evaluation to check on internal MT developers who tend to exaggerate performance of their baby systems.

Hundreds of small niches

Domains and languages play a big role

For example, Alibaba has built an engine that often wins over Google and Yandex in eCommerce translations in English <> Russian and Chinese <> Russian (click image for details). But it would lose with other texts and language combinations.

Choice has become more difficult. Keep reading to learn how new offerings facilitate it.

Human MT evaluation over API (One Hour Translation)

One Hour Translation launched a human machine translation evaluation service, ONEs (OHT NMT Evaluation Score). It opened up with an infographic comparing Google, Neural Systran and DeepL in 4 language combinations.

OHT’s approach is to have 20 human evaluators per language combination, all with experience in the subject area. Each scores the translations sentence by sentence from 1 to 100 with instructions regarding what each degree means. Tests are blind – translators do not know which engine they are evaluating, and thus carry no bias. Two statistical tests verify human inputs – the first one to ensure the results statistically significant, and the second to calculate the confidence score and margin of error. In the reporting section, OHT slices the results by winning sentences per engine, valuations for each sentence, and score distribution.

Evaluations take about two days to compile, and cost in the range of USD 2,000 – 4,000 depending on the number of sentences, languages, and engines compared. 

This human evaluation can work for any technology, including RbMT, SMT, and NMT, as well as public and on-premise MT.


Normally, quality human evaluations are difficult to organize and too expensive. Most tests only include one or two evaluators per language, and do not have a strong methodology to back them up, thus the results are subjective. OHT solves the problem of scale via their platform that has automated access to thousands of freelancers with domain area specialization. 

In addition, there is sound, statistical testing to reduce the human factor in a human evaluation.

What is missing

Evaluators look at the sentences but do not edit them.

When editing, translators may spend more effort on rewriting endings in inflectional languages rather than correcting a few mistakes in the meaning. It remains to be seen if human evaluation of this type can be applied with the same level of precision to selecting engines for post-editing.


This new service makes organized and scalable human evaluation easy to buy and pioneers human MT QE over API.

For buyers, we believe it should work as an occasional check to select MT for raw output and translating user-generated content.

For OHT, this service can become a differentiator. As the first MT QE that offers human opinion at scale via the API – with a methodology and at an affordable rate – it can open doors to enterprise clients who have embarked on the MT bandwagon.

Automated MT evaluation + marketplace (Inten.to)

Inten.to has launched an automatic evaluation with a marketplace. They made the first appearance in mid-2017 and since then got funded for close to USD 1 million.

Inten.to compares engines by quality and price on a quarterly basis, automatically selects the most suitable for the current task, and then provides that engine via API. Inten.to monetizes by selling MT with a markup.

For quality evaluations Inten.to uses LEPOR scores, a derivative of BLEU that compares MT with reference human translations.

So far, Inten.to has been integrated into Smartcat and Fluency TMS, where users can select its aggregation just like a regular engine.


The value is in convenience – instead of selecting engines, integrating 3 – 6 of them, then rechecking and monitoring performance over time, a user integrates only one, and gains access to everything already connected to the platform. Users will always have the best of the engines tuned to their price/quality requirements.

Because the valuation is automatic and not human, it can easily scale up to accommodate comparisons in dozens or even hundreds of language combinations.

As a bonus, there is a feature to run one’s own LEPOR evaluations for integrated engine using any type of specialized content.

What is missing

By default, Inten.to compares MT engines using news in 48 languages. This results in very low scores for specialized engines such as technical SAP MT. These scores can mislead buyers. Obtaining comparisons using specialized medical, legal, and financial content costs money, and with the main offering available for free, buyers might not want to invest in tailored evaluations using custom data.

Second, Inten.to is obviously aiming its main offering for the enterprise sector, and will need to persuade the big companies that as a small startup, it can provide enough service reliability, uptime, data security, and financial stability to support important operations.


Inten.to has introduced a unique aggregator model, something that has not been done before in the language industry – clearly because the founders are coming from IT and not from a linguistics background.

In its present form, the product will shine in scenarios where the user is working with regular types of content but with a large and varying language pool. However, in scenarios where the content is different all the time but the languages are the same (80 percent of LSPs and buyers around the world), the users might just want to select one preferred engine and stick to it.


These two new offerings usher in a new market niche of MT selection.

There still remain plenty of opportunities. For instance, there isn’t:

  • Any service to spot-check whether an instant translation of a document via MT + TM is good enough for business purposes
  • An online marketplace for small MT engines built by LSPs
  • Any large modern community to hire and train MT training specialists – not even on LinkedIn

While rapidly progressing towards end-users, MT training and comparison remains expert-driven and somewhat academic in flavor. In the next 3 years however, all that could change.

Stay up to date as Nimdzi publishes new insights.
We will keep you posted as each new report is published so that you are sure not to miss anything.

Related posts