So far, independent MT comparisons have been hard to come by. In 2016 – 2017 new entrants to NMT have posted numerous press-releases stating they’ve approached human quality (Google 2016, Microsoft 2018), or beat competitors (Lilt 2017, taken down since, DeepL 2017). Most of these comparisons have relied on a variation BLEU metric, which compares machine translation with a pre-loaded human translation, and can be manipulated.
At the same time, the need for independent evaluation is rising.
One Hour Translation launched a human machine translation evaluation service, ONEs (OHT NMT Evaluation Score). It opened up with an infographic comparing Google, Neural Systran and DeepL in 4 language combinations.
OHT’s approach is to have 20 human evaluators per language combination, all with experience in the subject area. Each scores the translations sentence by sentence from 1 to 100 with instructions regarding what each degree means. Tests are blind – translators do not know which engine they are evaluating, and thus carry no bias. Two statistical tests verify human inputs – the first one to ensure the results statistically significant, and the second to calculate the confidence score and margin of error. In the reporting section, OHT slices the results by winning sentences per engine, valuations for each sentence, and score distribution.
Evaluations take about two days to compile, and cost in the range of USD 2,000 – 4,000 depending on the number of sentences, languages, and engines compared.
This human evaluation can work for any technology, including RbMT, SMT, and NMT, as well as public and on-premise MT.
Inten.to has launched an automatic evaluation with a marketplace. They made the first appearance in mid-2017 and since then got funded for close to USD 1 million.
Inten.to compares engines by quality and price on a quarterly basis, automatically selects the most suitable for the current task, and then provides that engine via API. Inten.to monetizes by selling MT with a markup.
For quality evaluations Inten.to uses LEPOR scores, a derivative of BLEU that compares MT with reference human translations.
So far, Inten.to has been integrated into Smartcat and Fluency TMS, where users can select its aggregation just like a regular engine.
These two new offerings usher in a new market niche of MT selection.
There still remain plenty of opportunities. For instance, there isn’t:
While rapidly progressing towards end-users, MT training and comparison remains expert-driven and somewhat academic in flavor. In the next 3 years however, all that could change.
Blockchain and bitcoin have been around for more than a decade now, but most of us mortals are still struggling to understand how all this technology comes together, without even considering its implications for the language services industry. Let’s give it a try.
Media localization dominated the European Mesa Forum as arguably today's most attractive the language services market segment. Revenue for mid-sized Netflix localization vendors is exploding. Voice and Script International grew 88 percent last year, and Zoo Digital has seen an increase of 143 percent over two years. Together, they became 2017's fastest growing LSPs […]
On June 10, 2020, we published our Nimdzi Language Technology Atlas, the comprehensive resource that maps hundreds of language technology solutions from all around the world. Two months later, after receiving and reviewing feedback from more than three dozen companies who submitted requests to add new tools or change their categorization, we released an update to the infographic on August 27.
(Non-Traditional) Revenue Growth Through Customer-Centric Project Management Although it is not "traditional," project managers can generate more revenue than even the best salespeople … with the right tools and training. In this discussion, Tucker Johnson (Managing Director, Nimdzi Insights) and Vera Richards (VP, Akorbi Translation and Localization) will share their experience turning traditional operations teams […]