MT Masters One Trillion Words TAUS Takeaways: 2 out of 3
Machines translate more in a day than all human translators on the planet combined can do in a year
This year’s machine translation panel at the annual TAUS conference included Alon Lavie of Amazon Translate, Boxing Chen of Alibaba Translate, Necip Fazil Ayan representing Facebook Language Technologies group, and Chris Wendt from Microsoft Translator. We present a summary of discussions at their sessions and in the corridors in between regarding machine translation (MT).
The world is reaching a new milestone in MT – more than 1 trillion words a day. Machines translate in a single day more than all professional translators on the planet combined can do in a year.
Below is an optimistic estimate of the volume processed by a few select platforms every day.
Est: 200 billion words a day
Est: 300 billion words on a daily basis
Est: 200 billion words a day
6.5 billion posts a day, each post can be from 7 words to a couple hundred
Est: 50 – 60 billion words a day
30 million news updates and 500 million social media updates a day
Figure in Microsoft, Tencent, Yandex, SDL, Baidu, Amazon, Systran, Promt, Kantan, Asia Online, Iconic, Globalese, and thousands of corporate engines, and the total figure could be far north of 1 trillion. Ecommerce platforms and social media updates demand the largest throughput.
Machines vs a million human translators
Just to compare, here is the calculation for human translator throughput. According to recent tests by Sandberg Translation Partners, a human translator has an average throughput of about 2000 net words on a normal day not counting translation memory leverage.
There are between 600,000 and 1.2 million professional translators in the world. Platforms like Smartcat and Matecat arrive at 600,000 professionals as a conservative estimate. Translation marketplace, Proz.com, claims to have more than 850,000 registered users. An extrapolation based on statistical data for the number of registered self-employed translators (we have access to figures from France, Finland, Portugal, the UK, and Russia) lands the potential figure at 1.2 million:
Quick math – 1.2 million professionals x 2,000 words/day x 365 days = 438 billion words/year – that’s only a half of what MT is doing every day!
Machine Translation Trends 2018
Record volumes are not the only trend in MT. As researchers strive to achieve human parity for quality, salespeople seek to integrate machine translation into new business scenarios. A few new trends have emerged over the last year.
Non-human use cases
Early machine translation has been primarily used to help humans understand texts in foreign languages, and sometimes, just to give humans the quick gist of the meaning. As the quality has improved, MT has now become a part of the professional translation process, improving productivity. As a result, today’s and tomorrow’s use case is increasingly not for humans but rather for other software. Examples include:
Multilingual search is generally used for relevant products in multilingual ecommerce catalogs, for support articles, and forum posts.
Used in conjunction with speech-to-text and speech synthesis, MT allows for fully-automated interpreting used in wearable devices (although the quality is still lacking).
MT can translate texts from multiple languages into one, allowing for keyword identification, patents, drug and legislation search, data-driven journalism, and more.
In multilingual litigation, MT helps find evidence among terabytes of data on the defendant’s computers.
MT output personalization
Researchers now experiment with metadata to personalize the output of machine translation systems. For instance, gender and age recognition can help with the polite address and tone in the Japanese language, as well as with syntax in inflexional languages. Developers of MT systems can find a way to train MT on other personal information such as geographical location, recent search history on Google and Bing, income level, and interests and likes on Facebook to provide context for translation.
As scary as it might sound, an MT engine will know more about the recipient in the future and will be able to tailor the output for understanding, resulting in different translations of the same content aimed at different people. This may theoretically lead to ethical questions, such as a “translation understanding bubble” akin to personalized search bubble on Google.
The need for data for low-resourced languages and domains
The vast majority of machine translation happens in top economic language combinations, such as English to Chinese, French, German, Italian, and so on. For example, in Microsoft Translator, the top 10 language combinations make 53 percent of the volume, and the top 20 account for 71 percent. Beyond the 20 top combinations, the volume is not significant, and the usage chart shows a hockey stick. One of the reasons for disparity is that the training data for other languages is insufficient to produce good quality engines. As the world moves toward linguistic inclusion and equality – and also towards the discovery of new markets – the search for quality bilingual data becomes more of a quest for MT developers.
Developers often try to get the data by scanning multilingual websites, for example, via the paracrawl.eu project. However, this does not fully satisfy the need. The future may lie with utilizing monolingual data, according to Alon Lavie, or with creating a data marketplace to which multiple stakeholders (e.g., academia, governments, etc) can contribute, according to TAUS. Data may become the new oil, if only organizations can find a way to buy and sell it.
Stay up to date as Nimdzi publishes new insights. We will keep you posted as each new report is published so that you are sure not to miss anything.