TAUS Big Data Gets Bigger The quest for an ideal TM / MT configuration
By July 2018, the TAUS Quality Dashboard benchmarking database had exceeded 100 million words. It is still small in comparison with TMS databases, but it is slowly becoming a relevant aggregation of translation project metadata from which we can start drawing early conclusions.
TAUS’s QD is fed by a handful of enterprise companies, a notable exception being Baltic LSP Synergium. Collectively, they add about 1 million words a day.
Translation memory (TM) is the main way for churning translations at enterprise companies using TAUS DQF. Unedited matches from the TM account for a 64 percent of all content translated, and edited fuzzy matches represent close to 10 percent more. Depending on the discount scheme for matches used by vendors, companies might be saving anywhere from one half to 70 percent of their human translation budget with the help of CAT-tools and TMS.
That could mean USD 7.5 – 10.5 million in savings for the whole sample of 100 million words, assuming the average price per word is USD 0.15. Technology for ten enterprise companies should cost around USD 1 million a year, warranting a 7 – 10x return on investment.
TAUS’s figure for savings is much greater than any previous benchmarks. Two years ago, using data from Memsource, I looked at the TM leverage using a database of 500 million words, and the median saving was at 36 percent. Today, TAUS shows a 50 – 70 percent economy. The difference is that most Memsource clients at that time were language services companies. Large LSPs usually deal with varied content from multiple clients. A significant portion of their content is new and has no corresponding matches in the memory. Content in the enterprise is more regular and repetitive, and thus the TAUS database can boast higher match rates.
The quest for an ideal TM + MT combo
According to the dataset, machine translation is nowhere close to replacing TM in the business-boosting human translator productivity. MT accounts for roughly only 12.5 percent of segments. Furthermore, most MT suggestions require some editing. However, drawing final conclusions would be unfair considering that the sample for MT is still very small.
TAUS is looking for an ideal threshold on which to replace TM with MT. The report splits the sample into two workflows. In the first, there is translation memory with humans. In the second MT-supported workflow, the text goes through the TM first, and machine translation is used for segments where memory matches are below a quality threshold. At the moment, an early speculation is that the best threshold is a 70 percent match rate, after which MT becomes inefficient. Companies use this cut-off point in practice, and TAUS’s objective is to check whether there is data to prove this is the most efficient way.
The search continues — through Levenshtein edit distances and tag-riddled segments.
400 words an hour — the average productivity for a human translator
Finally, the dataset gives insight into human productivity. TAUS offers an online tool to benchmark, but the data there is skewed because most of the volume comes from TM and MT. Using the report data on human translation volumes we were able to configure the visualization for languages with significant human-made volumes only (German, Baltic, Russian). The result: 400 words an hour without the help of technology. Pure, un-augmented human brain power.
A 7-hour full work day nets about 2,800 words, or roughly 11 pages. If only someone could sit for 7 hours straight to perform uninterrupted translating…
TMS databases could offer a more precise picture
TAUS QD database of 100 million is tiny compared to the massive silos on which cloud-based TMS companies sit. For instance, Memsource claims to have processed more than 20 billion words last year, but with about one third actually translated. XTM says their public cloud clocked 14 billion in source words, and on private cloud clients uploaded billions more. In a recent presentation, Smartling claimed that they translated 8.5 billion words in 2017.
Companies track word numbers differently, and none of them believe each other, but it’s a good measure for the order of magnitude.
Though smaller, TAUS database has the benefit of neutrality. You can believe that their numbers are true.