Languages & The Media: The Latest Trends in Media Localization

Article by Sarah Hickey.

The “Languages & The Media” conference is an international conference focused on audiovisual language transfer in the media. The conference is hosted biennially in Berlin, Germany, and (in their own words) it:

“...brings together content creators and distributors, broadcasters, streaming services, language services providers, software and hardware developers, researchers, trainers, practitioners, and all those involved in the production, marketing and distribution of audiovisual content for information, entertainment or educational purposes through localisation and accessibility.”

This year, the event took place from November 7-9, 2022, and Nimdzi’s analysts joined the conversation to bring you the latest and greatest information and trends from the vibrant field of media localization.

Below is a summary of the main themes that stood out from what was discussed inside and outside the conference rooms.

Audio description

Audio description (AD) was a major topic at this year’s conference. AD is a service that aims at providing equal access to visual media (movies, TV shows, video games) for people who are blind and partially sighted. In addition to the regular on-screen dialogue, a narrator describes visual items that are relevant to the story (e.g. actions and facial expressions as well as surroundings).

Current challenges and potential solutions in audio description

AD appears to be one of the services that are growing in importance. Yet, there remain many challenges with it that have to be overcome. Below is a brief summary of those that were discussed:

  • Not a widespread solution: While AD is a well-established service in some markets, speakers at this year’s conference pointed out that AD does not exist yet at all in other markets. In those countries, blind and partially sighted people are denied equal access to visual media.
  • Mostly in English: In countries where AD does exist, the majority of AD services are provided in English. Even on channels that pride themselves in being multilingual, the AD content is mostly restricted to English. 
  • AD as an afterthought: More often than not, when new box sets are released, AD is not considered from the onset so blind and partially sighted people have to wait longer to access the same materials, which creates a bad user experience. 
  • Providing true accessibility: Especially when it comes to video games and software, there are many hurdles that have yet to be overcome for true accessibility to be achieved. For instance, how should QR codes be handled? And how can options in video games be presented in a way that allows users to have all the relevant information to enable them to make an informed decision on what to do next?
  • Neutrality: Traditionally, the rule is that the narrator should be neutral. However, whether that is still the right approach was questioned in discussions. Some argued that in some instances, a more engaging narrator can provide a better user experience. An example named was a dating show where including descriptions of flirty behavior and body language (as opposed to neutral descriptions of what the participants are wearing) might actually be appropriate and suit the tone and purpose of the show. That being said, in such a case it will be extremely challenging to use language that does not, in some way, sound offensive or inappropriate.  

Considering these challenges, here is a short summary of some of the solutions that speakers put forward:

  • Blind and low vision narrators: As with any other form of accessibility, the best solutions come from involving the community/end users rather than from making assumptions of what is and what is not needed on behalf of the very people who the solution is intended for. To make sure software is accessible for everyone, some companies have started working with blind and low vision narrators. 
  • Synthetic speech: While synthetic speech is not intended to be used for entertainment purposes (yet), some conference attendees raised the point that there are already use cases for this kind of technology. For instance, TikTok videos could be made accessible for blind and partially sighted people via text description spoken by a synthetic voice. Others made the point that ten years from now the quality of synthetic voices will have greatly improved and that we can expect new use cases to arise then. 
  • Audio subtitles: In an effort to combat the English-centric nature of AD to date, the use of audio subtitles was discussed. In such cases, subtitles are read out via synthetic voices. Whether this solution is deemed suitable by end users was not part of the debate, nor was it clear whether audio subtitles also account for the narrative parts of AD (as opposed to simply reading out just the dialogue). Nonetheless, it might be a solution that can at least bridge some of the gaps while AD is being rolled out in more markets and languages.

What do you see? Sensitive content in AD

While describing what happens on screen might seem fairly straightforward to the untrained eye, this task is anything but — especially when it comes to sensitive content, a challenge that deserves a closer look. 

One major question in AD is how characters on screen should be described. Should the narrator point out someone’s skin color and ethnicity, their gender, or their portrayed sexuality? On the one hand, we may argue that we do not need to know whether a character is white or black. In a truly equal world it should not matter and maybe to become a more inclusive world we need to stop mentioning it. On the other hand, the general rule is that if a sighted person can see/identify it, then it should be part of the audio description as well so that blind and partially sighted people have access to the same information. In addition, more often than not movies and TV shows contain scenes that highlight the unequal treatment of marginalized groups, in which case this information is integral to the plot. One potential solution discussed was to have audio introductions where people can choose whether they want these descriptions or not.

This is just one example of many complexities that were discussed to illustrate the kinds of challenges currently being faced and debated in this field.


There are many speech-to-text solutions in the field of media. In fact, taking spoken words and converting them into written text can be described as the bread and butter of media localization. In particular, closed captions and subtitles are well-established services. Let’s briefly differentiate between the two.

  • Closed captions are predominantly an accessibility feature aimed at providing people who are Deaf or hard of hearing with equal access to verbal communication and audio content in the same language as the original (no translation). That being said, these days captions are also increasingly being used by hearing people, particularly the younger generations, either as a tool to enhance comprehension (e.g. watching content in one’s second language) or when videos are watched on mobile devices in locations that require sound to be muted. 
  • Subtitles on the other hand are not only transcripts of the original audio but translations into other languages. For that reason, subtitles are predominantly intended for people who can hear but do not speak the same language as the one on screen. 

Aside from these media localization “classics,” today we can find new applications of speech-to-text solutions,  and two stood out from the discussions at this year’s conference.

Speech-to-text interpreting

Speech-to-text interpreting is still a relatively new field (established around 2015, depending on who you ask) and one that shows that these days it is no longer quite as simple as translation being written and interpreting being spoken. While live-subtitling (see next section) contains an element of interpreting, speech-to-text interpreting comes close to audiovisual translation (AVT). 

Before delving deeper into the topic, one important point to make upfront is that speech-to-text interpreting typically involves respeaking, which requires the respeaker to add punctuation and other formatting to their verbal output. In addition, respeakers may also use (or be required to use) predefined voice commands for special formatting or proper names. 

The below list provides an overview of the five main workflows in which speech-to-text interpreting can be performed:

  1. Interlingual respeaking/typing: One person interprets the original speech into another language but in a respeaking format that is more suitable for the transcription machine that generates the text output. 
  2. Simultaneous interpreting + intralingual respeaking/typing: A simultaneous interpreter renders the original speech into another language and a second person takes the interpretation and respeaks it again in the same language as the interpretation but in a way that is more suitable for captioning (including punctuation and other formatting). 
  3. Simultaneous interpreting and automatic speech recognition (ASR): A simultaneous interpreter renders the original speech into another language and the output is picked up by ASR technology to provide written output.
  4. Intralingual respeaking and machine translation (MT): One person respeaks the original speech in the same language (no translation) to make it more suitable for captioning. The respeakers output is then translated via MT.
  5. ASR and MT: ASR technology is used to transcribe the original speech, which is then translated via an MT engine.

Editing might be added to any of the five workflows described above. 

To date, the main use cases for speech-to-text interpreting are live broadcasts as well as meetings and events. 

Interestingly, interpreting associations are still unwilling to accept respeaking and speech-to-text interpreting as a profession, even though both services exist and there are professionals making a living from them. 

Live subtitling

Live subtitling is a service that has picked up immensely since the pandemic-induced spike in video conferencing, and the technology in this space has been making tremendous strides. It is therefore no surprise that several talks at this year’s conference homed in on the latest from within this particular niche.

In essence, the service of live subtitling involves taking spoken content and converting it to written content in multiple languages with minimal delay. Live subtitles can be created in a few different ways: they can be generated by a machine without human input, the machine output can be edited live by a person, or the live subtitles can involve linguists from the get-go.

Since the Zoom boom, live subtitles have become popular for online meetings but are also being used in live broadcasts, at onsite events, and to make radio content accessible online.

What is interesting to note about this trend is that the providers of live subtitling are coming into this space from different sides of the industry:

  1. Media localization providers experienced in the field of subtitling in a more general sense.
  2. Machine translation providers who are bringing their technology into the meeting and events space.
  3. Remote simultaneous interpreting (RSI) providers who are looking to offer a more robust service portfolio to their existing clients and also reach clients with smaller budgets.

AI dubbing 

Dubbing is the other bread and butter service in the media localization industry and, to date, one that is (almost) exclusively performed by voice actors. That this might change going forward became evident in presentations that showcased the latest developments in machine dubbing. 

The quality of synthetic voices has come a long way. Not only do some voices sound so remarkably human that in some cases it can be hard to tell whether or not the voice is synthetic, but also the latest developments involve technology that is able to mimic the original speaker's voice in the translated, synthetic version. 

Although not fit for entertainment purposes (yet), current use cases for AI dubbing range from international broadcasts to voiceover for documentaries and corporate videos.

Price pressure and the talent crunch 

The (alleged) talent crunch within the media localization space is certainly nothing new but it appears to have reached new heights due to ever-increasing volumes, new markets, and language pairs, as well as the economic pressure imposed by inflation. Unsurprisingly, the topic was discussed numerous times and in different formats at this year's conference, including as part of a panel discussion about the working conditions of audio-visual translators (AVTs). 

What has changed? The topic appears to be taken seriously and is being discussed openly from all sides (AVTs, language service providers, and buyers). Below are a few key takeaways from the discussions:

  • What talent crunch?: Depending on the market, there are actually more AVTs than ever before, making some question how there can be a talent crunch.
  • Thriving instead of surviving: Conversations tend to center around the minimum AVTs can work. The problem with this is that AVTs are professionally trained linguists with university degrees who do not just want to survive but thrive in their chosen profession. 
  • Characters instead of minutes: Current rates are set per minute of AV content. In discussions it became clear that this is not necessarily an accurate way of measuring an AVTs effort, as the amount of dialogue can vary heavily from scene to scene. It was argued that rates should be charged by number of characters instead, which is also the standard in traditional translation.
  • Collective bargaining: One example brought forth by a panelist was that of a company who paid almost 50% more for German subtitles than for Norwegian ones, which struck the panelist as extremely odd considering that rates for these two languages are typically relatively on par. Digging deeper, the panelist found that the reason for the difference was that German translators were simply unwilling to accept the low rates, highlighting the power of collective bargaining. In a similar vein, a collective bargaining initiative by European AVTs is being discussed.
  • Demand versus supply: It was argued that with all the debates around increasing volumes of content in the media localization field, rates for AVTs should go up and not be pushed down further.
  • Multi-million dollar profits: It is no secret that companies in the media localization space have seen their revenues climb for years and with rising demand there appears to be no end in sight. The point was raised that it is hard to believe that these multi-million dollar companies cannot afford to pay a decent rate to their linguists.

Rising volumes and shifting (language) directions

The media localization industry is tremendously busy, which is certainly not a new trend but rather an ongoing one. The term "explosion of content" was used several times, as it has been in industry circles over the past few years. We can get a sense of what dimensions we’re talking about if we look at Netflix’s 2021 localization volumes:

  • Seven million minutes of subtitled content — the equivalent of more than 13 years’ worth
  • Five million minutes of dubbed content — the equivalent of almost 10 years’ worth

These are huge figures, which our industry delivered on in months — and this is, of course, just one streaming platform. 

At the same time, it’s not just the number of hours of content, but also the number of languages. In the past, the standard was 12 languages, now it’s more than double that (in the case of Netflix it’s 37 languages for subtitling and 35 languages for dubbing for productions in 50 countries). In addition, content these days is coming from and going into any number of languages and it is no longer limited to going from English into other languages. Particularly direct Asia-to-Asia content, without English as a pivot language, continues to grow in popularity.

With this change in direction and with more content being produced in languages other than English, we are also witnessing the emergence of English dubbing, which is a relatively new phenomenon.

For the providers, the challenge is not purely about handling rising volumes but also establishing new workflows and finding new talent in the right markets, at the right time. This all adds to pre-production and post-production times for dubbing and subtitling.

The end-user perspective confirms the trend. In a panel discussion on content globalization, Simon Constable from Visual Data shared data from a survey that asked end-users whether they watch productions in other languages. The results showed that 25% do it all the time, 26% mostly do so, and 28% sometimes do. In addition, 63% stated that poor localization has an impact on what they watch and whether or not they switch it off. 

Access and inclusion and how to handle sensitive content

As one of this year's themes, access and inclusion were part of most discussions. This is no surprise considering that many services we group under the umbrella of media localization are also well-established accessibility services (as mentioned above). Several speakers encouraged the audience to "look around" at the lack of diversity in the industry and others remarked that you “can’t be what you can’t see” to remind us of the importance of representation of all groups in the media and entertainment space. 

Let’s take a look at some of the challenges that were discussed regarding accessibility, inclusion, and sensitive content.

Finding the right voice

Keynote speaker Änne Troester from the German dubbing association Synchronverband e.V. - die Gilde pointed out that for a long time it was believed that the dubbing industry was colorblind because it is audio only but that this assumption could not be more wrong. For one thing, Änne noted that white actors have always been allowed to voice black people but not the other way around. Efforts to change this and also to cast people who can represent marginalized and underrepresented groups of society accurately comes with its own challenge. Take the example of a character on screen transitioning from male to female or the other way around. Änne rightfully points out that we as an industry should not assume how this can best be represented in the dubbing script and the voice of the actor. To do it right, we need to ask representatives from these groups for their input. But then the challenge becomes one of how to do that. You cannot and should not ask people about their sexuality, so how do you recruit for this in a respectful manner that allows for a respectful and accurate representation on screen?

While there was no final answer to this challenging question, it is still a good thing that people are actually talking about it, that they are questioning current practices, and are actively looking for ways to make the media industry more inclusive.

The power of the written word

How should offensive content be translated? For one thing, translators need to consider the intensity of the word in the specific language and culture. Something that is relatively harmless in one language may be completely inappropriate in another. 

Another layer of complexity is added when transferring spoken words into written ones, whether that is in the form of captions or in the form of subtitles. Because, as many speakers reminded the audience throughout the conference, oftentimes seeing something in writing can come across much stronger than the verbal equivalent. Some even argue that there are some words that should never be written (e.g. the n-word).  


When we look at the overall picture, not much has changed. For years now, volumes have been rising and continue to do so. Hand in hand goes the notion of a perceived talent crunch, which mostly comes down to price pressure as well as finding talent in new markets, but also to the changing roles of linguists as language service providers continue to leverage technology in an effort to increase efficiencies.

Most notably, AI is increasingly entering the media space both in the form of machine dubbing and live subtitling. However, it isn’t necessarily aimed at the entertainment industry but rather the meeting and events space as well as the broadcasting industry. At least for now. 

For media localization, accessibility is not just another buzzword. This was evident both from the discussions inside and outside the conference rooms as well as from a “practice what you preach” standpoint. Indeed, two conference rooms were equipped with a large screen (positioned next to the stage) with a running machine-generated transcript of the conversations happening on stage. That being said, peers recognized that to achieve true accessibility and inclusion, there is a lot more work that needs to be done in the coming years. And they aim to do it.

This article was prepared by Sarah Hickey, Nimdzi's VP of Research. If you have any questions about interpreting technology, reach out to Sarah at [email protected].

17 November 2022

Stay up to date as Nimdzi publishes new insights.
We will keep you posted as each new report is published so that you are sure not to miss anything.

Related posts