Beyond the Language Barrier: How Real-Time Translation Is (and Isn’t) Changing Video Conferencing

The future of real-time translation in video conferencing lies not just in understanding what is said, but how it’s said

Unified Communications & Collaboration Interview

Published: May 27, 2025

Christopher Carey

Picture this: You’re on a video call with your team based in Tokyo, São Paulo, and Berlin. You say, “Let’s circle back next week,” but your Brazilian colleague hears, “Let us rotate in a circle next week.”

Welcome to the new world of real-time translation in video conferencing, where AI is fluent in over 100 languages but still doesn’t know when you’re being sarcastic.

While platforms like Zoom, Teams, and Google Meet have already established themselves as the kings of video conferencing, the race is now on to break down language and human interaction barriers within these meetings.

The Promise… and the Reality

Multiple services already offer live subtitles and captioning in hundreds of languages, translated chat messages in real time, and voice translation powered by artificial intelligence.

But despite impressive advancements that would have been almost unimaginable 10 years ago, current-generation translation services come with notable caveats:

Accuracy Varies Widely: While speech-to-text has matured significantly, especially in high-resource languages like English, Spanish, and Mandarin, the quality can drop in others.
There’s a difficulty operating in noisy or low-quality audio environments.
The challenge of matching conversations with heavy accents or overlapping speech.
A difficulty in interpreting industry-specific jargon, acronyms, or complex terminology.

Most real-time translation tools provide literal meanings but lack the emotional intelligence to pick up on sarcasm, humour, or cultural subtleties. A joke in German might be translated into flat, awkward English, while a frustrated tone in Portuguese might come across as neutral.

This can lead to misunderstandings or erode trust between international colleagues.

Although progress is being made with edge computing and low-latency networks like 5G, some platforms still introduce delays when converting speech into another language. This hampers natural dialogue and makes discussions feel mechanical.

The Players: Who’s Leading the Pack?

Zoom has integrated live translation for over a dozen languages using both AI and human interpretation via partners like Kudo and Interprefy.
Microsoft Teams offers real-time captions and translations, backed by Azure Cognitive Services.
Google Meet provides live captions in English, Spanish, French, and more, with expanded language support rolling out via AI updates.

Startups and tech giants alike are also experimenting with AI avatars, emotion-aware speech engines, and adaptive tone translation–the next frontier in making multilingual communication more human.

Last week, Google Beam finally rolled out with the stated intention of bringing humanity to the video conferencing experience.

Formerly known as Project Starline, its focus on perfecting the finer details of human interaction (together with a soon-to-be-released real-time translation feature) could be a game changer.

According to Google Beam’s General Manager Andrew Nartker, the combination of its AI video model and light field display creates “a profound sense of dimensionality and depth [which] allows you to make eye contact, read subtle cues, and build understanding and trust as if you were face-to-face.”

What’s Next: Toward Emotionally Intelligent Translation

While real-time translation in video conferencing has come a long way, it’s not exactly at a place where it could be considered a replacement for fluent human understanding.

The future of real-time translation in video conferencing lies in understanding not just what is said but also how it is said.

This depth and “dimensionality” is also being explored by companies like OpenAI and DeepL, who are working on advanced large language models that can adjust tone for politeness and cultural norms, detect and reflect emotional nuance, translate idioms and humour contextually and personalise voice translation to sound more like the speaker’s real voice.

The translation tools on the market are reliable enough for everyday business use, internal meetings, project check-ins, and informal calls, but human interpretation remains essential for critical conversations, sensitive topics, or legally binding discussions.

As AI grows smarter and more emotionally aware, it may soon be possible to attend a video call with someone on the other side of the world and feel like you’re truly speaking the same language, in every sense.

Until then, a little patience (and perhaps a bilingual colleague) still goes a long way.

Artificial Intelligence Video Conferencing

Brands mentioned in this article.

Microsoft Zoom