Fredric, who leads the Audio Engineering Team in Meet, has seen AI transform what his team is capable of doing. His team began working with speech translation about two years ago; At that time, existing models could handle offline translation, but the challenge was by doing it immediately -which would be necessary for Live Google meeting calls. But they knew it was possible so they started working with the Google Deepmind team. “When we started, we thought,” maybe this will take five years, “explains Fredric. Two years later we are here.” As things come with AI, “he explains,” Things just went faster and faster. Now there is a whole Google community with engineers from Pixel, Cloud, Chrome and more working with Google Deepmind to achieve real-time speech translation. “
Breakthrough in translation technology
Previous audio translation technologies depended on a multi -step process: transcribe the speech, translate the text and then convert it back to speech. This chain resulted in considerable latency, often 10-20 seconds, which made natural conversation impossible. And translated voices were generic, not finding the unique characteristics of the speaker.
The true breakthrough that HUIB (that leads product control to sound quality) explains, thanks to “large models”-not necessarily large language models (LLMs) but models capable of “one-shot” translation. “You send audio in and almost instantly that the model begins to emit sound,” he notes. This drastically reduced latency to almost mimic how a human interpreter treats and delivers speech. “We discovered that two to three seconds were a kind of sweet spot,” says Huib. Faster was difficult to understand; Slower does not borrow a natural conversation. But once they hit this timing, it meant that using this model can translate in Google Meet make simultaneous conversation across different languages ​​that are feasible.
Problem solving and great improvements
Developing this complex function was not without its obstacles. One of the most critical aspects was to ensure high quality translation, which can vary greatly depending on factors such as speaker accent, background noise or networking conditions. Despite challenges in development, the meet and deepmind teams worked together to refine these hiccups, test models and adjust them based on the real world.
Part of this test involved work with linguists and other language experts to really understand the nuances not only by translation but also accents. Languages ​​with closer attachments, such as Spanish, Italian, Portuguese and French, were easier to integrate, while structurally different languages ​​such as German presented greater challenges due to variations in everything from grammar to ordinary idioms. Currently, the model also translates most expressions literally, which can lead to funny misunderstandings, Huib and Frederic Note. However, they expect updates that use advanced LLMs to understand and translate such nuances more precisely, even capture tone and irony.
