Gemini 2.5 Native Audio upgrade, plus text-to-speech model updates

What customers say

Google Cloud customers are already using Gemini’s built-in audio capabilities to drive real business results, from mortgage loan processing to customer calls.

“Users often forget they’re talking to AI within a minute of using Sidekick, and in some cases have thanked the bot after a long chat… New Live API AI capabilities offered through Gemini [2.5 Flash Native Audio] empower our merchants to win.” -David Wurtz, VP of Product, Shopify
“By integrating the Gemini 2.5 Flash Native Audio model… we have significantly improved Mia’s capabilities since its launch in May 2025. This powerful combination has enabled us to generate over 14,000 loans for our broker partners.– Jason Bressler, Chief Technology Officer, United Wholesale Mortgage (UWM)
“Working with the Gemini 2.5 Flash Native Audio model through Vertex AI allows Newo.ai AI receptionists to achieve unparalleled conversational intelligence… .They can identify the main speaker even in noisy environments, switch languages mid-conversation and sound remarkably natural and emotionally expressive.” – David Yang, Co-Founder, Newo.ai

Live voice translation

Gemini now natively supports new speech-to-speech translation features designed to handle both continuous listening and two-way conversation.

With continuous listening, Gemini automatically translates speech in multiple languages into a single target language. This allows you to plug in headphones and hear the world around you in your language.

For two-way conversation, Gemini’s direct speech translation handles translation between two languages in real time, automatically switching the output language based on who is speaking. If you e.g. speak English and want to chat with a Hindi speaker, you’ll hear real-time English translations in your headphones while your phone broadcasts Hindi when you’re done talking.

Gemini’s direct speech translation has a number of key features that help in the real world:

Language coverage: Translate speech in over 70 languages and 2000 language pairs by combining the Gemini model’s world knowledge and multilingual capabilities with its native audio capabilities
Style transfer: Captures the nuance of human speech and preserves the speaker’s intonation, pacing and pitch so the translation sounds natural.
Multilingual input: Understands multiple languages simultaneously in a single session, helping you follow multilingual conversations without messing around with language settings.
Automatic registration: Identifies the language spoken and starts the translation, so you don’t even need to know what language is being spoken to start translating.
Noise robustness: Filters out ambient noise so you can talk comfortably even in loud outdoor environments.