Google has unveiled a new Text-to-Speech (TTS) model powered by Gemini 3.1 Flash. Supporting over 70 languages, the model allows developers to fine-tune style, pace, and accent using specialized audio tags. According to The Decoder, this is Google's most natural and expressive voice solution to date, even enabling the creation of multi-voice dialogues.
In terms of price-to-performance ratio, the new model outperforms ElevenLabs v3, securing second place in the overall Artificial Analysis rankings with an Elo score of 1211. This positioning firmly establishes Google as a major player in the TTS market, trailing only Inworld 1.5 Max.
The model is currently available in preview via the Gemini API, through Vertex AI for enterprise clients, and within Google Vids for Workspace users. While a free tier is available, it requires users to permit Google to use their data for product improvement. The paid tier ensures data privacy, priced at $1.00 per million input tokens and $20.00 per million output tokens. For batch processing, these costs are halved to $0.50 and $10.00, respectively. All AI-generated audio is watermarked using Google’s SynthID technology.
For businesses, this launch represents a significant expansion of AI voice capabilities. Broad language support combined with granular speech control allows for the creation of truly global and personalized voice products—ranging from customer service bots to automated content localization. The competitive pricing and high-quality benchmarks offer an attractive alternative for companies looking to scale voice interfaces efficiently. Furthermore, the integration of SynthID addresses the critical need for distinguishing AI-generated content, providing a layer of transparency in an increasingly cautious digital landscape.