· tts / text-to-speech / api
Best Text-to-Speech API in 2026: Ranked and Compared
ElevenLabs leads on voice quality, Cartesia on streaming latency, Google on cost. 8 TTS API providers scored across TTS Arena V2, P50 latency, and pricing.
By Ethan
2,169 words · 11 min read
Pick ElevenLabs if voice quality is the product. Pick Cartesia for real-time agents where every millisecond of latency is felt. Pick Google Cloud TTS if you’re building at scale and need Vietnamese (or any Southeast Asian language) without guessing. For most developers starting out, Deepgram’s $200 no-expiry free credit is the lowest-friction way to evaluate a text-to-speech API before committing to a tier.
Who this is for
Backend developers integrating speech synthesis into a product — conversational AI agents, accessibility features, podcast automation, or interactive voice response. If you’re comparing on-device models (Piper, Coqui) or building a voice clone pipeline from scratch, this isn’t the right article.
How we evaluated
Quality: TTS Arena V2 leaderboard (HuggingFace, sourced May 2026). Elo scores from human preference votes, blind pairwise comparisons, same methodology as Chatbot Arena. The real top 10 as of May 2026: CastleFlow v1.0, Vocu V3.0, Inworld TTS MAX, Inworld TTS, Hume Octave, Papla P1, MiniMax Speech-02-Turbo, Eleven Turbo v2.5 (~1539 Elo), MiniMax Speech-02-HD, Eleven Flash v2.5 (~1531). Of the providers covered here, only ElevenLabs appears in the top 10 — three entries: Eleven Turbo v2.5 (#8), Eleven Flash v2.5 (#10), Eleven Multilingual v2 (#11, ~1528).
Streaming latency: Picovoice tts-latency-benchmark (Gradium 2026 run) for directional ordering. The benchmark measures P50 time-to-first-audio (TTFA) from a cold start; results for some engines (Deepgram Aura-2, Cartesia Sonic-3) are in plots only and couldn’t be confirmed as text-verifiable figures — we’ve replaced specific ms numbers for those engines with qualitative characterizations. Treat latency claims here as directional, not SLAs.
Pricing: Official pricing pages, cross-referenced May 2026. Effective price per million characters, no volume discounts unless noted. Azure Neural prices are provisional; confirm before committing to a contract.
Free tiers: Verified against each provider’s current signup flow.
Comparison table
| Provider | Quality (TTS Arena V2 Elo) | P50 latency (TTFA) | $/1M chars | Voice cloning | Vietnamese | Free tier |
|---|---|---|---|---|---|---|
| Google Cloud TTS | not in top 10 | not measured | WaveNet/Neural2: $16 | No | ✅ WaveNet+Neural2 | 1M WaveNet chars/mo |
| ElevenLabs (Turbo v2.5 / Flash v2.5) | #8 ~1539 Elo (Turbo v2.5) | sub-300ms (Turbo v2.5) | ~$165 (Pro tier) | ✅ (consent required) | ✅ (Multilingual v2) | 10K chars/mo |
| AWS Polly | not in top 10 | not measured | Standard $4 / Neural $16 / Generative $30 | No | ❌ | 5M std + 1M neural, first 12 months |
| OpenAI tts-1 / tts-1-hd | not in top 10 | high variance | $15 / $30 | No | ❌ | None |
| Deepgram Aura-1/2 | not in top 10 | moderate (Aura-2) | $15 / $30 | No | ❌ | $200 credit, no expiry |
| Cartesia Sonic-3 | not in top 10 | lowest of providers here | ~$30 (Scale tier) | ✅ ($4/mo) | ❌ | Credit-based trial |
| Azure Neural TTS | not in top 10 | not measured | ~$15–16 / HD ~$22† | No | ✅ | None |
| PlayHT | not in top 10 | not measured | varies | ✅ (30 sec audio) | ❌ | 2,500 words trial |
† Azure Neural pricing is provisional as of May 2026 — verify before signing.
Per-tool breakdown
ElevenLabs
The quality leader in this comparison on TTS Arena V2, with three models in the leaderboard’s top 11: Eleven Turbo v2.5 (#8, Elo ~1539), Eleven Flash v2.5 (#10, ~1531), Eleven Multilingual v2 (#11, ~1528). No other provider in this review appears in the top 10. Turbo v2.5 is audibly better than everything else on expressive speech — it handles emphasis, pacing, and emotional register in a way that $30/1M models don’t.
The catch is price. At the Pro tier ($99/month for 600K characters), you’re paying roughly $165 per million characters. At scale — say, 50M characters a month — that’s $8,250/month before negotiation. If you’re building a high-volume pipeline where voice quality is a differentiator worth paying for, the math can work. If you’re price-sensitive or doing background narration, it probably doesn’t.
Streaming latency via Turbo v2.5 and Flash v2.5 is sub-300ms in practice. Neither matches Cartesia for hard real-time applications, but both are workable for most conversational agents.
Voice cloning requires explicit user consent and EU AI Act labeling (mandatory since August 2026). If you’re cloning voices from users or public figures, biometric data rules under GDPR apply.
Get started: ElevenLabs — 10K free characters/month, no credit card required.
Google Cloud TTS
Google Cloud TTS does not appear in TTS Arena V2’s top 10 — the leaderboard has no entry for any of Google’s production TTS models as of May 2026. Many teams have Google Cloud already in their stack and haven’t run a real quality evaluation, which is exactly the risk: the free tier and familiar SDK make it easy to ship without benchmarking.
The pricing model is the most permissive of any competitive-quality provider: 1 million WaveNet or Neural2 characters per month, ongoing, no expiry. WaveNet ($16/1M) is the workhorse tier. Studio ($160/1M) is for broadcast work. Standard voices are $4/1M if you need a cost floor and don’t care about quality.
Google Cloud TTS is the only provider on this list with confirmed Vietnamese support at the WaveNet and Neural2 quality tiers. MINT-Bench (April 2026) does not include Vietnamese — no standardized multilingual TTS benchmark covers it. If Vietnamese is a requirement, evaluate with native speakers against a test set you control.
No voice cloning. No streaming latency data in the Picovoice benchmark.
AWS Polly
Cheap at the standard tier ($4/1M), and AWS is already in most infrastructure stacks. One hard limit: the SynthesizeSpeech API enforces a 3,000 billed-character limit per request — SSML tags are not counted toward the billed limit — with a 10-minute audio cap. For long-form narration or book chapters, you must use the async StartSpeechSynthesisTask API. If you’re building a podcast pipeline and assumed you could stream 20-minute episodes through the synchronous endpoint, you can’t.
Neural voices ($16/1M) are a generation behind Google and ElevenLabs. Generative voices ($30/1M) close some of the gap but aren’t in TTS Arena V2’s top 10. Free tier runs for 12 months — 5M standard characters and 1M neural per month. After 12 months you pay standard rates.
No voice cloning. No Vietnamese.
OpenAI
tts-1 ($15/1M) is fine for utility speech — IVR, accessibility, low-stakes narration. tts-1-hd ($30/1M) is better but doesn’t reach ElevenLabs territory on TTS Arena V2. The bigger issue is latency: the Picovoice benchmark shows high variance on tts-1-hd, which makes it a poor choice for real-time agents where you need consistent TTFA.
The main advantage is API ergonomics. If you’re already using OpenAI for language model calls, adding TTS is two lines of code against a familiar SDK.
No free tier. No voice cloning. No Vietnamese.
Deepgram
Deepgram’s strength is developer experience on the free tier: $200 in credit, no expiry, no credit card required at signup. Aura-1 at $15/1M and Aura-2 at $30/1M are priced competitively, but neither appears in TTS Arena V2’s top 10.
Aura-2 is the slowest of the streaming-focused providers covered here, but still workable for asynchronous pipelines. Specific P50 figures from the Picovoice benchmark couldn’t be confirmed for Deepgram Aura-2 (results are in plots, not text); treat the directional claim as a guide. The no-expiry credit makes it the lowest-risk option for teams evaluating TTS before picking a provider.
Get started: Deepgram — $200 credit, no credit card, no expiry.
Cartesia
The latency leader. Cartesia positions Sonic-3 as the lowest-latency production TTS option, and that claim is consistent across its self-reported benchmarks. Specific P50 figures from the Picovoice benchmark couldn’t be confirmed for Sonic-3 (results are in plots, not text), so we’ve dropped the exact ms numbers. For a real-time voice agent where latency is audible and frustrating, Cartesia’s directional advantage is credible — verify with your own measurement in your deployment region.
Pricing is credit-based: roughly $30/1M at the Scale tier, but Cartesia doesn’t publish its credit-to-character conversion ratio directly. Verify before committing to a volume commitment.
Voice cloning is available from $4/month with instant turnaround. Sonic-3 doesn’t appear in TTS Arena V2’s top 10, so the trade-off is clear: you get the fastest streaming latency in the field, not the most expressive voice.
PlayHT
Voice cloning from 30 seconds of audio, available at the Creator tier. Quality is below ElevenLabs in head-to-head comparisons. Worth evaluating if you’re building a voice-cloning product and want to compare clone fidelity, but not the first choice for utility TTS pipelines.
Azure Neural TTS
Confirmed Vietnamese support, which puts it alongside Google for Southeast Asian language requirements. Pricing is provisional as of May 2026 (~$15–16/1M for standard Neural, ~$22/1M for HD Neural) — verify against the Azure pricing calculator before signing. No voice cloning. Doesn’t appear in TTS Arena V2’s top 10.
Best TTS API for your use case
Real-time conversational agents: Cartesia Sonic-3. Cartesia self-reports the lowest streaming latency of the providers here, and the directional claim is consistent. Latency is perceived — verify in your deployment region, but Cartesia is the right starting point. If you’re also selecting an LLM router for the same agent pipeline, see Best LLM router in 2026.
Voice quality as a product feature: ElevenLabs (Turbo v2.5 or Flash v2.5). Three entries in TTS Arena V2’s top 11; no other provider in this comparison makes the list.
High-volume narration or cost-sensitive workloads: Google Cloud WaveNet/Neural2 at $16/1M with 1M free characters/month, or AWS Polly Standard at $4/1M for utility-grade speech. Mind the 3,000 billed-char Polly limit if you’re doing long-form (SSML tags excluded from the billed count).
Vietnamese or Southeast Asian language support: Google Cloud TTS (WaveNet + Neural2 confirmed). Azure Neural is also supported but pricing is provisional. No benchmark covers Vietnamese — evaluate with native speakers.
Evaluation budget: Deepgram’s $200 no-expiry credit. No other provider offers a risk-free starting point at this size. For context on how TTS API costs fit into a full agent stack, see The real cost of running an AI agent team in 2026.
Indie hackers, side projects: Deepgram for evaluation, then Google Cloud TTS for production — the 1M WaveNet chars/month free tier covers most side-project volumes indefinitely.
Vietnamese TTS — a note on what the data can’t tell you
Google Cloud TTS (WaveNet + Neural2) and Azure Neural TTS both officially support Vietnamese. ElevenLabs Multilingual v2 also claims support, but there’s no benchmark data to evaluate it.
MINT-Bench (April 2026) is the most recent multilingual TTS benchmark in the literature — it does not include Vietnamese. No standardized evaluation methodology for Vietnamese TTS quality exists as of May 2026. The consequence: you can’t outsource this decision to a leaderboard. Pick the provider, run a test set against your actual content, and evaluate with native speakers whose feedback you trust.
Voice cloning compliance
Three providers offer voice cloning. The legal picture differs significantly.
ElevenLabs: Explicit consent from the voice owner is required under their terms of service. Since August 2026, EU AI Act labeling is mandatory for AI-generated audio — any synthesized voice presented to end users must be disclosed as AI-generated. If you collect voice samples from users, that audio is biometric data under GDPR. Retention, deletion, and cross-border transfer rules apply.
Cartesia: Instant cloning from $4/month. Terms are less prescriptive than ElevenLabs on consent workflows, but the same EU AI Act and GDPR obligations apply if you’re operating in the EU.
PlayHT: 30 seconds of audio, Creator tier. Similar obligations as Cartesia for EU deployments.
OpenAI, AWS Polly, Deepgram: No voice cloning.
If you’re building a product that clones end-user voices, the EU AI Act labeling requirement is not optional for EU users. Build the disclosure into the UX before you ship.
Caveats
Specific P50 TTFA figures for Deepgram Aura-2 and Cartesia Sonic-3 could not be confirmed from the Picovoice tts-latency-benchmark text (results are in plots, not text). We’ve removed those specific ms numbers and replaced them with qualitative characterizations. Cartesia’s latency advantage is directional — consistent across its self-reported benchmarks but not independently measured. Verify in your own deployment region before making latency-sensitive architectural decisions.
Azure Neural TTS pricing is provisional as of May 2026. Verify against the Azure pricing calculator before committing.
Cartesia’s credit-to-character conversion ratio is not published. The ~$30/1M figure is derived from Scale-tier credit pricing — confirm directly with Cartesia sales for volume commitments.
ElevenLabs and Deepgram are affiliate partners of toolchew. Affiliate status does not influence rankings or verdicts.
References
- TTS Arena V2 leaderboard — human preference Elo, pairwise blind comparisons
- Picovoice text-to-speech-benchmark — open-source streaming latency benchmark
- AWS Polly limits (official) — 3,000 billed-char SynthesizeSpeech limit (SSML tags excluded from billed count), 10-min audio cap
- MINT-Bench (April 2026) — multilingual TTS benchmark; Vietnamese not included
- ElevenLabs service terms — consent and EU AI Act obligations