· api / speech-to-text / stt
Best speech-to-text API for podcasts in 2026: compared
Deepgram Nova-3 for speed and the largest free tier. AssemblyAI Universal-2 if transcript intelligence is the product. WER benchmarks and diarization costs.
By Ethan · Updated May 31, 2026
1,858 words · 10 min read
For most podcast developers choosing a speech-to-text API, Deepgram Nova-3 is the call — fastest, biggest free tier, and straightforward enough to ship in an afternoon. But if transcript intelligence (show notes, topic detection, speaker sentiment) is where your product lives, AssemblyAI’s Universal-2 earns its slightly higher price with the best accuracy on real podcast audio.
Who this is for
Developers building podcast apps, transcription pipelines, or any product that ingests audio and needs text out. You’re choosing an API, not a no-code service. If you want a drag-and-drop editor without touching code, this isn’t the guide.
How we tested each speech-to-text API
WER (word error rate) figures come from the CodeSOTA 2026 benchmark, run against a 50h mix of call centre recordings, podcasts, and meeting audio at varying SNR levels — field-recorded, multi-speaker, variable mic quality. That’s harder than a clean studio recording, which is what matters when your users are uploading normal episodes.
Pricing pulled from each provider’s documentation on 2026-05-31. Free tier terms change; verify before you commit.
We evaluated four providers directly (Deepgram, AssemblyAI, Rev AI, OpenAI) and gathered Gladia data from documentation and public benchmarks. Gladia, Rev AI, and OpenAI gpt-4o-transcribe weren’t included in the CodeSOTA 2026 run, so no WER comparison is available for them — that gap is noted where it matters.
Pricing was verified by creating accounts and confirming rates in each provider’s billing documentation, not scraped from marketing pages. Free tiers were verified the same way. These numbers are point-in-time; pricing on API platforms changes without notice.
The contenders
Deepgram Nova-3 — best for most builds
Nova-3 Mono posts an 8.2% WER on the CodeSOTA podcast corpus. Competitive, not the top mark, but Deepgram’s real advantage is latency: roughly 450ms end-to-end on async calls. The API is also the least ceremonious of the group to integrate.
Pricing: $0.0077/min for pre-recorded transcription. Add diarization (speaker labels) at $0.0020/min and that becomes $0.0097/min. A 60-minute podcast costs $0.46 without speaker labels, $0.58 with them. Diarization is not included in the base price — show both numbers when quoting costs.
Free tier: $200 credit. That covers approximately 25,900 minutes before you pay anything.
A working transcription call:
from deepgram import DeepgramClient
dg = DeepgramClient("YOUR_API_KEY")
with open("episode.mp3", "rb") as f:
response = dg.listen.rest.v("1").transcribe_file(
{"buffer": f.read(), "mimetype": "audio/mp3"},
{"model": "nova-3", "smart_format": True}
)
print(response["results"]["channels"][0]["alternatives"][0]["transcript"])
smart_format handles punctuation and paragraph breaks without a second API call. For podcasts with multiple speakers, add "diarize": True to the options dict and expect speaker labels in the words array of the response.
Deepgram also has the largest free tier in this group. $200 is enough to transcribe a full podcast back-catalog before paying a cent — which makes it the lowest-risk starting point for a new integration.
Best fit: indie builders, podcast app developers, teams who want a fast integration at a predictable per-minute rate.
AssemblyAI Universal-2 — best accuracy
Universal-2 posts a 7.9% WER on the CodeSOTA podcast corpus — the top mark in this comparison. The gap over Deepgram (0.3 percentage points) won’t matter for most use cases, but it’s real and reproducible.
Where AssemblyAI separates itself is what it puts on top of the transcript. The same API call that returns your text can also return:
- Auto chapters — timestamps and topic titles, ready to paste into show notes
- Topic detection — IAB taxonomy categories for content classification
- Sentiment analysis — per-sentence speaker sentiment scores
- PII redaction — strips names, phone numbers, and card details from the transcript before it leaves the API
These aren’t bolt-ons — they’re included in the same async transcription request. If your product is “show notes generator,” “episode summarizer,” or “content moderation for podcasts,” Universal-2 is doing work that would otherwise require a second LLM call.
Pricing: async transcription is $0.0025/min. Add diarization at $0.00033/min. A 60-minute episode costs $0.15 for the transcript, $0.17 with speaker labels.
Universal-2 handles async uploads only. Live recording pipelines use Universal-3 Pro — a different model, not covered here.
Free tier: $50 credit.
A basic async call with chapters enabled:
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
auto_chapters=True,
speaker_labels=True,
sentiment_analysis=True,
)
transcript = aai.Transcriber().transcribe("episode.mp3", config)
for chapter in transcript.chapters:
print(f"{chapter.start}ms — {chapter.headline}")
Best fit: products where the transcript is raw material — show notes generators, content moderation pipelines, podcast discovery engines.
Rev AI Reverb — diarization bundled, no add-on charge
Rev AI prices diarization differently: it’s included in the base $0.0033/min for up to 8 speakers — no add-on charge, no separate billing line. A 60-minute podcast with full speaker labels costs $0.20.
The differentiator no other provider here offers is a human fallback. At $1.99/min you can route audio to Rev’s professional transcription team when accuracy is non-negotiable. If you’re building a hybrid pipeline (API for volume, humans for edge cases), Rev AI is the only vendor that covers both ends from one contract.
Pricing: $0.0033/min, diarization included for up to 8 speakers.
Free tier: 5 hours.
Caveats specific to Rev AI: Reverb is English-first. No WER data from the CodeSOTA corpus (Rev AI wasn’t included in the 2026 run). Don’t treat the Rev AI row in the table as comparable to the Deepgram or AssemblyAI numbers on accuracy — it’s a different measurement surface.
Best fit: English-language podcasts where bundled diarization (no add-on charge) simplifies billing, or where a human fallback for edge-case accuracy is a production requirement.
OpenAI gpt-4o-transcribe — for OpenAI-stack teams only
gpt-4o-transcribe wasn’t included in the CodeSOTA 2026 benchmark run, so no WER comparison is available for this provider. At $0.006/min, it’s the most expensive option in this comparison; accuracy on noisy real-world audio relative to the other providers is unverified.
The case for it is narrow: if your product is already deep in the OpenAI ecosystem — Assistants API, function calling, structured outputs — keeping transcription in the same provider reduces auth surface, vendor count, and invoice complexity. That’s a legitimate tradeoff for some teams.
The hard limitation you need to know about: 25 MB file cap. A 60-minute podcast encoded at 128 kbps is roughly 55 MB. Every episode needs pre-chunking before it hits the API. Chunking introduces seam artifacts at split points and requires logic to find clean silence-based boundaries. If your episodes are consistently under 25 MB (roughly 25 minutes at 128 kbps, or longer at lower bitrates), the cap doesn’t bite. Otherwise, budget the engineering time.
A diarization model exists separately. It’s less mature than Deepgram’s or AssemblyAI’s implementation as of the 2026 CodeSOTA run. If you need speaker labels alongside a gpt-4o transcript, budget testing time — the two model calls need to be reconciled against the same word boundaries, which adds integration complexity.
Best fit: teams already on OpenAI infrastructure who want a single vendor and can tolerate the accuracy tradeoff and chunking overhead.
Gladia — EU/GDPR projects, 100+ languages
Gladia processes audio in EU data centers and supports 100+ languages via its Solaria-1 model. Diarization is included. If you’re building in Europe with GDPR data residency as a hard requirement, Gladia removes the compliance conversation you’d otherwise have with the US providers.
Pricing: Starter plan is $0.0102/min — uncompetitive against the others. Growth plan drops to roughly $0.0033/min with a volume commitment. The jump between plans is significant; the number on the pricing page is not the number you’ll pay at meaningful scale.
Free tier: 10 hours per month.
We don’t have WER data for Solaria-1 on the CodeSOTA podcast corpus; Gladia wasn’t in the 2026 benchmark run. Treat Gladia’s accuracy as unknown relative to this comparison.
Best fit: EU-based products with GDPR data residency requirements, multilingual podcast platforms where supporting 30+ languages is a feature rather than a footnote.
Comparison table
| Provider | Model | Price/min (transcription) | Diarization | WER (podcast corpus) | Free tier |
|---|---|---|---|---|---|
| AssemblyAI | Universal-2 | $0.0025 | +$0.00033/min | 7.9% | $50 credit |
| Deepgram | Nova-3 Mono | $0.0077 | +$0.002/min | 8.2% | $200 credit |
| Rev AI | Reverb | $0.0033 | Included (≤8 speakers) | n/a¹ | 5 hrs |
| OpenAI | gpt-4o-transcribe | $0.006 | Separate model | n/a¹ | None |
| Gladia | Solaria-1 | $0.0102² | Included | n/a¹ | 10 hrs/mo |
¹ Not included in CodeSOTA 2026 benchmark run.
² Gladia Growth plan reduces to ~$0.0033/min with volume commitment.
WER source: CodeSOTA 2026 benchmark, 50h mix of call centre recordings, podcasts, and meeting audio at varying SNR levels.
Verdict
Pick Deepgram Nova-3 if: you want to ship without friction, the transcript text is the end product, and $200 of free credit is a better ramp than AssemblyAI’s $50.
Pick AssemblyAI Universal-2 if: you need chapters, topics, or sentiment built into the same response. The accuracy edge over Deepgram is marginal on most audio; the NLP toolkit is the real argument.
Pick Rev AI Reverb if: the podcast is English, you want diarization bundled without a separate billing line, or you need the human fallback at $1.99/min for recordings where accuracy is non-negotiable.
Pick OpenAI gpt-4o-transcribe if: you’re already on OpenAI infrastructure, a single vendor simplifies your stack, and you can handle episode chunking. Don’t pick it for accuracy.
Pick Gladia if: GDPR data residency is non-negotiable, you need broad multilingual support, and you’re willing to negotiate a Growth plan before comparing costs.
Azure Cognitive Services and AWS Transcribe are omitted here — both require existing cloud infrastructure commitments that make them default choices for teams already on those platforms, not starting points for new podcast tooling.
If your pipeline processes transcripts with an LLM — for show notes generation, topic extraction, or content moderation — the best LLM router comparison covers managing model costs at scale. Building a full audio loop that includes synthesis? The best text-to-speech API comparison covers the output side.
Caveats
OpenAI 25 MB cap is a pipeline requirement: A 60-minute podcast at 128 kbps is 55 MB. Plan for a chunker if you go that route; the cap is not a soft guideline.
Deepgram diarization is an add-on: Base price is $0.0077/min for pre-recorded; with speaker labels it’s $0.0097/min. Quote the right number for your use case.
AssemblyAI async vs. streaming pricing: Async ($0.0025/min) is for post-production uploads. Streaming ($0.0075/min) is for live recording pipelines. Podcast transcription is almost always async. Verify which endpoint your integration calls.
Rev AI and Gladia WER gaps: These two providers weren’t in the CodeSOTA 2026 benchmark run. Their accuracy numbers relative to Deepgram and AssemblyAI are unknown within this comparison. If accuracy parity matters, run your own test on a representative sample of your actual audio before committing.
Test with your own audio: The CodeSOTA corpus is a 50h mix of call centre recordings, podcasts, and meeting audio at varying SNR levels — not your podcast. WER shifts with mic quality, speaker accent, background noise, and domain vocabulary. Before committing at scale, run a 10-episode sample through at least two providers and measure accuracy against a manually verified segment. Benchmark numbers are a starting point.