LLM cost routing: when Haiku beats Opus and when it does not

Routing 1M classification input tokens from Claude Opus 4.7 to Haiku 4.5 saves $6.00 — an 80% reduction. The RouteLLM paper (ICLR 2025, arXiv:2406.18665, openreview.net/forum?id=8sSqNntaMr) showed you can get >85% cost reduction on conversational queries while holding 95% of flagship quality — by routing just 14% of traffic to the strong model. The decision isn’t binary; it depends heavily on which task class you are routing.

Who this is for

Developers building production LLM products where model costs appear on a budget review. If you are running fewer than 100k tokens per day, routing infrastructure adds more complexity than it saves — pick a cheap model and escalate when quality fails. This is a guide for teams that already have a volume problem.

What routing is and why it works

Model routing means sending each query to the cheapest model capable of handling it. The key word is “capable.” Not every query needs multi-step reasoning. Classification, extraction, and retrieval augmentation are often near-deterministic given a well-written prompt — model size doesn’t move the needle much there.

The cost spread forces the question. At current Anthropic pricing (May 2026, platform.claude.com/docs):

Model	Input ($/MTok)	Output ($/MTok)
Claude Haiku 4.5	$1.00	$5.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Opus 4.7	$5.00	$25.00

Opus output costs 5× more than Haiku output. The Batch API halves all tiers, but the relative spread stays the same. On OpenAI, the gap is wider: GPT-4o output ($10.00/MTok) is 16.7× more expensive than GPT-4o-mini ($0.60/MTok).

Concrete dollar example — 1M input + 100k output tokens of classification work:

Route	Claude cost	OpenAI cost
All flagship (Opus / GPT-4o)	$7.50	$3.50
All cheap (Haiku / GPT-4o-mini)	$1.50	$0.21
Savings	$6.00 (80%)	$3.29 (94%)

That spread means even a rough router that calls the right model 70% of the time produces real savings. One that hits 85% — achievable on most query mixes, per the ICLR paper — produces dramatic ones.

The economic argument is reinforced by research. The Hybrid LLM paper (arXiv:2404.14618, ICLR 2024) found that a learned router could cut flagship model calls by up to 40% with no measurable quality drop. The ACL 2024 routing lessons paper (aclanthology.org/2024.insights-1.15.pdf) documented that ~20% of queries on structured tasks saw small models match or outperform large ones — the ceiling on cheap-model quality is higher than intuition suggests.

Task taxonomy: what to route and what to escalate

This table is the anchor. Use it as a starting heuristic, then tune on your own query distribution. The savings figures are estimates based on RouteLLM benchmarks and current pricing; your mix will shift the numbers.

Task class	Route to cheap?	Est. savings	Quality delta
Binary / multi-label classification	✅ Yes	75–94%	Near-zero
Structured extraction (NER, slot-filling)	✅ Yes	75–94%	Near-zero
FAQ / retrieval-augmented lookup	✅ Yes	75–80%	Negligible
Short marketing copy	✅ Yes	60–80%	Low if template-driven
Code formatting / linting	✅ Yes	75–94%	Near-zero
Summarization (structured docs)	✅ Yes (with tuning)	50–75%	Low
Multi-turn conversation	⚠️ Partial	74–85%	Low–moderate
Broad knowledge QA	⚠️ Partial	~45%	Moderate
Code generation (non-trivial)	⚠️ Escalate often	20–35%	Significant
Math reasoning / chain-of-thought	⚠️ Escalate often	~35%	High
Complex debugging / multi-file refactoring	❌ Escalate	<20%	Very high
Long-context synthesis (>50k tokens)	❌ Escalate	<20%	Very high
Agentic multi-step chains	❌ Escalate	<15%	Very high (errors compound)
Customer empathy / escalation handling	❌ Escalate	<20%	Reputational risk

Why does the gradient look like this? The RouteLLM benchmarks are instructive. On MT Bench (open-ended conversation), just 14% of queries needed the strong model for 95% quality parity — >85% cost reduction. MMLU (knowledge QA) required 54% flagship calls (~45% savings). GSM8K (math reasoning) needed 65% (~35% savings).

The pattern: tasks with high query variance route well; tasks requiring consistent multi-step reasoning route poorly. MT Bench has many easy conversational turns a cheap model handles fine. GSM8K requires chain-of-thought on nearly every problem, so the cheap model’s accuracy collapses across the board.

Coding is the non-obvious case. SWE-bench Verified scores (swebench.com leaderboard, retrieved 2026-05-17): Opus 4.7 at 87.6%, Sonnet 4.6 at 79.6%, Haiku 4.5 at 73.3%. That 14.3 percentage-point gap is real. Roughly 1-in-6 coding tasks that Opus handles correctly, Haiku fails. Whether that matters depends on what sits downstream of a failure. A formatting task gone wrong is a minor annoyance. A multi-file refactor gone wrong costs hours.

The TACL summarization benchmark (doi:10.1162/tacl_a_00632) adds nuance: instruction tuning matters more than model size for structured summarization tasks. A small model fine-tuned on your domain can outperform a large general model, which shifts the “escalate often” row toward “route with tuning.”

Latency: the reason to route even if cost doesn’t matter

Artificial Analysis benchmarks (artificialanalysis.ai, retrieved 2026-05-17):

Model	TTFT (best observed)	Throughput (best observed)
Claude Haiku 4.5	0.60s (Google Vertex)	103.5 t/s (Amazon)
Claude Opus 4.7	17.20s (Amazon)	78.7 t/s (Amazon)

Opus 4.7 is 28–34× slower to first token than Haiku 4.5. If your pipeline has any real-time component — classification in a request path, user-facing autocomplete, anything with a human waiting — Haiku is the only viable choice regardless of price. The math is simple: even if Anthropic made Opus free, 17 seconds to first token eliminates it from synchronous use cases. At Haiku’s 0.60s TTFT, you can chain three model calls in the time it takes Opus to start its first response.

Router overhead is negligible. RouteLLM’s most expensive classifier adds <0.4% to total generation cost — a rounding error against the latency and cost savings it enables.

Tools for implementation

Three options at different points on the build-vs-manage spectrum. These are brief pointers, not setup tutorials — follow the links for configuration docs.

LiteLLM Router — most practical for self-hosters

LiteLLM (docs.litellm.ai/docs/routing) is an open-source Python proxy with 100+ model providers behind a unified OpenAI-compatible API. You define routing strategy in YAML:

model_list:
  - model_name: my-classifier
    litellm_params:
      model: claude-haiku-4-5
  - model_name: my-classifier
    litellm_params:
      model: claude-opus-4-7

router_settings:
  routing_strategy: cost-based-routing

The router picks the cheapest available model from the group and fails over on error. You get budget limits per API key, spend tracking, and retry logic. Savings are implicit in the price differential you configure — LiteLLM routes; you capture whatever spread exists between your chosen models.

For: engineers who own their infrastructure and need fine-grained routing control, fallback behavior, and spend visibility.

OpenRouter — zero-infra option

OpenRouter (openrouter.ai) is a managed gateway for 400+ models across 60+ providers, handling 80T+ monthly tokens. Use the openrouter/auto model slug for automatic model selection powered by NotDiamond routing. No infrastructure to maintain — swap your base URL, keep your existing OpenAI SDK calls.

No affiliate program found for OpenRouter (checked May 2026).

Limitation: hosted-only, less granular control than a self-hosted proxy. The auto-router is a black box — you cannot tune its thresholds or inspect routing decisions.

For: startups and solo developers who want routing without operating infrastructure.

RouteLLM — research-grade, highest benchmark ceiling

RouteLLM (github.com/lm-sys/RouteLLM) trains MF/BERT/Causal LLM classifiers on Chatbot Arena preference data. The headline result: 14% GPT-4 calls needed for 95% MT Bench quality — >85% cost reduction. The router transfers zero-shot to Claude Opus/Sonnet pairs (APGR 0.762–0.772) without retraining on the new model pair.

The catch is setup cost. Getting the best numbers requires preference labels. The paper found LLM-judge augmentation — generating ~120k labels using a judge model — pushed MT Bench performance from 26% GPT-4 calls (untrained) to 14% (trained), at a cost of ~$700. That is a one-time investment; after training, inference cost is negligible.

For: teams with large, varied query distributions where a 10–15% improvement in routing accuracy is worth an engineering sprint.

One tool to remove from your list: Martian. As of May 2026, withmartian.com no longer lists a routing product, pricing page, or sign-up. The routing service appears discontinued — do not rely on Martian as an active recommendation without verifying current availability.

Verdict: four decision thresholds

Route everything to cheap by default if your task mix is >50% classification, extraction, or RAG lookups. Start there, audit failures monthly, escalate patterns to Opus — do not begin with flagship.

Use a hybrid with explicit escalation signals if your mix includes significant code generation, knowledge QA, or summarization. Set explicit complexity proxies — token count, query type tag, confidence threshold — rather than relying on a trained router. Explicit rules are auditable; a black-box router is not.

Invest in RouteLLM if you have a large, varied query distribution, can generate preference labels, and the gap between 45% and >85% savings is worth an engineering sprint. This is the highest-ceiling option but has real setup cost.

Keep everything on Opus if your pipeline is an agentic chain where errors compound, handles customer-facing escalations requiring empathy, or performs long-context synthesis across multiple documents. The 14.3pp SWE-bench gap is real, and in multi-step chains, a wrong intermediate step does not just fail — it corrupts the downstream steps that depend on it.

One-sentence heuristic: route to the cheap model by default; escalate to flagship only when the query requires multi-step reasoning, novel synthesis, or cross-document understanding.

Caveats

Pricing is as of May 2026 — check platform.claude.com/docs and openrouter.ai before basing architecture decisions on these numbers. RouteLLM’s results trained on GPT-4/Mixtral pairs; zero-shot transfer to Claude pairs has variance, and your per-task-class numbers will differ from the paper’s aggregate benchmarks. The latency data is from Artificial Analysis (artificialanalysis.ai) and reflects best-observed figures from multiple providers — real-world TTFT depends on load and region.

No tools linked in this article have affiliate relationships with toolchew.

References

Anthropic pricing — platform.claude.com/docs/en/about-claude/models/overview — retrieved 2026-05-17
OpenAI pricing — openrouter.ai and llmpricecheck.com — retrieved 2026-05-17
RouteLLM — arXiv:2406.18665 (ICLR 2025, openreview.net/forum?id=8sSqNntaMr) — lmsys.org/blog/2024-07-01-routellm and github.com/lm-sys/RouteLLM
Hybrid LLM paper — arXiv:2404.14618 (ICLR 2024)
LiteLLM Router docs — docs.litellm.ai/docs/routing — retrieved 2026-05-17
OpenRouter auto router (powered by NotDiamond) — openrouter.ai/docs/guides/routing/routers/auto-router — retrieved 2026-05-17
Martian — withmartian.com — retrieved 2026-05-17 (no routing product, pricing page, or sign-up visible as of this date; routing service appears discontinued)
Artificial Analysis latency benchmarks — artificialanalysis.ai — retrieved 2026-05-17
ACL 2024 routing lessons — aclanthology.org/2024.insights-1.15.pdf
TACL summarization benchmark — doi:10.1162/tacl_a_00632
SWE-bench Verified scores — swebench.com (retrieved 2026-05-17): Opus 4.7: 87.6%, Sonnet 4.6: 79.6%, Haiku 4.5: 73.3%