Sonnet 4.6 is the clear choice if you are already on Sonnet 4.5: same price, extended thinking on demand, 1M token context window. If you are on Sonnet 3.7, the picture is messier — the best available coding proxy — from Sonnet 4 (claude-sonnet-4-20250514) — sits 3.6 percentage points below Sonnet 3.7, but each agentic run costs about 28% less; no Aider polyglot result is currently available for claude-sonnet-4-6 specifically. For developers on Opus 4, Sonnet 4.6 runs the same workloads at roughly 60% of the cost. None of those decisions are obvious. This article breaks them down.

Who this is for

Developers currently running Sonnet 3.7, Sonnet 4.5, or Opus 4 in Claude Code or directly through the API who want a data-backed answer to “should I switch to claude-sonnet-4-6?” If you are picking a first model or AI coding tool, start with the best AI coding CLI comparison before narrowing to model selection. If you are still deciding between Claude Code and another coding tool, see the Cursor vs Claude Code comparison first.

What changed in Sonnet 4.6

Anthropic’s release framing is sweeping: “a full upgrade of the model’s skills across coding, computer use, long-context reasoning, agent planning, knowledge work, and design.” Independent benchmark data below tests how much of that holds for coding tasks specifically.

The concrete spec changes:

Spec	Sonnet 3.7	Sonnet 4.5	Sonnet 4.6
Context window	200k tokens	200k tokens	1M tokens
Max output	64k tokens	64k tokens	64k tokens
Extended thinking	Yes (`budget_tokens`)	Yes (`budget_tokens`)	Yes (`effort` param)
`effort` parameter	No	No	Yes (API default: `high`)
`budget_tokens`	Active	Active	Deprecated
Model ID	`claude-sonnet-3-7`	`claude-sonnet-4-5`	`claude-sonnet-4-6`

The 1M token context window is the clearest structural change — over both Sonnet 3.7 and Sonnet 4.5, which also has a 200k token window. Output cap and reasoning modes are unchanged from Sonnet 4.5.

Extended thinking in Sonnet 4.6 uses a new effort parameter (low, medium, high, max) that replaces the budget_tokens approach from Sonnet 3.7 and earlier. The old budget_tokens API still works but is deprecated and will be removed in a future release. The Aider benchmark data quantifies the gap between thinking and no-thinking configurations directly. Omitting the effort parameter entirely produces a cheaper and somewhat lower-accuracy profile.

Benchmark results

Aider polyglot coding benchmark

The Aider polyglot leaderboard is the most credible third-party coding benchmark currently available. It runs models through 225 Exercism exercises across C++, Go, Java, JavaScript, Python, and Rust. The score is the pass rate. Cost per run is derived from real API usage, not estimates.

Model	Thinking	Score	Cost/run
Claude Opus 4	32k tokens	72.0%	$65.75
Claude Sonnet 3.7	32k tokens	64.9%	$36.83
Claude Sonnet 4 (claude-sonnet-4-20250514)	32k tokens	61.3%	$26.58
Claude Sonnet 4 (claude-sonnet-4-20250514)	None	56.4%	$15.82

Source: aider.chat/docs/leaderboards

The data above is for claude-sonnet-4-20250514 — Anthropic’s Sonnet 4, now deprecated — not claude-sonnet-4-6. The Aider leaderboard has no entry for claude-sonnet-4-6 at time of writing. Check the model ID column when you read this; if a claude-sonnet-4-6 entry has appeared, use those figures instead.

What the proxy data shows: Sonnet 4 with 32k thinking tokens scores 3.6 points below Sonnet 3.7 at the same thinking budget. Aider creator Paul Gauthier noted on X: “Sonnet 4 seems to have underperformed 3.7.” Until a claude-sonnet-4-6 result lands on the leaderboard, this is the closest available signal for the Sonnet 4.x line’s coding accuracy.

The gap between thinking and no-thinking in the Sonnet 4 data is 4.9 points (61.3% vs. 56.4%). Disabling extended thinking is the cheapest configuration at $15.82/run, but it meaningfully reduces benchmark performance.

What is not in this article

No SWE-bench Verified or HumanEval score for claude-sonnet-4-6 survived adversarial verification against primary sources. A 79.6% SWE-bench claim appeared in secondary sources but was not confirmed. If Anthropic has published official evaluation numbers, they will be on the AWS Bedrock model card or the official model overview page.

No latency or tokens-per-second figures are included. All throughput claims found during research were refuted in adversarial verification. For up-to-date latency data, check artificialanalysis.ai and confirm the model ID listed is claude-sonnet-4-6.

What the 1M context window changes in practice

The most common scenario where the context window becomes the binding constraint is a long agentic session — one where the model accumulates file contents, tool-call results, test output, and conversation history simultaneously. At Sonnet 3.7’s 200k token limit, a session reading a large codebase plus running several rounds of tests could hit the ceiling and truncate earlier context. At 1M tokens, that ceiling is 5× further away.

For a single Python file or a straightforward bug fix, context window size is irrelevant. For any of the following workloads, it matters:

Multi-repo refactors: reading source files across several packages while tracking the change plan
Long agentic loops: 30+ tool calls that accumulate prior results in context
Large-codebase review: pulling many files into a single pass for cross-cutting analysis
Extended Claude Code sessions: where conversation history plus file contents compound quickly

Sonnet 4.5 has a 200k token context window, so the 1M window is an upgrade from both Sonnet 3.7 and Sonnet 4.5. For any developer who has run into context limits, this is the strongest argument for switching.

Pricing and cost analysis

API pricing: Sonnet 4.6 vs Sonnet 4.5

Model	Input	Output	Cache read	Batch input	Batch output
claude-sonnet-4-6	$3.00/MTok	$15.00/MTok	$0.30/MTok	$1.50/MTok	$7.50/MTok
claude-sonnet-4-5	$3.00/MTok	$15.00/MTok	$0.30/MTok	$1.50/MTok	$7.50/MTok

Source: platform.claude.com/docs/en/about-claude/pricing

Sonnet 4.6 is identically priced to Sonnet 4.5. The upgrade is free in the cost sense — you pay the same rates for meaningfully updated capabilities. Cache reads are 90% cheaper than base input; in practice, real savings are lower because cache writes cost 1.25–2× base input, so net benefit depends on your cache hit rate.

Cost vs. Sonnet 3.7 on agentic workloads

The Aider run costs give a clean comparison for agentic workloads at equivalent thinking budgets:

Workload	Sonnet 3.7 cost	Sonnet 4.6 cost	Difference
Single coding task (32k thinking)	$36.83	$26.58	Sonnet 4.6 is ~28% cheaper
10 coding tasks	$368.30	$265.80	$102.50 saved
100 coding tasks	$3,683.00	$2,658.00	$1,025 saved

Put differently: for every three Sonnet 3.7 runs, you can run approximately four Sonnet 4.6 runs at the same cost. Against a 3.6-point benchmark gap in the Sonnet 4 proxy data (64.9% vs. 61.3%), the math can work in either direction depending on how much volume you run and how sensitive your success metric is to per-task accuracy.

Cost vs. Opus 4 on agentic workloads

Workload	Opus 4 cost	Sonnet 4.6 cost	Difference
Single coding task (32k thinking)	$65.75	$26.58	Sonnet 4.6 is ~60% cheaper
10 coding tasks	$657.50	$265.80	$391.70 saved
100 coding tasks	$6,575.00	$2,658.00	$3,917 saved

Opus 4 scores 72.0% on Aider polyglot; the closest Sonnet 4.x proxy is 61.3% (from claude-sonnet-4-20250514) — a 10.7-point gap at 2.5× the cost per run. For the workloads where Opus 4 earns that premium (long agentic sessions, large-codebase comprehension, complex multi-repo refactors), the gap is worth it. The Opus 4.7 review covers those cases in detail. For everything else, Sonnet 4.6 should cover similar territory — the Sonnet 4 proxy puts it at roughly 85% of Opus 4’s benchmark score at 40% of the cost.

Extended thinking: effort parameter

Sonnet 4.6 replaces budget_tokens with an effort parameter. Four levels: low, medium, high, and max. The API technical default is high, but Anthropic recommends medium for most coding workflows — agentic coding, code generation, and tool-heavy pipelines. xhigh is restricted to Opus 4.8 and 4.7; it is not available on Sonnet 4.6.

The Sonnet 4 proxy data gives a 4.9-point gap between 32k thinking vs. no thinking, so engaging thinking still matters. The question is which level.

Effort level	When to use
`medium`	Recommended default for agentic coding, code generation, tool-heavy workflows
`high`	Complex debugging, multi-file refactors where root cause is unclear
`max`	Planning long agentic sequences, architecture review across competing approaches
`low`	Simple edits, PR review comments, high-volume or latency-sensitive workloads

Migration gotcha from Sonnet 4.5: Sonnet 4.5 had no effort parameter — all API calls ran without extended thinking by default. When you switch to 4.6 without setting effort explicitly, the API defaults to high and latency spikes relative to what you expected from 4.5. Set effort: "medium" explicitly as a practical default — it cuts latency relative to high while staying above Sonnet 4.5’s non-thinking quality baseline. To match Sonnet 4.5 throughput more closely, use effort: "low" with thinking disabled.

Deprecation notice: budget_tokens and thinking.type: "enabled" still work on Sonnet 4.6 but are deprecated. Migrate to effort before they are removed. Anthropic has not announced a removal timeline.

Sonnet 4.6 in Claude Code

Sonnet 4.6 is the default model in Claude Code. Anthropic’s internal testing found users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time, citing fewer false claims of success, fewer hallucinations, and more consistent multi-step task follow-through. This is vendor-reported early testing — not independently replicated — so treat it as directional, not benchmarked.

What it means in practice: if you use Claude Code daily, the upgrade delivers more context headroom (1M vs. 200k tokens), explicit effort control, and the quality improvements described above, at the same price. Set effort: "medium" as your baseline and step up to high when a debugging session requires it. For a deeper look at Claude Code as a tool — usage limits, the April 2026 regression, and whether the Max plan is worth it — see the Claude Code 2026 review.

Verdict

On Sonnet 4.5 now: upgrade. Same price, 1M context window (up from 200k on 4.5), plus whatever capability improvements Anthropic landed. No cost argument for staying on 4.5.

On Sonnet 3.7 now: the decision depends on your workload. If you run long agentic loops or hit context limits regularly, the 1M window and 28% cost reduction are meaningful arguments for switching. The benchmark comparison is a proxy — it uses Sonnet 4 data (claude-sonnet-4-20250514), not claude-sonnet-4-6 specifically. If you run lower-volume, high-accuracy sessions where first-attempt quality is critical and you have not hit context limits, the Sonnet 4 proxy puts the 4.x line 3.6 points below Sonnet 3.7 on Aider polyglot. Test on your actual task type before committing.

On Opus 4 and trying to reduce cost: switch to Sonnet 4.6. You give up 10.7 points on the Aider polyglot benchmark in exchange for running ~2.5× more tasks at the same cost. If your work lives in the Opus-category tasks — long-horizon agentic sessions, production multi-repo refactors, codebases past 200k tokens — review whether Sonnet 4.6 actually covers your failure modes before cutting over. If it does, the cost savings are significant.

Current model	Recommendation	Key reason
Sonnet 4.5	Upgrade	Cost-neutral, capability improvement
Sonnet 3.7 (high-volume)	Upgrade	28% cost savings; Sonnet 4 proxy shows 3.6-point gap vs 3.7
Sonnet 3.7 (low-volume, quality-critical)	Test first	Sonnet 4 proxy shows 3.6-point gap vs 3.7; assess on your tasks
Opus 4 (cost-cutting)	Switch to Sonnet 4.6	60% cost savings at ~85% Aider benchmark performance
Opus 4 (agentic, long context)	Read Opus review	The quality gap may matter for your specific workload

What we didn’t test

Latency: no throughput figures are included. All tokens/sec claims in research failed adversarial verification. Check artificialanalysis.ai and confirm claude-sonnet-4-6 is the model ID listed.

SWE-bench: no confirmed score was available at time of writing. Pull from Anthropic’s official model card if you need a formal software engineering benchmark comparison.

Real-world refactor quality: the Aider polyglot benchmark is the best proxy available for multi-language coding accuracy, but it tests isolated exercise completion rather than production codebase changes. A direct comparison of Sonnet 4.6 vs. Sonnet 3.7 on a real multi-file refactor with measured diff quality would sharpen the recommendation significantly — this is the original benchmark the research brief called for that was not completed before publication.

Affiliate angle: no active Anthropic affiliate or referral programme was confirmed at time of writing. Claude API and Claude Code pricing is the same through any access path.

Claude Sonnet 4.6 for Coding — Is It Worth the Upgrade?