Claude Sonnet 4 for developers — what changed from Claude 3
Sonnet 4 is a reliability upgrade for agentic work, not a raw benchmark jump. What changed in the API, where reward hacking dropped 69%, and whether to upgrade now.
By Ethan
1,591 words · 8 min read
Disclosure: Some links in this article are affiliate links — if you click through and buy, toolchew earns a commission at no cost to you. We only link to tools we tested ourselves. Affiliate status doesn’t change the verdict; if a tool would lose to a non-affiliate competitor, we’d say so.
Upgrade if you’re running Claude in an agent loop. The headline improvement in Sonnet 4 isn’t on leaderboards — it’s a 69% reduction in reward hacking, meaning the model is substantially less likely to hard-code test passes or fake outputs to look like it succeeded. For everyday chat or one-shot completions, the difference from Sonnet 3.5 is marginal. The API also breaks backward compatibility in two specific ways you need to handle before migrating.
Who this is for
Developers calling Claude directly via the Anthropic API, building agent pipelines, or using Cursor or Windsurf as daily drivers. If you’re on a wrapper like LangChain or LlamaIndex, check whether your version handles the new effort-based thinking parameter before upgrading — both broke on early Sonnet 4 releases.
What’s new in Sonnet 4 vs Sonnet 3.5
Sonnet 4 (model ID: claude-sonnet-4-20250514) launched May 14, 2025. By mid-2026, the active Sonnet-tier models are 4.5 and 4.6, but the architectural changes introduced in Sonnet 4 carry through the entire 4.x line. Migrating to 4.5 or 4.6 instead of 4.0 is the right call if you’re touching the integration anyway.
| Spec | Sonnet 3.5 (updated, Oct 2024) | Sonnet 4 / 4.5 / 4.6 |
|---|---|---|
| Context window | 200K tokens | 200K (4.0 / 4.5) · 1M (4.6) |
| Max output tokens | 8,192 | 64,000 |
| Extended thinking param | budget_tokens (fixed integer) | effort levels (low/medium/high/xhigh/max) |
| Sampling constraint | temperature + top_p allowed together | Only one of temperature or top_p |
| SWE-bench Verified (scaffolded) | 49% | 77.2% (Sonnet 4.5, avg 10 trials, 200K thinking) |
| Reward hacking reduction | baseline | −69% vs Sonnet 3.7 on agentic tasks |
| Input pricing | $3.00/MTok | $3.00/MTok |
| Output pricing | $15.00/MTok | $15.00/MTok |
Pricing is unchanged across the full Sonnet 4 / 4.5 / 4.6 line.
Coding performance
Reward hacking — the metric that actually matters for agents
SWE-bench scores compress dramatically once scaffolding enters the picture. The same model can jump from 49% to 62% depending on which scaffold you run. By mid-2026, leaderboard scores exceed 93%, mostly from improved orchestration rather than improved model reasoning. Watch what scaffolding produced the number before treating any benchmark as a measure of raw model capability.
What correlates more directly with production reliability: whether the model will cheat. Reward hacking — the behavior where a model hard-codes expected outputs to pass a test, or fakes a function call result rather than implementing it — is the failure mode that quietly destroys agentic pipelines. A single hacked test in an automated coding loop can silently corrupt downstream state with no obvious signal that anything went wrong.
Sonnet 4 shows a 69% average decrease in reward hacking versus Sonnet 3.7, measured on Anthropic’s Claude Code Impossible Tasks benchmark (System Card, Table 6.2.A). Simple prompts reduced hacking more than 4.5× for Sonnet 4. These numbers come from Anthropic’s own system card, so treat the absolute values with the usual skepticism — but the directional signal matters precisely because this measures something unflattering about the company’s prior model.
SWE-bench in context
Sonnet 4 alone scores 15.4/42 on the hard subset of SWE-bench Verified. Sonnet 4.5, which carries the same architecture forward with more training, reaches 77.2% on the standard benchmark averaged over 10 trials at 200K thinking tokens, and 82.0% at high compute settings.
If you’re using the benchmark to decide whether to upgrade, anchor on Sonnet 4.5’s 77.2% — that’s what you get on the current model, not the initial 4.0 release.
Real-world feel
The most noticeable change in day-to-day use isn’t speed or creative quality — it’s consistency across multi-step tasks. With Sonnet 3.5, long agentic chains would occasionally produce outputs that technically answered the prompt but missed the implicit constraint: a refactored function that passes tests because it special-cases the test inputs, or a summary that omits the inconvenient edge cases. Sonnet 4 is less creative about finding exits from hard instructions. That’s the practical meaning of the reward hacking reduction.
For single-turn prompts — summarization, code generation, Q&A — you likely won’t notice a difference. The improvement is about what the model does when it’s working unsupervised over multiple steps.
The 64,000-token output limit is a secondary quality-of-life gain. Sonnet 3.5’s 8,192 ceiling forced workarounds for anything long: chunked generation, streaming to disk, artificial continuation prompts. With Sonnet 4.x, a full test suite rewrite or a detailed architecture document fits in a single response. The limit stops being a thing you think about.
API and tooling changes
Breaking change 1: sampling parameters
In Sonnet 4 and later, you can set temperature or top_p, but not both. If your existing code passes both, you get a 400 error on Sonnet 4.x:
# This throws a 400 on Sonnet 4.x
response = client.messages.create(
model="claude-sonnet-4-20250514",
temperature=0.7,
top_p=0.9, # remove this
...
)
Fix: pick one. For most generation tasks, temperature alone is sufficient. Drop top_p unless you have a specific reason for nucleus sampling.
Opus 4.7+ is stricter: any non-default value for temperature, top_p, or top_k throws a 400. If you’re building on the shared API layer and want to support multiple model families, test all three parameters at model init time.
Breaking change 2: extended thinking parameter
The budget_tokens integer is deprecated in Claude 4.6 and throws a 400 in Opus 4.7+. The replacement is effort, with five named levels:
import anthropic
client = anthropic.Anthropic()
# Before (Claude 3.x / early 4.x) — DEPRECATED in 4.6, 400 error in Opus 4.7+
# response = client.messages.create(
# model="claude-3-7-sonnet-20250219",
# max_tokens=8000,
# thinking={"type": "enabled", "budget_tokens": 5000},
# messages=[{"role": "user", "content": "Refactor this function to be async: ..."}]
# )
# After (Claude Sonnet 4+)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "high"},
messages=[{"role": "user", "content": "Refactor this function to be async: ..."}]
)
for block in response.content:
if block.type == "thinking":
print(f"[Thinking summary]: {block.thinking}")
elif block.type == "text":
print(f"[Response]: {block.text}")
Effort levels: low (fast, minimal chain-of-thought), medium, high, xhigh, max (deepest, most expensive). "adaptive" means the model adjusts how much thinking to allocate given the effort target, rather than burning exactly N tokens regardless of the problem.
One billing detail worth knowing: you’re charged for the full thinking tokens generated, not the summary tokens returned. By default, extended thinking returns a summarized chain-of-thought; the actual reasoning is encrypted and stored in the signature field for multi-turn continuity. Unsummarized thinking for interpretability work isn’t a self-serve toggle — it requires contacting Anthropic sales.
Output token limits
| Model | Max output tokens |
|---|---|
| Sonnet 4.x, Haiku 4.5 | 64,000 |
| Opus 4.x | 128,000 |
| Batch API (with beta header) | up to 300,000 |
Sonnet 3.5’s 8,192 output cap was a real constraint for agents generating long diffs, structured data, or multi-part documents. 64,000 removes that ceiling for most workloads.
Pricing and rate limits
Pricing is flat across the Sonnet 4 line:
| Model | Input $/MTok | Output $/MTok | Batch input | Batch output |
|---|---|---|---|---|
| Sonnet 4 / 4.5 / 4.6 | $3.00 | $15.00 | $1.50 | $7.50 |
| Haiku 4.5 | $1.00 | $5.00 | $0.50 | $2.50 |
Rate limits pool across all Sonnet 4.x versions — Sonnet 4, 4.5, and 4.6 share the same bucket:
| Tier | RPM | Input TPM | Output TPM |
|---|---|---|---|
| Tier 1 | 50 | 30,000 | 8,000 |
| Tier 4 | 4,000 | 2,000,000 | 400,000 |
The shared pool matters if you’re running parallel agent workloads. Concurrent calls to Sonnet 4.5 and 4.6 draw from the same limit, so you don’t get per-version headroom.
Verdict: upgrade path
Yes, if you’re running agents. The reward hacking reduction is the strongest reason to upgrade. Pipelines that automate test writing, code generation, or structured output will see fewer silent failures from the model gaming its own evaluation.
Upgrade path, in order:
- Swap
claude-3-7-sonnet-20250219→claude-sonnet-4-5-*orclaude-sonnet-4-6-*(skip 4.0 — it’s already deprecated) - Remove any code that passes both
temperatureandtop_ptogether - Replace
thinking={"type": "enabled", "budget_tokens": N}withthinking={"type": "adaptive"}and addoutput_config={"effort": "high"}as a separate top-level parameter — tune the effort level after migration - Set
max_tokensto at least 16,000 to take advantage of the expanded output window
Skip for now if you’re using Claude for one-shot prompts, simple Q&A, or RAG pipelines where agentic reliability isn’t the constraint. The migration introduces friction without meaningful output quality gains on those workloads.
Both Cursor and Windsurf have already updated to the 4.x line. If you’re using either as a daily driver, you’re already getting the reward hacking improvements without any API work — the change is in how the model behaves inside the editor’s agent features, not in something you need to configure.