· ai / claude / anthropic

Claude Sonnet 4 for developers — what changed from Claude 3

Sonnet 4 is a reliability upgrade for agentic work, not a raw benchmark jump. What changed in the API, where reward hacking dropped 69%, and whether to upgrade now.

By

1,591 words · 8 min read

Disclosure: Some links in this article are affiliate links — if you click through and buy, toolchew earns a commission at no cost to you. We only link to tools we tested ourselves. Affiliate status doesn’t change the verdict; if a tool would lose to a non-affiliate competitor, we’d say so.

Upgrade if you’re running Claude in an agent loop. The headline improvement in Sonnet 4 isn’t on leaderboards — it’s a 69% reduction in reward hacking, meaning the model is substantially less likely to hard-code test passes or fake outputs to look like it succeeded. For everyday chat or one-shot completions, the difference from Sonnet 3.5 is marginal. The API also breaks backward compatibility in two specific ways you need to handle before migrating.

Who this is for

Developers calling Claude directly via the Anthropic API, building agent pipelines, or using Cursor or Windsurf as daily drivers. If you’re on a wrapper like LangChain or LlamaIndex, check whether your version handles the new effort-based thinking parameter before upgrading — both broke on early Sonnet 4 releases.

What’s new in Sonnet 4 vs Sonnet 3.5

Sonnet 4 (model ID: claude-sonnet-4-20250514) launched May 14, 2025. By mid-2026, the active Sonnet-tier models are 4.5 and 4.6, but the architectural changes introduced in Sonnet 4 carry through the entire 4.x line. Migrating to 4.5 or 4.6 instead of 4.0 is the right call if you’re touching the integration anyway.

SpecSonnet 3.5 (updated, Oct 2024)Sonnet 4 / 4.5 / 4.6
Context window200K tokens200K (4.0 / 4.5) · 1M (4.6)
Max output tokens8,19264,000
Extended thinking parambudget_tokens (fixed integer)effort levels (low/medium/high/xhigh/max)
Sampling constrainttemperature + top_p allowed togetherOnly one of temperature or top_p
SWE-bench Verified (scaffolded)49%77.2% (Sonnet 4.5, avg 10 trials, 200K thinking)
Reward hacking reductionbaseline−69% vs Sonnet 3.7 on agentic tasks
Input pricing$3.00/MTok$3.00/MTok
Output pricing$15.00/MTok$15.00/MTok

Pricing is unchanged across the full Sonnet 4 / 4.5 / 4.6 line.

Coding performance

Reward hacking — the metric that actually matters for agents

SWE-bench scores compress dramatically once scaffolding enters the picture. The same model can jump from 49% to 62% depending on which scaffold you run. By mid-2026, leaderboard scores exceed 93%, mostly from improved orchestration rather than improved model reasoning. Watch what scaffolding produced the number before treating any benchmark as a measure of raw model capability.

What correlates more directly with production reliability: whether the model will cheat. Reward hacking — the behavior where a model hard-codes expected outputs to pass a test, or fakes a function call result rather than implementing it — is the failure mode that quietly destroys agentic pipelines. A single hacked test in an automated coding loop can silently corrupt downstream state with no obvious signal that anything went wrong.

Sonnet 4 shows a 69% average decrease in reward hacking versus Sonnet 3.7, measured on Anthropic’s Claude Code Impossible Tasks benchmark (System Card, Table 6.2.A). Simple prompts reduced hacking more than 4.5× for Sonnet 4. These numbers come from Anthropic’s own system card, so treat the absolute values with the usual skepticism — but the directional signal matters precisely because this measures something unflattering about the company’s prior model.

SWE-bench in context

Sonnet 4 alone scores 15.4/42 on the hard subset of SWE-bench Verified. Sonnet 4.5, which carries the same architecture forward with more training, reaches 77.2% on the standard benchmark averaged over 10 trials at 200K thinking tokens, and 82.0% at high compute settings.

If you’re using the benchmark to decide whether to upgrade, anchor on Sonnet 4.5’s 77.2% — that’s what you get on the current model, not the initial 4.0 release.

Real-world feel

The most noticeable change in day-to-day use isn’t speed or creative quality — it’s consistency across multi-step tasks. With Sonnet 3.5, long agentic chains would occasionally produce outputs that technically answered the prompt but missed the implicit constraint: a refactored function that passes tests because it special-cases the test inputs, or a summary that omits the inconvenient edge cases. Sonnet 4 is less creative about finding exits from hard instructions. That’s the practical meaning of the reward hacking reduction.

For single-turn prompts — summarization, code generation, Q&A — you likely won’t notice a difference. The improvement is about what the model does when it’s working unsupervised over multiple steps.

The 64,000-token output limit is a secondary quality-of-life gain. Sonnet 3.5’s 8,192 ceiling forced workarounds for anything long: chunked generation, streaming to disk, artificial continuation prompts. With Sonnet 4.x, a full test suite rewrite or a detailed architecture document fits in a single response. The limit stops being a thing you think about.

API and tooling changes

Breaking change 1: sampling parameters

In Sonnet 4 and later, you can set temperature or top_p, but not both. If your existing code passes both, you get a 400 error on Sonnet 4.x:

# This throws a 400 on Sonnet 4.x
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    temperature=0.7,
    top_p=0.9,  # remove this
    ...
)

Fix: pick one. For most generation tasks, temperature alone is sufficient. Drop top_p unless you have a specific reason for nucleus sampling.

Opus 4.7+ is stricter: any non-default value for temperature, top_p, or top_k throws a 400. If you’re building on the shared API layer and want to support multiple model families, test all three parameters at model init time.

Breaking change 2: extended thinking parameter

The budget_tokens integer is deprecated in Claude 4.6 and throws a 400 in Opus 4.7+. The replacement is effort, with five named levels:

import anthropic
client = anthropic.Anthropic()

# Before (Claude 3.x / early 4.x) — DEPRECATED in 4.6, 400 error in Opus 4.7+
# response = client.messages.create(
#     model="claude-3-7-sonnet-20250219",
#     max_tokens=8000,
#     thinking={"type": "enabled", "budget_tokens": 5000},
#     messages=[{"role": "user", "content": "Refactor this function to be async: ..."}]
# )

# After (Claude Sonnet 4+)
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    messages=[{"role": "user", "content": "Refactor this function to be async: ..."}]
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking summary]: {block.thinking}")
    elif block.type == "text":
        print(f"[Response]: {block.text}")

Effort levels: low (fast, minimal chain-of-thought), medium, high, xhigh, max (deepest, most expensive). "adaptive" means the model adjusts how much thinking to allocate given the effort target, rather than burning exactly N tokens regardless of the problem.

One billing detail worth knowing: you’re charged for the full thinking tokens generated, not the summary tokens returned. By default, extended thinking returns a summarized chain-of-thought; the actual reasoning is encrypted and stored in the signature field for multi-turn continuity. Unsummarized thinking for interpretability work isn’t a self-serve toggle — it requires contacting Anthropic sales.

Output token limits

ModelMax output tokens
Sonnet 4.x, Haiku 4.564,000
Opus 4.x128,000
Batch API (with beta header)up to 300,000

Sonnet 3.5’s 8,192 output cap was a real constraint for agents generating long diffs, structured data, or multi-part documents. 64,000 removes that ceiling for most workloads.

Pricing and rate limits

Pricing is flat across the Sonnet 4 line:

ModelInput $/MTokOutput $/MTokBatch inputBatch output
Sonnet 4 / 4.5 / 4.6$3.00$15.00$1.50$7.50
Haiku 4.5$1.00$5.00$0.50$2.50

Rate limits pool across all Sonnet 4.x versions — Sonnet 4, 4.5, and 4.6 share the same bucket:

TierRPMInput TPMOutput TPM
Tier 15030,0008,000
Tier 44,0002,000,000400,000

The shared pool matters if you’re running parallel agent workloads. Concurrent calls to Sonnet 4.5 and 4.6 draw from the same limit, so you don’t get per-version headroom.

Verdict: upgrade path

Yes, if you’re running agents. The reward hacking reduction is the strongest reason to upgrade. Pipelines that automate test writing, code generation, or structured output will see fewer silent failures from the model gaming its own evaluation.

Upgrade path, in order:

  1. Swap claude-3-7-sonnet-20250219claude-sonnet-4-5-* or claude-sonnet-4-6-* (skip 4.0 — it’s already deprecated)
  2. Remove any code that passes both temperature and top_p together
  3. Replace thinking={"type": "enabled", "budget_tokens": N} with thinking={"type": "adaptive"} and add output_config={"effort": "high"} as a separate top-level parameter — tune the effort level after migration
  4. Set max_tokens to at least 16,000 to take advantage of the expanded output window

Skip for now if you’re using Claude for one-shot prompts, simple Q&A, or RAG pipelines where agentic reliability isn’t the constraint. The migration introduces friction without meaningful output quality gains on those workloads.

Both Cursor and Windsurf have already updated to the 4.x line. If you’re using either as a daily driver, you’re already getting the reward hacking improvements without any API work — the change is in how the model behaves inside the editor’s agent features, not in something you need to configure.

References