· claude / ai-tools / coding

Claude Opus 4.7 for Coding — When the Big Model Wins

Opus 4.7 leads SWE-bench Verified at 87.6% and scores 70% on CursorBench vs. 58% for Opus 4.6. It costs ~2× Sonnet 4.6 after the tokenizer uplift. Here is exactly when it is worth it.

By

1,372 words · 7 min read

Opus 4.7 is the right model if you are running long agentic coding sessions, working in codebases that push past 200k tokens, or doing production multi-file refactors where abandoning the task mid-way costs real time. For everything else — PR reviews, isolated functions, high-volume pipelines — Sonnet 4.6 is faster and costs roughly half as much after you account for the tokenizer.

Who this is for

A developer already running Sonnet 4.6 or Haiku 4.5 who wants a concrete answer to “is Opus worth the upgrade for my workload?” If you are still picking a first model, start with Sonnet 4.6 and come back when you hit its limits.

What the data shows

The headline benchmark is SWE-bench Verified: Opus 4.7 scores 87.6%, up from 80.8% on Opus 4.6, and ahead of Gemini 3.1 Pro (80.6%). SWE-bench uses real GitHub issues with human-verified test suites — it is the closest publicly reproducible proxy for real repair work. The 6.8-point gain matters.

Anthropic also runs CursorBench — real IDE-integrated coding workflows, not synthetic puzzles. Opus 4.7 scores 70% vs. Opus 4.6’s 58%. That 12-point gap is the IDE dimension of the same story: better goal-state retention under pressure.

Third-party production signal: Rakuten reported Opus 4.7 resolving 3× more production tasks than Opus 4.6 in internal testing. That is the strongest independent datapoint available and it aligns with the benchmark direction.

One honest regression: BrowseComp (web research) dropped from 83.7% → 79.3% versus Opus 4.6. GPT-5.4 leads here at 89.3%. If your workflow depends heavily on Claude researching and synthesising web content, Opus 4.7 is a step back from its predecessor.

Findings by task type

Long-horizon agentic coding

CursorBench scores 70% vs. 58% for Opus 4.6 on real IDE-integrated workflows. The model maintains goal state across long tool-call chains without drifting. When Sonnet 4.6 runs a 40-step agentic task, it is more likely to give up or lose context mid-sequence. Opus 4.7 is not — this is where the benchmark improvements show up most visibly in practice.

Large context — 200k+ token codebases

HN user arcanemachiner (thread #47793411): “I had a conversation go well into the 200K token range…the model seemed surprisingly capable” — contrasted with Opus 4.6 which “seems to veer into the dumb zone heavily around the 200k mark.” This is structural, not marginal. Opus 4.7 and Sonnet 4.6 both support a 1M token context window, but Opus 4.7’s comprehension at the far end of that range is meaningfully different.

Production multi-file refactors

Rakuten’s 3× figure applies here. Multi-repo refactors that require holding a coherent change model across dozens of files and hundreds of call sites are exactly where Opus 4.7 earns its premium. Sonnet 4.6 is capable for single-file or small-scope changes; it loses coherence at scale.

IDE-integrated coding agents

CursorBench: 70% vs. 58%. If you use Cursor or Windsurf as your daily driver, the agent quality gap is real and it shows up in tasks like “refactor this auth module to use the new session API” — where the model needs to track multiple file boundaries and test implications simultaneously. Windsurf supports Opus 4.7 Fast Mode (beta), which delivers ~2.5× faster output at $30/$150 per MTok — a $6× price premium over standard Opus rates, worth it only for latency-sensitive interactive use.

Where Sonnet 4.6 is good enough

  • PR reviews and isolated functions: the benchmark gap between Opus and Sonnet is small on single-task completions. The cost gap (roughly 1.9–2× after tokenizer uplift, see below) is not.
  • High-volume pipelines: batch pricing for Opus 4.7 is $2.50/$12.50 per MTok vs. Sonnet 4.6’s $1.50/$7.50. Run 10M tokens per day and the difference is $10k/month before the tokenizer uplift.
  • Web research and agentic search: Sonnet 4.6 outperforms Opus 4.7 on BrowseComp, and GPT-5.4 leads both. If web research is your primary use case, Opus 4.7 is the wrong call.

The web research regression

HN thread #47793411 surfaced this clearly: developers using Opus 4.7 for search-backed agentic workflows are frustrated. The adaptive thinking system — which is supposed to decide when to apply extended reasoning — has a documented failure mode: it opts out of thinking when it should engage. User JamesSwift: “Its especially concerning / frustrating because boris’s reply to my bug report on opus being dummer was ‘we think adaptive thinking isnt working’ and then thats the last I heard of it.” User simonw: “I’m finding the ‘adaptive thinking’ thing very confusing, especially having written code against the previous thinking budget / thinking effort / etc modes.”

This isn’t a fringe bug. It affects workflows where the model needs to chain web lookups, synthesise results, and reason about relevance. If that describes your use case, BrowseComp (79.3% for Opus 4.7 vs. 89.3% for GPT-5.4) is the benchmark to watch.

The real cost: tokenizer uplift

Sticker price comparison: Opus 4.7 is $5/$25 per MTok (input/output); Sonnet 4.6 is $3/$15 per MTok. On paper, that is a 1.67× premium.

In practice it is higher. Opus 4.7 uses a new tokenizer that maps the same English input to up to 1.35× more tokens depending on content type. On typical developer workloads — English prose, code, JSON — the uplift is 1.12–1.18×. At $5/MTok with a 1.18× tokenizer multiplier, the effective input cost per byte of text is about $5.90/MTok equivalent. Against Sonnet 4.6’s $3/MTok, the effective cost ratio is closer to 1.97×, not 1.67×.

ScenarioSonnet 4.6 costOpus 4.7 costRatio
1M tokens input, standard pricing$3.00$5.001.67×
1M tokens input, with tokenizer uplift (1.18×)$3.00$5.901.97×
10M tokens/day, batch pricing$15/day$29.50/day~2×
Fast Mode, 1M output tokens$150

Non-English codebases (Japanese, Korean) may see tokenizer cost reductions due to more efficient encoding — that partially offsets the uplift for international teams.

Verdict — pick matrix

Use caseModelReason
Agentic tasks > 50 stepsOpus 4.712-point CursorBench lead (70% vs. 58%)
Codebase > 200k tokensOpus 4.7Structural comprehension gap at context extremes
Production multi-repo refactorsOpus 4.73× task resolution (Rakuten)
IDE agent (Windsurf Fast Mode)Opus 4.770% CursorBench, 2.5× faster output
PR reviews, isolated functionsSonnet 4.6~2× cheaper, negligible quality gap
High-volume pipelinesSonnet 4.6Batch cost gap compounds at scale
Web research, agentic searchSonnet 4.6 or GPT-5.4Opus 4.7 regressed on BrowseComp
Budget-first entry pointHaiku 4.5$1/$5 per MTok, 200k context

Caveats

Tokenizer cost trap: budget for the 12–18% uplift when migrating prompts from Opus 4.6 to 4.7. A pipeline optimised to stay under a cost ceiling will blow it without the adjustment.

Adaptive thinking: the HN-documented failure mode (chooses not to think when it should) is real and under-documented by Anthropic. Test your specific agentic chain before committing to Opus 4.7 for reasoning-heavy workflows.

Anthropic API affiliate: Anthropic has no affiliate program. The Windsurf link above is an affiliate link; the Anthropic pricing link is not.

Benchmarks are Anthropic-reported unless noted: CursorBench (70% vs. 58%) and Rakuten (3× task resolution) are from the Anthropic release announcement. SWE-bench Verified, SWE-bench Pro, and BrowseComp figures are from Vellum AI’s third-party benchmark analysis (linked in references).

References