Claude Opus 4.7 for Coding — When the Big Model Wins
Opus 4.7 leads SWE-bench Verified at 87.6% and scores 70% on CursorBench vs. 58% for Opus 4.6. It costs ~2× Sonnet 4.6 after the tokenizer uplift. Here is exactly when it is worth it.
By Ethan
1,372 words · 7 min read
Opus 4.7 is the right model if you are running long agentic coding sessions, working in codebases that push past 200k tokens, or doing production multi-file refactors where abandoning the task mid-way costs real time. For everything else — PR reviews, isolated functions, high-volume pipelines — Sonnet 4.6 is faster and costs roughly half as much after you account for the tokenizer.
Who this is for
A developer already running Sonnet 4.6 or Haiku 4.5 who wants a concrete answer to “is Opus worth the upgrade for my workload?” If you are still picking a first model, start with Sonnet 4.6 and come back when you hit its limits.
What the data shows
The headline benchmark is SWE-bench Verified: Opus 4.7 scores 87.6%, up from 80.8% on Opus 4.6, and ahead of Gemini 3.1 Pro (80.6%). SWE-bench uses real GitHub issues with human-verified test suites — it is the closest publicly reproducible proxy for real repair work. The 6.8-point gain matters.
Anthropic also runs CursorBench — real IDE-integrated coding workflows, not synthetic puzzles. Opus 4.7 scores 70% vs. Opus 4.6’s 58%. That 12-point gap is the IDE dimension of the same story: better goal-state retention under pressure.
Third-party production signal: Rakuten reported Opus 4.7 resolving 3× more production tasks than Opus 4.6 in internal testing. That is the strongest independent datapoint available and it aligns with the benchmark direction.
One honest regression: BrowseComp (web research) dropped from 83.7% → 79.3% versus Opus 4.6. GPT-5.4 leads here at 89.3%. If your workflow depends heavily on Claude researching and synthesising web content, Opus 4.7 is a step back from its predecessor.
Findings by task type
Long-horizon agentic coding
CursorBench scores 70% vs. 58% for Opus 4.6 on real IDE-integrated workflows. The model maintains goal state across long tool-call chains without drifting. When Sonnet 4.6 runs a 40-step agentic task, it is more likely to give up or lose context mid-sequence. Opus 4.7 is not — this is where the benchmark improvements show up most visibly in practice.
Large context — 200k+ token codebases
HN user arcanemachiner (thread #47793411): “I had a conversation go well into the 200K token range…the model seemed surprisingly capable” — contrasted with Opus 4.6 which “seems to veer into the dumb zone heavily around the 200k mark.” This is structural, not marginal. Opus 4.7 and Sonnet 4.6 both support a 1M token context window, but Opus 4.7’s comprehension at the far end of that range is meaningfully different.
Production multi-file refactors
Rakuten’s 3× figure applies here. Multi-repo refactors that require holding a coherent change model across dozens of files and hundreds of call sites are exactly where Opus 4.7 earns its premium. Sonnet 4.6 is capable for single-file or small-scope changes; it loses coherence at scale.
IDE-integrated coding agents
CursorBench: 70% vs. 58%. If you use Cursor or Windsurf as your daily driver, the agent quality gap is real and it shows up in tasks like “refactor this auth module to use the new session API” — where the model needs to track multiple file boundaries and test implications simultaneously. Windsurf supports Opus 4.7 Fast Mode (beta), which delivers ~2.5× faster output at $30/$150 per MTok — a $6× price premium over standard Opus rates, worth it only for latency-sensitive interactive use.
Where Sonnet 4.6 is good enough
- PR reviews and isolated functions: the benchmark gap between Opus and Sonnet is small on single-task completions. The cost gap (roughly 1.9–2× after tokenizer uplift, see below) is not.
- High-volume pipelines: batch pricing for Opus 4.7 is $2.50/$12.50 per MTok vs. Sonnet 4.6’s $1.50/$7.50. Run 10M tokens per day and the difference is $10k/month before the tokenizer uplift.
- Web research and agentic search: Sonnet 4.6 outperforms Opus 4.7 on BrowseComp, and GPT-5.4 leads both. If web research is your primary use case, Opus 4.7 is the wrong call.
The web research regression
HN thread #47793411 surfaced this clearly: developers using Opus 4.7 for search-backed agentic workflows are frustrated. The adaptive thinking system — which is supposed to decide when to apply extended reasoning — has a documented failure mode: it opts out of thinking when it should engage. User JamesSwift: “Its especially concerning / frustrating because boris’s reply to my bug report on opus being dummer was ‘we think adaptive thinking isnt working’ and then thats the last I heard of it.” User simonw: “I’m finding the ‘adaptive thinking’ thing very confusing, especially having written code against the previous thinking budget / thinking effort / etc modes.”
This isn’t a fringe bug. It affects workflows where the model needs to chain web lookups, synthesise results, and reason about relevance. If that describes your use case, BrowseComp (79.3% for Opus 4.7 vs. 89.3% for GPT-5.4) is the benchmark to watch.
The real cost: tokenizer uplift
Sticker price comparison: Opus 4.7 is $5/$25 per MTok (input/output); Sonnet 4.6 is $3/$15 per MTok. On paper, that is a 1.67× premium.
In practice it is higher. Opus 4.7 uses a new tokenizer that maps the same English input to up to 1.35× more tokens depending on content type. On typical developer workloads — English prose, code, JSON — the uplift is 1.12–1.18×. At $5/MTok with a 1.18× tokenizer multiplier, the effective input cost per byte of text is about $5.90/MTok equivalent. Against Sonnet 4.6’s $3/MTok, the effective cost ratio is closer to 1.97×, not 1.67×.
| Scenario | Sonnet 4.6 cost | Opus 4.7 cost | Ratio |
|---|---|---|---|
| 1M tokens input, standard pricing | $3.00 | $5.00 | 1.67× |
| 1M tokens input, with tokenizer uplift (1.18×) | $3.00 | $5.90 | 1.97× |
| 10M tokens/day, batch pricing | $15/day | $29.50/day | ~2× |
| Fast Mode, 1M output tokens | — | $150 | — |
Non-English codebases (Japanese, Korean) may see tokenizer cost reductions due to more efficient encoding — that partially offsets the uplift for international teams.
Verdict — pick matrix
| Use case | Model | Reason |
|---|---|---|
| Agentic tasks > 50 steps | Opus 4.7 | 12-point CursorBench lead (70% vs. 58%) |
| Codebase > 200k tokens | Opus 4.7 | Structural comprehension gap at context extremes |
| Production multi-repo refactors | Opus 4.7 | 3× task resolution (Rakuten) |
| IDE agent (Windsurf Fast Mode) | Opus 4.7 | 70% CursorBench, 2.5× faster output |
| PR reviews, isolated functions | Sonnet 4.6 | ~2× cheaper, negligible quality gap |
| High-volume pipelines | Sonnet 4.6 | Batch cost gap compounds at scale |
| Web research, agentic search | Sonnet 4.6 or GPT-5.4 | Opus 4.7 regressed on BrowseComp |
| Budget-first entry point | Haiku 4.5 | $1/$5 per MTok, 200k context |
Related reading
- Claude Haiku 4.5 for Coding — Benchmark and Cost Guide
- Cursor vs GitHub Copilot in 2026: Which Is Faster?
- Claude Code vs Codex 2026: Terminal AI Agents Compared
Caveats
Tokenizer cost trap: budget for the 12–18% uplift when migrating prompts from Opus 4.6 to 4.7. A pipeline optimised to stay under a cost ceiling will blow it without the adjustment.
Adaptive thinking: the HN-documented failure mode (chooses not to think when it should) is real and under-documented by Anthropic. Test your specific agentic chain before committing to Opus 4.7 for reasoning-heavy workflows.
Anthropic API affiliate: Anthropic has no affiliate program. The Windsurf link above is an affiliate link; the Anthropic pricing link is not.
Benchmarks are Anthropic-reported unless noted: CursorBench (70% vs. 58%) and Rakuten (3× task resolution) are from the Anthropic release announcement. SWE-bench Verified, SWE-bench Pro, and BrowseComp figures are from Vellum AI’s third-party benchmark analysis (linked in references).
References
- Anthropic: Introducing Claude Opus 4.7 — official benchmarks, feature list, release context
- Anthropic API Pricing — verified token prices, batch prices, Fast Mode pricing (fetched 2026-05-16)
- Anthropic Models Overview — context window sizes, max output tokens, API IDs
- Vellum AI: Claude Opus 4.7 Benchmarks Explained — SWE-bench Verified 87.6%, SWE-bench Pro 64.3%
- HN #47793411: Introducing Claude Opus 4.7 — developer signals, adaptive thinking issues (April–May 2026)
- BuildFastWithAI: Claude Opus 4.7 Full Review — benchmark table cross-reference