· ai-tools / codex / openai

Codex CLI in 2026: OpenAI's Terminal Play, Reviewed

Codex CLI leads Claude Code by 13 points on Terminal-Bench 2.0 and burns 4× fewer tokens. Trails by 5.7 points on SWE-bench Pro. Here is who should use it.

By Ethan

1,415 words · 8 min read

OpenAI’s Codex CLI is the right tool for fire-and-forget terminal work: async PR generation, DevOps scripting, batch code review. On Terminal-Bench 2.0, GPT-5.5 scores 82.7% against Claude Opus 4.7’s 69.4% — a 13-point gap that is not close. It also burns 3–4× fewer tokens per equivalent task. Where Codex trails is complex, multi-file refactoring: on SWE-bench Pro, Claude leads 64.3% to 58.6%. Codex is a capable autonomous worker. Claude Code is still the better pair programmer.

Who Should Use Codex CLI

Terminal-first developers who want to delegate tasks and come back to a finished PR — DevOps engineers, backend teams with predictable scripting workloads, anyone running batch automation across multiple repos. If you need the agent alongside you through a difficult refactor, iterating on your feedback, Claude Code is a better fit.

This review covers Codex CLI (local, open-source, v0.130.0, released 2026-05-08). It does not cover the Codex cloud sandbox in ChatGPT — that product runs in OpenAI’s infrastructure and has different performance characteristics. CLI vs. cloud is not a footnote; they are separate tools.

What we tested

No controlled first-party test workload ran for this article. The benchmark figures below come from OpenAI’s own evaluations and third-party comparators (cited per claim). We flag this explicitly where it matters. Version pinned to CLI v0.130.0 with GPT-5.5 as the primary model; GPT-5.3-Codex results noted where available.

Findings

Task scope — what Codex is built for

Codex CLI is an async agent. The interaction model is: describe the task, detach, return to a finished PR. That is its design philosophy and where it consistently outperforms alternatives.

On Terminal-Bench 2.0 — a benchmark of CLI and DevOps tasks — GPT-5.5 scores 82.7%. Claude Opus 4.7 scores 69.4%. GPT-5.3-Codex scores 77.3%. A 13.3-point lead across three model variants is a consistent structural advantage, not a one-run fluke.

On SWE-bench Verified (synthetic GitHub issue resolution), GPT-5.5 scores 88.7% versus Claude Opus 4.7’s 87.6%. Codex leads by 1.1 points here — its “home turf” for isolated task resolution.

On SWE-bench Pro — real-world tasks with messier codebases and less well-defined success criteria — the order flips: Claude Opus 4.7 64.3%, GPT-5.5 58.6%. A 5.7-point gap. Community reports from HN align with this: users describe Codex reasoning about “server backends and REST APIs in an app that doesn’t have any of that” on unfamiliar codebases.

The pattern: Codex knows its territory. Clean terminal tasks, well-specified scripts, PR generation from a brief — it delivers. Complex refactoring with accumulated context, incremental debugging, code review in a messy monorepo — Claude Code’s deeper repo comprehension shows. See Claude Code vs Codex CLI for a direct benchmark-by-benchmark comparison.

Parallel execution is a genuine strength: 8 subagents simultaneously, configurable per permission profile. For teams running batch code review across a repo, or generating boilerplate for multiple services at once, this compounds the token efficiency advantage.

Speed and token efficiency

The Figma-clone comparison is the clearest single data point: Claude Code consumed 6.23M tokens on an equivalent task to Codex’s 1.5M. A 4× difference. At $1.75/1M input tokens for GPT-5.5 versus $5/1M for Claude Opus 4.7, that compounds to roughly 15× cost difference per task at the API tier. For a full TCO analysis across toolchain configurations, see The real cost of running an AI agent team in 2026.

GPT-5.3-Codex is 25% faster than its predecessor model. GPT-5.3-Codex-Spark (research preview) breaks 1,000 tokens/second. For teams running high-volume automation, these numbers affect throughput materially.

Codex also runs lighter on your machine: approximately 80MB RAM footprint versus Claude Code’s multi-gigabyte requirement. On a dev box that’s noise. On a constrained CI runner or a VM with limited memory, it matters.

Usage windows on the subscription tier are 5-hour rolling, not daily reset. If your workload is bursty — a sprint of automation in the morning, nothing else — the rolling window can cut against you.

Context and repo scale

GPT-5.4 (March 2026) introduced a 1M token context window. GPT-5.1-Codex-Max adds “compaction” — rolling up prior context windows to enable coherent work across millions of tokens, documented for 24+ hour autonomous runs.

The caveat is cost. Prompts above 272K tokens on the GPT-5.5 API trigger a 2× input surcharge and 1.5× output surcharge. Claude Code’s 1M context has no equivalent penalty threshold. On a large repo where prompts routinely exceed 272K tokens, calculate your actual spend before assuming Codex is cheaper.

For long-horizon autonomous tasks — feature development across multiple sessions, large-scale migration — GPT-5.1-Codex-Max is the model to evaluate. The compaction approach is architecturally different from Claude’s extended context; whether it degrades gracefully on your specific workload requires testing.

Diff quality and review burden

Auto-review mode (GA May 2026) adds a subagent that auto-approves low-risk shell commands. Approval gates are configurable per permission profile. This meaningfully reduces the click-through friction on long async runs.

The community picture is mixed. Fast release cadence means features break between versions — the HN thread documents initial default model errors on launch. Hallucination risk on unfamiliar codebases is higher than Claude Code. The tradeoff is speed and throughput: Codex ships fast, iterates fast, and sometimes breaks things between releases.

For teams that will review every diff before merge — which you should — the review burden is manageable. For teams expecting Codex to auto-merge without oversight, the hallucination rate on complex code is not low enough to warrant it.

Guardrails and safe-stop behavior

Sandbox boundaries: write limits, network policy, protected paths. The approval policy distinguishes low-risk commands (auto-approved) from dangerous commands (confirmation required or blocked). Enterprise tier adds OpenTelemetry logging, compliance logs, SAML SSO, RBAC, and data residency.

Hooks reached GA in May 2026 alongside lifecycle compaction support. For teams that need to wire Codex into a CI/CD pipeline with specific pre- and post-run logic, this is the integration path.

The security documentation at developers.openai.com/codex/security is detailed. For regulated environments, the enterprise compliance feature set is real — not just a checkbox.

Verdict

Use Codex CLI if your primary use case is terminal scripting, DevOps automation, or fire-and-forget PR generation. The Terminal-Bench lead is structural. The token efficiency advantage is real and compounds at scale. The async model is a genuine fit for delegated workloads.

Use Claude Code if you’re doing complex multi-file refactoring, need deep context over a large or unfamiliar codebase, or want an agent that holds up through iterative back-and-forth. The SWE-bench Pro gap (5.7 points) is meaningful on real-world tasks.

Pairing both is a defensible strategy: Codex for batch and DevOps, Claude Code for the hard problems. At similar entry pricing ($20/month each), the marginal cost of a second tool is low if your workload splits cleanly.

Caveats

No first-party testing. Every benchmark number in this article is OpenAI’s own evaluation or a third-party comparator. We ran no controlled test workload. The figures are the best available public data; they are not independent validation.

Model identity. “GPT-5.5 in Codex” — it is not publicly confirmed whether this is the base GPT-5.5 or a fine-tune (codex-1 lineage). OpenAI has not clarified. The benchmark scores are attributed to GPT-5.5 in the comparators we cited; interpret accordingly.

Pro 2× promo. The Pro 5× plan at $100/month includes a 2× multiplier through May 31, 2026. If you’re reading this in June or later, verify current pricing at developers.openai.com/codex/pricing.

CLI vs. cloud. This review covers the local CLI only. The cloud Codex in ChatGPT runs in an OpenAI sandbox and has a different execution model, different latency profile, and different performance characteristics. Benchmark scores from cloud Codex do not transfer cleanly to CLI.

No affiliate relationship. OpenAI has no public affiliate program for Codex or ChatGPT as of May 2026.

References

ClaimSource
CLI v0.130.0, Apache-licensed, Rustgithub.com/openai/codex
Terminal-Bench 2.0 scoresmorphllm.com/comparisons/codex-vs-claude-code
SWE-bench Pro scoresmorphllm.com/comparisons/codex-vs-claude-code
SWE-bench Verified scoresmorphllm.com/comparisons/codex-vs-claude-code
Figma-clone token comparisonmorphllm.com/comparisons/codex-vs-claude-code
GPT-5.3-Codex speed benchmarksneowin.net — GPT-5.3-Codex debut
Codex pricingdevelopers.openai.com/codex/pricing
Community reports (hallucination, loop)news.ycombinator.com/item?id=43708025
Agent approvals / sandboxdevelopers.openai.com/codex/agent-approvals-security
Adoption statisticsgradually.ai/en/codex-statistics
2021 vs 2025 Codex historyaiwiki.ai/wiki/codex
Affiliate statusseofai.com/openai-affiliate-program