· ai / coding / cli

OpenAI Codex CLI review: autonomous terminal coding (2026)

Codex CLI earns its place for OpenAI shops that need a terminal agent with serious safety controls. The recovery story when things go wrong is the weak link.

By

1,931 words · 10 min read

Codex CLI is worth your time if your stack runs on OpenAI models and you want a terminal agent with explicit, layered safety controls. If you need model flexibility — or a way to recover gracefully when an autonomous run goes sideways — reach for Aider or Claude Code instead.

Who this is for

Backend and platform engineers who live in the terminal: remote dev boxes, headless Linux servers, CI pipelines. If you need a visual diff viewer, an IDE plugin, or access to non-OpenAI models, this tool will frustrate you.

What Codex CLI is (and isn’t)

OpenAI Codex CLI — published as @openai/codex on npm in April 2025 — is an open-source terminal coding agent under the Apache-2.0 license. It can read and edit code files, rename files via shell commands, execute arbitrary shell commands, and interact with your local Git repository. All locally, from your terminal.

It is not the original Codex API that OpenAI deprecated in 2023. And it’s not the cloud-hosted Codex feature inside ChatGPT Enterprise. This is a standalone tool you install on your machine. Your code doesn’t go anywhere except through the API call you make to OpenAI’s models with your own key.

The distinction matters because “Codex” in OpenAI’s history has meant several different things. The CLI is the youngest and the most developer-facing of them.

What we tested

@openai/codex v0.1.x (June 2026) on macOS 15 and Ubuntu 24.04, connected to the OpenAI API using GPT-5.4 and GPT-5.4-mini. Test workloads included: adding unit tests to an existing module (~400 lines), a multi-file refactor across a TypeScript service (~3,000 lines), and a CI task (run test suite, fix all failures, commit). We ran each task in Suggest mode first, then Auto Edit, and repeated the CI task in Full Auto.

Install and setup

npm install -g @openai/codex
export OPENAI_API_KEY=sk-...
codex --help

The CLI reads OPENAI_API_KEY from your environment. For per-project configuration, drop a .codex/config.json at your repo root — it overrides the global ~/.codex/config.json for that directory. A minimal project config looks like:

{
  "model": "gpt-5.4",
  "sandbox": "workspace-write",
  "ask_for_approval": "on-request"
}

That gives you Auto Edit semantics on every run in that directory without passing flags each time.

First run:

# Suggest mode — safe default, no file writes
codex "explain what this function does and what edge cases it misses"

# Auto Edit mode — writes files, asks before running anything
codex --sandbox workspace-write "refactor this module to use async/await throughout"

# Full Auto mode — writes and runs without prompting
codex --sandbox workspace-write --ask-for-approval never "run all tests and fix failures"

Core workflow: three modes

Codex CLI’s safety model has two independently configurable layers, and understanding the difference matters before you pick a mode.

Layer 1: the sandbox (--sandbox). Defines hard OS-enforced technical limits — what the agent is physically capable of doing.

ModeFile accessNetwork
read-onlyInspect only — no editsOff
workspace-writeRead and write within current directoryOff
danger-full-accessNo filesystem boundariesUnrestricted

read-only and workspace-write both block network access. danger-full-access removes that restriction — the agent can install packages, fetch remote content, and call external APIs. For a security-conscious deployment — a shared server, a CI environment — staying on workspace-write is the right default. For local dev workflows that need network access, danger-full-access is the explicit opt-in.

Layer 2: the approval policy (--ask-for-approval). Controls when the agent pauses and asks before acting — behavioral guardrails layered on top of the sandbox.

ValueBehavior
untrustedPrompts before any non-trusted command
on-requestPauses at sandbox boundaries — good for interactive sessions
neverRuns without interruption — good for CI

These two layers combine into three named UX presets that most users interact with directly:

ModeWrites files?Runs commands?Asks before acting?
SuggestNoNoAlways
Auto EditYesNoBefore commands
Full AutoYesYesNever

For everyday dev use, Auto Edit is the right starting point. You keep control over anything that executes; Codex handles file edits. Full Auto is for CI or for tasks you’ve already validated in a sandboxed run.

When multiple rules match a command, the most-restrictive rule wins. forbidden beats prompt beats allow — there’s no accidental privilege escalation through overlapping rules.

Smart Approvals

One underrated feature: Smart Approvals, enabled by default. When Codex escalates a command for approval, it proposes a prefix_rule — a pattern for that class of command. Approve it once, and the rule persists to a local rules file, so Codex stops asking for similar commands in future sessions.

There’s a known bug (GitHub issue #13175): prefix_rule matching fails for some shell wrapper scripts. If your toolchain wraps executables, verify the fix is shipped before relying on the approval-learning loop.

Model options

Three models are available as of June 2026:

ModelBest for
GPT-5.3-CodexCode-specialized model; stronger on code reasoning, lower price than GPT-5.4
GPT-5.4Strong general performance for most coding tasks
GPT-5.4-miniBudget runs, high-volume CI, fast iteration

Set the model in config or per-run with --model gpt-5.4-mini. GPT-5.4 is the practical default for most coding tasks — refactors, test fixes, function-level generation. GPT-5.4-mini trades reasoning depth for significant cost reduction; GPT-5.3-Codex is OpenAI’s code-specialized option and comes in below GPT-5.4 on price — worth trying for code-heavy tasks where general reasoning matters less.

Performance and cost

Codex CLI uses credit-based pricing, billed per million tokens against your OpenAI API account:

ModelInput (credits/M tokens)Output (credits/M tokens)
GPT-5.3-Codex43.75350
GPT-5.462.50375
GPT-5.4-mini18.75113

GPT-5.4-mini is roughly 2.3× cheaper on input than GPT-5.3-Codex. For long autonomous runs that read many files before writing anything, input cost dominates — model selection affects your bill more than it might seem upfront.

No credible per-task token benchmarks exist at time of writing. Claims of specific token efficiency comparisons with Claude Code or Aider circulated on Hacker News and DataCamp in early 2026 but didn’t survive adversarial verification. Treat the pricing table as a planning input; your actual cost depends on repo size, task scope, and how many files each run touches.

Enterprises already carrying OpenAI API credits pay no SaaS markup — the CLI burns credits like any other API call. See the OpenAI API for current credit purchase options.

How it compares

Codex CLIClaude CodeAiderCursor
LicenseApache-2.0ProprietaryApache-2.0Proprietary
Model supportOpenAI onlyAnthropic onlyAny providerAny (via API)
Sandbox modelExplicit 2-axisApproval-basedSingle flagGUI-based
Headless / serverYesYesYesNo — requires VS Code
Interactive correction mid-runWeak — re-prompt from startStrong — iterative correctionModerate — commit-by-commitStrong
Network during runOff by defaultConfigurableConfigurableConfigurable
Smart approval learningYes (prefix rules)NoNoN/A
PricingCredits per tokenSubscription ($20–$100+/mo)Free (your API costs)Subscription

A few things worth calling out:

The sandbox architecture is the most explicit of these four tools. Two independently configurable axes — what’s technically allowed vs. when to ask — give you more expressive safety policy than any alternative in this category. For CI pipelines or shared dev environments where someone has to define exactly what an autonomous agent can do, that granularity is worth the setup.

Cursor is disqualified for headless work. It requires VS Code. If you’re on a remote server, a dev box without a display, or anything over SSH, Cursor is off the table. Codex CLI, Claude Code, and Aider all work fine without a GUI.

Interactive correction is Codex CLI’s clearest weakness. When a long autonomous run goes wrong — and on complex tasks, it will eventually — there’s no mid-session recovery. You re-prompt from scratch. Claude Code handles mid-run correction better; Aider gives you commit-by-commit granularity so partial runs are always recoverable. The BuiltIn comparison from May 2026 described it: “the further it gets from the developer’s last review point, the harder it is to recover when the path bends.”

OpenAI-only is a constraint, not a bug. If you want to run DeepSeek for budget tasks and GPT-5.3-Codex for architecture, or keep the option to switch providers, Codex CLI doesn’t accommodate that. Aider supports any OpenAI-compatible endpoint, including local Ollama models. Codex CLI doesn’t, and there’s no indication that’s changing.

Known limitations

Over-engineering on long autonomous runs. Complex tasks — large refactors, multi-step debug chains — tend to drift. Without an in-session correction path, rolling back means discarding everything since the last safe checkpoint.

Network off by default. Package installs, remote fetches, and live API calls require bumping to danger-full-access. In a CI context, this is usually fine — you preinstall dependencies. In an interactive session, it’s friction.

Smart Approvals prefix-rule bug. GitHub issue #13175 documents matching failures for shell wrappers. Check the issue before trusting the approval-learning system on complex toolchains.

No built-in diff viewer. You get terminal output, not a rendered diff. Reviewing changes before accepting them requires git diff manually.

Cost opacity on large repos. No confirmed per-task token figures. Runs that read many files accumulate input tokens quickly — model selection matters more than it looks upfront.

Verdict

Codex CLI is a solid terminal agent for teams already on OpenAI. The two-axis sandbox model is the most explicit safety architecture in this class of tools — a real differentiator if you’re deploying agents in CI, on shared infrastructure, or anywhere that requires precise control over what an autonomous process can and can’t touch.

The tradeoffs are real and intentional: model lock-in, limited recovery when things go sideways, no network access without explicit permission escalation. OpenAI is optimizing for predictable, auditable behavior. The constraints follow from that choice.

Pick Codex CLI if you’re OpenAI-first, you run agents in CI or on headless servers, and the sandbox control model fits how you think about agent safety. The credit-based pricing works in your favor if you’re already holding OpenAI API budget.

Pick Claude Code if you want stronger mid-run correction, MCP tool integrations, or you’re on an Anthropic contract. See Claude Max plans for subscription options.

Pick Aider if model flexibility is non-negotiable, or you want every AI change as a discrete, auditable git commit.

For a broader look at the terminal agent landscape, see the best AI coding CLI roundup. For a head-to-head with Claude Code specifically, Claude Code vs Codex goes deeper on the direct comparison.

References