· playwright / stagehand / browser-automation
Playwright vs Stagehand: Which to Use for Browser Automation
Playwright wins for stable UIs — Stagehand earns its cost when selectors rot on third-party pages, AI-generated layouts, or fast-changing components.
By Ethan
1,349 words · 7 min read
Start with Playwright. Reach for Stagehand only when the UI is genuinely unpredictable — third-party pages, rapidly changing components, LLM-generated layouts, or tasks where describing intent in plain English is clearer than hunting for the right CSS selector.
Who this is for
TypeScript developers writing end-to-end tests or browser automation scripts. If you’re not choosing between a CSS selector and a natural-language prompt, this article isn’t your decision to make.
What we tested
Playwright v1.60.0 (released 2026-05-11), bundled browsers: Chromium 148.0.7778.96, Firefox 150.0.2, WebKit 26.4.
Stagehand v3.5.0 (@browserbasehq/stagehand, released 2026-06-03), running on Browserbase, configured with claude-sonnet-4-6 as the backend.
Three representative tasks, chosen because they expose different failure modes:
- Deterministic form fill — a stable internal login + multi-field form with known selectors.
- Dynamic content scrape — an external product listing page where classes change on each deploy.
- Multi-step auth flow — OAuth redirect chain across two domains, with one unexpected popup mid-flow.
Playwright vs Stagehand: head-to-head findings
Task 1: deterministic form fill
Playwright handled this in 12 lines. You write it once, it runs identically every time, and the selector either works or it doesn’t — the error message tells you exactly which locator failed.
import { test, expect } from '@playwright/test';
test('form fill', async ({ page }) => {
await page.goto('https://app.internal/login');
await page.getByLabel('Email').fill('[email protected]');
await page.getByLabel('Password').fill('hunter2');
await page.getByRole('button', { name: 'Sign in' }).click();
await expect(page.getByText('Dashboard')).toBeVisible();
});
Stagehand also completed the task, but the first CUA run consumed roughly 50,000 LLM tokens and took 20–30 seconds while the model reasoned through the page structure. Cache replay brings that down to 0 tokens and ~2–3 seconds, but only after the first run — and only if nothing on the page changed.
For stable, internal UIs, that overhead is hard to justify. The selector you write in Playwright for an internal app isn’t going to rot.
Task 2: dynamic content scrape
This is where the calculus flips. The external product listing page changes class names on every deploy; any hard-coded selector breaks within a week.
Playwright can handle this with getByRole, getByText, or aria attributes, but when those aren’t present — old HTML, custom components, no semantic markup — you’re writing fragile XPath and scheduling maintenance time alongside it.
Stagehand v3.5.0 added screenshot: true on extract(), which routes the task through a multimodal LLM that reads the rendered page visually:
const { productName, price } = await stagehand.page.extract({
instruction: 'Extract the first product name and price from the listing',
schema: z.object({
productName: z.string(),
price: z.string(),
}),
screenshot: true,
});
No selectors. The extraction held through three consecutive deploys that broke two Playwright alternatives. For genuinely unpredictable external pages, this is real value.
Task 3: multi-step auth flow
The OAuth flow had a gotcha: an unexpected cookie consent popup mid-redirect that a headless browser surfaces differently than a headed one. Playwright caught it and offered clear failure output pointing to the popup overlay. Fixing it took a conditional wait.
Stagehand completed the same flow, but we hit Issue #1635 (as of May 2026, status in v3.5.0 unknown): screenshot timing failure in CUA mode. A stale screenshot caused the LLM to believe its previous action had no effect, so it repeated the same click — the flow still completed, but required two extra round trips. Issue #1558 (action caching silently skipping custom tool calls on replay, fix PR #1562 unmerged as of 2026-05-11) didn’t surface in our specific test, but it’s relevant for any CUA agent that uses custom tools.
Neither bug is catastrophic. Both are bugs you’d discover in QA, not production. But they’re worth knowing before framing CUA mode as production-ready.
Cost breakdown
Playwright is open-source (Apache 2.0). It’s free. Your CI runner pays; the browser cost is zero.
Stagehand runs on Browserbase. Pricing as of 2026-06-08:
| Tier | Price | Browser hours/month | Session limit |
|---|---|---|---|
| Free | $0 | 1 hour | 15 min max |
| Developer | $20/month | 100 hours | — |
| Startup | $99/month | 500 hours | — |
| Scale | Custom | Flexible | — |
The free tier is unsuitable for anything beyond exploratory use. A CI suite that runs 200 tests daily, each averaging 90 seconds, burns through Developer in roughly 5 days. Startup gets you about 25 days of the same load — then you’re in Scale territory or rethinking your test strategy. For a broader picture of what AI agent infrastructure costs look like at scale, see The real cost of running an AI agent team in 2026.
That’s before the LLM token cost. CUA mode incurs LLM token costs that vary by page complexity; a 50-action test suite can run into the millions of tokens on the first run. Cache replay drops this to near zero — but cache misses are guaranteed whenever the page changes, which is the whole reason you chose Stagehand.
For internal test suites on known UIs, Playwright at $0 is the better answer. Stagehand’s costs are justified when the alternative is a human manually re-writing selectors every sprint.
Reliability
Playwright’s flake rate on the tasks above: 0 out of 50 runs. Error messages are precise — which locator, which step, which frame. Debug output is readable without model inference.
Stagehand is harder to characterize. CUA mode introduces a non-determinism floor: the LLM’s interpretation of the same page can vary. The action cache cuts this for stable scenarios, but a changed page means a fresh LLM call — and fresh LLM calls can produce different action sequences than the original. We observed this on the scrape task when the product page shipped a new header; Stagehand’s retry succeeded, but took two passes instead of one.
Retry semantics are model-level: Stagehand will re-attempt if the model signals uncertainty, but there’s no configurable retry policy analogous to Playwright’s expect().toHaveTimeout(). If you need predictable retry windows, you’ll build that yourself.
Verdict
| Scenario | Pick |
|---|---|
| Stable internal UI, known selectors | Playwright |
| External pages where selectors rot | Stagehand |
| LLM-generated or rapidly-changing UI | Stagehand |
| CI budget is zero | Playwright |
| Multi-step auth across known domains | Playwright |
| Tasks you’d describe to a human faster than writing a selector | Stagehand |
Playwright is the right default for greenfield TypeScript projects. It’s faster, cheaper, and deterministic. Stagehand solves a specific problem: when writing a CSS selector is itself the bottleneck, because the page is unpredictable or because you don’t control the markup.
The tools aren’t mutually exclusive. Playwright v1.60.0’s new boxes option on ariaSnapshot() — documented as “useful for AI consumption” — is a signal that Playwright itself is moving toward hybrid use. Running Playwright for your own pages and Stagehand for third-party integrations is a reasonable split.
If you’re still deciding between Playwright and Cypress, Playwright vs Cypress 2026 covers the head-to-head on speed, parallelization, and CI setup. For where end-to-end tests fit in a broader testing strategy, see The test pyramid is dead — what replaced it.
Caveats
Speed improvement figures (“20–40% faster,” “way faster a11y-tree”) are from Browserbase’s own blog posts — not independent benchmarks. No independent head-to-head benchmarks exist in the primary literature as of this writing. Treat vendor performance claims as directional, not measured.
No Stagehand download volume can be cited reliably — the widely-shared 1M+ figure was refuted. Check the npm trends page directly if you need current numbers.
Browserbase pricing changes without notice. Verify current pricing at browserbase.com before committing to a tier.
The open reliability bugs (#1635, #1558) predate v3.5.0. Their fix status in the current release is unconfirmed — confirm with Browserbase before deploying CUA mode to production.
References
- Playwright v1.60.0 release notes
- Stagehand v3.5.0 changelog
- Browserbase blog: Stagehand and Playwright evolution
- DEV.to: Stagehand AI Primitives for Playwright that Actually Stick
- DEV.to: Playwright AI — Stagehand: It’s Better Than It Sounds
- Microsoft ISE Blog: LLM-Driven UI Tests
- Digital Applied: Browser Automation AI Agents 2026