Feature flags as architecture — the long-term cost

Feature flags work. The problem isn’t the idea — it’s the accumulation. Most teams add flags intentionally and delete them accidentally-never. The carrying cost compounds until someone either owns the lifecycle or waits long enough to get hurt. If you’re past 20 live flags and don’t have a deletion policy, you’re accumulating risk you can’t see.

Who this is for

Engineering teams that have shipped feature flags and are now deciding whether to formalize the practice. If you’re evaluating whether to adopt flags at all, that’s a shorter conversation. If you’re already running 50+ flags across multiple services with no lifecycle tracking, start at the governance section and work backward.

The incident that names the pattern

On August 1, 2012, Knight Capital Group lost $440 million in 45 minutes. The SEC’s investigation (Order 34-70694) documented exactly how.

A developer repurposed a feature flag that had originally activated a deprecated trading strategy called “Power Peg.” Power Peg had been turned off years earlier, but the code behind it was never deleted. The flag’s new purpose was to enable a routing behavior in a deployment. Eight of Knight’s nine servers received the new code. The ninth didn’t.

When the market opened, the ninth server saw the flag, ran the only code it knew for that flag — the dead Power Peg strategy — and started executing erroneous buy orders at scale. By the time anyone understood what was happening, 45 minutes and $440 million had passed.

Ninety-seven internal emails about the deployment had gone out in the nine days prior. None of them connected the repurposed flag to the dead code it would activate on an un-upgraded server.

The flag wasn’t the bug. The permanently-alive flag pointing to never-deleted code was the bug. The flag was the mechanism that made unintended behavior reachable after everyone assumed it wasn’t.

How feature flags drift from temporary to permanent

Martin Fowler’s analysis of feature flags is the canonical reference here. He identifies four flag types by intended lifetime:

Release flags: temporary, 1–14 days. Gate incomplete features during a deployment window.
Experiment flags: medium-term, weeks to months. Run A/B tests, measure, decide.
Ops flags: long-lived, sometimes permanent. Circuit breakers, kill switches for expensive features.
Permissioning flags: permanent by design. User entitlements, beta access, enterprise tiers.

The problem is that “temporary” requires active enforcement. Drift happens in three phases.

Phase 1: Release flag becomes permanent config. The feature ships. The flag stays because removing it feels risky — what if something breaks? — and the pressure to do it evaporates the moment the feature is live. Someone puts it on the backlog. The backlog grows. The flag is now permanent config.

Phase 2: Permanent config becomes silent default. The flag’s value hasn’t changed in six months. Nobody flips it. The team forgets it’s a flag; they think of it as “the way the system works.” A new engineer inherits the codebase and never learns the flag exists. The flag’s handler may contain code that was only ever meant to run during the transition window.

Phase 3: Silent default becomes a trap. Any code reachable through a silent-default flag is code nobody is testing or monitoring under the assumption that it runs. Add a deployment mishap, a repurposed flag, or a missed server in a rolling deployment, and you have the Knight Capital situation: behavior you thought was unreachable, reachable.

Fowler calls this “carrying cost.” Each live flag is an additional branch in your system’s observable behavior. You have to account for it in tests, in debugging, in deploys. One flag is negligible. Forty flags across eight services is a threat model.

What the operational costs look like in practice

Knight Capital is the catastrophic tail. Two PostHog incidents from the last eight months show what the more common costs look like.

PostHog, October 2025: PostHog’s feature flag service shared a Redis cluster with its main application. Flag evaluation traffic spiked unexpectedly, hit Redis hard, and cascaded into the main app. Four separate outages. Over 14 hours of cumulative impact across the incident window. The full post-mortem is on PostHog’s site.

The root cause wasn’t flags. It was the infrastructure coupling that flag evaluation introduced when the service was designed for convenience — one Redis cluster — rather than isolation. Feature flags as a first-class service need first-class operational boundaries.

PostHog, February 2026: An OOM condition on the flag evaluation cache workers — triggered by months of accumulated test data — left the cache serving stale values. No downtime. No data loss. But users got outdated flag state — features that should have been enabled weren’t, and vice versa. Silent correctness failure. The post-mortem is on GitHub.

This second incident matters more than it looks. There’s no outage signal. No page. No SLA breach. The system is “up” while serving wrong behavior, and the monitoring that catches downtime is blind to this category of failure. You find it through user reports or anomaly detection in product metrics — both slow and noisy.

Latency is the third cost. Every flag evaluation adds overhead proportional to how you evaluate it. Remote evaluation — calling the flag service per request — adds network round-trip time. In-process evaluation against a local cache is fast, often under a millisecond. The local cache approach is what most vendors recommend at scale. But the cache has to stay warm, synchronized, and correctly invalidated. The PostHog February incident is a cache invalidation failure. This failure mode exists in every flag platform; only the details differ.

Governance patterns that actually hold

The difference between teams that manage flags well and teams that don’t isn’t tooling — it’s whether they treat flag deletion as a first-class engineering task.

Type-aware lifetimes. Fowler recommends treating flag type as a lifecycle contract. A release flag committed to the codebase should have a stated expiry date in a comment or config file. GitLab’s internal documentation on feature flags formalizes this: six flag types with documented maximum lifetimes ranging from 2 months for gitlab_com_derisk release flags to an unlimited ops type with mandatory 12-month review cycles. YAML metadata governs flag creation. Chatops commands handle rollout. The process is automated enough that following it is the path of least resistance.

Time bombs. A flag that self-destructs on evaluation after a configured date — by raising an error or logging loudly — turns a forgettable cleanup task into a loud failure on day N+1. The build breaks. The CI pipeline fails. Someone has to deal with it. This is more reliable than a Jira ticket or a Slack reminder. Fowler explicitly recommends time bombs for release flags. Fewer flag platforms support them natively than you’d expect.

Lean inventory limits. GitHub’s feature flag practice — described in their engineering blog — keeps the live inventory small enough that every flag’s purpose is actively known. GitHub’s model includes dual CI builds (one with the flag on, one with it off), per-actor targeting (flag evaluation per user, per organization, per repository, not just per-environment), and cleanup scripts that scan for flags past their stated lifetime. The cleanup becomes a background process, not a quarterly audit. For teams evaluating CI platforms to run this kind of dual-build automation, see GitHub Actions vs GitLab CI.

Hard limit on inventory size. If adding a new flag requires deleting an old one, you audit before you add. This sounds heavy-handed, but the teams that implement it consistently describe it as the intervention that stopped accumulation. The exact limit depends on team size; the discipline matters more than the number.

OpenFeature: vendor-neutral evaluation

If you want to stay portable across providers, the CNCF’s OpenFeature project (incubating since November 21, 2023) gives you a standard evaluation SDK with production implementations for Go, Java, JavaScript, Python, .NET, PHP, Ruby, Swift, and other languages. Licensed under Apache 2.0. Conceptually analogous to OpenTelemetry: one API, pluggable backends.

The practical argument isn’t ideological. It’s operational. If your codebase calls LaunchDarkly’s SDK directly, switching providers means touching every evaluation call site. With OpenFeature, you swap the provider, not the code. For teams early enough in their flag adoption to choose, the portability is worth the abstraction cost.

OpenFeature does not handle governance. It does not tell you when to delete a flag, enforce lifecycle stages, or detect stale evaluations. It’s plumbing, not policy.

Tooling comparison

The governance features vary enough across tools that the choice matters for teams with real accumulation problems.

Tool	Lifecycle stages	Stale detection	Per-flag expiry	Self-hosted	License
LaunchDarkly	6 formal stages	Yes	Yes — targeting rule expiry dates	No (SaaS only)	Proprietary
Unleash	5 stages	Yes (7–40 days by type)	No per-flag date	Yes	AGPL-3.0
GrowthBook	Archive / active	Yes (2 wk inactive or single variation)	No	Yes	MIT + open core
Flagsmith	Advisory labels only	No	Scheduled changes (Scale-Up+)	Yes	BSD-3-Clause
Flipt	None documented	None	None	Yes	Fair Core License

LaunchDarkly has the most formal lifecycle tracking. Six stages. Targeting rule expiry lets you configure a rollout to automatically shut off after a date — the closest thing to native time bombs in a SaaS product. The hard constraint is that there’s no self-hosted option. For teams with data residency requirements or air-gapped deployments, it’s out regardless of governance quality.

Unleash is the strongest self-hosted option with real governance. Five lifecycle stages with tooling to move flags through them. Stale detection tuned by flag type: 7 days for experimental flags, 40 days for long-lived operational ones. Licensed AGPL-3.0; verify compliance requirements before self-hosting.

GrowthBook is a strong fit for teams that want experimentation analytics alongside flag management. Stale detection is simpler — flags are marked after two weeks of inactivity or when only one variation is being served — but it’s automated, which beats manual audits. MIT core with a commercial cloud tier.

Flagsmith has a referral program that makes it economically attractive for small teams evaluating self-hosted options (see the disclosure at the top of this article — Flagsmith is the only tool here with an affiliate relationship). The governance tooling is lighter than Unleash: advisory labels, no automatic stale detection, scheduled changes on Scale-Up and above. Use it if your flag inventory is small enough that you’re your own governance layer.

Flipt is the minimal option. No lifecycle tracking, no stale detection. Appropriate as a thin config layer if you’re managing governance through code review and naming conventions rather than platform features.

Verdict

The platform is a secondary decision. The primary decision is whether your team has a flag lifecycle policy at all: who owns cleanup, what triggers deletion, what prevents a release flag from silently becoming permanent config.

Solo dev or small team, fewer than 20 flags: pick any tool. The discipline matters more than the platform. GrowthBook or Unleash for open-source self-hosted. Name every flag with its type and an expiry date in the config. Review the list monthly.

Growing team, 20–200 flags: stale detection earns its keep. Unleash if you need self-hosted and want formal lifecycle stages. LaunchDarkly if you want the most complete governance tooling and SaaS is acceptable. Adopt OpenFeature’s SDK layer now to stay portable. For infrastructure to self-host Unleash or GrowthBook, see our 2026 full-stack deploy platform guide.

Scale, 200+ flags across multiple services: Unleash or LaunchDarkly plus custom automation. GitHub’s cleanup-script pattern is worth replicating: automated scans, filed issues, no quarterly audits that never happen. Governance has to be built into the deployment pipeline, not bolted on manually. Flag creation should require a stated type, owner, and expiry.

At any scale: audit your current inventory before adding a platform. You may find you’re not carrying 40 flags — you’re carrying 8 release flags and 32 silent defaults whose state nobody has checked in a year. The platform won’t surface that. A deletion is free, and it reduces the surface area of the next deployment incident.

The Knight Capital incident is sometimes framed as a reason to avoid feature flags. That’s the wrong lesson. The incident was a governance failure: dead code that was never deleted, a flag that was repurposed without documentation, and a deployment that left one server behind. Disciplined flag use — typed, time-bounded, inventoried — would have surfaced both problems before August 1, 2012. The flag was the symptom. The accumulation without a deletion policy was the disease.