"Feature flags decouple deploy from release. A/B tests measure whether a change improves a metric. Both need the same infrastructure — but confusing them leads to no holdout and declaring winners on noise. The hard problems are flicker, caching, and statistical validity."
| Feature flag | A/B experiment | |
|---|---|---|
| Goal | Safe deploy / kill switch | Measure metric impact |
| Assignment | Targeted (% rollout) | Random + sticky (user ID hash) |
| Needs stats? | No | Yes — pre-commit metric + sample size |
| Holdback? | No | Yes — 5–10% for long-term signal |
| Where | Pro | Con |
|---|---|---|
| Client-side | Fast to ship | Flicker (CLS), no SEO |
| Server-side (SSR) | No flicker, SEO-safe | Needs cookie context, CDN harder |
| Edge | No flicker, cacheable, fast | Limited runtime env |
Default: server-side or edge. Client-side only with SSR-embedded flag payload (window.__FLAGS__) to avoid SDK round-trip flicker.
Bad Vary: Cookie — CDNs ignore or bypass cache entirely
Good URL-based routing: /search/control vs /search/variant — each gets its own cache entry
Best Edge middleware rewrites URL per assignment → variant HTML independently cached → high hit rate
1% → 10% → 50% → 90% → [5% holdback]
Check errors + p75 CWV at each step. Holdback = permanent 5–10% on control to detect novelty effects and long-term degradation.
Holdback cost: deliberately withholding a good experience from 5–10% of users. Worth it for large uncertain changes, not small clear wins.
| Trap | Fix |
|---|---|
| Peeking — stop early when p<0.05 | Pre-commit run duration + sample size. Use sequential testing methods for continuous monitoring. |
| SRM — actual split ≠ configured (e.g. 54/46 vs 50/50) | Experiment is invalid. Debug assignment logic. Never use SRM results for decisions. |
| Novelty effect — lift decays after week 1 | Run ≥ 2 weeks. Holdback reveals long-term. |
| Multiple metrics — conversion up, revenue down | Pre-commit one primary metric. Rest are informational. |
Edge middleware assigns variant → sets sticky cookie → rewrites to variant-specific URL (independent CDN cache) → SSR renders the right variant in first HTML → no flicker → RUM segmented by experiment group → pre-commit primary metric (booking conversion) → run 2 weeks → check SRM first → holdback 5% post-rollout.
Hide-then-show anti-flicker: setting visibility:hidden on body until SDK loads delays FCP for ALL users, not just experiment participants. Degrades Core Web Vitals fleet-wide. Fix: SSR-embed the flag assignment so SDK reads synchronously from window.__FLAGS__.
Peeking trap: p=0.03 after 2 days does NOT mean the variant won. Pre-commit sample size and run duration before launching. Checking every day and stopping early inflates false positives — you'll declare a winner on noise.