A/B Testing & Feature Flags

FE Lead · Lesson 22 — print before the round
One-line to open with

"Feature flags decouple deploy from release. A/B tests measure whether a change improves a metric. Both need the same infrastructure — but confusing them leads to no holdout and declaring winners on noise. The hard problems are flicker, caching, and statistical validity."

Flags vs A/B experiments

Feature flagA/B experiment
GoalSafe deploy / kill switchMeasure metric impact
AssignmentTargeted (% rollout)Random + sticky (user ID hash)
Needs stats?NoYes — pre-commit metric + sample size
Holdback?NoYes — 5–10% for long-term signal

Variant assignment options

WhereProCon
Client-sideFast to shipFlicker (CLS), no SEO
Server-side (SSR)No flicker, SEO-safeNeeds cookie context, CDN harder
EdgeNo flicker, cacheable, fastLimited runtime env

Default: server-side or edge. Client-side only with SSR-embedded flag payload (window.__FLAGS__) to avoid SDK round-trip flicker.

Caching variants

Bad Vary: Cookie — CDNs ignore or bypass cache entirely

Good URL-based routing: /search/control vs /search/variant — each gets its own cache entry

Best Edge middleware rewrites URL per assignment → variant HTML independently cached → high hit rate

Safe rollout + holdback

1% → 10% → 50% → 90% → [5% holdback]

Check errors + p75 CWV at each step. Holdback = permanent 5–10% on control to detect novelty effects and long-term degradation.

Holdback cost: deliberately withholding a good experience from 5–10% of users. Worth it for large uncertain changes, not small clear wins.

Statistical traps (know these)

TrapFix
Peeking — stop early when p<0.05Pre-commit run duration + sample size. Use sequential testing methods for continuous monitoring.
SRM — actual split ≠ configured (e.g. 54/46 vs 50/50)Experiment is invalid. Debug assignment logic. Never use SRM results for decisions.
Novelty effect — lift decays after week 1Run ≥ 2 weeks. Holdback reveals long-term.
Multiple metrics — conversion up, revenue downPre-commit one primary metric. Rest are informational.

Flag lifecycle governance

  • Dead flags = tech debt + rollback risk
  • 90-day TTL policy — auto-alert for expired flags
  • Instrument RUM segmented by experiment group — so a bad test doesn't corrupt global CWV baseline
  • Sticky assignment via hashed user ID — same variant across sessions/devices
Memorize: the checkout A/B test answer

Edge middleware assigns variant → sets sticky cookie → rewrites to variant-specific URL (independent CDN cache) → SSR renders the right variant in first HTML → no flicker → RUM segmented by experiment group → pre-commit primary metric (booking conversion) → run 2 weeks → check SRM first → holdback 5% post-rollout.

Don't fail the interview

Hide-then-show anti-flicker: setting visibility:hidden on body until SDK loads delays FCP for ALL users, not just experiment participants. Degrades Core Web Vitals fleet-wide. Fix: SSR-embed the flag assignment so SDK reads synchronously from window.__FLAGS__.

Peeking trap: p=0.03 after 2 days does NOT mean the variant won. Pre-commit sample size and run duration before launching. Checking every day and stopping early inflates false positives — you'll declare a winner on noise.