A/B Testing & Feature Flags

Safe rollout, experimentation, and the FE infrastructure behind it — a topic discussed continuously in the round.

1 Feature flags vs A/B experiments — not the same thing

These two are often conflated but have different goals:

Feature flag

Safe deployment tool

On/off switch to separate deploy from release. Ship code to production hidden behind a flag; flip the flag when you're ready. Enables canary releases, instant rollback without a redeploy, kill switches for incidents. No statistical analysis — it's not measuring anything.

A/B experiment

Measurement tool

Randomly split users into control and variant groups. Measure a metric (conversion, click-through, revenue). Requires statistical significance before calling a winner. The randomisation matters — you're doing science.

In practice, the same flag infrastructure handles both — you either flip a flag at 100% (deployment) or roll it out gradually and instrument it (experiment). The distinction is in how you interpret the data.

One-liner

Feature flags decouple deploy from release. A/B tests measure whether a change improves a metric. Both use the same infrastructure — but confusing them leads to running "experiments" with no holdout and declaring winners too early.

2 Variant assignment — client-side vs server-side vs edge

Where assigned	How it works	Key trade-offs
Client-side	JS runs on the client, reads a flag SDK, flips the UI after page load	Fast to ship, no server changes — but causes flicker (page renders control, then flips to variant). Also: bots/crawlers see control only — bad for SEO tests.
Server-side	SSR decides variant before HTML is sent. User gets the right variant on first paint.	No flicker, SEO-safe, works in RSC — but requires cookie/header context at render time. Harder to CDN-cache (variants must be isolated).
Edge-side	CDN edge worker (Cloudflare Workers, Vercel Edge) assigns variant, routes to variant-specific cached response or rewrites HTML.	Fast (no origin hit for assignment), cache-friendly, no flicker — but limited runtime environment (no Node APIs, small bundle). Best for the global CDN presence.

Source: Vercel — Feature Flags at the Edge · LaunchDarkly — Client-side JS SDK

3 The flicker problem

Client-side A/B tests are notorious for flicker (also called FOUC — flash of unstyled/wrong content): the browser renders the control variant, the flag SDK initialises and returns the variant assignment, then the page visually jumps. Users see a flash. This harms UX and metrics — the flash can cause variant users to bounce before the test is even valid.

Client-side (broken): timeline

HTML arrives

Control rendered

→ user sees control layout

Flag SDK loads

JS evaluates

50–200ms later

Variant applied

DOM flips

→ CLS spike · user confused

Server-side / edge (fixed): timeline

Server/edge assigns variant

Variant HTML sent

→ correct variant on first paint

No flip

Page stable from first byte

→ CLS = 0 from experiment

Client-side flicker mitigation (when server-side isn't an option)

Hide-then-show: add CSS body { visibility: hidden } in the <head>, reveal after SDK loads. Prevents the flash — but delays FCP for everyone, including users for whom no variant exists. Bad for Core Web Vitals (L03).
Cache the assignment in a cookie: on second visit, read the cookie synchronously before render. Only first visit flickers.
Inline the assignment in SSR: even with a client-side SDK, SSR can pre-load the variant assignment from the flag service and embed it as a window.__FLAGS__ JSON object in the HTML — no round-trip needed, SDK reads from it synchronously.

Client-side A/B tests and Core Web Vitals: Every client-side A/B test is a potential CLS source (layout jump from variant flip) and LCP threat (hero image differs between variants, loads later). At platform scale — hundreds of simultaneous experiments — this compounds. The Lead answer: push assignment to the server or edge, use SSR-embedded flag data as a fallback, and run CWV monitoring segmented by experiment (so a bad test doesn't corrupt the global RUM baseline).

4 Caching with A/B variants — the hard problem

A CDN caches one response per cache key. If two users get different HTML (different variants), the CDN must serve them different responses — or you'll serve the wrong variant from cache.

The three approaches

URL-based variants

Separate cache keys

/search?__experiment=new-filters or route rewriting at edge. Each variant has its own URL = its own cache entry. Clean, cache-friendly. Awkward for users (don't want experiment params in their bookmarks).

Cookie-based variants

Vary: Cookie

Vary: Cookie tells the CDN to cache per cookie value. Works in theory — in practice, CDNs handle Vary: Cookie poorly (Fastly/Cloudflare either ignore it or bypass cache entirely). Effective CDN hit rate → near 0.

Edge assignment (best approach)

Rewrite at CDN, cache variant HTML

An edge worker intercepts the request, reads a lightweight assignment cookie (or assigns on first visit), then either routes to a variant-specific origin path or rewrites the response from cached static HTML. Both variant responses are independently cached at the CDN — cache hit rate stays high, assignment happens at ~0ms latency.

// Vercel Edge Middleware — assign variant, rewrite path
export function middleware(req: NextRequest) {
  const variant = req.cookies.get('exp_new_filters')?.value
    ?? (Math.random() < 0.5 ? 'control' : 'variant');

  const res = NextResponse.rewrite(
    new URL(`/search/${variant}`, req.url)
  );
  res.cookies.set('exp_new_filters', variant, { maxAge: 60 * 60 * 24 * 7 });
  return res;
}

5 Safe rollout pattern + holdbacks

A/B tests are also how you roll out risky features safely. The pattern:

Canary

Watch errors + perf metrics

→

10%

Ramp

Check conversion + p75 CWV

→

50%

Half

Stat significance check

→

90%

Full

Monitor holdback

→

Holdback

Keep forever for long-term signal

The holdback group

A holdback (or holdout) keeps 5–10% of users on the control even after full rollout. This lets you measure long-term impact — novelty effects wear off after a few weeks, and some metrics only show degradation months later. Without a holdback, you can't distinguish "it worked" from "users were excited about something new." Google and large tech companies use permanent holdbacks on major features.

Holdback cost: You're deliberately giving a worse experience to 5–10% of users if the experiment is positive. Justifiable for large, uncertain changes. Not worth the cost for tiny improvements with clear metrics. The Lead decision: "Is the value of long-term signal worth delaying the benefit to the holdback group?"

6 Statistical validity — what a Lead must know

You don't need a statistics degree, but you need to flag these problems in a review:

Trap	What it is	Fix
Peeking / early stopping	Checking results every day and stopping the test when it looks significant. Repeated testing inflates false positive rate — you'll declare a winner on noise.	Pre-commit to a sample size and run time before launching. Use sequential testing methods (e.g., always-valid p-values) if you need continuous monitoring.
Sample ratio mismatch (SRM)	The actual split (e.g., 52/48) differs from the intended split (50/50). Indicates a bug in assignment — your groups aren't comparable.	Always check assignment ratios first. If SRM exists, the experiment result is invalid. Debug the randomisation logic.
Novelty effect	Variant looks better initially because users click on anything new. Wears off.	Run experiments long enough to cover a full user behaviour cycle (at least 1–2 weeks). Holdbacks reveal this over months.
Multiple metrics	Testing improves conversion but degrades revenue per booking. Which wins?	Pre-commit to a primary metric before running. Secondary metrics are informational, not decision-making criteria.

7 Flag tooling landscape

Tool	Strength	the platform fit
LaunchDarkly	Full-featured, targeting rules, SSE streaming flag updates, audit log. Industry standard.	High-traffic; streaming flag sync = no polling lag
Statsig	Built-in experiment statistics (CUPED variance reduction, sequential testing). Flag + analysis in one.	Teams that want flags + stats without a separate analytics tool
GrowthBook	Open-source, self-hostable, connects to your own data warehouse for stats.	Cost control; data sovereignty concerns
Unleash	Open-source, enterprise tier, strategy plugins (gradual rollout, userId hash).	Teams needing self-hosted + on-prem
Vercel Edge Config	Ultra-low-latency flag reads at the edge (~1ms). No SDK roundtrip.	Edge-assignment pattern (§4)

Source: LaunchDarkly — What are feature flags? · GrowthBook — Overview

8 Lead framing — A/B testing as engineering discipline

Product thinks of A/B tests as a tool to pick winners. A Lead thinks of it as an infrastructure problem:

Consistent assignment — same user always gets the same variant across devices and sessions (sticky assignment via hashed user ID, not random-per-request).
Flicker-free — server-side or edge assignment as default; client-side only as an exception with mitigation.
Cache-safe — variant isolation at CDN so experiments don't trash hit rates.
Observable — every variant change tagged in your RUM/analytics pipeline so you can segment CWV by experiment group.
Flag lifecycle — experiments end. Dead flag cleanup is technical debt. A Lead enforces a 90-day TTL on flags with automated alerts for expired experiments still in code.

Full loop

Concept: A/B testing at scale is a platform capability — it needs consistent assignment, flicker-free rendering, cache-safe variant isolation, and a flag-lifecycle process. Trade-off: server-side assignment is the right default but needs request-time context (cookies/headers) that conflicts with aggressive CDN caching; edge assignment solves both but constrains your runtime environment. Anchor: "Running 60 simultaneous client-side experiments, RUM showed a mobile CLS regression we traced to experiment flicker — we moved to edge-assignment via a worker with independently-cached variant HTML and CLS recovered; we also found three experiments running 8 months with no decision, dead flags clogging the code." Impact: faster experimentation velocity (ship flags without a server deploy), safer rollouts (instant kill switch), and better metric validity (no flicker contaminating results). Invite: "I'd weigh the edge-assignment complexity differently for a small team — LaunchDarkly client-side with an SSR-embedded flag payload is much simpler to start with."

9Check yourself — scenario quiz

0 / 8 correct

1. A product manager says "let's run an A/B test — just add a feature flag and we'll see which version performs better." What's missing from this framing?

2. You implement an A/B test for a new hotel search filter UI. Users in the variant group report the page "jumps" when it loads. What's the root cause?

3. Your team sets Vary: Cookie on the CDN to serve different HTML per A/B variant. What's the problem?

4. You launch an experiment Monday morning. By Tuesday afternoon, the variant shows a 15% lift in conversion with p=0.03. Should you call the winner and ship?

5. An experiment is configured for a 50/50 split. After 3 days, the actual assignment is 54% control / 46% variant. What does this indicate?

6. You fully roll out a new hotel detail page to 100% of users after a successful A/B test. Why might you keep 5% on the old version (holdback)?

7. A colleague proposes client-side A/B testing with the "hide-then-show" anti-flicker pattern (visibility: hidden on body until SDK loads). What's the cost?

8. An engineer finishes an experiment (variant won, fully shipped). Six months later you notice the feature flag is still in the codebase. Why does this matter?

Out-loud drill — before next session

"The goal is to A/B test a new checkout flow — different layout, different CTA copy, different price display. How would you implement the experiment infrastructure? Cover: assignment strategy, flicker prevention, caching, metrics, and how you'd call the winner."

Target: ~3 minutes. Hit: server-side/edge assignment → no flicker → cache via variant-routed URLs → instrument RUM per variant → pre-commit primary metric (conversion) → run full duration → check SRM → holdback after rollout.

Good follow-up topics:

CUPED variance reduction in stats Multi-armed bandit vs fixed A/B LaunchDarkly vs Statsig — detailed comparison Feature flag governance across 20 teams How to test a flag-gated component with RTL+MSW A/B testing in Next.js App Router with middleware Experiment contamination — cross-variant effects