Safe rollout, experimentation, and the FE infrastructure behind it — a topic discussed continuously in the round.
These two are often conflated but have different goals:
On/off switch to separate deploy from release. Ship code to production hidden behind a flag; flip the flag when you're ready. Enables canary releases, instant rollback without a redeploy, kill switches for incidents. No statistical analysis — it's not measuring anything.
Randomly split users into control and variant groups. Measure a metric (conversion, click-through, revenue). Requires statistical significance before calling a winner. The randomisation matters — you're doing science.
In practice, the same flag infrastructure handles both — you either flip a flag at 100% (deployment) or roll it out gradually and instrument it (experiment). The distinction is in how you interpret the data.
Feature flags decouple deploy from release. A/B tests measure whether a change improves a metric. Both use the same infrastructure — but confusing them leads to running "experiments" with no holdout and declaring winners too early.
| Where assigned | How it works | Key trade-offs |
|---|---|---|
| Client-side | JS runs on the client, reads a flag SDK, flips the UI after page load | Fast to ship, no server changes — but causes flicker (page renders control, then flips to variant). Also: bots/crawlers see control only — bad for SEO tests. |
| Server-side | SSR decides variant before HTML is sent. User gets the right variant on first paint. | No flicker, SEO-safe, works in RSC — but requires cookie/header context at render time. Harder to CDN-cache (variants must be isolated). |
| Edge-side | CDN edge worker (Cloudflare Workers, Vercel Edge) assigns variant, routes to variant-specific cached response or rewrites HTML. | Fast (no origin hit for assignment), cache-friendly, no flicker — but limited runtime environment (no Node APIs, small bundle). Best for the global CDN presence. |
Client-side A/B tests are notorious for flicker (also called FOUC — flash of unstyled/wrong content): the browser renders the control variant, the flag SDK initialises and returns the variant assignment, then the page visually jumps. Users see a flash. This harms UX and metrics — the flash can cause variant users to bounce before the test is even valid.
Client-side (broken): timeline
Server-side / edge (fixed): timeline
body { visibility: hidden } in the <head>, reveal after SDK loads. Prevents the flash — but delays FCP for everyone, including users for whom no variant exists. Bad for Core Web Vitals (L03).window.__FLAGS__ JSON object in the HTML — no round-trip needed, SDK reads from it synchronously.A CDN caches one response per cache key. If two users get different HTML (different variants), the CDN must serve them different responses — or you'll serve the wrong variant from cache.
/search?__experiment=new-filters or route rewriting at edge. Each variant has its own URL = its own cache entry. Clean, cache-friendly. Awkward for users (don't want experiment params in their bookmarks).
Vary: Cookie tells the CDN to cache per cookie value. Works in theory — in practice, CDNs handle Vary: Cookie poorly (Fastly/Cloudflare either ignore it or bypass cache entirely). Effective CDN hit rate → near 0.
An edge worker intercepts the request, reads a lightweight assignment cookie (or assigns on first visit), then either routes to a variant-specific origin path or rewrites the response from cached static HTML. Both variant responses are independently cached at the CDN — cache hit rate stays high, assignment happens at ~0ms latency.
// Vercel Edge Middleware — assign variant, rewrite path export function middleware(req: NextRequest) { const variant = req.cookies.get('exp_new_filters')?.value ?? (Math.random() < 0.5 ? 'control' : 'variant'); const res = NextResponse.rewrite( new URL(`/search/${variant}`, req.url) ); res.cookies.set('exp_new_filters', variant, { maxAge: 60 * 60 * 24 * 7 }); return res; }
A/B tests are also how you roll out risky features safely. The pattern:
A holdback (or holdout) keeps 5–10% of users on the control even after full rollout. This lets you measure long-term impact — novelty effects wear off after a few weeks, and some metrics only show degradation months later. Without a holdback, you can't distinguish "it worked" from "users were excited about something new." Google and large tech companies use permanent holdbacks on major features.
You don't need a statistics degree, but you need to flag these problems in a review:
| Trap | What it is | Fix |
|---|---|---|
| Peeking / early stopping | Checking results every day and stopping the test when it looks significant. Repeated testing inflates false positive rate — you'll declare a winner on noise. | Pre-commit to a sample size and run time before launching. Use sequential testing methods (e.g., always-valid p-values) if you need continuous monitoring. |
| Sample ratio mismatch (SRM) | The actual split (e.g., 52/48) differs from the intended split (50/50). Indicates a bug in assignment — your groups aren't comparable. | Always check assignment ratios first. If SRM exists, the experiment result is invalid. Debug the randomisation logic. |
| Novelty effect | Variant looks better initially because users click on anything new. Wears off. | Run experiments long enough to cover a full user behaviour cycle (at least 1–2 weeks). Holdbacks reveal this over months. |
| Multiple metrics | Testing improves conversion but degrades revenue per booking. Which wins? | Pre-commit to a primary metric before running. Secondary metrics are informational, not decision-making criteria. |
| Tool | Strength | the platform fit |
|---|---|---|
| LaunchDarkly | Full-featured, targeting rules, SSE streaming flag updates, audit log. Industry standard. | High-traffic; streaming flag sync = no polling lag |
| Statsig | Built-in experiment statistics (CUPED variance reduction, sequential testing). Flag + analysis in one. | Teams that want flags + stats without a separate analytics tool |
| GrowthBook | Open-source, self-hostable, connects to your own data warehouse for stats. | Cost control; data sovereignty concerns |
| Unleash | Open-source, enterprise tier, strategy plugins (gradual rollout, userId hash). | Teams needing self-hosted + on-prem |
| Vercel Edge Config | Ultra-low-latency flag reads at the edge (~1ms). No SDK roundtrip. | Edge-assignment pattern (§4) |
Product thinks of A/B tests as a tool to pick winners. A Lead thinks of it as an infrastructure problem:
Concept: A/B testing at scale is a platform capability — it needs consistent assignment, flicker-free rendering, cache-safe variant isolation, and a flag-lifecycle process. Trade-off: server-side assignment is the right default but needs request-time context (cookies/headers) that conflicts with aggressive CDN caching; edge assignment solves both but constrains your runtime environment. Anchor: "Running 60 simultaneous client-side experiments, RUM showed a mobile CLS regression we traced to experiment flicker — we moved to edge-assignment via a worker with independently-cached variant HTML and CLS recovered; we also found three experiments running 8 months with no decision, dead flags clogging the code." Impact: faster experimentation velocity (ship flags without a server deploy), safer rollouts (instant kill switch), and better metric validity (no flicker contaminating results). Invite: "I'd weigh the edge-assignment complexity differently for a small team — LaunchDarkly client-side with an SSR-embedded flag payload is much simpler to start with."
0 / 8 correct
"The goal is to A/B test a new checkout flow — different layout, different CTA copy, different price display. How would you implement the experiment infrastructure? Cover: assignment strategy, flicker prevention, caching, metrics, and how you'd call the winner."
Target: ~3 minutes. Hit: server-side/edge assignment → no flicker → cache via variant-routed URLs → instrument RUM per variant → pre-commit primary metric (conversion) → run full duration → check SRM → holdback after rollout.
Good follow-up topics: