Frontend Testing Strategy

From testing pyramid to testing culture — how a Lead frames test ROI, eliminates flakiness, and makes testing a team discipline.

1 Testing pyramid vs testing trophy

The classic pyramid says: many unit tests at the bottom, some integration in the middle, few e2e at the top. Good mental model for backends where units are pure functions. For UI, it over-indexes on unit tests that break every refactor without catching real bugs.

Kent C. Dodds proposed the testing trophy: the biggest investment is in integration tests, because they give the most ROI for UI code.

Testing trophy (UI ROI order)

e2e — few high-value flows

Playwright / Cypress

Integration — most tests

RTL + MSW · highest ROI

Unit — one piece in isolation

Jest / Vitest · dependencies mocked

Static analysis — cheapest confidence

TypeScript · ESLint · Prettier

Why integration > unit for UI?

Unit tests verify one component in isolation — children are mocked, so they can't catch bugs that come from components working together. Integration tests render the real tree, simulate user events, and assert on what the user sees — they catch wiring bugs and survive refactors. An integration test checking "the formatted price is visible after render" doesn't care how PriceTag is implemented internally.

One-liner

Integration tests give the best ROI for UI — they test what the user experiences, not how you built it. I optimize my team's effort toward the integration layer, keep unit tests for isolated logic, and use e2e sparingly for critical user journeys.

Source: Kent C. Dodds — The Testing Trophy and Testing Classifications · Kent C. Dodds — Write Tests, Not Too Many, Mostly Integration

Unit vs integration — the tool isn't the boundary, the style is

Jest and Vitest are test runners — they can run both unit tests and integration tests. The type is determined by how you test, not which runner you use.

What you're testing	How dependencies are handled	Tool combo	Classification
Pure function or utility	No dependencies — nothing to mock	Jest / Vitest alone	Unit
React component	Child components mocked with `jest.mock()`	Jest / Vitest + RTL	Unit
React component + its real children	No mocking — real children render, user events flow through the whole tree	Jest / Vitest + RTL	Integration
Full app	Real browser, real (or stubbed) network	Playwright / Cypress	e2e

The boundary between unit and integration is isolation — not the tool, not the DOM, not whether it's a React component. A unit test mocks its dependencies so only one piece of code is under test. An integration test lets real dependencies run and verifies they work together correctly.

Unit test example — pure utility function

// formatPrice.ts
export function formatPrice(amount: number, currency: string) {
  return new Intl.NumberFormat('en-US', { style: 'currency', currency })
    .format(amount);
}

// formatPrice.test.ts — pure unit test, no DOM, no React
it('formats USD correctly', () => {
  expect(formatPrice(1234.5, 'USD')).toBe('$1,234.50');
});

it('formats THB correctly', () => {
  expect(formatPrice(500, 'THB')).toBe('฿500.00');
});

No DOM, no render, no React — just input → output. Fast, deterministic, zero setup. This is the natural home for utility logic.

Unit test example — React component in isolation

When a component has child dependencies, unit-test it by mocking those children with jest.mock(). This isolates the component under test from its subtree — you're only verifying its own logic (what it renders based on props, which callbacks it calls).

// ProductCard.tsx — depends on two child components
import { PriceTag } from './PriceTag';
import { WishlistButton } from './WishlistButton';

export function ProductCard({ name, price, onWishlist }) {
  return (
    <div>
      <h3>{name}</h3>
      <PriceTag amount={price} />
      <WishlistButton onClick={onWishlist} />
    </div>
  );
}

// ProductCard.test.tsx — unit test: children are mocked out
import { render, screen } from '@testing-library/react';

// Replace real children with stubs — only ProductCard's own logic is under test
jest.mock('./PriceTag', () => ({
  PriceTag: ({ amount }) => <span data-testid="price">{amount}</span>,
}));
jest.mock('./WishlistButton', () => ({
  WishlistButton: ({ onClick }) => <button onClick={onClick}>wish</button>,
}));

it('renders the product name', () => {
  render(<ProductCard name="Deluxe Room" price={120} onWishlist={jest.fn()} />);
  expect(screen.getByText('Deluxe Room')).toBeInTheDocument();
});

it('passes price down to PriceTag', () => {
  render(<ProductCard name="Deluxe Room" price={120} onWishlist={jest.fn()} />);
  expect(screen.getByTestId('price')).toHaveTextContent('120');
});

it('calls onWishlist when wishlist button is clicked', async () => {
  const onWishlist = jest.fn();
  render(<ProductCard name="Deluxe Room" price={120} onWishlist={onWishlist} />);
  screen.getByText('wish').click();
  expect(onWishlist).toHaveBeenCalledTimes(1);
});

PriceTag and WishlistButton are replaced with dumb stubs — their real implementations never run. The test only verifies ProductCard's own behaviour: does it render the name? Does it pass the right prop to the price stub? Does it wire up the callback? That's a unit test of a React component.

Integration test example — same ProductCard, no mocks

Same component, same runner — but now no jest.mock(). Real PriceTag and WishlistButton render, user events flow through the whole tree, and assertions are on what the user sees.

// ProductCard.test.tsx — integration test, children are NOT mocked
import { render, screen } from '@testing-library/react';
import userEvent from '@testing-library/user-event';
import { ProductCard } from './ProductCard';

// No jest.mock() — PriceTag and WishlistButton render for real

it('shows formatted price from PriceTag', () => {
  render(<ProductCard name="Deluxe Room" price={120} onWishlist={jest.fn()} />);
  // PriceTag's real formatting logic runs — we assert on the final visible output
  expect(screen.getByText('$120.00')).toBeInTheDocument();
});

it('adds to wishlist when button is clicked', async () => {
  const user = userEvent.setup();
  const onWishlist = jest.fn();
  render(<ProductCard name="Deluxe Room" price={120} onWishlist={onWishlist} />);

  await user.click(screen.getByRole('button', { name: /add to wishlist/i }));
  expect(onWishlist).toHaveBeenCalledTimes(1);
});

Same file, same runner — the only difference from the unit test above is no jest.mock(). PriceTag's real formatting runs, WishlistButton's real markup renders. If you refactor ProductCard to pass price differently, the test still passes as long as the user still sees $120.00 — it never cares about internal props or state.

	Unit test	Integration test
`jest.mock()`?	Yes — children stubbed out	No — real children render
What runs?	Only `ProductCard`'s own logic	`ProductCard` + `PriceTag` + `WishlistButton`
Assertion	`data-testid="price"` on the stub	`$120.00` from real `PriceTag` formatting
Survives refactor of children?	No — stubs are hardcoded	Yes — only the visible output matters

Common mistake: confusing the runner with the test type. Jest/Vitest run both unit and integration tests. RTL can be used for both too — with jest.mock() it's a unit test, without it's an integration test. The only thing that determines the layer is isolation: are dependencies mocked (unit) or real (integration)?

2 Testing Library philosophy — test behavior, not implementation

"The more your tests resemble the way your software is used, the more confidence they can give you." — Testing Library guiding principles.

The practical consequence: never query by CSS class or internal state. Query the way a real user (or screen reader) would find an element.

Query priority

getByRole

Matches ARIA roles — button, textbox, heading. Tests accessibility semantics at the same time. First choice always.

getByLabelText

Form inputs linked to a label. Fails if labels are missing — that's a feature, not a bug.

getByPlaceholderText

Last resort for unlabelled inputs. Placeholder text is not an accessible label.

getByText

Visible text content. Use for buttons, links, headings where role is ambiguous.

getByTestId

Last resort — when no other query is practical. Adds coupling to HTML that users can't see. Prefer roles.

getBy* vs findBy* vs queryBy*

Family	Behaviour	When to use
`getBy*`	Synchronous, throws if not found	Element exists immediately (already rendered)
`findBy*`	Returns a Promise, retries until found or timeout	After async operations — data load, state update. Always use this for async.
`queryBy*`	Synchronous, returns null if not found	Asserting element does NOT exist: `expect(queryBy…).not.toBeInTheDocument()`

userEvent vs fireEvent

fireEvent.click(button) dispatches a single synthetic click event. userEvent.click(button) simulates the full event sequence a real browser fires: pointerdown → mousedown → focus → pointerup → mouseup → click. Always use userEvent — it catches bugs that fireEvent misses (handlers that listen to mousedown instead of click).

import { render, screen } from '@testing-library/react';
import userEvent from '@testing-library/user-event';

test('submits the search form', async () => {
  render(<SearchForm />);
  const user = userEvent.setup(); // v14+: setup() for better async support

  await user.type(screen.getByRole('textbox', { name: /destination/i }), 'Bangkok');
  await user.click(screen.getByRole('button', { name: /search/i }));

  // findBy* because results load async
  expect(await screen.findByText(/hotels in Bangkok/i)).toBeInTheDocument();
});

3 The tooling landscape

Need	Tool	Key distinction
Unit + integration runner	Vitest (preferred) / Jest	Vitest is ESM-native, integrates with Vite, 2–5× faster. Jest has the larger ecosystem. Both use the same assertion API (`expect`).
Component interaction	React Testing Library	Renders real DOM via jsdom. Tests user behaviour, not internals. No Enzyme — Enzyme tests implementation details.
Network mocking	MSW (Mock Service Worker)	Intercepts at the network level (real fetch/axios calls hit MSW, not mocked modules). Same handlers for tests AND browser dev. No axios mock → realistic.
e2e	Playwright / Cypress	Playwright: multi-browser, multi-tab, parallelisation, better CI. Cypress: DX/time-travel debugging, single-browser/tab, slower in CI. Playwright is the modern default.
Visual regression	Chromatic / Percy / Applitools	Screenshot-diff on Storybook stories or pages. Catches unintended CSS changes. Applitools uses AI diffing for cross-browser scale. Useful for design systems (L15).
Component docs + interaction tests	Storybook + play()	`play()` functions use Testing Library inside stories. Documents AND tests simultaneously.

Why MSW beats mocking axios/fetch modules

// ❌ Module mock — brittle, only tests axios, not fetch/swr/react-query
jest.mock('axios');
(axios.get as jest.Mock).mockResolvedValue({ data: hotels });

// ✅ MSW — intercepts real network, works for any fetching library
import { http, HttpResponse } from 'msw';
import { server } from './mocks/server';

beforeAll(() => server.listen());
afterEach(() => server.resetHandlers());
afterAll(() => server.close());

// Override a handler in a specific test for an error scenario
server.use(
  http.get('/api/hotels', () => HttpResponse.error())
);

One-liner

MSW is the single biggest testing-stack upgrade most teams haven't made. It makes your tests realistic without coupling to a fetch implementation, and the same handlers work in the browser for local development. I set it up once as shared infra — every team inherits it.

4 Flaky tests — the #1 Lead topic in testing

A flaky test is worse than no test — it teaches your team to ignore red CI and ships false confidence. Eliminating flakiness is a Lead-level discipline, not a "fix it when it bothers you" chore.

Timing

Using getBy* for elements that appear asynchronously. Fix: always await findBy* or await waitFor(…) after any state update, data fetch, or animation.

Test order

One test mutates shared state (global store, module singleton, DOM) and the next test depends on it. Fix: proper cleanup in afterEach; RTL's cleanup() auto-unmounts between tests.

Dates & time

new Date() returns different values in different CI runs. Fix: vi.useFakeTimers() / jest.useFakeTimers() + vi.setSystemTime(). Reset in afterEach.

Real network

Tests hit real APIs — slow, non-deterministic, auth-dependent. Fix: MSW (§3) for unit/integration; stub at the network layer in e2e (Playwright route.fulfill()).

Animation / transition

Element is in the DOM but not visually stable. Fix: disable animations in tests (prefers-reduced-motion: reduce in jsdom config), or wait for animation end with waitFor.

Selector fragility

Tests select by CSS class or XPath that changes with refactors. Fix: getByRole / data-testid as a last resort. In Playwright: ARIA locators (getByRole, getByLabel) over CSS selectors.

Fleet-wide flakiness management (the Lead answer)

Quarantine, don't delete. Mark flaky tests with a skip + open a P2 ticket. Deleting removes coverage; skipping preserves intent while stopping the noise.
Track a flakiness rate per suite per week. If it's above ~2%, it's a team problem, not a test problem.
Fix sprints. Dedicate 20% of a sprint to flakiness reduction when the rate climbs. Treat escaped flaky tests like production bugs.
Retry is a last resort, not a fix. Playwright/Jest retry hides root causes. One retry = OK for intermittent network. Three retries = the test is broken and you're hiding it.

The "just add retries" trap: CI retries make the dashboard green but don't fix flaky tests. They slow the suite (failed attempts still run), mask real intermittent failures (actual bugs!), and erode team trust in red CI. A Lead sees a retry count >1 as a signal to investigate, not a solved problem.

5 e2e strategy at scale

e2e tests are 10–100× slower than integration tests and significantly more flaky. The Lead's goal is to keep the e2e suite small, fast, and trusted.

What belongs in e2e

Critical user journeys

Search → select → book → confirm. The 20% of paths that are 80% of business value. Not every component interaction — those belong in integration tests.

Scaling technique

Parallelisation + sharding

Playwright shards tests across CI workers by file. A 30-min suite on 1 worker → 6 min on 5 workers. Same cost, 5× faster signal. Essential for teams with >50 e2e tests.

// playwright.config.ts — shard across 5 CI workers
export default {
  workers: process.env.CI ? 4 : undefined,
  retries: process.env.CI ? 1 : 0, // one retry in CI, zero locally
  use: {
    baseURL: 'http://localhost:3000',
  },
  // package.json CI: playwright test --shard=1/5, 2/5, 3/5 …
};

One-liner

e2e is expensive — keep it small and sharded. I use integration tests for component behaviour and e2e only for the critical booking journeys. If the e2e suite takes over 10 minutes, I move tests down to integration level.

6 Visual regression testing

Visual regression tests capture a baseline screenshot of a component or page, then compare it against every subsequent PR. Pixel diffs surface unintended CSS changes — color, spacing, typography, layout — before they reach production.

What visual regression catches

CSS regressions

An upstream CSS change subtly shifts a button's padding or changes a font weight. Logic tests pass — the component renders and clicks. Visual regression catches it because the screenshot differs.

What it does NOT catch

Functional bugs

Wrong data, wrong click handlers, broken async states — those are integration/e2e territory. Visual regression is a CSS-correctness tool, not a behaviour tool.

Tooling

Tool	How it works	Best for
Chromatic	Cloud service built on Storybook. Renders every story in a cloud browser, diffs against baseline, sends PR review with highlighted changes. Accepts/rejects per story.	Design systems, component libraries with many consumers (L15)
Percy (BrowserStack)	Same cloud-screenshot-diff model as Chromatic, browser-agnostic. Integrates with Playwright/Cypress/Storybook.	Full-page e2e visual regression
Applitools	AI-based visual diff ("Visual AI") — ignores rendering noise (antialiasing, font hinting) and only flags meaningful visual changes. Multi-browser grid runs the same test across browsers/devices in parallel.	Cross-browser visual regression at scale; enterprise teams needing smarter diff (fewer false positives than pixel diffing)
Playwright built-in	`expect(page).toHaveScreenshot()` — stores PNG baselines in the repo, diffs locally. No external service, free, works in CI.	Teams that want visual regression without a SaaS dependency

Storybook + play() — document AND test simultaneously

A Storybook play() function runs Testing Library interactions inside a story. Chromatic captures the visual output of each interaction state. One story gives you: documentation, interactive demo, visual regression baseline, and an interaction test — all from the same source.

// HotelCard.stories.tsx
export const WithFavourite: Story = {
  play: async ({ canvasElement }) => {
    const canvas = within(canvasElement);
    await userEvent.click(canvas.getByRole('button', { name: /save hotel/i }));
    // Chromatic screenshots the filled-heart state — visual regression included
  },
};

Visual regression trade-off — requires human sign-off: Every intentional design change produces a diff that CI flags as a failure. Engineers must review and approve each diff. The risk: teams learn to blindly click "accept all" without looking, which defeats the purpose. Mitigation: visual regression is an informational CI check (not a hard gate), and scope it to your design system only — not every page screenshot.

One-liner

Visual regression is most valuable for design systems and shared component libraries — the components 30 teams depend on. I run it as an informational CI check via Chromatic on Storybook stories. It catches the CSS regression nobody intended to ship.

7 Coverage — signal not target

100% coverage is a vanity metric. You can cover every line with tests that make no meaningful assertions:

test('renders without crashing', () => {
  render(<PaymentForm />); // 100% line coverage, zero behaviour tested
});

	Coverage as signal	Coverage as target (wrong)
Goal	Find untested critical paths, error states, edge cases	Reach a number — e.g. 80% or 100%
Outcome	Tests that catch real bugs	Tests written to move the number (assertion-free, trivial renders)
Metric	Escaped defects — bugs that shipped without test coverage	Coverage % in the CI badge

Practical floor: Set a coverage gate at a low floor (60–70%) to catch cases where a whole feature ships with zero tests. Investigate drops, don't chase the ceiling. Coverage of critical payment and auth paths should be high by intent — not because the gate forced it.

8 Testing in CI/CD — gating and culture

The CI signal hierarchy

Check type	Gate merge?	Reasoning
TypeScript + lint (static)	Yes — always	Cheapest signal, zero flakiness
Unit + integration	Yes	Fast (seconds to minutes), high ROI
e2e (critical paths)	Yes on main; informational on PRs	Slow — block merges to main, but run async on PRs to avoid blocking dev
Quarantined / flaky	Never	Informational only until fixed
Visual regression	Informational (human review)	Catches unintended changes, but requires human sign-off

Raising testing culture from zero

Mandating tests creates resentment. Making tests easy creates adoption. The Lead playbook:

Pair on the first test per team. Sit with an engineer and write one test together for their feature. Remove the blank-page problem.
Shared test infra. Create a test-utils package with pre-configured render wrappers (providers, MSW server, custom queries). Writing a test should start with one import, not 30 lines of setup.
Track escaped defects. Every post-mortem asks "was there a test that could have caught this?" Over a quarter, patterns emerge — write tests for those patterns.
Celebrate, don't mandate. Recognise PRs that add meaningful test coverage. Review tests as carefully as code in PR review.
Set a floor, not a ceiling. Coverage gate at a low floor prevents zero-coverage features. Chasing 100% is the lead's job to prevent.

Full loop

Concept: testing culture is infrastructure — the same way you invest in CI pipelines, you invest in shared test helpers, flakiness dashboards, and pairing rituals. Trade-off: strict coverage gates raise the bar but create gaming behaviour (assertion-free tests), so I set a low floor gate + track escaped defects — a metric that can't be gamed. Anchor: "We had a 12% flakiness rate and engineers stopped trusting CI; I ran a sprint that quarantined 40 flaky tests, root-caused them (mostly missing MSW + getBy* on async elements), and got us under 1% in 4 weeks — deploy confidence went up measurably." Impact: faster PRs, fewer reverts, more confident feature flags and deploys — that's the outcome the business cares about, not the test count. Invite: "I'd weigh the investment differently for a team under extreme delivery pressure — start with a minimal shared MSW setup and one integration test per critical path rather than the full rollout."

9Check yourself — scenario quiz

0 / 9 correct

1. A junior engineer asks "should we unit-test every React component — one test per function?" What's the Lead answer?

2. You need to select the "Search hotels" submit button in a Testing Library test. Which query is best?

3. A test intermittently fails with "Unable to find element" on an element that should appear after a data fetch. What's the fix?

Current code: const el = screen.getByText(/Hotel results/i);

4. Your team mocks axios in every test file. A colleague proposes switching to MSW. What's the case for the switch?

5. A test is flaky — it passes 9 out of 10 times. Your CI pipeline is set to retry failing tests up to 3 times. After adding the retry, the dashboard goes green. Is this resolved?

6. Your team's coverage gate is at 80%. An engineer opens a PR that drops coverage from 82% to 79%, blocking merge. The PR is a critical security fix. What's the Lead response?

7. A team that has no tests asks you how to start. What's the first action?

8. Which metric best measures whether your testing investment is paying off — beyond coverage %?

9. Your design system's visual regression CI check flags a diff every time a designer makes an intentional spacing change, so engineers start clicking "accept all" without reviewing diffs. What's the Lead fix?

Out-loud drill — before next session

"The platform ships a hotel booking flow across 40 markets. Walk me through the testing strategy you'd set as Lead — what types of tests at each layer, tooling choices, how you'd handle flakiness, and how you'd raise the testing culture from near-zero."

Target: ~3 minutes. Hit: trophy model → RTL+MSW for integration → Playwright sharded for e2e → flakiness quarantine approach → escaped-defects metric → shared test-utils as culture lever.

Good follow-up topics:

Testing async hooks in depth Contract testing with Pact Accessibility testing (axe + jest-axe) Testing React context / providers Playwright vs Cypress — detailed comparison Visual regression — Playwright toHaveScreenshot() setup Storybook play() interaction tests How to test server actions / RSC