From testing pyramid to testing culture — how a Lead frames test ROI, eliminates flakiness, and makes testing a team discipline.
The classic pyramid says: many unit tests at the bottom, some integration in the middle, few e2e at the top. Good mental model for backends where units are pure functions. For UI, it over-indexes on unit tests that break every refactor without catching real bugs.
Kent C. Dodds proposed the testing trophy: the biggest investment is in integration tests, because they give the most ROI for UI code.
Testing trophy (UI ROI order)
Unit tests verify one component in isolation — children are mocked, so they can't catch bugs that come from components working together. Integration tests render the real tree, simulate user events, and assert on what the user sees — they catch wiring bugs and survive refactors. An integration test checking "the formatted price is visible after render" doesn't care how PriceTag is implemented internally.
Integration tests give the best ROI for UI — they test what the user experiences, not how you built it. I optimize my team's effort toward the integration layer, keep unit tests for isolated logic, and use e2e sparingly for critical user journeys.
Jest and Vitest are test runners — they can run both unit tests and integration tests. The type is determined by how you test, not which runner you use.
| What you're testing | How dependencies are handled | Tool combo | Classification |
|---|---|---|---|
| Pure function or utility | No dependencies — nothing to mock | Jest / Vitest alone | Unit |
| React component | Child components mocked with jest.mock() |
Jest / Vitest + RTL | Unit |
| React component + its real children | No mocking — real children render, user events flow through the whole tree | Jest / Vitest + RTL | Integration |
| Full app | Real browser, real (or stubbed) network | Playwright / Cypress | e2e |
The boundary between unit and integration is isolation — not the tool, not the DOM, not whether it's a React component. A unit test mocks its dependencies so only one piece of code is under test. An integration test lets real dependencies run and verifies they work together correctly.
// formatPrice.ts export function formatPrice(amount: number, currency: string) { return new Intl.NumberFormat('en-US', { style: 'currency', currency }) .format(amount); } // formatPrice.test.ts — pure unit test, no DOM, no React it('formats USD correctly', () => { expect(formatPrice(1234.5, 'USD')).toBe('$1,234.50'); }); it('formats THB correctly', () => { expect(formatPrice(500, 'THB')).toBe('฿500.00'); });
No DOM, no render, no React — just input → output. Fast, deterministic, zero setup. This is the natural home for utility logic.
When a component has child dependencies, unit-test it by mocking those children with jest.mock(). This isolates the component under test from its subtree — you're only verifying its own logic (what it renders based on props, which callbacks it calls).
// ProductCard.tsx — depends on two child components import { PriceTag } from './PriceTag'; import { WishlistButton } from './WishlistButton'; export function ProductCard({ name, price, onWishlist }) { return ( <div> <h3>{name}</h3> <PriceTag amount={price} /> <WishlistButton onClick={onWishlist} /> </div> ); } // ProductCard.test.tsx — unit test: children are mocked out import { render, screen } from '@testing-library/react'; // Replace real children with stubs — only ProductCard's own logic is under test jest.mock('./PriceTag', () => ({ PriceTag: ({ amount }) => <span data-testid="price">{amount}</span>, })); jest.mock('./WishlistButton', () => ({ WishlistButton: ({ onClick }) => <button onClick={onClick}>wish</button>, })); it('renders the product name', () => { render(<ProductCard name="Deluxe Room" price={120} onWishlist={jest.fn()} />); expect(screen.getByText('Deluxe Room')).toBeInTheDocument(); }); it('passes price down to PriceTag', () => { render(<ProductCard name="Deluxe Room" price={120} onWishlist={jest.fn()} />); expect(screen.getByTestId('price')).toHaveTextContent('120'); }); it('calls onWishlist when wishlist button is clicked', async () => { const onWishlist = jest.fn(); render(<ProductCard name="Deluxe Room" price={120} onWishlist={onWishlist} />); screen.getByText('wish').click(); expect(onWishlist).toHaveBeenCalledTimes(1); });
PriceTag and WishlistButton are replaced with dumb stubs — their real implementations never run. The test only verifies ProductCard's own behaviour: does it render the name? Does it pass the right prop to the price stub? Does it wire up the callback? That's a unit test of a React component.
Same component, same runner — but now no jest.mock(). Real PriceTag and WishlistButton render, user events flow through the whole tree, and assertions are on what the user sees.
// ProductCard.test.tsx — integration test, children are NOT mocked import { render, screen } from '@testing-library/react'; import userEvent from '@testing-library/user-event'; import { ProductCard } from './ProductCard'; // No jest.mock() — PriceTag and WishlistButton render for real it('shows formatted price from PriceTag', () => { render(<ProductCard name="Deluxe Room" price={120} onWishlist={jest.fn()} />); // PriceTag's real formatting logic runs — we assert on the final visible output expect(screen.getByText('$120.00')).toBeInTheDocument(); }); it('adds to wishlist when button is clicked', async () => { const user = userEvent.setup(); const onWishlist = jest.fn(); render(<ProductCard name="Deluxe Room" price={120} onWishlist={onWishlist} />); await user.click(screen.getByRole('button', { name: /add to wishlist/i })); expect(onWishlist).toHaveBeenCalledTimes(1); });
Same file, same runner — the only difference from the unit test above is no jest.mock(). PriceTag's real formatting runs, WishlistButton's real markup renders. If you refactor ProductCard to pass price differently, the test still passes as long as the user still sees $120.00 — it never cares about internal props or state.
| Unit test | Integration test | |
|---|---|---|
jest.mock()? |
Yes — children stubbed out | No — real children render |
| What runs? | Only ProductCard's own logic |
ProductCard + PriceTag + WishlistButton |
| Assertion | data-testid="price" on the stub |
$120.00 from real PriceTag formatting |
| Survives refactor of children? | No — stubs are hardcoded | Yes — only the visible output matters |
jest.mock() it's a unit test, without it's an integration test. The only thing that determines the layer is isolation: are dependencies mocked (unit) or real (integration)?
"The more your tests resemble the way your software is used, the more confidence they can give you." — Testing Library guiding principles.
The practical consequence: never query by CSS class or internal state. Query the way a real user (or screen reader) would find an element.
getByRolegetByLabelTextgetByPlaceholderTextgetByTextgetByTestId| Family | Behaviour | When to use |
|---|---|---|
getBy* | Synchronous, throws if not found | Element exists immediately (already rendered) |
findBy* | Returns a Promise, retries until found or timeout | After async operations — data load, state update. Always use this for async. |
queryBy* | Synchronous, returns null if not found | Asserting element does NOT exist: expect(queryBy…).not.toBeInTheDocument() |
fireEvent.click(button) dispatches a single synthetic click event. userEvent.click(button) simulates the full event sequence a real browser fires: pointerdown → mousedown → focus → pointerup → mouseup → click. Always use userEvent — it catches bugs that fireEvent misses (handlers that listen to mousedown instead of click).
import { render, screen } from '@testing-library/react'; import userEvent from '@testing-library/user-event'; test('submits the search form', async () => { render(<SearchForm />); const user = userEvent.setup(); // v14+: setup() for better async support await user.type(screen.getByRole('textbox', { name: /destination/i }), 'Bangkok'); await user.click(screen.getByRole('button', { name: /search/i })); // findBy* because results load async expect(await screen.findByText(/hotels in Bangkok/i)).toBeInTheDocument(); });
| Need | Tool | Key distinction |
|---|---|---|
| Unit + integration runner | Vitest (preferred) / Jest | Vitest is ESM-native, integrates with Vite, 2–5× faster. Jest has the larger ecosystem. Both use the same assertion API (expect). |
| Component interaction | React Testing Library | Renders real DOM via jsdom. Tests user behaviour, not internals. No Enzyme — Enzyme tests implementation details. |
| Network mocking | MSW (Mock Service Worker) | Intercepts at the network level (real fetch/axios calls hit MSW, not mocked modules). Same handlers for tests AND browser dev. No axios mock → realistic. |
| e2e | Playwright / Cypress | Playwright: multi-browser, multi-tab, parallelisation, better CI. Cypress: DX/time-travel debugging, single-browser/tab, slower in CI. Playwright is the modern default. |
| Visual regression | Chromatic / Percy / Applitools | Screenshot-diff on Storybook stories or pages. Catches unintended CSS changes. Applitools uses AI diffing for cross-browser scale. Useful for design systems (L15). |
| Component docs + interaction tests | Storybook + play() | play() functions use Testing Library inside stories. Documents AND tests simultaneously. |
// ❌ Module mock — brittle, only tests axios, not fetch/swr/react-query jest.mock('axios'); (axios.get as jest.Mock).mockResolvedValue({ data: hotels }); // ✅ MSW — intercepts real network, works for any fetching library import { http, HttpResponse } from 'msw'; import { server } from './mocks/server'; beforeAll(() => server.listen()); afterEach(() => server.resetHandlers()); afterAll(() => server.close()); // Override a handler in a specific test for an error scenario server.use( http.get('/api/hotels', () => HttpResponse.error()) );
MSW is the single biggest testing-stack upgrade most teams haven't made. It makes your tests realistic without coupling to a fetch implementation, and the same handlers work in the browser for local development. I set it up once as shared infra — every team inherits it.
A flaky test is worse than no test — it teaches your team to ignore red CI and ships false confidence. Eliminating flakiness is a Lead-level discipline, not a "fix it when it bothers you" chore.
getBy* for elements that appear asynchronously. Fix: always await findBy* or await waitFor(…) after any state update, data fetch, or animation.afterEach; RTL's cleanup() auto-unmounts between tests.new Date() returns different values in different CI runs. Fix: vi.useFakeTimers() / jest.useFakeTimers() + vi.setSystemTime(). Reset in afterEach.route.fulfill()).prefers-reduced-motion: reduce in jsdom config), or wait for animation end with waitFor.getByRole / data-testid as a last resort. In Playwright: ARIA locators (getByRole, getByLabel) over CSS selectors.e2e tests are 10–100× slower than integration tests and significantly more flaky. The Lead's goal is to keep the e2e suite small, fast, and trusted.
Search → select → book → confirm. The 20% of paths that are 80% of business value. Not every component interaction — those belong in integration tests.
Playwright shards tests across CI workers by file. A 30-min suite on 1 worker → 6 min on 5 workers. Same cost, 5× faster signal. Essential for teams with >50 e2e tests.
// playwright.config.ts — shard across 5 CI workers export default { workers: process.env.CI ? 4 : undefined, retries: process.env.CI ? 1 : 0, // one retry in CI, zero locally use: { baseURL: 'http://localhost:3000', }, // package.json CI: playwright test --shard=1/5, 2/5, 3/5 … };
e2e is expensive — keep it small and sharded. I use integration tests for component behaviour and e2e only for the critical booking journeys. If the e2e suite takes over 10 minutes, I move tests down to integration level.
Visual regression tests capture a baseline screenshot of a component or page, then compare it against every subsequent PR. Pixel diffs surface unintended CSS changes — color, spacing, typography, layout — before they reach production.
An upstream CSS change subtly shifts a button's padding or changes a font weight. Logic tests pass — the component renders and clicks. Visual regression catches it because the screenshot differs.
Wrong data, wrong click handlers, broken async states — those are integration/e2e territory. Visual regression is a CSS-correctness tool, not a behaviour tool.
| Tool | How it works | Best for |
|---|---|---|
| Chromatic | Cloud service built on Storybook. Renders every story in a cloud browser, diffs against baseline, sends PR review with highlighted changes. Accepts/rejects per story. | Design systems, component libraries with many consumers (L15) |
| Percy (BrowserStack) | Same cloud-screenshot-diff model as Chromatic, browser-agnostic. Integrates with Playwright/Cypress/Storybook. | Full-page e2e visual regression |
| Applitools | AI-based visual diff ("Visual AI") — ignores rendering noise (antialiasing, font hinting) and only flags meaningful visual changes. Multi-browser grid runs the same test across browsers/devices in parallel. | Cross-browser visual regression at scale; enterprise teams needing smarter diff (fewer false positives than pixel diffing) |
| Playwright built-in | expect(page).toHaveScreenshot() — stores PNG baselines in the repo, diffs locally. No external service, free, works in CI. |
Teams that want visual regression without a SaaS dependency |
A Storybook play() function runs Testing Library interactions inside a story. Chromatic captures the visual output of each interaction state. One story gives you: documentation, interactive demo, visual regression baseline, and an interaction test — all from the same source.
// HotelCard.stories.tsx export const WithFavourite: Story = { play: async ({ canvasElement }) => { const canvas = within(canvasElement); await userEvent.click(canvas.getByRole('button', { name: /save hotel/i })); // Chromatic screenshots the filled-heart state — visual regression included }, };
Visual regression is most valuable for design systems and shared component libraries — the components 30 teams depend on. I run it as an informational CI check via Chromatic on Storybook stories. It catches the CSS regression nobody intended to ship.
100% coverage is a vanity metric. You can cover every line with tests that make no meaningful assertions:
test('renders without crashing', () => { render(<PaymentForm />); // 100% line coverage, zero behaviour tested });
| Coverage as signal | Coverage as target (wrong) | |
|---|---|---|
| Goal | Find untested critical paths, error states, edge cases | Reach a number — e.g. 80% or 100% |
| Outcome | Tests that catch real bugs | Tests written to move the number (assertion-free, trivial renders) |
| Metric | Escaped defects — bugs that shipped without test coverage | Coverage % in the CI badge |
Practical floor: Set a coverage gate at a low floor (60–70%) to catch cases where a whole feature ships with zero tests. Investigate drops, don't chase the ceiling. Coverage of critical payment and auth paths should be high by intent — not because the gate forced it.
| Check type | Gate merge? | Reasoning |
|---|---|---|
| TypeScript + lint (static) | Yes — always | Cheapest signal, zero flakiness |
| Unit + integration | Yes | Fast (seconds to minutes), high ROI |
| e2e (critical paths) | Yes on main; informational on PRs | Slow — block merges to main, but run async on PRs to avoid blocking dev |
| Quarantined / flaky | Never | Informational only until fixed |
| Visual regression | Informational (human review) | Catches unintended changes, but requires human sign-off |
Mandating tests creates resentment. Making tests easy creates adoption. The Lead playbook:
test-utils package with pre-configured render wrappers (providers, MSW server, custom queries). Writing a test should start with one import, not 30 lines of setup.Concept: testing culture is infrastructure — the same way you invest in CI pipelines, you invest in shared test helpers, flakiness dashboards, and pairing rituals. Trade-off: strict coverage gates raise the bar but create gaming behaviour (assertion-free tests), so I set a low floor gate + track escaped defects — a metric that can't be gamed. Anchor: "We had a 12% flakiness rate and engineers stopped trusting CI; I ran a sprint that quarantined 40 flaky tests, root-caused them (mostly missing MSW + getBy* on async elements), and got us under 1% in 4 weeks — deploy confidence went up measurably." Impact: faster PRs, fewer reverts, more confident feature flags and deploys — that's the outcome the business cares about, not the test count. Invite: "I'd weigh the investment differently for a team under extreme delivery pressure — start with a minimal shared MSW setup and one integration test per critical path rather than the full rollout."
0 / 9 correct
"The platform ships a hotel booking flow across 40 markets. Walk me through the testing strategy you'd set as Lead — what types of tests at each layer, tooling choices, how you'd handle flakiness, and how you'd raise the testing culture from near-zero."
Target: ~3 minutes. Hit: trophy model → RTL+MSW for integration → Playwright sharded for e2e → flakiness quarantine approach → escaped-defects metric → shared test-utils as culture lever.
Good follow-up topics: