How we turned Storybook into a behavioral verification engine

This is part 3 of a four-part series. In part 1, we covered governance: how we made the code base AI-ready. In part 2, we covered delivery: the migration strategy. This post details how we made the test suite trustworthy—specifically, how we turned Storybook from a component playground into something closer to a behavioral specification for the access management interface on Red Hat Hybrid Cloud Console.

A common approach is to use Storybook to render components in isolation with static props. We use it to verify that the application behaves correctly end-to-end: real components, real routing, real data fetching, real permission checks, and real error handling. The only thing that's simulated is the network. A combination of patterns (typed mock factories, seed data, interaction helpers, stateful databases, and step isolation) turns individual stories into executable specifications. Together, they form a system where adding a feature means adding the proof that it works; breaking a feature means failing a test that describes exactly what broke and why.

That's what we mean by verification engine. Here's how it works.

The no-fake-layers principle

Most user interface (UI) test suites mock at the wrong layer. They replace data-fetching functions with static returns. The tests pass, but the integration between the component and the network was never exercised. Bugs ship because the mock boundary was too close to the component.

We wanted a different guarantee: if it works in Storybook, it works in the browser.

We use Mock Service Worker (MSW) to intercept HTTP requests at the browser level and return controlled responses. Each story exercises the full stack from the component down to the network boundary. If a component makes a request to the wrong endpoint, the test fails. If the error handling path has a bug, the test catches it because the error response comes through the same code path as a real one.

The patterns (with code)

Making that guarantee real required enforcing a few patterns across the codebase. These are extracted from our governance documentation (the same docs that AI coding assistants read automatically, as described in part 1). Each one exists because we hit the problem it prevents.

Handler factories, not inline mocks

The most common source of technical debt in test code is hardcoding mock responses inline. Each test defines its own data, with its own field names. When the API changes, you update the real code and then spend a day updating test files—or you don't and the tests pass against stale data.

We use handler factories: functions that return typed MSW handlers for a specific API domain. The types flow from the SDK through the data layer into the factory. When the SDK updates a response shape, the factory breaks at compile time.

// DO:  Factory calls — typed, reusable, version-aware
handlers: [...v2RolesHandlers()]                              // happy path, default seed data
handlers: [...v2RolesHandlers(customRoles, { onList: spy })]  // custom data + spy callback
handlers: [...groupsHandlers([])]                             // empty state
handlers: [...groupsErrorHandlers(500)]                       // error state
handlers: [...v2RolesLoadingHandlers()]                       // loading state (skeleton screens)
// DONT:  Inline handlers — untyped, duplicated, invisible to SDK updates
parameters: {
  msw: {
    handlers: [
      http.get('/api/rbac/v2/roles/', () =>
        HttpResponse.json({ data: [{ name: 'Admin', uuid: '123' }] })
      ),
    ],
  },
},

One factory update fixes every story that uses it. One inline handler is one more thing to forget. (We didn't start here. The first stories used inline handlers. After the third API change required updating dozens of files, we extracted the factories and never looked back.)

Seed constants, not hardcoded strings

Play functions (Storybook's mechanism for running user interactions) must never use string literals for entity names. When seed data changes, hardcoded strings silently break.

// DO:  Named constants — breaks visibly when seed data changes
await canvas.findByText(ROLE_TENANT_ADMIN.name);
await expandWorkspaceRow(user, canvas, WS_ROOT.name);
// DONT:  Hardcoded strings — silently wrong when seed data changes
await canvas.findByText('Tenant admin');
await expandWorkspaceRow(user, canvas, 'Root Workspace');

Interaction helpers, not copy-paste

The first test inlines every Document Object Model (DOM) query. The second test copies the first. By the twentieth test, you have twenty slightly different implementations of "open a modal and fill in a form."

We extracted every reusable interaction into shared helpers and banned the inline alternatives:

// DO:  Shared helpers — async-safe, reusable, maintained in one place
const modal = await waitForModal();
await clickWizardNext(user, modal);
await confirmDestructiveModal(user, { buttonLabel: /remove/i });
await selectTableRow(user, canvas, ROLE_TENANT_ADMIN.name);
// DONT:  Banned in play functions
document.querySelector('[role="dialog"]');        // direct DOM access
await delay(500);                                 // arbitrary waits
element.dispatchEvent(new MouseEvent('click'));    // raw DOM events
canvas.getByRole('button', { name: /save/i });    // sync query after async boundary

The helpers enforce correct async patterns. Every DOM query after an action uses findBy* (which retries) instead of getBy* (which doesn't). Every assertion that depends on an async operation uses waitFor. The banned patterns in the preceding list are codified in our governance docs; the async patterns (findBy* over getBy*, no arbitrary waits) are enforced by lint rules that fail the build.

Step isolation in journey stories

Component stories test individual surfaces. Journey stories test multi-step flows: create a role, see it appear, edit it, delete it, verify it's gone. These use stateful mock databases that support real create, read, update, and delete (CRUD) operations in memory—the handler factories read from and write to the same collection.

The step() function organizes phases and creates closure boundaries. DOM references from one step can't leak into the next. This prevents a category of flaky tests where a stale reference causes a later assertion to fail:

play: async ({ canvasElement, step }) => {
  const canvas = within(canvasElement);
  await step('Navigate to roles list', async () => {
    const table = await canvas.findByRole('grid');
    await expect(within(table).findByText(FIRST_ROLE.name)).resolves.toBeInTheDocument();
  });
  await step('Create a new role', async () => {
    await user.click(await canvas.findByRole('button', { name: /create role/i }));
    const modal = await waitForModal();
    await clearAndType(user, modal, /role name/i, 'New role');
    await clickWizardNext(user, modal);
    await expect(createRoleSpy).toHaveBeenCalled();
  });
};

What the engine looks like at scale

None of these patterns are remarkable on their own. Handler factories are a convenience. Seed constants are good hygiene. Interaction helpers reduce duplication. Step isolation prevents flakiness.

When composed together, however, they produce something qualitatively different from a test suite. Each feature in the application has a story that describes its behavior: what the user sees, what API calls are made, what happens on error, what happens with empty data, what happens without permission.

A total of 959 of these specifications run in continuous integration (CI) on every change. When we upgraded the component library across 272 files, the suite showed exactly which behaviors survived and which did not. When we removed 216 files of legacy state management in a single commit, the suite proved the application still worked.

The stories aren't tests that verify the code. They're the specification that defines what "working" means.

The cross-functional win

The most exciting benefit had nothing to do with code quality. It changed how the team collaborates.

Every story is automatically deployed to a hosted environment via CI. That environment became the place where design reviews happen—not staging, not a branch deploy, not a screen recording.

Designers open the story, navigate through the flow, and see exactly what users will see.

A product manager reviewed one of our wizard flows in Storybook before it ever reached staging and caught a step-ordering issue that would have shipped otherwise.

The cost of that catch was zero. The cost of finding it in production would have been a support ticket and a confused customer.

Stories also became the reference artifact the team points to when behavioral questions come up. "Does the delete confirmation show the resource name?" is no longer a question someone answers from memory. Someone pulls up the story. The story is the answer. We used the same stories to demo features to stakeholders outside the immediate team. The behavioral accuracy meant we were showing real product behavior, not a polished prototype that would diverge from what shipped.

The technical investment in making stories accurate (no fake layers, real network mocking, and real permission checks) is what makes all of this possible. If the stories used fake data or skipped permission checks, nobody outside engineering would trust them. Because the stories exercise the same code paths as the real product, everyone uses them. Engineering, design, and product now share a single artifact that describes how the application behaves—and that artifact runs in CI on every change.

What I'd tell you if you're building this

Mock at the network boundary, not the component boundary. The further your mock is from the component, the more real code your test exercises.

Build factories, not inline mocks. A handler factory pays for itself after the third test that uses it. When the API changes, you update the factory and every consumer gets the fix.

Extract interactions into helpers immediately. The first time you write "open modal, fill form, submit" is the last time it should be inline.

Invest in the cross-functional story. The engineering value of behavioral tests is real. But the organizational value of giving design and product a shared artifact they can trust is larger than any single technical benefit.

Try Red Hat Hybrid Cloud Console at console.redhat.com.

Learn more

Last updated: May 6, 2026

How we turned Storybook into a behavioral verification engine

The no-fake-layers principle

The patterns (with code)

Handler factories, not inline mocks

Seed constants, not hardcoded strings

Interaction helpers, not copy-paste

Step isolation in journey stories

What the engine looks like at scale

The cross-functional win

What I'd tell you if you're building this

Learn more

Debugging image mode with Red Hat OpenShift 4.20: A practical guide

EvalHub: Because "looks good to me" isn't a benchmark

SQL Server HA on RHEL: Meet Pacemaker HA Agent v2 (tech preview)

Deploy with confidence: Continuous integration and continuous delivery for agentic AI

Every layer counts: Defense in depth for AI agents with Red Hat AI

Deployment of Red Hat OpenShift Data Foundation using GitOps

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links