Claude Code + Playwright QA Rig

Setup

I gave Claude Code a real browser and a strict no-deleting-flakes rule

My Claude Code + Playwright rig writes the test first, quarantines flakes instead of nuking them, and only reports specs I actually have to fix.

qa_marta9 min read2026-06-18

I have a confession that gets me funny looks in standup. I do not trust green checkmarks. A passing suite tells me nothing if half of it is skipped, a third of it is sleeping for two seconds, and one spec only passes when the CI box is under load. So when I built my Claude Code setup, the goal was not speed. The goal was a rig that is honest about what is broken.

This is the Claude Code + Playwright QA Rig. It runs Claude Sonnet 4.6, talks to a real browser through the Playwright MCP server, pulls failures back from GitHub and Sentry, and uses three small subagents so the model that fixes a bug never has to read 400 lines of noisy stack traces. Here is the whole thing, config and all.

What this build is for

Testing and QA automation. If your team is drowning in flaky end-to-end tests and someone keeps adding await page.waitForTimeout(3000) to make CI shut up, this is aimed squarely at you.

The MCP servers: a browser, plus where failures come from

The Playwright MCP is the whole point. It lets Claude open a real Chromium, click things, read the accessibility tree, and check whether the change it just made actually renders before it tells me it is done. No more agents confidently editing a selector they never loaded. GitHub MCP gives it the failing CI runs and PR context. Sentry MCP lets it line up a flaky spec against the actual production error, which is how you find the difference between a bad test and a bad app.

.mcp.json

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest", "--headless", "--isolated"]
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" }
    },
    "sentry": {
      "command": "npx",
      "args": ["-y", "@sentry/mcp-server@latest"],
      "env": {
        "SENTRY_AUTH_TOKEN": "${SENTRY_AUTH_TOKEN}",
        "SENTRY_ORG": "greenbuild"
      }
    }
  }
}

Run Playwright isolated

I pass --isolated so every browser session starts from a clean profile. Shared state between runs is the number one source of a test that passes alone and fails in the suite. Found that out the hard way on a Friday.

Official MCP RegistryThe open catalog and API for public MCP servers. This is where I check that a server is real before I add it.registry.modelcontextprotocol.io The Best MCP Servers for Developers in 2026Builder.io roundup with setup notes. Good sanity check on which servers are worth the context cost.builder.io—

The rules, written on the wall

My CLAUDE.md is short on purpose. Three rules carry the whole philosophy. Write the test before you touch the flaky one, so we capture the actual bug instead of papering over it. Quarantine flakes, never delete them, because a deleted flake is a known bug you chose to forget. And keep the chatter out of the main context.

CLAUDE.md

# QA Rig rules (non-negotiable)

## Flaky tests
- Write a NEW failing test that reproduces the issue BEFORE fixing the flaky one.
- Quarantine flakes with `test.fixme()` and a // FLAKE: link to the issue.
- NEVER delete a flaky test. A deleted flake is a silent bug.
- Banned: page.waitForTimeout(). Use web-first assertions / expect.poll().

## Reporting
- Only surface FAILING specs to me. Passing + skipped stay out of the summary.
- One line per failure: spec path, the assertion, and your best single hypothesis.

## Selectors
- getByRole / getByLabel first. data-testid only when there is no accessible name.

The waitForTimeout ban is the important one

Nine times out of ten a fixed sleep is a flake waiting to happen on a slower machine. If the agent reaches for one I want it to stop and use a real assertion instead. This rule alone cut my retry count almost in half.

The subagents: keep the noise away from the fix

Three subagents, each with its own context window. The test-author writes specs and never debugs. The flake-hunter is the one I am proud of: it reruns suspect specs in a loop, classifies them, and only the actionable ones make it back to me. The triage-reporter turns a wall of red into a short list. The trick is that the flake-hunter eats all the noisy output in its own context, so my main session stays clean.

.claude/agents/flake-hunter.md

---
name: flake-hunter
description: Reruns suspect specs to separate real failures from flakes. Use after any red CI run.
tools: Bash, Read, mcp__playwright__browser_navigate, mcp__playwright__browser_snapshot
model: sonnet
---

You isolate flaky end-to-end tests. Process:

1. Run each suspect spec 5 times in isolation: `npx playwright test <spec> --repeat-each=5`
2. Classify:
   - fails 5/5  -> REAL FAILURE
   - fails 1-4/5 -> FLAKE (quarantine with test.fixme + // FLAKE: <issue>)
   - passes 5/5 -> resolved, drop it
3. Do NOT fix application code. Do NOT delete tests.
4. Return ONLY: spec path, classification, pass count (e.g. 2/5).

Keep all raw Playwright output inside your own context. The parent only
needs the verdict.

claude code — flake-hunter run

You

CI went red on the checkout suite. Sort out what's actually broken.

Agent

Delegating to flake-hunter. Reran 6 suspect specs, 5x each, isolated.

Agent

checkout/apply-coupon.spec.ts REAL FAILURE 5/5 (expects discounted total, gets full price) checkout/guest-flow.spec.ts FLAKE 2/5 -> quarantined, // FLAKE: #418 The other 4 passed 5/5. Want me to write a repro test for apply-coupon first?

The flake-hunter reporting back to the main session. Note how little it says.

Create custom subagents - Claude Code DocsThe official reference for the YAML frontmatter, tool allowlist and isolated context window each subagent gets.code.claude.com

How to Build Claude Subagents Better Than 99% of People· IndyDevDan

The hooks: automation that does not depend on the model remembering

Rules are suggestions to a language model. Hooks are not. I run the affected e2e tests on every file edit so the agent gets feedback the moment it breaks something, and I strip the Stop summary down to failures only. The model can forget my rule about reporting. The hook cannot.

.claude/settings.json

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "npx playwright test --only-changed --reporter=line 2>&1 | tail -n 25"
          }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "npx playwright test --reporter=line 2>&1 | grep -E '✘|failed' || echo 'all green'"
          }
        ]
      }
    ]
  }
}

zsh — qa-rig

kicked it off after a red CI run

$claude "checkout suite is red, triage it"

> Pulling failed run from github MCP... 2 failing specs

> flake-hunter: rerunning 6 specs, isolated, 5x each

apply-coupon.spec.ts ......... REAL 5/5

guest-flow.spec.ts ........... FLAKE 2/5 -> quarantined #418

PostToolUse hook fires after it writes the repro test

Running 1 changed spec... 1 failed (good, it reproduces)

> Fixed coupon rounding in cart total. Reran: 1 passed

$git diff --stat

src/cart/total.ts | 4 +-

tests/checkout/apply-coupon.spec.ts | 11 +++++

tests/checkout/guest-flow.spec.ts | 2 +-

What the rig costs me, honestly

I am not going to pretend this is free or magic. Sonnet 4.6 is plenty for test work and keeps the cost reasonable, which matters when the flake-hunter is rerunning specs five times each. Here are the numbers I actually see on a typical triage-and-fix loop.

Metric	Value	Note
Model	Claude Sonnet 4.6	Opus is overkill for spec work
Avg response	2.7s	fine, since I am not babysitting it
Cost / task	$0.33	rerunning flakes adds up, watch it
Pass rate	93%	the 7% is usually a genuinely ambiguous spec
Tier	S	on Setuproll, for what that is worth

The honest catch: rerunning a spec five times to classify it is not cheap, and on a big suite the flake-hunter can chew through tokens. I scope it to suspect specs only, never the whole suite. If you point it at everything you will get a bill you do not like.

Add the three MCP servers, but start with just Playwright and prove it can drive a browser before adding github and sentry.
Copy the CLAUDE.md rules. The waitForTimeout ban is the one that matters most.
Drop in the flake-hunter subagent and let it have its own context. Do not let it edit app code.
Wire the two hooks. Test on changes, summarize failures only on stop.
Run it against one real flaky suite for a week before you trust it with main.

Claude Code Hooks explained in 5 minutes· IndyDevDan

One thing I would tell past me

Quarantine, do not delete. The first version of this rig let the agent delete tests it could not pass and I lost two real bugs that way before I noticed. Now a flake gets test.fixme and an issue link, and it stays visible until a human signs it off.

That is the whole rig. It is not clever, it is just stubborn about the right things: write the test first, never hide a flake, and keep the noise out of the room where the work happens. If your green builds have been lying to you, this is the setup that finally got mine to tell the truth. Install it with the line below, point it at your worst suite, and let me know what it finds.

install

npx setuproll add claude-code-playwright-qa

Claude Code + Playwright QA Rig

Install this build

Components

Model

MCP servers

Subagents

Hooks

Rules