I gave Claude Code a real browser and a strict no-deleting-flakes rule
My Claude Code + Playwright rig writes the test first, quarantines flakes instead of nuking them, and only reports specs I actually have to fix.
I have a confession that gets me funny looks in standup. I do not trust green checkmarks. A passing suite tells me nothing if half of it is skipped, a third of it is sleeping for two seconds, and one spec only passes when the CI box is under load. So when I built my Claude Code setup, the goal was not speed. The goal was a rig that is honest about what is broken.
This is the Claude Code + Playwright QA Rig. It runs Claude Sonnet 4.6, talks to a real browser through the Playwright MCP server, pulls failures back from GitHub and Sentry, and uses three small subagents so the model that fixes a bug never has to read 400 lines of noisy stack traces. Here is the whole thing, config and all.
The MCP servers: a browser, plus where failures come from
The Playwright MCP is the whole point. It lets Claude open a real Chromium, click things, read the accessibility tree, and check whether the change it just made actually renders before it tells me it is done. No more agents confidently editing a selector they never loaded. GitHub MCP gives it the failing CI runs and PR context. Sentry MCP lets it line up a flaky spec against the actual production error, which is how you find the difference between a bad test and a bad app.
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@playwright/mcp@latest", "--headless", "--isolated"]
},
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" }
},
"sentry": {
"command": "npx",
"args": ["-y", "@sentry/mcp-server@latest"],
"env": {
"SENTRY_AUTH_TOKEN": "${SENTRY_AUTH_TOKEN}",
"SENTRY_ORG": "greenbuild"
}
}
}
}The rules, written on the wall
My CLAUDE.md is short on purpose. Three rules carry the whole philosophy. Write the test before you touch the flaky one, so we capture the actual bug instead of papering over it. Quarantine flakes, never delete them, because a deleted flake is a known bug you chose to forget. And keep the chatter out of the main context.
# QA Rig rules (non-negotiable)
## Flaky tests
- Write a NEW failing test that reproduces the issue BEFORE fixing the flaky one.
- Quarantine flakes with `test.fixme()` and a // FLAKE: link to the issue.
- NEVER delete a flaky test. A deleted flake is a silent bug.
- Banned: page.waitForTimeout(). Use web-first assertions / expect.poll().
## Reporting
- Only surface FAILING specs to me. Passing + skipped stay out of the summary.
- One line per failure: spec path, the assertion, and your best single hypothesis.
## Selectors
- getByRole / getByLabel first. data-testid only when there is no accessible name.The subagents: keep the noise away from the fix
Three subagents, each with its own context window. The test-author writes specs and never debugs. The flake-hunter is the one I am proud of: it reruns suspect specs in a loop, classifies them, and only the actionable ones make it back to me. The triage-reporter turns a wall of red into a short list. The trick is that the flake-hunter eats all the noisy output in its own context, so my main session stays clean.
---
name: flake-hunter
description: Reruns suspect specs to separate real failures from flakes. Use after any red CI run.
tools: Bash, Read, mcp__playwright__browser_navigate, mcp__playwright__browser_snapshot
model: sonnet
---
You isolate flaky end-to-end tests. Process:
1. Run each suspect spec 5 times in isolation: `npx playwright test <spec> --repeat-each=5`
2. Classify:
- fails 5/5 -> REAL FAILURE
- fails 1-4/5 -> FLAKE (quarantine with test.fixme + // FLAKE: <issue>)
- passes 5/5 -> resolved, drop it
3. Do NOT fix application code. Do NOT delete tests.
4. Return ONLY: spec path, classification, pass count (e.g. 2/5).
Keep all raw Playwright output inside your own context. The parent only
needs the verdict.
22:40The hooks: automation that does not depend on the model remembering
Rules are suggestions to a language model. Hooks are not. I run the affected e2e tests on every file edit so the agent gets feedback the moment it breaks something, and I strip the Stop summary down to failures only. The model can forget my rule about reporting. The hook cannot.
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "npx playwright test --only-changed --reporter=line 2>&1 | tail -n 25"
}
]
}
],
"Stop": [
{
"hooks": [
{
"type": "command",
"command": "npx playwright test --reporter=line 2>&1 | grep -E '✘|failed' || echo 'all green'"
}
]
}
]
}
}What the rig costs me, honestly
I am not going to pretend this is free or magic. Sonnet 4.6 is plenty for test work and keeps the cost reasonable, which matters when the flake-hunter is rerunning specs five times each. Here are the numbers I actually see on a typical triage-and-fix loop.
| Metric | Value | Note |
|---|---|---|
| Model | Claude Sonnet 4.6 | Opus is overkill for spec work |
| Avg response | 2.7s | fine, since I am not babysitting it |
| Cost / task | $0.33 | rerunning flakes adds up, watch it |
| Pass rate | 93% | the 7% is usually a genuinely ambiguous spec |
| Tier | S | on Setuproll, for what that is worth |
The honest catch: rerunning a spec five times to classify it is not cheap, and on a big suite the flake-hunter can chew through tokens. I scope it to suspect specs only, never the whole suite. If you point it at everything you will get a bill you do not like.
- Add the three MCP servers, but start with just Playwright and prove it can drive a browser before adding github and sentry.
- Copy the CLAUDE.md rules. The waitForTimeout ban is the one that matters most.
- Drop in the flake-hunter subagent and let it have its own context. Do not let it edit app code.
- Wire the two hooks. Test on changes, summarize failures only on stop.
- Run it against one real flaky suite for a week before you trust it with main.
5:42That is the whole rig. It is not clever, it is just stubborn about the right things: write the test first, never hide a flake, and keep the noise out of the room where the work happens. If your green builds have been lying to you, this is the setup that finally got mine to tell the truth. Install it with the line below, point it at your worst suite, and let me know what it finds.
npx setuproll add claude-code-playwright-qa