I let Claude Code red-team my codebase, but it does not get commit rights
How I run Claude Code as a read-only SAST cell: Opus 4.8, the Semgrep MCP, a threat-modeler subagent that has to prove exploitability, and hooks that stop the auditor from exfiltrating the secrets it audits.
I have a rule I have repeated in enough postmortems that people quote it back at me: a tool you trust to find your bugs is also a tool that can ship them. Most of the AI coding setups I see online hand the model a hammer and full repo write access and call the result a security review. That is not a review. That is an unsupervised intern with a sudo password and a deadline.
So I built the opposite. My Claude Code setup is a sealed audit cell. It can read every line of the target, it can run Semgrep, it can trace tainted input from an HTTP handler all the way to a query. What it cannot do is write a single byte to the source tree, commit, push, install a package, or reach the network. It finds things and it argues for why they are exploitable. Then a human, me, decides what happens next.
This is the whole build. The model is Opus 4.8 because the reasoning that separates a real injection from a styled false positive is exactly where weaker models flood you with junk. Everything else exists to keep the audit honest and the auditor caged.
The CLAUDE.md is a contract, not a vibe
My CLAUDE.md for an audit reads more like rules of engagement than project notes. It states the threat model up front and refuses to soften it. The single most important line in the whole file is the one that says a finding does not exist until the model can describe how it is exploited. Scanners spray. I want a sniper. Confirm exploitability or stay quiet.
# Audit target: payments-svc (Node 20 / Express / Postgres)
## What this is
A security audit harness, NOT a feature factory. You are reviewing other
people's code for vulnerabilities. You do not own this repo. You do not
ship to it. Your output is findings and proposed patches, full stop.
## Prime directive
- NEVER auto-apply, auto-commit, or open a PR with a fix. Propose only.
- A finding does not exist until you can describe how it is exploited.
No exploit path = no report. I do not want a wall of "potential" noise.
- Rank by severity and reachability, not by count. Ten styled lint hits
matter less than one reachable SQL injection.
## Threat model (assume this, do not relax it)
- All HTTP input is attacker-controlled: body, query, headers, cookies.
- Auth is the boundary. Anything past it can still be a confused-deputy.
- Secrets live in env and Vault. If you see one in code, it is a finding.
- The DB is a blast radius, not a trust zone. Trace tainted data to a sink.
## How to report a finding
For each issue give me, in this order:
1. Severity (Critical / High / Medium / Low) + CVSS-ish reasoning.
2. The sink: file:line where the bad thing happens.
3. The source: where attacker input enters and how it reaches the sink.
4. A concrete exploit (curl, payload, or steps). If you cannot, downgrade.
5. A proposed fix as a diff. Do not apply it.
## Commands you may run (read-only / analysis only)
- `npx semgrep --config auto` (via the semgrep MCP, prefer that)
- `rg` / `grep` to trace data flow
- `git log` / `git blame` to date a regression
- NEVER: `git push`, `git commit`, `npm install`, `rm`, any write to src/
## Gotchas
- This codebase wraps some queries in a helper that LOOKS parameterized
but string-concats on one path. Read the helper before you trust it.
- "It's behind auth" is not a mitigation by itself. Check authz too.settings.json: a cage, then permissions inside the cage
Here is where most of the safety actually lives. The default mode is plan, so nothing executes without me seeing the intent first. The allow list is read-only by design: reads, grep, git history, and the Semgrep MCP. The deny list is the part I actually care about, and deny always beats allow. No edits, no writes, no commit, no push, no npm, no rm, and crucially no curl. The env files are denied outright at the read level so the model cannot even look at them.
{
"permissions": {
"defaultMode": "plan",
"allow": [
"Read(**)",
"Grep(**)",
"Glob(**)",
"Bash(rg:*)",
"Bash(git log:*)",
"Bash(git blame:*)",
"Bash(git diff:*)",
"mcp__semgrep__*"
],
"ask": [
"Bash(npx semgrep:*)"
],
"deny": [
"Read(./.env)",
"Read(./.env.*)",
"Read(./**/secrets/**)",
"Edit(**)",
"Write(**)",
"Bash(git push:*)",
"Bash(git commit:*)",
"Bash(npm:*)",
"Bash(rm:*)",
"Bash(curl:*)"
]
},
"env": {
"NODE_ENV": "development"
},
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [{ "type": "command", "command": ".claude/hooks/secret-scan.sh" }]
}
],
"PostToolUse": [
{
"matcher": "Read|Grep",
"hooks": [{ "type": "command", "command": ".claude/hooks/semgrep-diff.sh" }]
}
],
"Stop": [
{
"hooks": [{ "type": "command", "command": ".claude/hooks/severity-report.sh" }]
}
]
}
}allowis everything an auditor legitimately needs: read the world, grep the world, walk git history, call Semgrep tools. Nothing in here can mutate the target.askis the one heavier action, a raw Semgrep CLI run, which I want to eyeball before it chews through a huge tree. The MCP path is pre-approved; the shell path is gated.denyis the cage. Edits, writes, commits, pushes, package installs, deletes, network calls, and any read of env or secrets. If a single rule in this file fails closed, it should be these.
.mcp.json: Semgrep is the muscle, the rest is context
Four servers, no more. Filesystem and GitHub give the model the code and the issue history. Sentry is underrated for security work: a stack trace from a real production crash often points straight at an unhandled, attacker-reachable path that a static scan ranks as low. And Semgrep is the workhorse. The Semgrep MCP lets the model run rules and read structured results directly, instead of me piping CLI output into a prompt and hoping it parses.
{
"mcpServers": {
"filesystem": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "."]
},
"github": {
"type": "http",
"url": "https://api.githubcopilot.com/mcp/"
},
"sentry": {
"type": "http",
"url": "https://mcp.sentry.dev/mcp"
},
"semgrep": {
"type": "stdio",
"command": "uvx",
"args": ["semgrep-mcp"],
"env": { "SEMGREP_RULES": "p/owasp-top-ten p/secrets p/javascript" }
}
}
}p/owasp-top-ten, p/secrets, and p/javascript on the Semgrep MCP so the scan starts from a known, auditable baseline rather than whatever 'auto' decides today. Reproducible scans matter when you have to defend a finding three weeks later.
5:42Subagents: scanner, adversary, patch author
Three subagents, each its own isolated context so noise from one does not poison the others. The sast-scanner runs the rules and collects raw candidates. The threat-modeler is the one that earns its keep: it takes those candidates and tries to actually exploit them, dropping anything it cannot reach. The fix-author writes a proposed patch as a diff for the survivors. Note the handoff: scanner finds, modeler proves, fix-author drafts, and none of them can apply anything.
- sast-scanner (Opus): drives the Semgrep MCP, collects hits, and does a first grep-based pass for the patterns rules miss. Output is a flat candidate list, nothing ranked yet.
- threat-modeler (Opus): the adversary. Traces source to sink, writes the exploit, drops the unreachable, and ranks survivors by reachability times impact. This is the gate.
- fix-author (Opus): proposes a minimal patch as a diff with a note on why it closes the path. Never applies it. The diff goes in the report for a human to take or leave.
Here is the threat-modeler verbatim. The system prompt is basically me yelling 'prove it' in a structured way.
---
name: threat-modeler
description: >
Turns raw scanner hits into a ranked, exploitable threat model. Use after
sast-scanner produces candidate findings, before anything is reported.
tools: Read, Grep, Glob, Bash
model: opus
---
You are an adversary. Your job is to decide which candidate findings are
real and which are noise, then rank the real ones.
For every candidate you receive:
1. Find the SOURCE. Where does attacker-controlled input enter? If you
cannot trace tainted data from an HTTP boundary (or other untrusted
input) to the sink, mark it UNREACHABLE and drop it. Do not report it.
2. Build the exploit. Write the actual payload or curl that triggers it.
If you cannot write one, the severity is at most Low and you say why.
3. Score it. Severity is reachability x impact, not the rule's default.
A reflected value in an admin-only page is not the same as an
unauthenticated RCE. Say which.
4. Check the "mitigation". If the code claims to sanitize, read the
sanitizer. Half the false negatives in this codebase are helpers that
look safe and are not.
Output a ranked list, Critical first. One paragraph per finding: source,
sink, exploit, blast radius. Hand it to fix-author. You do NOT write fixes
and you do NOT touch files.Hooks: the parts I refuse to leave to the model's good intentions
Rules in CLAUDE.md are advice. A motivated prompt injection can talk a model out of advice. Hooks are shell scripts the harness runs no matter what the model wants, so anything that is genuinely load-bearing for safety goes here. Three of them.
- PreToolUse runs a secret scanner on every Bash call. In an audit the danger is reversed: I am not stopping the model from leaking my secrets to git, I am stopping it from exfiltrating the target's secrets at all. It blocks reads of env files and any outbound network primitive.
- PostToolUse runs an incremental Semgrep over whatever region the model just read, so coverage tracks attention instead of one giant scan at the start that everyone forgets by turn forty.
- Stop assembles the severity-ranked report. The session does not end with a vague summary; it ends with Critical-first findings, each with a source, a sink, and an exploit, or it does not end clean.
The blocking contract is the same one everyone gets wrong: a PreToolUse hook only stops the call on exit code 2. Exit 0 lets it run; any other code is treated as a soft error and the session keeps going. Test your blocking hooks before you trust your cage. Here is the real one.
#!/usr/bin/env bash
# .claude/hooks/secret-scan.sh
# PreToolUse hook on Bash. Reads the tool call as JSON on stdin.
# Exit 2 => block the command, stderr is shown back to the model.
# In an AUDIT context the threat is partly the agent itself: it must not
# exfiltrate the secrets it is auditing. So we block on egress, not ingress.
set -euo pipefail
INPUT=$(cat)
COMMAND=$(printf '%s' "$INPUT" | jq -r '.tool_input.command // empty')
# Block any attempt to read or print an env / secret file.
if printf '%s' "$COMMAND" | grep -qiE '(cat|less|head|tail|rg|grep)[[:space:]].*(\.env|secrets/|id_rsa|\.pem)'; then
echo "secret-scan: reading secret material is denied in audit mode" >&2
exit 2
fi
# Block anything that looks like exfiltration over the network.
if printf '%s' "$COMMAND" | grep -qiE '(curl|wget|nc|scp|/dev/tcp)'; then
echo "secret-scan: outbound network from the audit cell is denied" >&2
exit 2
fi
exit 0
26:11What a real session looks like
I point it at a branch and let the scanner go, but the part I watch is the threat-modeler throwing things out. A scan that returns forty hits and gets cut to three real ones is the scan working correctly. Count is vanity. Reachability is the metric.
What it caught, what it does not
I am not going to pretend this replaces a human pentest. It does not reason about business logic abuse the way a person does, and it will miss a multi-step auth bypass that needs three requests in the right order. But for the boring, dense, easy-to-miss class of bugs across a large tree, it is genuinely good, and the exploitability gate keeps the report short enough that I read all of it.
| Bug class | How it does | Notes |
|---|---|---|
| SQL / NoSQL injection | Strong | Source-to-sink tracing is where Opus earns the cost |
| Hardcoded secrets / keys | Strong | Semgrep p/secrets plus the deny on env reads |
| IDOR / broken object authz | Good | Catches missing owner checks; misses subtle role logic |
| XSS (reflected / stored) | Good | Reads the sanitizer instead of trusting its name |
| Business-logic abuse | Weak | Still a human job; it does not model intent well |
| Race conditions / TOCTOU | Weak | Timing bugs need a human and often a runtime |
The setup did not make Claude a better hacker. It made the report short enough to trust and impossible to act on without me. That is exactly the trade I want from a tool near my secrets.
Take it, but keep the cage
Everything above ships in this build: the rules-of-engagement CLAUDE.md, the deny-heavy settings.json, the four MCP servers, the three subagents, and the three hooks. If you change one thing, do not change the deny list. The cleverness is optional. The cage is not. Adjust the rulesets and the target notes to your stack, then run it against a branch you do not mind reading hard truths about.