I let Claude touch a 14-year-old codebase. Here is the harness that kept me employed.
An Opus 4.8 setup that writes characterization tests first, refactors in tiny behavior-preserving steps, and hands me a behavior-diff every time it stops.
The code I get paid to fix was written before the iPhone shipped. It has a 1,900 line file named utils.php, a function called doStuff2, and a comment that just says // do not remove this, nobody knows why. There are no tests. There is no spec. The only documentation is the bug tracker, and the bug tracker is also a lie.
So when people ask whether I let an AI refactor this stuff, the honest answer is: yes, but on a very short leash. I am not worried about the model being dumb. Opus 4.8 untangles knots that used to eat my whole afternoon. I am worried about the model being confidently helpful, quietly changing behavior nobody asked it to change, and me finding out at 2am from a pager.
This build is the leash. The whole thing is built around one rule I have screamed at junior devs for years: pin the behavior down before you move anything.
What is in the box
Stripped down, the build is Claude Opus 4.8 in Claude Code, three MCP servers, three subagents, and three hooks that fire whether the model remembers to be careful or not. I do not rely on the model remembering anything. I rely on the harness.
| Piece | What I run | Why |
|---|---|---|
| Model | Claude Opus 4.8 | It is slower and pricier, and on tangled legacy logic it is the only one that does not fold |
| MCP | filesystem, github, postgres | Read the tree, open PRs, and actually query the DB the old code leans on |
| Subagents | characterization-tester, refactorer, diff-reviewer | Three narrow jobs beat one model trying to do all three at once |
| Hooks | snapshot, run tests, behavior-diff | Safety that does not depend on anyone being disciplined |
Numbers from my last quarter on it: roughly 3.2 seconds per turn, about 63 cents a session, and an 88 percent pass rate on the characterization suites it writes. The cost stings a little. The cost of shipping a silent regression into a billing system stings a lot more, so I made my peace.
The rules file, where the discipline lives
CLAUDE.md is the first thing the agent reads, so this is where I put the law. Short, blunt, no philosophy. The model does not need a TED talk, it needs the three things it must not do.
# Legacy refactor rules
This is a 14-year-old PHP/MySQL codebase. There are no specs.
Treat existing behavior as the spec, including the bugs.
## Hard rules
1. Add characterization tests BEFORE touching any logic.
If a function has no test pinning its current output, you
write that test first and confirm it passes against the
UNCHANGED code.
2. Refactor in tiny behavior-preserving steps. Rename, extract,
inline. One transformation per commit.
3. NEVER change behavior and structure in the same commit.
A bug fix is a separate, clearly labeled commit.
## When unsure
- Preserve the weird behavior. The weird behavior is load-bearing
until proven otherwise. Flag it, do not "fix" it.
- Read the calling code before you trust a function's name.
doStuff2 does not do stuff. It computes tax.Subagent one: pin the behavior
The characterization-tester is the most important member of the crew and the one I had to fight Claude the hardest to keep narrow. Its only job is to capture what the code does right now. Not what it should do. What it does. Golden master, snapshot, table of inputs and outputs, whatever fits.
---
name: characterization-tester
description: Pins down the CURRENT behavior of legacy code with tests before any refactor. Use proactively before editing logic.
tools: Read, Grep, Glob, Bash, Edit
model: opus
---
You write characterization tests, not correctness tests.
Process:
1. Read the target function and every caller you can find.
2. Enumerate real input shapes: empty, null, the obvious happy
path, and the ugly edge cases the code clearly handles.
3. Write tests that assert the CURRENT output, even if it looks
wrong. Wrong-but-current is the goal.
4. Run them against the unchanged code. They MUST pass. If a
test fails, your assumption about current behavior is wrong,
not the code. Fix the test.
5. Report coverage gaps you could not pin down. Do not guess.
You never refactor. You never fix bugs. You only describe
reality with assertions.The trick is step four. If the test you wrote fails against the original code, you were wrong about the code, not the other way around. Every junior I have mentored gets this backwards at least once. Nice to have a teammate that gets corrected and then stays corrected.
The hooks: trust nobody, including the model
Here is where I sleep at night. The hooks run on lifecycle events, so they fire even if the model decides to be a cowboy. Before any edit I snapshot the file. After any edit the characterization suite runs. When the session stops, I get a behavior-diff report I actually read.
{
"hooks": {
"PreToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "scripts/snapshot.sh \"$CLAUDE_TOOL_FILE\""
}
]
}
],
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "scripts/char-tests.sh \"$CLAUDE_TOOL_FILE\""
}
]
}
],
"Stop": [
{
"hooks": [
{
"type": "command",
"command": "scripts/behavior-diff.sh"
}
]
}
]
}
}The snapshot script is dumb on purpose: copy the file into .refactor-snapshots/ keyed by a timestamp. The behavior-diff re-runs the characterization suite against the snapshot baseline and the working tree, then prints anything that drifted. If a single golden value changed and the commit was supposed to be a pure rename, that is a red flag, and I want it loud.
The other two subagents, briefly
- refactorer: does exactly one transformation per pass. Rename, extract, inline. It is forbidden from fixing anything, even when the bug is right there grinning at it. Separate commit, separate label.
- diff-reviewer: reads the final diff like a grumpy senior on a Friday. Its job is to catch the moment structure and behavior sneaked into the same change, and to say so before the PR goes up.
Why Opus, why these three MCP servers
I tried cheaper models on this. They are great until the call graph gets deep, and then they start hallucinating helper functions that do not exist or losing track of which branch they are in. On a clean greenfield app, fine, use the cheap one. On code where one function reaches through four files and a stored procedure, I want the model that holds the whole mess in its head. That is Opus 4.8, and the 63 cents is cheaper than my time bisecting.
The MCP set is small and boring on purpose. filesystem so it can actually read the tree instead of guessing. github so the diff-reviewer can open the PR. postgres because half the real behavior in this app lives in the database, and you cannot characterize what you cannot query. I do not add servers I will not use. Every tool in the allowlist is a tool the model can surprise me with.
5:42If you only watch one thing before copying my hooks, watch that. The lifecycle is the whole point. Hooks fire on the event, not on the model's good intentions, and on legacy work good intentions are the first thing to evaporate.
disler/claude-code-hooks-masteryEvery hook lifecycle event with working examples. I lifted the matcher pattern for my snapshot hook from here and trimmed it down.disler/claude-code-hooks-mastery4.2kCreate custom subagents - Claude Code DocsThe official spec for the YAML frontmatter, the tool allowlist, and the isolated context window. Read the tools field section twice before you give a refactorer Bash.code.claude.comThe honest caveats
This is slow. Writing characterization tests before every change roughly doubles the work up front, and on a tight deadline you will hate it. I have hated it. But the tests are not throwaway. Once the behavior is pinned, it stays pinned, and the next person who touches that function inherits a safety net instead of a minefield. The up-front cost is real and it pays back the third time someone refactors the same code.
It also will not save you from yourself if you ignore the reports. I have watched people install all of this and then skim the Stop output like terms of service. The harness only works if you treat a drifted golden value as an emergency, not a suggestion.
If you want to run it, the build is on Setuproll as claude-code-refactor-legacy. Install Claude Code, drop the CLAUDE.md and the three agent files in, wire the hooks, and start it with claude --model opus. Then go pick the gnarliest function you are scared of and let it pin the behavior first. That moment, when the diff report says zero golden values drifted on a function you have feared for years, is the closest thing to peace this job offers.