Claude Code Legacy Refactor Engine

Workflow

I let Claude touch a 14-year-old codebase. Here is the harness that kept me employed.

An Opus 4.8 setup that writes characterization tests first, refactors in tiny behavior-preserving steps, and hands me a behavior-diff every time it stops.

legacy_hank9 min read

The code I get paid to fix was written before the iPhone shipped. It has a 1,900 line file named utils.php, a function called doStuff2, and a comment that just says // do not remove this, nobody knows why. There are no tests. There is no spec. The only documentation is the bug tracker, and the bug tracker is also a lie.

So when people ask whether I let an AI refactor this stuff, the honest answer is: yes, but on a very short leash. I am not worried about the model being dumb. Opus 4.8 untangles knots that used to eat my whole afternoon. I am worried about the model being confidently helpful, quietly changing behavior nobody asked it to change, and me finding out at 2am from a pager.

This build is the leash. The whole thing is built around one rule I have screamed at junior devs for years: pin the behavior down before you move anything.

The one law

Never change behavior and structure in the same commit. If a refactor and a behavior change land together and something breaks, you cannot tell which one did it. That is how you spend a weekend bisecting your own cleverness.

What is in the box

Stripped down, the build is Claude Opus 4.8 in Claude Code, three MCP servers, three subagents, and three hooks that fire whether the model remembers to be careful or not. I do not rely on the model remembering anything. I rely on the harness.

Piece	What I run	Why
Model	Claude Opus 4.8	It is slower and pricier, and on tangled legacy logic it is the only one that does not fold
MCP	filesystem, github, postgres	Read the tree, open PRs, and actually query the DB the old code leans on
Subagents	characterization-tester, refactorer, diff-reviewer	Three narrow jobs beat one model trying to do all three at once
Hooks	snapshot, run tests, behavior-diff	Safety that does not depend on anyone being disciplined

Numbers from my last quarter on it: roughly 3.2 seconds per turn, about 63 cents a session, and an 88 percent pass rate on the characterization suites it writes. The cost stings a little. The cost of shipping a silent regression into a billing system stings a lot more, so I made my peace.

The rules file, where the discipline lives

CLAUDE.md is the first thing the agent reads, so this is where I put the law. Short, blunt, no philosophy. The model does not need a TED talk, it needs the three things it must not do.

CLAUDE.md

# Legacy refactor rules

This is a 14-year-old PHP/MySQL codebase. There are no specs.
Treat existing behavior as the spec, including the bugs.

## Hard rules
1. Add characterization tests BEFORE touching any logic.
   If a function has no test pinning its current output, you
   write that test first and confirm it passes against the
   UNCHANGED code.
2. Refactor in tiny behavior-preserving steps. Rename, extract,
   inline. One transformation per commit.
3. NEVER change behavior and structure in the same commit.
   A bug fix is a separate, clearly labeled commit.

## When unsure
- Preserve the weird behavior. The weird behavior is load-bearing
  until proven otherwise. Flag it, do not "fix" it.
- Read the calling code before you trust a function's name.
  doStuff2 does not do stuff. It computes tax.

Tip

That last line is real. I keep a running list of lying function names in the file. It saves the agent (and me) from believing the names, which in this codebase is a rookie mistake.

Subagent one: pin the behavior

The characterization-tester is the most important member of the crew and the one I had to fight Claude the hardest to keep narrow. Its only job is to capture what the code does right now. Not what it should do. What it does. Golden master, snapshot, table of inputs and outputs, whatever fits.

.claude/agents/characterization-tester.md

---
name: characterization-tester
description: Pins down the CURRENT behavior of legacy code with tests before any refactor. Use proactively before editing logic.
tools: Read, Grep, Glob, Bash, Edit
model: opus
---

You write characterization tests, not correctness tests.

Process:
1. Read the target function and every caller you can find.
2. Enumerate real input shapes: empty, null, the obvious happy
   path, and the ugly edge cases the code clearly handles.
3. Write tests that assert the CURRENT output, even if it looks
   wrong. Wrong-but-current is the goal.
4. Run them against the unchanged code. They MUST pass. If a
   test fails, your assumption about current behavior is wrong,
   not the code. Fix the test.
5. Report coverage gaps you could not pin down. Do not guess.

You never refactor. You never fix bugs. You only describe
reality with assertions.

The trick is step four. If the test you wrote fails against the original code, you were wrong about the code, not the other way around. Every junior I have mentored gets this backwards at least once. Nice to have a teammate that gets corrected and then stays corrected.

The hooks: trust nobody, including the model

Here is where I sleep at night. The hooks run on lifecycle events, so they fire even if the model decides to be a cowboy. Before any edit I snapshot the file. After any edit the characterization suite runs. When the session stops, I get a behavior-diff report I actually read.

.claude/settings.json

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "scripts/snapshot.sh \"$CLAUDE_TOOL_FILE\""
          }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "scripts/char-tests.sh \"$CLAUDE_TOOL_FILE\""
          }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "scripts/behavior-diff.sh"
          }
        ]
      }
    ]
  }
}

The snapshot script is dumb on purpose: copy the file into .refactor-snapshots/ keyed by a timestamp. The behavior-diff re-runs the characterization suite against the snapshot baseline and the working tree, then prints anything that drifted. If a single golden value changed and the commit was supposed to be a pure rename, that is a red flag, and I want it loud.

zsh - refactor-engine

kicking off an extract-method on the tax mess in utils.php

$claude "extract the VAT calc out of doStuff2 into its own function"

[characterization-tester] no tests pin doStuff2. writing 11 first.

PASS tests/Char/DoStuff2Test.php (11 assertions, against unchanged code)

[refactorer] extracting computeVat(). behavior preserved.

[hook:PostToolUse] char-tests.sh -> 11 passed, 0 changed

[hook:Stop] behavior-diff: 0 golden values drifted. safe to commit.

$git commit -m "refactor: extract computeVat from doStuff2 (no behavior change)"

Read the diff report. Actually read it.

The whole build is pointless if you rubber-stamp the Stop report. When it says zero golden values drifted, that is the agent telling you the refactor is provably a no-op. When it says one drifted, that is the agent saving your job. Treat both as load-bearing.

The other two subagents, briefly

refactorer: does exactly one transformation per pass. Rename, extract, inline. It is forbidden from fixing anything, even when the bug is right there grinning at it. Separate commit, separate label.
diff-reviewer: reads the final diff like a grumpy senior on a Friday. Its job is to catch the moment structure and behavior sneaked into the same change, and to say so before the PR goes up.

diff-reviewer - PR check

You

review the diff for PR #4417 before I push

Agent

Two transformations in one commit. The computeVat extract is clean and behavior-preserving. But you also changed the rounding from floor() to round() on line 88. That is a behavior change hiding in a refactor. Split it into its own commit labeled fix, and add a characterization test that captures the OLD floor() result so we know exactly what we are changing.

You

...yeah. good catch. splitting it.

The reviewer flagging a mixed commit before it reached a human. This is the message I wanted a teammate to send me for 22 years.

Why Opus, why these three MCP servers

I tried cheaper models on this. They are great until the call graph gets deep, and then they start hallucinating helper functions that do not exist or losing track of which branch they are in. On a clean greenfield app, fine, use the cheap one. On code where one function reaches through four files and a stored procedure, I want the model that holds the whole mess in its head. That is Opus 4.8, and the 63 cents is cheaper than my time bisecting.

The MCP set is small and boring on purpose. filesystem so it can actually read the tree instead of guessing. github so the diff-reviewer can open the PR. postgres because half the real behavior in this app lives in the database, and you cannot characterize what you cannot query. I do not add servers I will not use. Every tool in the allowlist is a tool the model can surprise me with.

Claude Code Hooks explained in 5 minutes· IndyDevDan

If you only watch one thing before copying my hooks, watch that. The lifecycle is the whole point. Hooks fire on the event, not on the model's good intentions, and on legacy work good intentions are the first thing to evaporate.

disler/claude-code-hooks-masteryEvery hook lifecycle event with working examples. I lifted the matcher pattern for my snapshot hook from here and trimmed it down.disler/claude-code-hooks-mastery4.2k Create custom subagents - Claude Code DocsThe official spec for the YAML frontmatter, the tool allowlist, and the isolated context window. Read the tools field section twice before you give a refactorer Bash.code.claude.com

The honest caveats

This is slow. Writing characterization tests before every change roughly doubles the work up front, and on a tight deadline you will hate it. I have hated it. But the tests are not throwaway. Once the behavior is pinned, it stays pinned, and the next person who touches that function inherits a safety net instead of a minefield. The up-front cost is real and it pays back the third time someone refactors the same code.

It also will not save you from yourself if you ignore the reports. I have watched people install all of this and then skim the Stop output like terms of service. The harness only works if you treat a drifted golden value as an emergency, not a suggestion.

If you want to run it, the build is on Setuproll as claude-code-refactor-legacy. Install Claude Code, drop the CLAUDE.md and the three agent files in, wire the hooks, and start it with claude --model opus. Then go pick the gnarliest function you are scared of and let it pin the behavior first. That moment, when the diff report says zero golden values drifted on a function you have feared for years, is the closest thing to peace this job offers.

Claude Code Legacy Refactor Engine

Install this build

Components

Model

MCP servers

Subagents

Hooks

Rules