Claude Code DevOps IaC Pilot

Setup

I let an agent touch my Terraform. Here is the harness that lets me sleep.

An infra-as-code Claude Code setup where the model plans, a hook hard-blocks prod applies, and a Stop hook tells me what it costs before I do anything dumb.

sre_nadia9 min read2026-06-20

The worst page I ever got was at 03:14 on a Tuesday. Someone ran an apply against prod from a laptop, a security group lost an ingress rule, and our payment processor could no longer reach us. Three hours, two senior engineers, one very awake CFO. Nothing was malicious. It was just a human, tired, typing yes when the plan scrolled past too fast to read.

So when people ask whether I let an AI agent run Terraform for me, the honest answer is: yes, but I built the same guardrails I would build for any junior who can move fast. This is the actual Claude Code build I run for infra work. Opus 4.8 on the model, four MCP servers, three subagents, and a set of hooks that exist precisely because I do not trust myself at 3am, let alone an eager model.

Premise

The agent never gets to apply to prod on its own. The whole design assumes the model will eventually try something stupid, and makes that attempt boring and safe instead of catastrophic.

The shape of the build

Here is what is wired up, top to bottom. Nothing exotic. The point is not clever components, it is that every dangerous action has a gate in front of it.

Layer	What I use	Why
Model	Claude Opus 4.8	Plan diffs over big HCL trees need the reasoning. I pay for it.
MCP	github, filesystem, cloudflare, sentry	PRs, local state, edge config, and the thing that tells me what broke.
Subagents	terraform-planner, policy-reviewer, incident-responder	One plans, one nags about policy, one is for when it is already on fire.
Hooks	PreToolUse, PostToolUse, Stop	Block prod apply, surface plan diff, print cost before I walk away.

The numbers on my Setuproll card: roughly 3 seconds per turn, about 55 cents a session, 86 percent pass rate on the infra eval suite. Slower and pricier than my web build. I do not care. A bad plan diff costs more than a dollar.

CLAUDE.md: three rules, no novel

My memory file for infra repos is short on purpose. A 400-line CLAUDE.md is a 400-line thing nobody reads, model included. Three rules. They are the same three I would put on the team wiki.

CLAUDE.md

# Infra agent rules

## Non-negotiable
1. Plan before apply, always. Never run `terraform apply` without
   showing me a fresh `terraform plan` output first.
2. No secrets in state or code. No literals for tokens, keys, passwords.
   Use the secrets backend. If you think you need a secret in HCL, stop
   and ask.
3. Tag every resource with owner + env. Untagged resources fail review.

## Workflow
- Work in a branch named infra/<ticket>. Open a PR via the github MCP.
- For anything touching prod, write the plan to a file and let the
  human apply it. You do not apply to prod.
- Prefer `for_each` over count. We have been burned by index churn.

The for_each line is not filler

We once had a count-indexed list reorder itself and Terraform decided to destroy and recreate eleven RDS instances. The model now knows. You should encode your scar tissue, not just your style.

settings.json: the part that actually saves me

Rules in a markdown file are suggestions. Hooks are law, because the harness runs them, not the model. This is the abbreviated version of my project settings. The PreToolUse matcher is the one I would die on a hill for.

.claude/settings.json

{
  "permissions": {
    "allow": ["Bash(terraform plan:*)", "Bash(terraform validate:*)"],
    "ask": ["Bash(terraform apply:*)"],
    "deny": ["Bash(terraform apply *prod*:*)", "Bash(rm -rf:*)"]
  },
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          { "type": "command", "command": ".claude/hooks/block-prod-apply.sh" }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          { "type": "command", "command": "terraform plan -no-color | tail -40" }
        ]
      }
    ],
    "Stop": [
      {
        "matcher": "*",
        "hooks": [
          { "type": "command", "command": ".claude/hooks/cost-estimate.sh" }
        ]
      }
    ]
  }
}

The block-prod-apply hook reads the command the model is about to run, greps for prod workspaces, and exits non-zero with a message if it sees one. Non-zero on PreToolUse means the tool call never happens. The model gets told no and has to route around it, which in practice means it writes the plan to a file and pings me. Exactly what I want.

.claude/hooks/block-prod-apply.sh

#!/usr/bin/env bash
# PreToolUse hook. stdin is JSON with the proposed tool call.
input=$(cat)
cmd=$(echo "$input" | jq -r '.tool_input.command // empty')

if echo "$cmd" | grep -Eq 'terraform +apply' && \
   echo "$cmd" | grep -Eiq 'prod|production'; then
  echo "BLOCKED: prod apply requires a human. Write the plan to" >&2
  echo "a file and tell me. I will run it myself." >&2
  exit 2   # exit 2 = block the tool call, feed stderr back to Claude
fi
exit 0

The subagents

I run three. They are not for show. Each one has a narrow job and its own context, so the planner is not also trying to remember our incident runbook.

terraform-planner runs validate, fmt, and plan, then hands me a human-readable summary of the diff. It is read-only by design, no apply in its tool allowlist.
policy-reviewer checks the plan against our rules: tags present, no public S3, no 0.0.0.0/0 on anything that is not a load balancer. It is the nag I wish I had at 03:14.
incident-responder is the break-glass one. It pulls from the sentry MCP, correlates the latest deploy, and drafts a timeline. I only invoke it when something is already broken.

.claude/agents/terraform-planner.md

---
name: terraform-planner
description: Produces a reviewable terraform plan diff. Read-only.
tools: Bash, Read, Grep
---
You generate plans, you never apply.

Steps:
1. terraform fmt -check and terraform validate.
2. terraform plan -no-color -out=tfplan.binary.
3. Summarize the diff in plain English: what is created,
   changed, destroyed. Call out anything destructive in BOLD.
4. If the plan touches a prod workspace, stop and tell the
   human to apply it. Do not attempt apply yourself.

claude-code - infra/PLAT-882

You

Add a read replica to the orders DB in staging. Plan it.

Agent

Ran fmt + validate, clean. Plan summary: + aws_db_instance.orders_replica (create) ~ aws_db_subnet_group.orders (in-place, add AZ) No destroys. Replica is staging-only, tagged owner=payments env=staging. policy-reviewer: PASS. Want me to open the PR?

You

Yes, PR it. Do not apply.

Agent

PR #214 opened via github MCP. Stop hook estimate: this run cost $0.41, the new replica adds ~$58/mo. Plan written to tfplan.binary for your apply.

The terraform-planner handing back a diff. The destroy line is the only thing I actually read.

Why the Stop hook prints cost

Two costs, both real: what the agent session burned, and what the infra change will cost per month. The second one has stopped me from merging a db.r6g.4xlarge I did not need more than once.

A real session

This is roughly what a normal afternoon looks like. Notice the part where it refuses to apply to prod. That refusal is the whole product.

zsh - infra-platform

$claude

Claude Code. 4 MCP servers connected: github, filesystem, cloudflare, sentry.

$> migrate the prod cache cluster to the new node type

terraform-planner: validate ok, generating plan...

Plan: 1 to change, 0 to add, 1 to destroy.

~ aws_elasticache_replication_group.cache (node type)

- aws_elasticache_cluster.cache_legacy (DESTROY)

Attempting: terraform apply (workspace: prod)

BLOCKED: prod apply requires a human. Plan written to tfplan.binary.

Stop hook: session $0.52. This change is in-place, no monthly delta.

$terraform show tfplan.binary | less

I read every line of the destroy. Then I apply it. Me. Not the agent.

Where I cribbed the patterns

I did not invent the hook lifecycle. The official docs and a couple of community repos saved me a weekend. If you build something like this, start here.

disler/claude-code-hooks-masteryEvery hook lifecycle event with working examples, including exit-code 2 blocking, which is the trick my prod guard relies on.github.com4.1k Claude Code Documentation - AnthropicThe canonical reference for CLAUDE.md memory, settings.json hooks, MCP servers, and subagent frontmatter. Read the hooks page twice.code.claude.comdocs

Claude Code Hooks explained in 5 minutes· IndyDevDan

What I would tell my 3am self

The thing that took me a while to accept: the agent is not the risk. The risk is the same as it has always been, an apply against prod that nobody reviewed. The agent just runs faster, so an unguarded one fails faster. Put the gate where the danger is. Plans are cheap, applies are forever, and a hook does not get tired at 3am the way I do.

If you want to start from my setup, it is on Setuproll. Drop the components into your repo, then go rewrite the rules with your own scar tissue. Install line: npx claude-code-devops-iac init. Then read the plan. Always read the plan.

Claude Code DevOps IaC Pilot

Install this build

Components

Model

MCP servers

Subagents

Hooks

Rules