A
Claude Code logoClaude CodeDevOps / infrastructure as code

Claude Code DevOps IaC Pilot

Nadia Brandt@sre_nadia
88.0Overall score

A terraform-planner subagent produces a reviewable plan diff before anything applies, and the PreToolUse hook hard-blocks prod applies without human sign-off. Cloudflare MCP and cost estimates keep infra changes safe and predictable.

88.0Score
734Votes
5Components
9hUpdated

Install this build

Export
terminal
npx setuproll add claude-code-devops-iac

Components

Model

  • Claude Opus 4.8

MCP servers

  • github
  • filesystem
  • cloudflare
  • sentry

Subagents

  • terraform-planner
  • policy-reviewer
  • incident-responder

Hooks

  • PreToolUse: block apply on prod
  • PostToolUse: terraform plan diff
  • Stop: cost estimate

Rules

  • Plan before apply, always
  • No secrets in state or code
  • Tag every resource with owner+env
Setup

I let an agent touch my Terraform. Here is the harness that lets me sleep.

An infra-as-code Claude Code setup where the model plans, a hook hard-blocks prod applies, and a Stop hook tells me what it costs before I do anything dumb.

sre_nadia9 min read2026-06-20

The worst page I ever got was at 03:14 on a Tuesday. Someone ran an apply against prod from a laptop, a security group lost an ingress rule, and our payment processor could no longer reach us. Three hours, two senior engineers, one very awake CFO. Nothing was malicious. It was just a human, tired, typing yes when the plan scrolled past too fast to read.

So when people ask whether I let an AI agent run Terraform for me, the honest answer is: yes, but I built the same guardrails I would build for any junior who can move fast. This is the actual Claude Code build I run for infra work. Opus 4.8 on the model, four MCP servers, three subagents, and a set of hooks that exist precisely because I do not trust myself at 3am, let alone an eager model.

Premise
The agent never gets to apply to prod on its own. The whole design assumes the model will eventually try something stupid, and makes that attempt boring and safe instead of catastrophic.

The shape of the build

Here is what is wired up, top to bottom. Nothing exotic. The point is not clever components, it is that every dangerous action has a gate in front of it.

LayerWhat I useWhy
ModelClaude Opus 4.8Plan diffs over big HCL trees need the reasoning. I pay for it.
MCPgithub, filesystem, cloudflare, sentryPRs, local state, edge config, and the thing that tells me what broke.
Subagentsterraform-planner, policy-reviewer, incident-responderOne plans, one nags about policy, one is for when it is already on fire.
HooksPreToolUse, PostToolUse, StopBlock prod apply, surface plan diff, print cost before I walk away.

The numbers on my Setuproll card: roughly 3 seconds per turn, about 55 cents a session, 86 percent pass rate on the infra eval suite. Slower and pricier than my web build. I do not care. A bad plan diff costs more than a dollar.

CLAUDE.md: three rules, no novel

My memory file for infra repos is short on purpose. A 400-line CLAUDE.md is a 400-line thing nobody reads, model included. Three rules. They are the same three I would put on the team wiki.

CLAUDE.md
# Infra agent rules

## Non-negotiable
1. Plan before apply, always. Never run `terraform apply` without
   showing me a fresh `terraform plan` output first.
2. No secrets in state or code. No literals for tokens, keys, passwords.
   Use the secrets backend. If you think you need a secret in HCL, stop
   and ask.
3. Tag every resource with owner + env. Untagged resources fail review.

## Workflow
- Work in a branch named infra/<ticket>. Open a PR via the github MCP.
- For anything touching prod, write the plan to a file and let the
  human apply it. You do not apply to prod.
- Prefer `for_each` over count. We have been burned by index churn.
The for_each line is not filler
We once had a count-indexed list reorder itself and Terraform decided to destroy and recreate eleven RDS instances. The model now knows. You should encode your scar tissue, not just your style.

settings.json: the part that actually saves me

Rules in a markdown file are suggestions. Hooks are law, because the harness runs them, not the model. This is the abbreviated version of my project settings. The PreToolUse matcher is the one I would die on a hill for.

.claude/settings.json
{
  "permissions": {
    "allow": ["Bash(terraform plan:*)", "Bash(terraform validate:*)"],
    "ask": ["Bash(terraform apply:*)"],
    "deny": ["Bash(terraform apply *prod*:*)", "Bash(rm -rf:*)"]
  },
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          { "type": "command", "command": ".claude/hooks/block-prod-apply.sh" }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          { "type": "command", "command": "terraform plan -no-color | tail -40" }
        ]
      }
    ],
    "Stop": [
      {
        "matcher": "*",
        "hooks": [
          { "type": "command", "command": ".claude/hooks/cost-estimate.sh" }
        ]
      }
    ]
  }
}

The block-prod-apply hook reads the command the model is about to run, greps for prod workspaces, and exits non-zero with a message if it sees one. Non-zero on PreToolUse means the tool call never happens. The model gets told no and has to route around it, which in practice means it writes the plan to a file and pings me. Exactly what I want.

.claude/hooks/block-prod-apply.sh
#!/usr/bin/env bash
# PreToolUse hook. stdin is JSON with the proposed tool call.
input=$(cat)
cmd=$(echo "$input" | jq -r '.tool_input.command // empty')

if echo "$cmd" | grep -Eq 'terraform +apply' && \
   echo "$cmd" | grep -Eiq 'prod|production'; then
  echo "BLOCKED: prod apply requires a human. Write the plan to" >&2
  echo "a file and tell me. I will run it myself." >&2
  exit 2   # exit 2 = block the tool call, feed stderr back to Claude
fi
exit 0

The subagents

I run three. They are not for show. Each one has a narrow job and its own context, so the planner is not also trying to remember our incident runbook.

  • terraform-planner runs validate, fmt, and plan, then hands me a human-readable summary of the diff. It is read-only by design, no apply in its tool allowlist.
  • policy-reviewer checks the plan against our rules: tags present, no public S3, no 0.0.0.0/0 on anything that is not a load balancer. It is the nag I wish I had at 03:14.
  • incident-responder is the break-glass one. It pulls from the sentry MCP, correlates the latest deploy, and drafts a timeline. I only invoke it when something is already broken.
.claude/agents/terraform-planner.md
---
name: terraform-planner
description: Produces a reviewable terraform plan diff. Read-only.
tools: Bash, Read, Grep
---
You generate plans, you never apply.

Steps:
1. terraform fmt -check and terraform validate.
2. terraform plan -no-color -out=tfplan.binary.
3. Summarize the diff in plain English: what is created,
   changed, destroyed. Call out anything destructive in BOLD.
4. If the plan touches a prod workspace, stop and tell the
   human to apply it. Do not attempt apply yourself.
claude-code - infra/PLAT-882
You
Add a read replica to the orders DB in staging. Plan it.
Agent
Ran fmt + validate, clean. Plan summary: + aws_db_instance.orders_replica (create) ~ aws_db_subnet_group.orders (in-place, add AZ) No destroys. Replica is staging-only, tagged owner=payments env=staging. policy-reviewer: PASS. Want me to open the PR?
You
Yes, PR it. Do not apply.
Agent
PR #214 opened via github MCP. Stop hook estimate: this run cost $0.41, the new replica adds ~$58/mo. Plan written to tfplan.binary for your apply.
The terraform-planner handing back a diff. The destroy line is the only thing I actually read.
Why the Stop hook prints cost
Two costs, both real: what the agent session burned, and what the infra change will cost per month. The second one has stopped me from merging a db.r6g.4xlarge I did not need more than once.

A real session

This is roughly what a normal afternoon looks like. Notice the part where it refuses to apply to prod. That refusal is the whole product.

zsh - infra-platform
$claude
Claude Code. 4 MCP servers connected: github, filesystem, cloudflare, sentry.
$> migrate the prod cache cluster to the new node type
terraform-planner: validate ok, generating plan...
Plan: 1 to change, 0 to add, 1 to destroy.
~ aws_elasticache_replication_group.cache (node type)
- aws_elasticache_cluster.cache_legacy (DESTROY)
Attempting: terraform apply (workspace: prod)
BLOCKED: prod apply requires a human. Plan written to tfplan.binary.
Stop hook: session $0.52. This change is in-place, no monthly delta.
$terraform show tfplan.binary | less
I read every line of the destroy. Then I apply it. Me. Not the agent.
$

Where I cribbed the patterns

I did not invent the hook lifecycle. The official docs and a couple of community repos saved me a weekend. If you build something like this, start here.

disler/claude-code-hooks-masteryEvery hook lifecycle event with working examples, including exit-code 2 blocking, which is the trick my prod guard relies on.github.com4.1kClaude Code Documentation - AnthropicThe canonical reference for CLAUDE.md memory, settings.json hooks, MCP servers, and subagent frontmatter. Read the hooks page twice.code.claude.comdocs
Claude Code Hooks explained in 5 minutes5:42
Claude Code Hooks explained in 5 minutes· IndyDevDan

What I would tell my 3am self

The thing that took me a while to accept: the agent is not the risk. The risk is the same as it has always been, an apply against prod that nobody reviewed. The agent just runs faster, so an unguarded one fails faster. Put the gate where the danger is. Plans are cheap, applies are forever, and a hook does not get tired at 3am the way I do.

If you want to start from my setup, it is on Setuproll. Drop the components into your repo, then go rewrite the rules with your own scar tissue. Install line: npx claude-code-devops-iac init. Then read the plan. Always read the plan.

0 Reviews

Your rating
Sign in to post

Loading discussion...