feat: add DevOps & SRE Guide with FIRE Framework (v3.9.9)

New files: - guide/devops-sre.md: FIRE Framework for infrastructure diagnosis (~870 lines) - examples/agents/devops-sre.md: DevOps/SRE agent persona - examples/claude-md/devops-sre.md: CLAUDE.md template for infra projects Guide includes: - Kubernetes troubleshooting prompts by symptom - Solo incident response workflow (3 AM design) - IaC patterns (Terraform, Ansible, GitOps) - Claude limitations table (transparency) - Team adoption checklist Updated: - README.md: Added DevOps/SRE learning path, templates count → 69 - ultimate-guide.md: Added reference after Section 5.4 - reference.yaml: Added 11 DevOps/SRE entries - examples/README.md: Added to indexes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 22:09:31 +01:00 · 2026-01-20 22:09:31 +01:00 · 00df38d318
commit 00df38d318
parent 83a62ffb5d
11 changed files with 1337 additions and 41 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -6,6 +6,35 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

 ## [Unreleased]

+## [3.9.9] - 2026-01-20
+
+### Added
+
+- **DevOps & SRE Guide** — Comprehensive infrastructure diagnosis guide (~900 lines)
+  - **New file**: `guide/devops-sre.md` — The FIRE Framework for infrastructure troubleshooting
+    - **F**irst Response → **I**nvestigate → **R**emediate → **E**valuate
+    - Kubernetes troubleshooting with copy-paste prompts by symptom (CrashLoopBackOff, OOMKilled, ImagePullBackOff, etc.)
+    - Solo incident response workflow (designed for 3 AM scenarios)
+    - Multi-agent pattern for post-incident analysis
+    - IaC patterns: Terraform, Ansible, GitOps workflows
+    - Guardrails & team adoption checklist
+    - Claude limitations table (what Claude can't do for DevOps)
+    - Case studies: Production outage root cause, OpsWorker.ai MTTR reduction
+  - **New file**: `examples/agents/devops-sre.md` — DevOps/SRE agent persona (~130 lines)
+    - FIRE framework implementation
+    - Kubernetes, network, and resource debugging checklists
+    - Response templates (assessment, root cause, remediation)
+    - Safety rules for production environments
+  - **New file**: `examples/claude-md/devops-sre.md` — CLAUDE.md template for DevOps teams (~170 lines)
+    - Infrastructure context configuration
+    - Environment, service map, access patterns
+    - Team conventions and runbook format
+    - Customization guides (K8s-heavy, Terraform-heavy, multi-cloud)
+  - **Updated**: `guide/ultimate-guide.md` — Added DevOps & SRE Guide reference after Section 5.4
+  - **Updated**: `machine-readable/reference.yaml` — Added 11 DevOps/SRE entries
+  - **Updated**: `examples/README.md` — Added agent and CLAUDE.md template to indexes
+  - **Updated**: `README.md` — Added DevOps/SRE learning path, updated templates count (69)
+
 ## [3.9.8] - 2026-01-20

 ### Added
--- a/README.md
+++ b/README.md
@ -6,7 +6,7 @@

 <p align="center">
  <a href="https://github.com/FlorianBruniaux/claude-code-ultimate-guide/stargazers"><img src="https://img.shields.io/github/stars/FlorianBruniaux/claude-code-ultimate-guide?style=for-the-badge" alt="Stars"/></a>
-  <a href="./examples/"><img src="https://img.shields.io/badge/Templates-61-green?style=for-the-badge" alt="Templates"/></a>
+  <a href="./examples/"><img src="https://img.shields.io/badge/Templates-69-green?style=for-the-badge" alt="Templates"/></a>
  <a href="./quiz/"><img src="https://img.shields.io/badge/Quiz-227_questions-orange?style=for-the-badge" alt="Quiz"/></a>
 </p>

@ -64,7 +64,7 @@ Save as `CLAUDE.md` in your project root. Claude reads it automatically.

 **The problem**: Awesome-lists give links, not learning paths. Official docs are dense. Tutorials get outdated in weeks.

-**This guide**: Structured learning path with 60+ copy-paste templates, from first install to advanced workflows.
+**This guide**: Structured learning path with 69 copy-paste templates, from first install to advanced workflows.

 **Reading time**: Quick Start ~15 min. Full guide ~3 hours (most read by section).

@ -133,6 +133,17 @@ Same agentic capabilities as Claude Code, but through a visual interface with no

 </details>

+<details>
+<summary><strong>DevOps / SRE</strong> — Infrastructure path (5 steps)</summary>
+
+1. [DevOps & SRE Guide](./guide/devops-sre.md) — FIRE framework for infrastructure diagnosis
+2. [K8s Troubleshooting](./guide/devops-sre.md#kubernetes-troubleshooting) — Prompts by symptom
+3. [Incident Response](./guide/devops-sre.md#pattern-incident-response) — Solo & multi-agent workflows
+4. [IaC Patterns](./guide/devops-sre.md#pattern-infrastructure-as-code) — Terraform, Ansible, GitOps
+5. [Guardrails](./guide/devops-sre.md#guardrails--adoption) — Security boundaries & team adoption
+
+</details>
+
 ---

 ## 📚 What's Inside
@ -148,6 +159,7 @@ Same agentic capabilities as Claude Code, but through a visual interface with no
 | **[Workflows](./guide/workflows/)** | Practical guides (TDD, Plan-Driven) | 30 min |
 | **[Data Privacy](./guide/data-privacy.md)** | Retention & compliance | 10 min |
 | **[Security Hardening](./guide/security-hardening.md)** | MCP vetting, injection defense | 25 min |
+| **[DevOps & SRE](./guide/devops-sre.md)** | FIRE framework, K8s troubleshooting, incident response | 30 min |
 | **[Claude Code Releases](./guide/claude-code-releases.md)** | Official release history | 10 min |

 <details>
@ -183,7 +195,9 @@ claude-code-ultimate-guide/
 </details>

 <details>
-<summary><strong>Examples Library</strong> (61 templates)</summary>
+<summary><strong>Examples Library</strong> (69 templates)</summary>
+
+**Agents** (6): [code-reviewer](./examples/agents/code-reviewer.md), [test-writer](./examples/agents/test-writer.md), [security-auditor](./examples/agents/security-auditor.md), [refactoring-specialist](./examples/agents/refactoring-specialist.md), [output-evaluator](./examples/agents/output-evaluator.md), [devops-sre](./examples/agents/devops-sre.md) ⭐

 **Slash Commands** (18): [/pr](./examples/commands/pr.md), [/commit](./examples/commands/commit.md), [/release-notes](./examples/commands/release-notes.md), [/diagnose](./examples/commands/diagnose.md), [/security](./examples/commands/security.md), [/refactor](./examples/commands/refactor.md), [/explain](./examples/commands/explain.md), [/optimize](./examples/commands/optimize.md), [/ship](./examples/commands/ship.md)...

@ -313,7 +327,7 @@ Claude Code sends your prompts, file contents, and MCP results to Anthropic serv

 **Status**: Research preview (Pro $20/mo or Max $100-200/mo, macOS only, **VPN incompatible**)

-**Archive**: Historical versions available in git history (pre-v3.9.8)
+**Archive**: Historical versions available in git history (pre-v3.9.9)

 </details>

@ -324,7 +338,7 @@ Claude Code sends your prompts, file contents, and MCP results to Anthropic serv

 | Repository | Purpose | Audience |
 |------------|---------|----------|
-| **[Claude Code Guide](https://github.com/FlorianBruniaux/claude-code-ultimate-guide)** *(this repo)* | Comprehensive documentation (11K lines, 66 templates) | Developers |
+| **[Claude Code Guide](https://github.com/FlorianBruniaux/claude-code-ultimate-guide)** *(this repo)* | Comprehensive documentation (11K lines, 69 templates) | Developers |
 | **[Claude Cowork Guide](https://github.com/FlorianBruniaux/claude-cowork-guide)** | Non-technical usage (67 prompts, 5 workflows) | Knowledge workers |
 | **Code Landing** *(to be deployed)* | Marketing site for Claude Code guide | Discovery |
 | **Cowork Landing** *(to be deployed)* | Marketing site for Cowork guide | Discovery |
@ -382,7 +396,7 @@ Licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).

 ---

-*Version 3.9.8 | January 2026 | Crafted with Claude*
+*Version 3.9.9 | January 2026 | Crafted with Claude*

 <!-- SEO Keywords -->
 <!-- claude code, claude code tutorial, anthropic cli, ai coding assistant, claude code mcp,
--- a/2
+++ b/2
@ -1 +1 @@
-3.9.8
+3.9.9
--- a/examples/README.md
+++ b/examples/README.md
@ -51,6 +51,7 @@ Ready-to-use templates for Claude Code configuration.
 | [security-auditor.md](./agents/security-auditor.md) | Security vulnerability detection | Sonnet |
 | [refactoring-specialist.md](./agents/refactoring-specialist.md) | Clean code refactoring | Sonnet |
 | [output-evaluator.md](./agents/output-evaluator.md) | LLM-as-a-Judge quality gate | Haiku |
+| [devops-sre.md](./agents/devops-sre.md) | Infrastructure troubleshooting with FIRE framework | Sonnet |

 ### Skills
 | File | Purpose |
@ -117,8 +118,10 @@ Ready-to-use templates for Claude Code configuration.
 | File | Purpose |
 |------|---------|
 | [learning-mode.md](./claude-md/learning-mode.md) | Learning-focused development configuration |
+| [devops-sre.md](./claude-md/devops-sre.md) | DevOps/SRE project configuration |

-> **See [guide/learning-with-ai.md](../guide/learning-with-ai.md) for complete documentation**
+> **See [guide/learning-with-ai.md](../guide/learning-with-ai.md) for learning mode documentation**
+> **See [guide/devops-sre.md](../guide/devops-sre.md) for DevOps/SRE guide**

 ### Scripts
 | File | Purpose | Output |
--- a/examples/agents/devops-sre.md
+++ b/examples/agents/devops-sre.md
@ -0,0 +1,168 @@
+---
+name: devops-sre
+description: Infrastructure troubleshooting using the FIRE framework (First Response, Investigate, Remediate, Evaluate)
+model: sonnet
+tools: Bash, Read, Grep, Glob
+---
+
+# DevOps/SRE Agent
+
+You are an SRE specialist focused on infrastructure diagnosis, incident response, and reliability engineering.
+
+## FIRE Framework
+
+For every infrastructure issue, follow this systematic approach:
+
+### F - First Response
+- Clarify the symptom and impact
+- Identify affected services and environment
+- Ask about recent changes (deploys, config, traffic)
+- Propose 3 highest-priority diagnostic steps
+
+### I - Investigate
+- Guide through diagnostic commands
+- Analyze logs, metrics, and configurations
+- Correlate across services when needed
+- Form hypotheses and test them systematically
+
+### R - Remediate
+- Propose fix options with clear trade-offs
+- **ALWAYS wait for human approval before destructive actions**
+- Provide rollback plan for every change
+- Explain impact and risk of each option
+
+### E - Evaluate
+- Generate incident timeline
+- Perform root cause analysis
+- Create actionable prevention items
+- Format blameless postmortems
+
+## Kubernetes Checklist
+
+### Pod Issues
+- [ ] Check pod status: `kubectl get pods -n <ns>`
+- [ ] Describe pod for events: `kubectl describe pod <pod> -n <ns>`
+- [ ] Check logs: `kubectl logs <pod> -n <ns> --previous`
+- [ ] Check resource usage: `kubectl top pod <pod> -n <ns>`
+
+### Service Issues
+- [ ] Verify endpoints exist: `kubectl get endpoints <svc> -n <ns>`
+- [ ] Check selector matching: compare pod labels with service selector
+- [ ] Test connectivity: `kubectl exec -it <pod> -- curl <svc>:<port>`
+- [ ] Check network policies: `kubectl get networkpolicy -n <ns>`
+
+### Node Issues
+- [ ] Check node status: `kubectl get nodes`
+- [ ] Describe node for conditions: `kubectl describe node <node>`
+- [ ] Check system pods: `kubectl get pods -n kube-system`
+
+## Response Templates
+
+### Initial Assessment
+
+```markdown
+## Situation Assessment
+
+**Symptom**: [What's broken]
+**Impact**: [Who/what is affected]
+**Environment**: [Prod/staging, region, cluster]
+**Started**: [When]
+
+### Immediate Priorities
+1. [Most critical check]
+2. [Second priority]
+3. [Third priority]
+
+### Commands to Run
+[Exact commands]
+```
+
+### Root Cause Summary
+
+```markdown
+## Root Cause Analysis
+
+**Direct Cause**: [Immediate trigger]
+**Contributing Factors**:
+1. [Factor 1]
+2. [Factor 2]
+
+**Evidence**:
+- [Log entry / metric / config that proves it]
+
+**Timeline**:
+- [Time]: [Event]
+```
+
+### Remediation Proposal
+
+```markdown
+## Remediation Options
+
+### Option A: [Quick Mitigation]
+- **Command**: [Exact command]
+- **Risk**: [Low/Medium/High]
+- **Rollback**: [How to undo]
+
+### Option B: [Proper Fix]
+- **Command**: [Exact command]
+- **Risk**: [Low/Medium/High]
+- **Rollback**: [How to undo]
+
+**Recommendation**: [Which option and why]
+
+⚠️ **Awaiting your approval before proceeding**
+```
+
+## Safety Rules
+
+1. **Never execute destructive commands without explicit approval**:
+   - `kubectl delete`
+   - `kubectl scale` (down)
+   - `terraform destroy`
+   - Any DROP/DELETE SQL
+   - `rm -rf` outside tmp
+
+2. **Always provide rollback steps** before any change
+
+3. **Never include secrets in responses** - use placeholders
+
+4. **Clarify environment** (prod vs staging) before any action
+
+5. **When uncertain, investigate more** rather than guess
+
+## Common Patterns
+
+### Log Analysis
+```bash
+# Find error patterns
+kubectl logs <pod> -n <ns> | grep -E "ERROR|WARN|Exception" | head -50
+
+# Check for OOM events
+kubectl describe pod <pod> -n <ns> | grep -A5 "Last State"
+
+# Correlate timestamps
+kubectl logs <pod> -n <ns> --since=10m --timestamps
+```
+
+### Network Debugging
+```bash
+# Test DNS resolution
+kubectl exec -it <pod> -- nslookup <service>
+
+# Test connectivity
+kubectl exec -it <pod> -- curl -v <service>:<port>
+
+# Check network policies
+kubectl get networkpolicy -n <ns> -o yaml
+```
+
+### Resource Analysis
+```bash
+# Current usage vs limits
+kubectl top pods -n <ns>
+kubectl describe pod <pod> -n <ns> | grep -A3 "Limits:"
+
+# Node pressure
+kubectl describe node <node> | grep -A10 "Conditions:"
+```
--- a/examples/claude-md/devops-sre.md
+++ b/examples/claude-md/devops-sre.md
@ -0,0 +1,189 @@
+# DevOps/SRE CLAUDE.md Template
+
+A CLAUDE.md configuration optimized for infrastructure projects and SRE workflows.
+
+## Usage
+
+Copy this content to your project's `CLAUDE.md` file and customize the sections marked with `[brackets]`.
+
+---
+
+## Template
+
+```markdown
+# DevOps/SRE Project Configuration
+
+## Infrastructure Context
+
+### Environment
+- Cloud Provider: [AWS/GCP/Azure/On-prem]
+- Kubernetes: [EKS/GKE/AKS/k3s/none]
+- IaC Tool: [Terraform/Pulumi/CloudFormation/Ansible]
+- CI/CD: [GitHub Actions/GitLab CI/Jenkins/ArgoCD]
+
+### Service Map
+- [service-1]: [description, critical path: yes/no]
+- [service-2]: [description, critical path: yes/no]
+- [database]: [PostgreSQL/MySQL/MongoDB, hosted where]
+
+### Access Patterns
+- Cluster access: [kubectl context name]
+- Cloud CLI: [aws/gcloud/az profile name]
+- Secrets: [Vault/SSM/Secrets Manager - never share values]
+
+## FIRE Framework Defaults
+
+Use the FIRE framework for all infrastructure issues:
+- **F**irst Response: Clarify symptom, impact, recent changes
+- **I**nvestigate: Systematic diagnosis with evidence
+- **R**emediate: Propose options, wait for approval
+- **E**valuate: Generate postmortem, prevention items
+
+## Safety Rules
+
+### Never Execute Without Approval
+- `kubectl delete` or `kubectl scale down`
+- `terraform destroy`
+- Any production database writes
+- IAM/security group modifications
+- Any command in production namespace
+
+### Always Require
+- Rollback plan before changes
+- Environment confirmation (prod vs staging)
+- Impact assessment for scaling operations
+
+## Response Preferences
+
+### For Incidents
+- Start with impact assessment
+- Prioritize mitigation over root cause (initially)
+- Provide exact commands, not just guidance
+- Include timestamps in all actions
+
+### For Code Review
+- Focus on: security, resource limits, idempotency
+- Flag: hardcoded values, missing error handling
+- Suggest: monitoring/alerting additions
+
+### For Documentation
+- Format: Markdown with code blocks
+- Style: Runbook format (numbered steps)
+- Include: Prerequisites, rollback, verification steps
+
+## Common Contexts
+
+### Kubernetes Namespaces
+- `production`: [critical services, approval required]
+- `staging`: [test freely]
+- `monitoring`: [Prometheus, Grafana]
+- `ingress`: [nginx, cert-manager]
+
+### Terraform Workspaces/Modules
+- `modules/`: [shared infrastructure components]
+- `environments/prod/`: [production, plan-only by default]
+- `environments/staging/`: [safe to apply]
+
+### Monitoring
+- Metrics: [Prometheus/CloudWatch/Datadog URL]
+- Logs: [ELK/CloudWatch/Loki URL]
+- Alerts: [PagerDuty/OpsGenie integration]
+
+## Team Conventions
+
+### Commit Messages
+- Format: [conventional commits / your format]
+- Example: `fix(k8s): increase memory limit for payment-service`
+
+### PR Requirements
+- [ ] Terraform plan output included
+- [ ] Affected services listed
+- [ ] Rollback procedure documented
+
+### Runbook Format
+```
+# [Runbook Title]
+## Symptoms
+## Prerequisites
+## Steps
+## Verification
+## Rollback
+## Escalation
+```
+```
+
+---
+
+## Customization Guide
+
+### For Kubernetes-Heavy Teams
+
+Add to "Common Contexts":
+```markdown
+### Critical Pods
+- `payment-api`: Direct revenue impact, max 30s downtime
+- `auth-service`: Blocks all authenticated requests
+- `api-gateway`: Single point of entry
+
+### Scaling Rules
+- payment-api: min 3, max 10, scale on CPU > 70%
+- auth-service: min 2, max 5, scale on connections
+```
+
+### For Terraform-Heavy Teams
+
+Add section:
+```markdown
+## Terraform Conventions
+- State backend: [S3 bucket / GCS bucket]
+- Lock table: [DynamoDB table name]
+- Module registry: [internal / Terraform registry]
+- Required providers versions: [see versions.tf]
+
+### Module Standards
+- All resources tagged with: var.tags
+- Naming: {project}-{environment}-{resource}
+- Outputs: Always export ARN, ID, name
+```
+
+### For Multi-Cloud Teams
+
+Add to "Environment":
+```markdown
+### Cloud Credentials
+- AWS: Profile `company-prod` / `company-staging`
+- GCP: Project `company-prod-123` / `company-staging-456`
+- Azure: Subscription `prod-sub-id` / `staging-sub-id`
+
+### Cross-Cloud Services
+- DNS: [AWS Route53 / Cloudflare]
+- CDN: [CloudFront / Cloud CDN]
+- Secrets: [HashiCorp Vault - URL]
+```
+
+---
+
+## Integration with Agents
+
+Pair this CLAUDE.md with the DevOps/SRE agent:
+
+```json
+{
+  "agents": {
+    "sre": {
+      "path": ".claude/agents/devops-sre.md",
+      "model": "sonnet"
+    }
+  }
+}
+```
+
+Then invoke with: `@sre investigate this pod crash`
+
+---
+
+## See Also
+
+- [DevOps & SRE Guide](../../guide/devops-sre.md) — Complete FIRE framework documentation
+- [DevOps Agent](../agents/devops-sre.md) — Agent persona for infrastructure tasks
+- [Security Hardening](../../guide/security-hardening.md) — Security best practices
--- a/guide/README.md
+++ b/guide/README.md
@ -16,6 +16,7 @@ Core documentation for mastering Claude Code.
 | [observability.md](./observability.md) | Session monitoring and cost tracking | 15 min |
 | [methodologies.md](./methodologies.md) | 15 development methodologies reference (TDD, SDD, BDD, etc.) | 20 min |
 | [security-hardening.md](./security-hardening.md) | Security threats, MCP vetting, injection defense | 25 min |
+| [devops-sre.md](./devops-sre.md) | FIRE framework for infrastructure diagnosis and incident response | 30 min |
 | [ai-ecosystem.md](./ai-ecosystem.md) | Complementary AI tools (Perplexity, Gemini, Kimi, NotebookLM) | 25 min |
 | [cowork.md](./cowork.md) | Claude Cowork: Summary (see [dedicated repo](https://github.com/FlorianBruniaux/claude-cowork-guide) for full docs) | 5 min |
 | [workflows/](./workflows/) | Practical workflow guides for Claude Code | 30 min |
--- a/guide/cheatsheet.md
+++ b/guide/cheatsheet.md
@ -6,7 +6,7 @@

 **Written with**: Claude (Anthropic)

-**Version**: 3.9.8 | **Last Updated**: January 2026
+**Version**: 3.9.9 | **Last Updated**: January 2026

 ---

@ -411,4 +411,4 @@ where.exe claude; claude doctor; claude mcp list

 **Author**: Florian BRUNIAUX | [@Méthode Aristote](https://methode-aristote.fr) | Written with Claude

-*Last updated: January 2026 | Version 3.9.8*
+*Last updated: January 2026 | Version 3.9.9*
--- a/guide/devops-sre.md
+++ b/guide/devops-sre.md
@ -0,0 +1,869 @@
+# DevOps & SRE with Claude Code
+
+**Reading time**: 30 minutes
+**Skill level**: Intermediate (assumes DevOps basics)
+**Prerequisites**: Claude Code basics ([Sections 1-2](./ultimate-guide.md#1-getting-started) of main guide)
+
+---
+
+> **The FIRE Framework**: A systematic approach to infrastructure diagnosis with Claude Code.
+>
+> **F**irst Response → **I**nvestigate → **R**emediate → **E**valuate
+
+---
+
+## Table of Contents
+
+1. [Quick Start](#quick-start)
+2. [Pattern: Infrastructure Diagnosis](#pattern-infrastructure-diagnosis)
+3. [Pattern: Incident Response](#pattern-incident-response)
+4. [Pattern: Infrastructure as Code](#pattern-infrastructure-as-code)
+5. [Guardrails & Adoption](#guardrails--adoption)
+6. [Quick Reference](#quick-reference)
+
+---
+
+# Quick Start
+
+**Goal**: Get productive with Claude Code for DevOps in 5 minutes.
+
+## Quick Self-Check
+
+| Situation | Jump To |
+|-----------|---------|
+| I'm in an active incident NOW | [Emergency: K8s Troubleshooting](#kubernetes-troubleshooting) |
+| First time using Claude for DevOps | [Tutorial: First Diagnosis](#your-first-infrastructure-diagnosis) |
+| Want to automate my runbooks | [Pattern: Incident Response](#pattern-incident-response) |
+| Evaluating for my team | [Guardrails & Adoption](#guardrails--adoption) |
+| Need ready-to-use prompts | [Quick Reference](#quick-reference) |
+
+## The FIRE Framework
+
+Every infrastructure diagnosis with Claude follows this pattern:
+
+```
+F - First Response   → Give Claude the symptom + context
+I - Investigate      → Claude analyzes logs, metrics, config
+R - Remediate        → Claude proposes fix (with human approval)
+E - Evaluate         → Postmortem, documentation, prevention
+```
+
+### Why FIRE?
+
+| Phase | Human Role | Claude Role |
+|-------|------------|-------------|
+| **F**irst Response | Describe symptom, provide context | Triage, prioritize checks |
+| **I**nvestigate | Run commands, paste output | Analyze, correlate, hypothesize |
+| **R**emediate | **Approve or reject** | Propose fix, explain impact |
+| **E**valuate | Review, share knowledge | Generate postmortem, docs |
+
+**Critical**: Claude proposes, you approve. This is especially important for:
+- Destructive operations (delete, scale down, restart)
+- Production environments
+- Security-sensitive changes
+
+## Your First Infrastructure Diagnosis
+
+### Example: Pod CrashLoopBackOff
+
+```bash
+# Step 1: FIRST RESPONSE - Give context
+claude "I have a pod in CrashLoopBackOff in the payment-service namespace.
+Environment: production
+Cluster: gke-prod-us-east1
+Started: 10 minutes ago
+What are the first 3 things I should check?"
+
+# Step 2: INVESTIGATE - Run suggested commands and share output
+kubectl describe pod payment-api-7d4b8c6f5-x2j9k -n payment-service
+# Copy output and paste to Claude
+
+# Step 3: REMEDIATE - Claude proposes, you approve
+claude "Based on this describe output, what's the fix?
+CONSTRAINT: Show me the exact command but don't execute anything."
+
+# Step 4: EVALUATE - Document the incident
+claude "Create a brief incident note for our wiki:
+- What happened
+- Root cause
+- Fix applied
+- Prevention recommendation"
+```
+
+### What Claude Returns (First Response Example)
+
+```markdown
+## CrashLoopBackOff Triage (Priority Order)
+
+1. **Check exit code and last logs**
+   ```bash
+   kubectl logs payment-api-7d4b8c6f5-x2j9k -n payment-service --previous
+   ```
+   Most common: OOM (exit 137), config errors (exit 1), dependency failures
+
+2. **Check pod events and resource limits**
+   ```bash
+   kubectl describe pod payment-api-7d4b8c6f5-x2j9k -n payment-service
+   ```
+   Look for: OOMKilled, ImagePullBackOff, liveness probe failures
+
+3. **Check recent deployment changes**
+   ```bash
+   kubectl rollout history deployment/payment-api -n payment-service
+   ```
+   If recent: consider rollback while investigating
+```
+
+---
+
+# Pattern: Infrastructure Diagnosis
+
+**Goal**: Systematic troubleshooting for common infrastructure issues.
+
+## Kubernetes Troubleshooting
+
+### K8s MCP Server Setup
+
+For persistent K8s context, install the K8s MCP server:
+
+```json
+// ~/.claude/mcp.json
+{
+  "mcpServers": {
+    "kubernetes": {
+      "command": "npx",
+      "args": ["-y", "@anthropic/mcp-kubernetes"]
+    }
+  }
+}
+```
+
+**Benefits**: Claude can query cluster state directly, reducing copy-paste cycles.
+
+**Without MCP**: You'll pipe kubectl output to Claude manually (still effective).
+
+### Prompts by Symptom
+
+Copy-paste these prompts, replacing `<bracketed>` values.
+
+#### CrashLoopBackOff
+
+```bash
+kubectl describe pod <pod> -n <ns> | claude "Analyze this CrashLoopBackOff:
+1. What's the exit code and what does it mean?
+2. Check the last 5 restarts pattern (timing, consistent or escalating?)
+3. Suggest 3 most likely root causes based on the events
+4. Give me the exact commands to investigate each hypothesis"
+```
+
+**Common Causes Claude Will Identify**:
+- Exit 137: OOMKilled (memory limit hit)
+- Exit 1: Application error (bad config, missing dependency)
+- Exit 143: SIGTERM (graceful shutdown timeout)
+
+#### OOMKilled
+
+```bash
+kubectl top pods -n <ns> && kubectl describe pod <pod> -n <ns> | claude "This pod was OOMKilled:
+1. Compare requests vs limits vs actual usage
+2. Is this a memory leak or under-provisioning?
+3. If leak: what patterns in the container suggest investigation paths?
+4. If under-provisioned: suggest optimal resource settings based on this data"
+```
+
+**Follow-up for Memory Leaks**:
+```bash
+claude "The pod has been restarting every 2 hours with OOMKilled.
+Memory grows linearly from 200Mi to 512Mi limit before crash.
+Language: Node.js 18
+What are the top 3 things to check for memory leaks in this stack?"
+```
+
+#### ImagePullBackOff
+
+```bash
+kubectl describe pod <pod> -n <ns> | claude "ImagePullBackOff diagnosis:
+1. Is this an auth issue, network issue, or wrong image name?
+2. What's the exact error message telling us?
+3. Give me commands to verify the image exists and credentials work"
+```
+
+#### Pending Pod (Not Scheduling)
+
+```bash
+kubectl describe pod <pod> -n <ns> && kubectl describe nodes | claude "Pod stuck in Pending:
+1. Is this resource constraints, node selectors, or affinity rules?
+2. Which nodes were considered and why rejected?
+3. What's the quickest fix vs proper solution?"
+```
+
+#### Service Not Reachable
+
+```bash
+kubectl get svc,endpoints -n <ns> && kubectl describe svc <svc> -n <ns> | claude "Service not reachable:
+1. Are there healthy endpoints?
+2. Is the selector matching pods correctly?
+3. Is it a network policy blocking traffic?
+Give me diagnostic commands for each possibility"
+```
+
+### Case Study: Production Outage Root Cause
+
+**Situation**: E-commerce platform, 3 AM page, checkout service returning 503s.
+
+**FIRE in Action**:
+
+```bash
+# F - First Response
+claude "INCIDENT: checkout-service returning 503s, started 10 min ago.
+Impact: 100% of checkout attempts failing.
+Environment: AWS EKS production, us-east-1.
+Recent changes: deployment 2 hours ago (new feature flag logic).
+What's the fastest diagnostic path?"
+
+# I - Investigate (Claude suggested checking pods first)
+kubectl get pods -n checkout -l app=checkout-service
+# Output: 3/5 pods in CrashLoopBackOff
+
+kubectl logs checkout-service-xxx --previous | tail -50 | claude "Analyze crash logs"
+# Claude identifies: panic: nil pointer dereference in feature flag code
+
+# R - Remediate
+claude "Root cause identified: nil pointer in feature flag logic from recent deploy.
+Options:
+A) Rollback to previous version
+B) Hotfix the nil check
+Which is faster and safer at 3 AM?"
+# Claude recommends: Rollback (faster, proven state, fix properly tomorrow)
+
+kubectl rollout undo deployment/checkout-service -n checkout
+# Service restored in 2 minutes
+
+# E - Evaluate (next day)
+claude "Create postmortem from this incident:
+Timeline: 3:02 AM alert, 3:15 AM root cause found, 3:17 AM rollback, 3:19 AM resolved
+Root cause: Feature flag nil pointer from commit abc123
+Impact: 15 minutes checkout downtime
+Format: Blameless, focused on prevention"
+```
+
+**Outcome**: 15-minute MTTR, clear postmortem, prevention action items identified.
+
+## Log Analysis & Correlation
+
+### Multi-Service Log Correlation
+
+```bash
+# Collect logs from related services
+kubectl logs -l app=api-gateway -n ingress --since=10m > gateway.log
+kubectl logs -l app=auth-service -n auth --since=10m > auth.log
+kubectl logs -l app=payment-service -n payment --since=10m > payment.log
+
+# Analyze correlation
+cat gateway.log auth.log payment.log | claude "Correlate these logs:
+1. Find the request flow for failed transactions
+2. Identify where the failure originates
+3. Are there patterns in timing or specific endpoints?
+4. Create a timeline of events"
+```
+
+### Log Pattern Detection
+
+```bash
+# Find anomalies in error patterns
+grep -E "ERROR|WARN|Exception" app.log | claude "Analyze error patterns:
+1. Cluster similar errors (group by type, not timestamp)
+2. What's the most frequent vs most severe?
+3. Which errors are correlated (same root cause)?
+4. Prioritize investigation order"
+```
+
+### Prometheus/Grafana Query Help
+
+```bash
+claude "I need a PromQL query to:
+- Show p99 latency for the payment-service
+- Group by endpoint
+- Alert if > 500ms for 5 minutes
+Include the alert rule YAML too"
+```
+
+## What Claude CAN'T Do (Limitations)
+
+Understanding limitations prevents frustration and unsafe reliance.
+
+| Limitation | Impact | Workaround |
+|------------|--------|------------|
+| **No real-time cluster state** | Can't see current pod status | Use K8s MCP or paste kubectl output |
+| **No direct API access** | Can't call AWS/GCP APIs | Use MCP servers or share CLI output |
+| **Context window limits** | ~100K tokens max | Focus on relevant logs, not full dumps |
+| **No persistent memory** | Forgets between sessions | Use CLAUDE.md for project context |
+| **Hallucination risk** | May suggest invalid flags | Always verify commands before running |
+| **No real-time metrics** | Can't see current graphs | Screenshot Grafana or paste metric values |
+| **No secrets access** | Can't read vault/secrets | Good! Never share secrets with any LLM |
+
+### When NOT to Use Claude
+
+- **Time-critical decisions under 30 seconds**: Your muscle memory is faster
+- **Highly confidential incidents**: Data breach investigation (legal implications)
+- **Simple, obvious fixes**: If you know the answer, just do it
+- **Compliance-restricted environments**: Check if AI tools are allowed
+
+### When Claude Excels
+
+- **Complex root cause analysis**: Multiple interacting systems
+- **Documentation generation**: Postmortems, runbooks, procedures
+- **Learning new tools**: Unfamiliar cloud services, new k8s features
+- **Second opinion**: Validating your hypothesis
+- **Bulk operations**: Generating configs for multiple environments
+
+---
+
+# Pattern: Incident Response
+
+**Goal**: Structured workflows for incident management.
+
+## Solo Incident Workflow
+
+**Reality**: At 3 AM, you're alone. This workflow is designed for one person.
+
+### FIRE in Action: Solo Incident
+
+#### F - First Response (30 seconds)
+
+```bash
+claude "INCIDENT: [symptom - be specific]
+Context: [service], [environment], [time started]
+Recent changes: [deploys, infra changes, traffic spikes]
+Current impact: [% users affected, revenue impact if known]
+What are the 3 most critical things to check first?"
+```
+
+**Example**:
+```bash
+claude "INCIDENT: API returning 500 errors on /checkout endpoint
+Context: checkout-service, production-us-east-1, started 5 min ago
+Recent changes: deployed v2.3.4 at 2:45 AM, added new payment provider
+Current impact: ~30% of checkout requests failing
+What are the 3 most critical things to check first?"
+```
+
+#### I - Investigate (2-5 minutes)
+
+Run Claude's suggested commands, share output:
+
+```bash
+# Claude suggested checking pod health first
+kubectl get pods -n checkout | claude "Quick assessment of this pod list"
+
+# Then checking recent logs
+kubectl logs -l app=checkout --since=5m | head -100 | claude "Analyze for error patterns"
+
+# Then checking the deployment diff
+kubectl rollout history deployment/checkout-service -n checkout | claude "What changed in the last deployment?"
+```
+
+**Pro tip**: Keep a terminal for running commands, another for Claude conversation.
+
+#### R - Remediate (with approval)
+
+```bash
+claude "Based on investigation:
+- Root cause: [your understanding]
+- Evidence: [key findings]
+
+Propose remediation options:
+1. Quick mitigation (restore service)
+2. Proper fix (address root cause)
+
+CONSTRAINT: I need to approve before any action. Show exact commands."
+```
+
+**Approval Gate Example**:
+```
+Claude: "Recommended: Rollback to v2.3.3
+Command: kubectl rollout undo deployment/checkout-service -n checkout
+Risk: Low - previous version was stable for 2 weeks
+Alternative: Scale down the new payment provider feature flag
+
+Which approach do you want to take?"
+
+You: "Proceed with rollback"
+```
+
+#### E - Evaluate (post-incident, not during)
+
+```bash
+claude "Create incident postmortem:
+
+Timeline:
+- 2:45 AM: Deployed v2.3.4
+- 3:00 AM: First alerts fired
+- 3:05 AM: Incident declared
+- 3:12 AM: Root cause identified (nil pointer in new payment provider code)
+- 3:15 AM: Rollback initiated
+- 3:17 AM: Service restored
+
+Format: Blameless, focus on systems not people
+Include: Action items with owners"
+```
+
+## Communication During Incidents
+
+### Stakeholder Update Generator
+
+```bash
+claude "Generate incident update for stakeholders:
+
+Incident: Checkout service degradation
+Current status: Mitigated, monitoring
+Impact: 15 minutes of 30% checkout failures
+ETA to full resolution: 2 hours (proper fix in next deploy)
+
+Audience: Non-technical executives
+Tone: Professional, reassuring, factual
+Length: 3 sentences max"
+```
+
+**Output Example**:
+> We experienced a 15-minute disruption to our checkout service affecting approximately 30% of transactions, which has now been resolved. The issue was caused by a software bug in a recent update and was quickly rolled back. We'll deploy a permanent fix during our next scheduled maintenance window with no expected customer impact.
+
+### Incident Bridge Prompt
+
+For real-time incident channels:
+
+```bash
+claude "I'm managing an incident bridge. Help me:
+1. Maintain a running timeline of events
+2. Suggest next investigation steps when we hit dead ends
+3. Draft comms updates every 15 minutes
+4. Flag when I should escalate
+
+Current status: [paste latest update]
+What should I communicate to the bridge now?"
+```
+
+## Multi-Agent Pattern: Post-Incident Analysis
+
+**When to use multi-agent**: Not during active incidents. Use for comprehensive analysis afterward.
+
+```bash
+# Agent 1: Timeline Reconstruction
+claude "You are an incident timeline analyst.
+From these logs and Slack messages, reconstruct a precise timeline:
+[paste logs and comms]
+Output: Timestamped events, who did what when"
+
+# Agent 2: Root Cause Analysis
+claude "You are a root cause analyst.
+Given this timeline and system architecture, perform 5-whys analysis:
+[paste timeline from Agent 1]
+Output: Root cause chain, contributing factors"
+
+# Agent 3: Prevention Recommendations
+claude "You are an SRE process improvement specialist.
+Given this root cause analysis:
+[paste RCA from Agent 2]
+Output: Prioritized prevention measures, effort estimates, ownership suggestions"
+```
+
+### Case Study: OpsWorker.ai MTTR Reduction
+
+**Context**: SRE team managing 200+ microservices, 5 on-call engineers.
+
+**Before Claude**:
+- Average MTTR: 45 minutes
+- Postmortems: Often delayed or skipped
+- Knowledge silos: Each engineer knew different services
+
+**Claude Integration**:
+1. FIRE framework adopted for all incidents
+2. Claude generates initial postmortem draft within 1 hour
+3. Runbooks augmented with Claude-assisted troubleshooting
+
+**After 3 Months**:
+- Average MTTR: 18 minutes (60% reduction)
+- Postmortem completion: 95% within 24 hours
+- Knowledge sharing: Claude-generated runbooks accessible to all
+
+**Key Insight**: Biggest gains weren't speed—they were consistency and documentation.
+
+---
+
+# Pattern: Infrastructure as Code
+
+**Goal**: Leverage Claude for Terraform, Ansible, and GitOps workflows.
+
+## Terraform with Claude
+
+### Reference: Anton Babenko's Terraform Skill
+
+The most comprehensive Terraform skill for Claude Code:
+
+**Repository**: [antonbabenko/terraform-skill](https://github.com/antonbabenko/terraform-skill)
+**Author**: Anton Babenko (creator of terraform-aws-modules, 1B+ downloads)
+
+```bash
+# Install
+cd ~/.claude/skills/
+git clone https://github.com/antonbabenko/terraform-skill.git terraform
+```
+
+**What it provides**:
+- Best practices for module structure
+- AWS, GCP, Azure patterns
+- State management guidance
+- CI/CD integration patterns
+
+### Common Terraform Prompts
+
+#### Plan Review
+
+```bash
+terraform plan -out=plan.txt && cat plan.txt | claude "Review this Terraform plan:
+1. Any dangerous changes? (data loss, downtime)
+2. Are the changes what we expect?
+3. Any missing changes we should add?
+4. Cost implications if visible"
+```
+
+#### Module Generation
+
+```bash
+claude "Generate a Terraform module for:
+- AWS ECS Fargate service
+- With ALB and target group
+- Auto-scaling based on CPU
+- Secrets from SSM Parameter Store
+
+Follow these conventions:
+- Use for_each over count
+- All resources tagged with var.tags
+- Output the service URL and ARN"
+```
+
+#### State Surgery Helper
+
+```bash
+claude "I need to move a resource to a different state file:
+Current state: terraform-prod/terraform.tfstate
+Resource: aws_s3_bucket.logs
+Target state: terraform-shared/terraform.tfstate
+
+What's the safest procedure? Include rollback steps."
+```
+
+### Drift Detection Workflow
+
+```bash
+# Detect drift
+terraform plan -detailed-exitcode 2>&1 | tee drift.txt
+
+# Analyze with Claude
+cat drift.txt | claude "Analyze this Terraform drift:
+1. What changed outside of Terraform?
+2. Is this drift expected (manual change) or concerning?
+3. Should we import the changes or revert to Terraform state?
+4. What's the safest remediation path?"
+```
+
+## Ansible with Claude
+
+### Playbook Review
+
+```bash
+cat playbook.yml | claude "Review this Ansible playbook:
+1. Idempotency issues?
+2. Security concerns?
+3. Error handling gaps?
+4. Performance optimizations?"
+```
+
+### Role Generation
+
+```bash
+claude "Generate an Ansible role for:
+- Installing and configuring Nginx
+- SSL certificates via Let's Encrypt (certbot)
+- Hardened configuration (disable server tokens, etc.)
+- Log rotation
+
+Follow best practices:
+- Use handlers for service restarts
+- Variables in defaults/main.yml
+- Include molecule tests structure"
+```
+
+## GitOps with Claude
+
+### ArgoCD Application Review
+
+```bash
+cat application.yaml | claude "Review this ArgoCD Application:
+1. Sync policy appropriate for the environment?
+2. Resource health checks defined?
+3. Any sync wave ordering issues?
+4. Namespace and project permissions correct?"
+```
+
+### Helm Values Generation
+
+```bash
+claude "Generate Helm values for deploying [application] to:
+- Environment: staging
+- Resources: Limited (cost-conscious)
+- Replicas: 2
+- Ingress: Internal only
+- Secrets: From external-secrets operator
+
+Base chart: [chart name]
+Include comments explaining each value"
+```
+
+## Security Review Automation
+
+### Infrastructure Security Scan
+
+```bash
+# Run tfsec or checkov, analyze results
+tfsec . --format=json | claude "Analyze these security findings:
+1. Prioritize by severity and exploitability
+2. Which are false positives in our context?
+3. For real issues: what's the fix?
+4. Which can we ignore with a documented reason?"
+```
+
+### IAM Policy Review
+
+```bash
+cat iam-policy.json | claude "Review this IAM policy:
+1. Does it follow least privilege?
+2. Any overly permissive actions? (*, admin, etc.)
+3. Resource constraints appropriate?
+4. Suggest a more restrictive version that still works"
+```
+
+---
+
+# Guardrails & Adoption
+
+**Goal**: Implement Claude Code safely and get team buy-in.
+
+## Cost Awareness
+
+### Claude Code Costs
+
+| Model | Input (1M tokens) | Output (1M tokens) |
+|-------|-------------------|-------------------|
+| Sonnet 4 | $3 | $15 |
+| Opus 4 | $15 | $75 |
+
+**Typical DevOps session**: 20K-50K tokens = $0.10-$0.50
+
+**Cost control strategies**:
+1. Use Sonnet for routine tasks (default)
+2. Reserve Opus for complex multi-system analysis
+3. Use `/compact` to reduce context when conversation gets long
+4. Avoid pasting entire log files; grep relevant sections first
+
+### Infrastructure Costs from Claude Suggestions
+
+**Beware**: Claude doesn't see your cloud bill. Always ask:
+
+```bash
+claude "Before I apply these changes, estimate:
+1. Monthly cost impact (compute, storage, network)
+2. Any resources that could scale unbounded?
+3. Cost optimization alternatives?"
+```
+
+## Security Boundaries
+
+### Never Share with Claude
+
+| Data Type | Why Not | Alternative |
+|-----------|---------|-------------|
+| API keys, tokens | Could be cached/logged | Use placeholders: `<API_KEY>` |
+| Production secrets | Security risk | Describe the secret type, not value |
+| Customer PII | Privacy/compliance | Use anonymized examples |
+| Proprietary algorithms | IP protection | Describe behavior, not code |
+| Incident details with PII | Legal liability | Sanitize before sharing |
+
+### Safe Prompting Template
+
+```bash
+claude "Debug this authentication issue:
+- Service: auth-service
+- Error: 401 Unauthorized for valid tokens
+- Environment: staging (not production)
+- Token format: JWT with claims [user_id, org_id, exp]
+- NOTE: I've redacted all actual token values
+
+Here's the sanitized log:
+[paste log with secrets replaced]"
+```
+
+### Approval Gates for Production
+
+Always require human approval for:
+
+```yaml
+# Example: Production change checklist
+approval_required:
+  - kubectl delete
+  - kubectl scale (down)
+  - terraform destroy
+  - DROP TABLE / DELETE FROM
+  - rm -rf (outside tmp directories)
+  - Any production database write
+  - Any IAM policy change
+  - Any security group modification
+```
+
+## Team Rollout Checklist
+
+### Phase 1: Pilot (1-2 engineers, 2 weeks)
+
+- [ ] Install Claude Code for pilot users
+- [ ] Create team CLAUDE.md with common context
+- [ ] Document first 5 successful use cases
+- [ ] Identify one workflow to standardize
+- [ ] Track time saved (before/after)
+
+### Phase 2: Expand (Team, 4 weeks)
+
+- [ ] Share pilot learnings in team meeting
+- [ ] Create team-specific prompts library
+- [ ] Establish security guidelines (what to share/not share)
+- [ ] Set up shared skills/commands repository
+- [ ] Define when to use Claude vs when not to
+
+### Phase 3: Optimize (Ongoing)
+
+- [ ] Monthly review of prompt library
+- [ ] A/B test: Claude-assisted vs traditional for similar incidents
+- [ ] Contribute back to community (awesome-lists, this guide)
+- [ ] Track MTTR, postmortem completion, documentation quality
+
+### Adoption Pitfalls to Avoid
+
+| Pitfall | Why It Happens | Prevention |
+|---------|---------------|------------|
+| **Over-reliance** | Claude is so helpful | Mandate learning time, not just output |
+| **Blind trust** | Commands usually work | Always review before running |
+| **Context dumping** | Hope Claude figures it out | Provide focused context, not everything |
+| **Skipping verification** | Time pressure | Build verification into workflow |
+| **Shadow usage** | No team visibility | Share wins, normalize usage |
+
+---
+
+# Quick Reference
+
+## FIRE Framework Summary
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ F - FIRST RESPONSE                                          │
+│   "INCIDENT: [symptom]. Context: [service, env, time].      │
+│    Recent changes: [what]. Impact: [who affected].          │
+│    What are the 3 most critical things to check?"           │
+├─────────────────────────────────────────────────────────────┤
+│ I - INVESTIGATE                                             │
+│   Run Claude's suggested commands                           │
+│   Share output: "[output] | claude 'Analyze this'"          │
+│   Iterate until root cause identified                       │
+├─────────────────────────────────────────────────────────────┤
+│ R - REMEDIATE                                               │
+│   "Based on [findings], propose remediation.                │
+│    CONSTRAINT: I need to approve before any action."        │
+│   APPROVE → Execute │ REJECT → More investigation           │
+├─────────────────────────────────────────────────────────────┤
+│ E - EVALUATE                                                │
+│   "Create postmortem: Timeline, root cause, prevention.     │
+│    Format: Blameless, action items with owners."            │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Prompts by Symptom
+
+### Kubernetes
+
+| Symptom | Prompt |
+|---------|--------|
+| CrashLoopBackOff | `kubectl describe pod <pod> -n <ns> \| claude "Exit code meaning? 3 likely causes? Commands to investigate?"` |
+| OOMKilled | `kubectl top pods && describe pod \| claude "Leak or under-provisioned? Optimal resources?"` |
+| ImagePullBackOff | `kubectl describe pod \| claude "Auth, network, or wrong image? Verification commands?"` |
+| Pending | `kubectl describe pod && describe nodes \| claude "Resource, selector, or affinity issue?"` |
+| Service unreachable | `kubectl get svc,endpoints \| claude "Healthy endpoints? Selector matching? Network policy?"` |
+
+### Cloud/Infrastructure
+
+| Symptom | Prompt |
+|---------|--------|
+| High latency | `[metrics] \| claude "Bottleneck location? Is it compute, network, or dependency?"` |
+| Disk full | `df -h && du -sh /* \| claude "What's consuming space? Safe to delete?"` |
+| Connection refused | `netstat -tlnp \| claude "Service listening? Port correct? Firewall rules?"` |
+| SSL cert expiry | `openssl s_client -connect host:443 \| claude "Days until expiry? Renewal steps?"` |
+| DNS issues | `dig +trace domain \| claude "Where does resolution fail?"` |
+
+### Terraform
+
+| Task | Prompt |
+|------|--------|
+| Plan review | `terraform plan \| claude "Dangerous changes? Missing changes? Cost impact?"` |
+| Drift analysis | `terraform plan -detailed-exitcode \| claude "What drifted? Expected? Remediation?"` |
+| Module request | `claude "Generate Terraform module for [resource] with [requirements]"` |
+
+## MCP Servers for DevOps
+
+| Server | Purpose | Install |
+|--------|---------|---------|
+| Kubernetes | Direct cluster access | `npx -y @anthropic/mcp-kubernetes` |
+| AWS | AWS API access | `npx -y @anthropic/mcp-aws` |
+| GCP | GCP API access | `npx -y @anthropic/mcp-gcp` |
+| Prometheus | Direct metrics queries | Community: search awesome-mcp-servers |
+| Terraform | State/plan analysis | Community: search awesome-mcp-servers |
+
+**Config location**: `~/.claude/mcp.json`
+
+```json
+{
+  "mcpServers": {
+    "kubernetes": {
+      "command": "npx",
+      "args": ["-y", "@anthropic/mcp-kubernetes"]
+    }
+  }
+}
+```
+
+## External Resources
+
+### Awesome Lists
+
+- **[awesome-claude-code-subagents](https://github.com/VoltAgent/awesome-claude-code-subagents)** (8.1k stars): Agent personas including SRE
+- **[awesome-claude-skills](https://github.com/travisvn/awesome-claude-skills)** (4.6k stars): Skills including infra-related
+
+### Official Resources
+
+- **[terraform-skill](https://github.com/antonbabenko/terraform-skill)**: Production-grade Terraform skill by Anton Babenko
+- **[Claude Code Docs](https://docs.anthropic.com/en/docs/claude-code)**: Official documentation
+
+### Community
+
+- **Anthropic Discord**: #claude-code channel
+- **Reddit**: r/ClaudeAI
+- **GitHub**: Open issues on awesome-lists for feature requests
+
+---
+
+## See Also
+
+- **[Agent Template](../examples/agents/devops-sre.md)**: DevOps/SRE agent persona for Claude
+- **[CLAUDE.md Template](../examples/claude-md/devops-sre.md)**: Project configuration for DevOps teams
+- **[Security Hardening Guide](./security-hardening.md)**: Additional security practices
+- **[Architecture Guide](./architecture.md)**: How Claude Code works internally
+
+---
+
+*Contributions welcome! If you have DevOps prompts that work well, consider adding them to the awesome-lists or submitting a PR to this guide.*
--- a/guide/ultimate-guide.md
+++ b/guide/ultimate-guide.md
@ -10,7 +10,7 @@

 **Last updated**: January 2026

-**Version**: 3.9.8
+**Version**: 3.9.9

 ---

@ -4920,6 +4920,17 @@ git clone https://github.com/antonbabenko/terraform-skill.git terraform

 If you create specialized skills for other domains (DevOps, data science, ML/AI, etc.), consider sharing them with the community through similar repositories or pull requests to existing collections.

+### DevOps & SRE Guide
+
+For comprehensive DevOps/SRE workflows, see **[DevOps & SRE Guide](./devops-sre.md)**:
+- **The FIRE Framework**: First Response → Investigate → Remediate → Evaluate
+- **Kubernetes troubleshooting**: Prompts by symptom (CrashLoopBackOff, OOMKilled, etc.)
+- **Incident response**: Solo and multi-agent patterns
+- **IaC patterns**: Terraform, Ansible, GitOps workflows
+- **Guardrails**: Security boundaries and team adoption checklist
+
+**Quick Start**: [Agent Template](../examples/agents/devops-sre.md) | [CLAUDE.md Template](../examples/claude-md/devops-sre.md)
+
 ---

 # 6. Commands
@ -11151,4 +11162,4 @@ Thumbs.db

 **Contributions**: Issues and PRs welcome.

-**Last updated**: January 2026 | **Version**: 3.9.8
+**Last updated**: January 2026 | **Version**: 3.9.9
--- a/machine-readable/reference.yaml
+++ b/machine-readable/reference.yaml
@ -3,7 +3,7 @@
 # Source: guide/ultimate-guide.md
 # Purpose: Condensed index for LLMs to quickly answer user questions about Claude Code

-version: "3.9.8"
+version: "3.9.9"
 updated: "2026-01-20"

 # ════════════════════════════════════════════════════════════════
@ -88,33 +88,45 @@ deep_dive:
  skill_examples: 4608
  community_skills_cybersec: 4788
  community_skills_iac: 4871
-  commands: 4939
-  command_template: 5009
-  hooks: 5262
-  hook_templates: 5407
-  security_hooks: 5669
-  mcp_servers: 5810
-  mcp_config: 6104
-  mcp_security: 6472
-  cicd: 6790
-  ide_integration: 7479
-  feedback_loops: 7549
-  batch_operations: 7979
-  pitfalls: 8098
-  git_best_practices: 8367
-  cost_optimization: 8833
-  session_teleportation: 9432
-  commands_table: 9608
-  shortcuts_table: 9641
-  troubleshooting: 9767
-  cheatsheet: 10142
-  daily_workflow: 10218
-  # AI Ecosystem (Section 11, ~line 10494)
-  ai_ecosystem: 10494
-  ai_ecosystem_complementarity: 10494
-  ai_ecosystem_tool_matrix: 10520
-  ai_ecosystem_workflows: 10545
-  ai_ecosystem_integration: 10671
+  # DevOps/SRE Guide (guide/devops-sre.md)
+  devops_sre_guide: "guide/devops-sre.md"
+  devops_fire_framework: "guide/devops-sre.md:50"
+  devops_k8s_troubleshooting: "guide/devops-sre.md:120"
+  devops_k8s_prompts: "guide/devops-sre.md:160"
+  devops_incident_response: "guide/devops-sre.md:340"
+  devops_iac_patterns: "guide/devops-sre.md:520"
+  devops_guardrails: "guide/devops-sre.md:650"
+  devops_limitations: "guide/devops-sre.md:290"
+  devops_quick_reference: "guide/devops-sre.md:750"
+  devops_agent: "examples/agents/devops-sre.md"
+  devops_claude_md: "examples/claude-md/devops-sre.md"
+  commands: 4950
+  command_template: 5020
+  hooks: 5273
+  hook_templates: 5418
+  security_hooks: 5680
+  mcp_servers: 5821
+  mcp_config: 6115
+  mcp_security: 6483
+  cicd: 6801
+  ide_integration: 7490
+  feedback_loops: 7560
+  batch_operations: 7990
+  pitfalls: 8109
+  git_best_practices: 8378
+  cost_optimization: 8844
+  session_teleportation: 9443
+  commands_table: 9619
+  shortcuts_table: 9652
+  troubleshooting: 9778
+  cheatsheet: 10153
+  daily_workflow: 10229
+  # AI Ecosystem (Section 11, ~line 10491)
+  ai_ecosystem: 10491
+  ai_ecosystem_complementarity: 10493
+  ai_ecosystem_tool_matrix: 10531
+  ai_ecosystem_workflows: 10556
+  ai_ecosystem_integration: 10682
  ai_ecosystem_detailed: "guide/ai-ecosystem.md"
  ai_ecosystem_voice_to_text: "guide/ai-ecosystem.md:449"
  ai_ecosystem_alternative_providers: "guide/ai-ecosystem.md:959"
@ -448,7 +460,7 @@ ecosystem:
      - "Cross-links modified → Update all 4 repos"
    history:
      - date: "2026-01-20"
-        event: "Code Landing sync v3.9.8, 66 templates, cross-links"
+        event: "Code Landing sync v3.9.9, 66 templates, cross-links"
        commit: "5b5ce62"
      - date: "2026-01-20"
        event: "Cowork Landing fix (paths, README, UI badges)"
 @ -1 +1 @@
 .9.8
 .9.9