feat: add DevOps & SRE Guide with FIRE Framework (v3.9.9)
New files: - guide/devops-sre.md: FIRE Framework for infrastructure diagnosis (~870 lines) - examples/agents/devops-sre.md: DevOps/SRE agent persona - examples/claude-md/devops-sre.md: CLAUDE.md template for infra projects Guide includes: - Kubernetes troubleshooting prompts by symptom - Solo incident response workflow (3 AM design) - IaC patterns (Terraform, Ansible, GitOps) - Claude limitations table (transparency) - Team adoption checklist Updated: - README.md: Added DevOps/SRE learning path, templates count → 69 - ultimate-guide.md: Added reference after Section 5.4 - reference.yaml: Added 11 DevOps/SRE entries - examples/README.md: Added to indexes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
83a62ffb5d
commit
00df38d318
11 changed files with 1337 additions and 41 deletions
29
CHANGELOG.md
29
CHANGELOG.md
|
|
@ -6,6 +6,35 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|||
|
||||
## [Unreleased]
|
||||
|
||||
## [3.9.9] - 2026-01-20
|
||||
|
||||
### Added
|
||||
|
||||
- **DevOps & SRE Guide** — Comprehensive infrastructure diagnosis guide (~900 lines)
|
||||
- **New file**: `guide/devops-sre.md` — The FIRE Framework for infrastructure troubleshooting
|
||||
- **F**irst Response → **I**nvestigate → **R**emediate → **E**valuate
|
||||
- Kubernetes troubleshooting with copy-paste prompts by symptom (CrashLoopBackOff, OOMKilled, ImagePullBackOff, etc.)
|
||||
- Solo incident response workflow (designed for 3 AM scenarios)
|
||||
- Multi-agent pattern for post-incident analysis
|
||||
- IaC patterns: Terraform, Ansible, GitOps workflows
|
||||
- Guardrails & team adoption checklist
|
||||
- Claude limitations table (what Claude can't do for DevOps)
|
||||
- Case studies: Production outage root cause, OpsWorker.ai MTTR reduction
|
||||
- **New file**: `examples/agents/devops-sre.md` — DevOps/SRE agent persona (~130 lines)
|
||||
- FIRE framework implementation
|
||||
- Kubernetes, network, and resource debugging checklists
|
||||
- Response templates (assessment, root cause, remediation)
|
||||
- Safety rules for production environments
|
||||
- **New file**: `examples/claude-md/devops-sre.md` — CLAUDE.md template for DevOps teams (~170 lines)
|
||||
- Infrastructure context configuration
|
||||
- Environment, service map, access patterns
|
||||
- Team conventions and runbook format
|
||||
- Customization guides (K8s-heavy, Terraform-heavy, multi-cloud)
|
||||
- **Updated**: `guide/ultimate-guide.md` — Added DevOps & SRE Guide reference after Section 5.4
|
||||
- **Updated**: `machine-readable/reference.yaml` — Added 11 DevOps/SRE entries
|
||||
- **Updated**: `examples/README.md` — Added agent and CLAUDE.md template to indexes
|
||||
- **Updated**: `README.md` — Added DevOps/SRE learning path, updated templates count (69)
|
||||
|
||||
## [3.9.8] - 2026-01-20
|
||||
|
||||
### Added
|
||||
|
|
|
|||
26
README.md
26
README.md
|
|
@ -6,7 +6,7 @@
|
|||
|
||||
<p align="center">
|
||||
<a href="https://github.com/FlorianBruniaux/claude-code-ultimate-guide/stargazers"><img src="https://img.shields.io/github/stars/FlorianBruniaux/claude-code-ultimate-guide?style=for-the-badge" alt="Stars"/></a>
|
||||
<a href="./examples/"><img src="https://img.shields.io/badge/Templates-61-green?style=for-the-badge" alt="Templates"/></a>
|
||||
<a href="./examples/"><img src="https://img.shields.io/badge/Templates-69-green?style=for-the-badge" alt="Templates"/></a>
|
||||
<a href="./quiz/"><img src="https://img.shields.io/badge/Quiz-227_questions-orange?style=for-the-badge" alt="Quiz"/></a>
|
||||
</p>
|
||||
|
||||
|
|
@ -64,7 +64,7 @@ Save as `CLAUDE.md` in your project root. Claude reads it automatically.
|
|||
|
||||
**The problem**: Awesome-lists give links, not learning paths. Official docs are dense. Tutorials get outdated in weeks.
|
||||
|
||||
**This guide**: Structured learning path with 60+ copy-paste templates, from first install to advanced workflows.
|
||||
**This guide**: Structured learning path with 69 copy-paste templates, from first install to advanced workflows.
|
||||
|
||||
**Reading time**: Quick Start ~15 min. Full guide ~3 hours (most read by section).
|
||||
|
||||
|
|
@ -133,6 +133,17 @@ Same agentic capabilities as Claude Code, but through a visual interface with no
|
|||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>DevOps / SRE</strong> — Infrastructure path (5 steps)</summary>
|
||||
|
||||
1. [DevOps & SRE Guide](./guide/devops-sre.md) — FIRE framework for infrastructure diagnosis
|
||||
2. [K8s Troubleshooting](./guide/devops-sre.md#kubernetes-troubleshooting) — Prompts by symptom
|
||||
3. [Incident Response](./guide/devops-sre.md#pattern-incident-response) — Solo & multi-agent workflows
|
||||
4. [IaC Patterns](./guide/devops-sre.md#pattern-infrastructure-as-code) — Terraform, Ansible, GitOps
|
||||
5. [Guardrails](./guide/devops-sre.md#guardrails--adoption) — Security boundaries & team adoption
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
## 📚 What's Inside
|
||||
|
|
@ -148,6 +159,7 @@ Same agentic capabilities as Claude Code, but through a visual interface with no
|
|||
| **[Workflows](./guide/workflows/)** | Practical guides (TDD, Plan-Driven) | 30 min |
|
||||
| **[Data Privacy](./guide/data-privacy.md)** | Retention & compliance | 10 min |
|
||||
| **[Security Hardening](./guide/security-hardening.md)** | MCP vetting, injection defense | 25 min |
|
||||
| **[DevOps & SRE](./guide/devops-sre.md)** | FIRE framework, K8s troubleshooting, incident response | 30 min |
|
||||
| **[Claude Code Releases](./guide/claude-code-releases.md)** | Official release history | 10 min |
|
||||
|
||||
<details>
|
||||
|
|
@ -183,7 +195,9 @@ claude-code-ultimate-guide/
|
|||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Examples Library</strong> (61 templates)</summary>
|
||||
<summary><strong>Examples Library</strong> (69 templates)</summary>
|
||||
|
||||
**Agents** (6): [code-reviewer](./examples/agents/code-reviewer.md), [test-writer](./examples/agents/test-writer.md), [security-auditor](./examples/agents/security-auditor.md), [refactoring-specialist](./examples/agents/refactoring-specialist.md), [output-evaluator](./examples/agents/output-evaluator.md), [devops-sre](./examples/agents/devops-sre.md) ⭐
|
||||
|
||||
**Slash Commands** (18): [/pr](./examples/commands/pr.md), [/commit](./examples/commands/commit.md), [/release-notes](./examples/commands/release-notes.md), [/diagnose](./examples/commands/diagnose.md), [/security](./examples/commands/security.md), [/refactor](./examples/commands/refactor.md), [/explain](./examples/commands/explain.md), [/optimize](./examples/commands/optimize.md), [/ship](./examples/commands/ship.md)...
|
||||
|
||||
|
|
@ -313,7 +327,7 @@ Claude Code sends your prompts, file contents, and MCP results to Anthropic serv
|
|||
|
||||
**Status**: Research preview (Pro $20/mo or Max $100-200/mo, macOS only, **VPN incompatible**)
|
||||
|
||||
**Archive**: Historical versions available in git history (pre-v3.9.8)
|
||||
**Archive**: Historical versions available in git history (pre-v3.9.9)
|
||||
|
||||
</details>
|
||||
|
||||
|
|
@ -324,7 +338,7 @@ Claude Code sends your prompts, file contents, and MCP results to Anthropic serv
|
|||
|
||||
| Repository | Purpose | Audience |
|
||||
|------------|---------|----------|
|
||||
| **[Claude Code Guide](https://github.com/FlorianBruniaux/claude-code-ultimate-guide)** *(this repo)* | Comprehensive documentation (11K lines, 66 templates) | Developers |
|
||||
| **[Claude Code Guide](https://github.com/FlorianBruniaux/claude-code-ultimate-guide)** *(this repo)* | Comprehensive documentation (11K lines, 69 templates) | Developers |
|
||||
| **[Claude Cowork Guide](https://github.com/FlorianBruniaux/claude-cowork-guide)** | Non-technical usage (67 prompts, 5 workflows) | Knowledge workers |
|
||||
| **Code Landing** *(to be deployed)* | Marketing site for Claude Code guide | Discovery |
|
||||
| **Cowork Landing** *(to be deployed)* | Marketing site for Cowork guide | Discovery |
|
||||
|
|
@ -382,7 +396,7 @@ Licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
|
|||
|
||||
---
|
||||
|
||||
*Version 3.9.8 | January 2026 | Crafted with Claude*
|
||||
*Version 3.9.9 | January 2026 | Crafted with Claude*
|
||||
|
||||
<!-- SEO Keywords -->
|
||||
<!-- claude code, claude code tutorial, anthropic cli, ai coding assistant, claude code mcp,
|
||||
|
|
|
|||
2
VERSION
2
VERSION
|
|
@ -1 +1 @@
|
|||
3.9.8
|
||||
3.9.9
|
||||
|
|
|
|||
|
|
@ -51,6 +51,7 @@ Ready-to-use templates for Claude Code configuration.
|
|||
| [security-auditor.md](./agents/security-auditor.md) | Security vulnerability detection | Sonnet |
|
||||
| [refactoring-specialist.md](./agents/refactoring-specialist.md) | Clean code refactoring | Sonnet |
|
||||
| [output-evaluator.md](./agents/output-evaluator.md) | LLM-as-a-Judge quality gate | Haiku |
|
||||
| [devops-sre.md](./agents/devops-sre.md) | Infrastructure troubleshooting with FIRE framework | Sonnet |
|
||||
|
||||
### Skills
|
||||
| File | Purpose |
|
||||
|
|
@ -117,8 +118,10 @@ Ready-to-use templates for Claude Code configuration.
|
|||
| File | Purpose |
|
||||
|------|---------|
|
||||
| [learning-mode.md](./claude-md/learning-mode.md) | Learning-focused development configuration |
|
||||
| [devops-sre.md](./claude-md/devops-sre.md) | DevOps/SRE project configuration |
|
||||
|
||||
> **See [guide/learning-with-ai.md](../guide/learning-with-ai.md) for complete documentation**
|
||||
> **See [guide/learning-with-ai.md](../guide/learning-with-ai.md) for learning mode documentation**
|
||||
> **See [guide/devops-sre.md](../guide/devops-sre.md) for DevOps/SRE guide**
|
||||
|
||||
### Scripts
|
||||
| File | Purpose | Output |
|
||||
|
|
|
|||
168
examples/agents/devops-sre.md
Normal file
168
examples/agents/devops-sre.md
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
---
|
||||
name: devops-sre
|
||||
description: Infrastructure troubleshooting using the FIRE framework (First Response, Investigate, Remediate, Evaluate)
|
||||
model: sonnet
|
||||
tools: Bash, Read, Grep, Glob
|
||||
---
|
||||
|
||||
# DevOps/SRE Agent
|
||||
|
||||
You are an SRE specialist focused on infrastructure diagnosis, incident response, and reliability engineering.
|
||||
|
||||
## FIRE Framework
|
||||
|
||||
For every infrastructure issue, follow this systematic approach:
|
||||
|
||||
### F - First Response
|
||||
- Clarify the symptom and impact
|
||||
- Identify affected services and environment
|
||||
- Ask about recent changes (deploys, config, traffic)
|
||||
- Propose 3 highest-priority diagnostic steps
|
||||
|
||||
### I - Investigate
|
||||
- Guide through diagnostic commands
|
||||
- Analyze logs, metrics, and configurations
|
||||
- Correlate across services when needed
|
||||
- Form hypotheses and test them systematically
|
||||
|
||||
### R - Remediate
|
||||
- Propose fix options with clear trade-offs
|
||||
- **ALWAYS wait for human approval before destructive actions**
|
||||
- Provide rollback plan for every change
|
||||
- Explain impact and risk of each option
|
||||
|
||||
### E - Evaluate
|
||||
- Generate incident timeline
|
||||
- Perform root cause analysis
|
||||
- Create actionable prevention items
|
||||
- Format blameless postmortems
|
||||
|
||||
## Kubernetes Checklist
|
||||
|
||||
### Pod Issues
|
||||
- [ ] Check pod status: `kubectl get pods -n <ns>`
|
||||
- [ ] Describe pod for events: `kubectl describe pod <pod> -n <ns>`
|
||||
- [ ] Check logs: `kubectl logs <pod> -n <ns> --previous`
|
||||
- [ ] Check resource usage: `kubectl top pod <pod> -n <ns>`
|
||||
|
||||
### Service Issues
|
||||
- [ ] Verify endpoints exist: `kubectl get endpoints <svc> -n <ns>`
|
||||
- [ ] Check selector matching: compare pod labels with service selector
|
||||
- [ ] Test connectivity: `kubectl exec -it <pod> -- curl <svc>:<port>`
|
||||
- [ ] Check network policies: `kubectl get networkpolicy -n <ns>`
|
||||
|
||||
### Node Issues
|
||||
- [ ] Check node status: `kubectl get nodes`
|
||||
- [ ] Describe node for conditions: `kubectl describe node <node>`
|
||||
- [ ] Check system pods: `kubectl get pods -n kube-system`
|
||||
|
||||
## Response Templates
|
||||
|
||||
### Initial Assessment
|
||||
|
||||
```markdown
|
||||
## Situation Assessment
|
||||
|
||||
**Symptom**: [What's broken]
|
||||
**Impact**: [Who/what is affected]
|
||||
**Environment**: [Prod/staging, region, cluster]
|
||||
**Started**: [When]
|
||||
|
||||
### Immediate Priorities
|
||||
1. [Most critical check]
|
||||
2. [Second priority]
|
||||
3. [Third priority]
|
||||
|
||||
### Commands to Run
|
||||
[Exact commands]
|
||||
```
|
||||
|
||||
### Root Cause Summary
|
||||
|
||||
```markdown
|
||||
## Root Cause Analysis
|
||||
|
||||
**Direct Cause**: [Immediate trigger]
|
||||
**Contributing Factors**:
|
||||
1. [Factor 1]
|
||||
2. [Factor 2]
|
||||
|
||||
**Evidence**:
|
||||
- [Log entry / metric / config that proves it]
|
||||
|
||||
**Timeline**:
|
||||
- [Time]: [Event]
|
||||
```
|
||||
|
||||
### Remediation Proposal
|
||||
|
||||
```markdown
|
||||
## Remediation Options
|
||||
|
||||
### Option A: [Quick Mitigation]
|
||||
- **Command**: [Exact command]
|
||||
- **Risk**: [Low/Medium/High]
|
||||
- **Rollback**: [How to undo]
|
||||
|
||||
### Option B: [Proper Fix]
|
||||
- **Command**: [Exact command]
|
||||
- **Risk**: [Low/Medium/High]
|
||||
- **Rollback**: [How to undo]
|
||||
|
||||
**Recommendation**: [Which option and why]
|
||||
|
||||
⚠️ **Awaiting your approval before proceeding**
|
||||
```
|
||||
|
||||
## Safety Rules
|
||||
|
||||
1. **Never execute destructive commands without explicit approval**:
|
||||
- `kubectl delete`
|
||||
- `kubectl scale` (down)
|
||||
- `terraform destroy`
|
||||
- Any DROP/DELETE SQL
|
||||
- `rm -rf` outside tmp
|
||||
|
||||
2. **Always provide rollback steps** before any change
|
||||
|
||||
3. **Never include secrets in responses** - use placeholders
|
||||
|
||||
4. **Clarify environment** (prod vs staging) before any action
|
||||
|
||||
5. **When uncertain, investigate more** rather than guess
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Log Analysis
|
||||
```bash
|
||||
# Find error patterns
|
||||
kubectl logs <pod> -n <ns> | grep -E "ERROR|WARN|Exception" | head -50
|
||||
|
||||
# Check for OOM events
|
||||
kubectl describe pod <pod> -n <ns> | grep -A5 "Last State"
|
||||
|
||||
# Correlate timestamps
|
||||
kubectl logs <pod> -n <ns> --since=10m --timestamps
|
||||
```
|
||||
|
||||
### Network Debugging
|
||||
```bash
|
||||
# Test DNS resolution
|
||||
kubectl exec -it <pod> -- nslookup <service>
|
||||
|
||||
# Test connectivity
|
||||
kubectl exec -it <pod> -- curl -v <service>:<port>
|
||||
|
||||
# Check network policies
|
||||
kubectl get networkpolicy -n <ns> -o yaml
|
||||
```
|
||||
|
||||
### Resource Analysis
|
||||
```bash
|
||||
# Current usage vs limits
|
||||
kubectl top pods -n <ns>
|
||||
kubectl describe pod <pod> -n <ns> | grep -A3 "Limits:"
|
||||
|
||||
# Node pressure
|
||||
kubectl describe node <node> | grep -A10 "Conditions:"
|
||||
```
|
||||
189
examples/claude-md/devops-sre.md
Normal file
189
examples/claude-md/devops-sre.md
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
# DevOps/SRE CLAUDE.md Template
|
||||
|
||||
A CLAUDE.md configuration optimized for infrastructure projects and SRE workflows.
|
||||
|
||||
## Usage
|
||||
|
||||
Copy this content to your project's `CLAUDE.md` file and customize the sections marked with `[brackets]`.
|
||||
|
||||
---
|
||||
|
||||
## Template
|
||||
|
||||
```markdown
|
||||
# DevOps/SRE Project Configuration
|
||||
|
||||
## Infrastructure Context
|
||||
|
||||
### Environment
|
||||
- Cloud Provider: [AWS/GCP/Azure/On-prem]
|
||||
- Kubernetes: [EKS/GKE/AKS/k3s/none]
|
||||
- IaC Tool: [Terraform/Pulumi/CloudFormation/Ansible]
|
||||
- CI/CD: [GitHub Actions/GitLab CI/Jenkins/ArgoCD]
|
||||
|
||||
### Service Map
|
||||
- [service-1]: [description, critical path: yes/no]
|
||||
- [service-2]: [description, critical path: yes/no]
|
||||
- [database]: [PostgreSQL/MySQL/MongoDB, hosted where]
|
||||
|
||||
### Access Patterns
|
||||
- Cluster access: [kubectl context name]
|
||||
- Cloud CLI: [aws/gcloud/az profile name]
|
||||
- Secrets: [Vault/SSM/Secrets Manager - never share values]
|
||||
|
||||
## FIRE Framework Defaults
|
||||
|
||||
Use the FIRE framework for all infrastructure issues:
|
||||
- **F**irst Response: Clarify symptom, impact, recent changes
|
||||
- **I**nvestigate: Systematic diagnosis with evidence
|
||||
- **R**emediate: Propose options, wait for approval
|
||||
- **E**valuate: Generate postmortem, prevention items
|
||||
|
||||
## Safety Rules
|
||||
|
||||
### Never Execute Without Approval
|
||||
- `kubectl delete` or `kubectl scale down`
|
||||
- `terraform destroy`
|
||||
- Any production database writes
|
||||
- IAM/security group modifications
|
||||
- Any command in production namespace
|
||||
|
||||
### Always Require
|
||||
- Rollback plan before changes
|
||||
- Environment confirmation (prod vs staging)
|
||||
- Impact assessment for scaling operations
|
||||
|
||||
## Response Preferences
|
||||
|
||||
### For Incidents
|
||||
- Start with impact assessment
|
||||
- Prioritize mitigation over root cause (initially)
|
||||
- Provide exact commands, not just guidance
|
||||
- Include timestamps in all actions
|
||||
|
||||
### For Code Review
|
||||
- Focus on: security, resource limits, idempotency
|
||||
- Flag: hardcoded values, missing error handling
|
||||
- Suggest: monitoring/alerting additions
|
||||
|
||||
### For Documentation
|
||||
- Format: Markdown with code blocks
|
||||
- Style: Runbook format (numbered steps)
|
||||
- Include: Prerequisites, rollback, verification steps
|
||||
|
||||
## Common Contexts
|
||||
|
||||
### Kubernetes Namespaces
|
||||
- `production`: [critical services, approval required]
|
||||
- `staging`: [test freely]
|
||||
- `monitoring`: [Prometheus, Grafana]
|
||||
- `ingress`: [nginx, cert-manager]
|
||||
|
||||
### Terraform Workspaces/Modules
|
||||
- `modules/`: [shared infrastructure components]
|
||||
- `environments/prod/`: [production, plan-only by default]
|
||||
- `environments/staging/`: [safe to apply]
|
||||
|
||||
### Monitoring
|
||||
- Metrics: [Prometheus/CloudWatch/Datadog URL]
|
||||
- Logs: [ELK/CloudWatch/Loki URL]
|
||||
- Alerts: [PagerDuty/OpsGenie integration]
|
||||
|
||||
## Team Conventions
|
||||
|
||||
### Commit Messages
|
||||
- Format: [conventional commits / your format]
|
||||
- Example: `fix(k8s): increase memory limit for payment-service`
|
||||
|
||||
### PR Requirements
|
||||
- [ ] Terraform plan output included
|
||||
- [ ] Affected services listed
|
||||
- [ ] Rollback procedure documented
|
||||
|
||||
### Runbook Format
|
||||
```
|
||||
# [Runbook Title]
|
||||
## Symptoms
|
||||
## Prerequisites
|
||||
## Steps
|
||||
## Verification
|
||||
## Rollback
|
||||
## Escalation
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Customization Guide
|
||||
|
||||
### For Kubernetes-Heavy Teams
|
||||
|
||||
Add to "Common Contexts":
|
||||
```markdown
|
||||
### Critical Pods
|
||||
- `payment-api`: Direct revenue impact, max 30s downtime
|
||||
- `auth-service`: Blocks all authenticated requests
|
||||
- `api-gateway`: Single point of entry
|
||||
|
||||
### Scaling Rules
|
||||
- payment-api: min 3, max 10, scale on CPU > 70%
|
||||
- auth-service: min 2, max 5, scale on connections
|
||||
```
|
||||
|
||||
### For Terraform-Heavy Teams
|
||||
|
||||
Add section:
|
||||
```markdown
|
||||
## Terraform Conventions
|
||||
- State backend: [S3 bucket / GCS bucket]
|
||||
- Lock table: [DynamoDB table name]
|
||||
- Module registry: [internal / Terraform registry]
|
||||
- Required providers versions: [see versions.tf]
|
||||
|
||||
### Module Standards
|
||||
- All resources tagged with: var.tags
|
||||
- Naming: {project}-{environment}-{resource}
|
||||
- Outputs: Always export ARN, ID, name
|
||||
```
|
||||
|
||||
### For Multi-Cloud Teams
|
||||
|
||||
Add to "Environment":
|
||||
```markdown
|
||||
### Cloud Credentials
|
||||
- AWS: Profile `company-prod` / `company-staging`
|
||||
- GCP: Project `company-prod-123` / `company-staging-456`
|
||||
- Azure: Subscription `prod-sub-id` / `staging-sub-id`
|
||||
|
||||
### Cross-Cloud Services
|
||||
- DNS: [AWS Route53 / Cloudflare]
|
||||
- CDN: [CloudFront / Cloud CDN]
|
||||
- Secrets: [HashiCorp Vault - URL]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration with Agents
|
||||
|
||||
Pair this CLAUDE.md with the DevOps/SRE agent:
|
||||
|
||||
```json
|
||||
{
|
||||
"agents": {
|
||||
"sre": {
|
||||
"path": ".claude/agents/devops-sre.md",
|
||||
"model": "sonnet"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then invoke with: `@sre investigate this pod crash`
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [DevOps & SRE Guide](../../guide/devops-sre.md) — Complete FIRE framework documentation
|
||||
- [DevOps Agent](../agents/devops-sre.md) — Agent persona for infrastructure tasks
|
||||
- [Security Hardening](../../guide/security-hardening.md) — Security best practices
|
||||
|
|
@ -16,6 +16,7 @@ Core documentation for mastering Claude Code.
|
|||
| [observability.md](./observability.md) | Session monitoring and cost tracking | 15 min |
|
||||
| [methodologies.md](./methodologies.md) | 15 development methodologies reference (TDD, SDD, BDD, etc.) | 20 min |
|
||||
| [security-hardening.md](./security-hardening.md) | Security threats, MCP vetting, injection defense | 25 min |
|
||||
| [devops-sre.md](./devops-sre.md) | FIRE framework for infrastructure diagnosis and incident response | 30 min |
|
||||
| [ai-ecosystem.md](./ai-ecosystem.md) | Complementary AI tools (Perplexity, Gemini, Kimi, NotebookLM) | 25 min |
|
||||
| [cowork.md](./cowork.md) | Claude Cowork: Summary (see [dedicated repo](https://github.com/FlorianBruniaux/claude-cowork-guide) for full docs) | 5 min |
|
||||
| [workflows/](./workflows/) | Practical workflow guides for Claude Code | 30 min |
|
||||
|
|
|
|||
|
|
@ -6,7 +6,7 @@
|
|||
|
||||
**Written with**: Claude (Anthropic)
|
||||
|
||||
**Version**: 3.9.8 | **Last Updated**: January 2026
|
||||
**Version**: 3.9.9 | **Last Updated**: January 2026
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -411,4 +411,4 @@ where.exe claude; claude doctor; claude mcp list
|
|||
|
||||
**Author**: Florian BRUNIAUX | [@Méthode Aristote](https://methode-aristote.fr) | Written with Claude
|
||||
|
||||
*Last updated: January 2026 | Version 3.9.8*
|
||||
*Last updated: January 2026 | Version 3.9.9*
|
||||
|
|
|
|||
869
guide/devops-sre.md
Normal file
869
guide/devops-sre.md
Normal file
|
|
@ -0,0 +1,869 @@
|
|||
# DevOps & SRE with Claude Code
|
||||
|
||||
**Reading time**: 30 minutes
|
||||
**Skill level**: Intermediate (assumes DevOps basics)
|
||||
**Prerequisites**: Claude Code basics ([Sections 1-2](./ultimate-guide.md#1-getting-started) of main guide)
|
||||
|
||||
---
|
||||
|
||||
> **The FIRE Framework**: A systematic approach to infrastructure diagnosis with Claude Code.
|
||||
>
|
||||
> **F**irst Response → **I**nvestigate → **R**emediate → **E**valuate
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Quick Start](#quick-start)
|
||||
2. [Pattern: Infrastructure Diagnosis](#pattern-infrastructure-diagnosis)
|
||||
3. [Pattern: Incident Response](#pattern-incident-response)
|
||||
4. [Pattern: Infrastructure as Code](#pattern-infrastructure-as-code)
|
||||
5. [Guardrails & Adoption](#guardrails--adoption)
|
||||
6. [Quick Reference](#quick-reference)
|
||||
|
||||
---
|
||||
|
||||
# Quick Start
|
||||
|
||||
**Goal**: Get productive with Claude Code for DevOps in 5 minutes.
|
||||
|
||||
## Quick Self-Check
|
||||
|
||||
| Situation | Jump To |
|
||||
|-----------|---------|
|
||||
| I'm in an active incident NOW | [Emergency: K8s Troubleshooting](#kubernetes-troubleshooting) |
|
||||
| First time using Claude for DevOps | [Tutorial: First Diagnosis](#your-first-infrastructure-diagnosis) |
|
||||
| Want to automate my runbooks | [Pattern: Incident Response](#pattern-incident-response) |
|
||||
| Evaluating for my team | [Guardrails & Adoption](#guardrails--adoption) |
|
||||
| Need ready-to-use prompts | [Quick Reference](#quick-reference) |
|
||||
|
||||
## The FIRE Framework
|
||||
|
||||
Every infrastructure diagnosis with Claude follows this pattern:
|
||||
|
||||
```
|
||||
F - First Response → Give Claude the symptom + context
|
||||
I - Investigate → Claude analyzes logs, metrics, config
|
||||
R - Remediate → Claude proposes fix (with human approval)
|
||||
E - Evaluate → Postmortem, documentation, prevention
|
||||
```
|
||||
|
||||
### Why FIRE?
|
||||
|
||||
| Phase | Human Role | Claude Role |
|
||||
|-------|------------|-------------|
|
||||
| **F**irst Response | Describe symptom, provide context | Triage, prioritize checks |
|
||||
| **I**nvestigate | Run commands, paste output | Analyze, correlate, hypothesize |
|
||||
| **R**emediate | **Approve or reject** | Propose fix, explain impact |
|
||||
| **E**valuate | Review, share knowledge | Generate postmortem, docs |
|
||||
|
||||
**Critical**: Claude proposes, you approve. This is especially important for:
|
||||
- Destructive operations (delete, scale down, restart)
|
||||
- Production environments
|
||||
- Security-sensitive changes
|
||||
|
||||
## Your First Infrastructure Diagnosis
|
||||
|
||||
### Example: Pod CrashLoopBackOff
|
||||
|
||||
```bash
|
||||
# Step 1: FIRST RESPONSE - Give context
|
||||
claude "I have a pod in CrashLoopBackOff in the payment-service namespace.
|
||||
Environment: production
|
||||
Cluster: gke-prod-us-east1
|
||||
Started: 10 minutes ago
|
||||
What are the first 3 things I should check?"
|
||||
|
||||
# Step 2: INVESTIGATE - Run suggested commands and share output
|
||||
kubectl describe pod payment-api-7d4b8c6f5-x2j9k -n payment-service
|
||||
# Copy output and paste to Claude
|
||||
|
||||
# Step 3: REMEDIATE - Claude proposes, you approve
|
||||
claude "Based on this describe output, what's the fix?
|
||||
CONSTRAINT: Show me the exact command but don't execute anything."
|
||||
|
||||
# Step 4: EVALUATE - Document the incident
|
||||
claude "Create a brief incident note for our wiki:
|
||||
- What happened
|
||||
- Root cause
|
||||
- Fix applied
|
||||
- Prevention recommendation"
|
||||
```
|
||||
|
||||
### What Claude Returns (First Response Example)
|
||||
|
||||
```markdown
|
||||
## CrashLoopBackOff Triage (Priority Order)
|
||||
|
||||
1. **Check exit code and last logs**
|
||||
```bash
|
||||
kubectl logs payment-api-7d4b8c6f5-x2j9k -n payment-service --previous
|
||||
```
|
||||
Most common: OOM (exit 137), config errors (exit 1), dependency failures
|
||||
|
||||
2. **Check pod events and resource limits**
|
||||
```bash
|
||||
kubectl describe pod payment-api-7d4b8c6f5-x2j9k -n payment-service
|
||||
```
|
||||
Look for: OOMKilled, ImagePullBackOff, liveness probe failures
|
||||
|
||||
3. **Check recent deployment changes**
|
||||
```bash
|
||||
kubectl rollout history deployment/payment-api -n payment-service
|
||||
```
|
||||
If recent: consider rollback while investigating
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Pattern: Infrastructure Diagnosis
|
||||
|
||||
**Goal**: Systematic troubleshooting for common infrastructure issues.
|
||||
|
||||
## Kubernetes Troubleshooting
|
||||
|
||||
### K8s MCP Server Setup
|
||||
|
||||
For persistent K8s context, install the K8s MCP server:
|
||||
|
||||
```json
|
||||
// ~/.claude/mcp.json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubernetes": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@anthropic/mcp-kubernetes"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits**: Claude can query cluster state directly, reducing copy-paste cycles.
|
||||
|
||||
**Without MCP**: You'll pipe kubectl output to Claude manually (still effective).
|
||||
|
||||
### Prompts by Symptom
|
||||
|
||||
Copy-paste these prompts, replacing `<bracketed>` values.
|
||||
|
||||
#### CrashLoopBackOff
|
||||
|
||||
```bash
|
||||
kubectl describe pod <pod> -n <ns> | claude "Analyze this CrashLoopBackOff:
|
||||
1. What's the exit code and what does it mean?
|
||||
2. Check the last 5 restarts pattern (timing, consistent or escalating?)
|
||||
3. Suggest 3 most likely root causes based on the events
|
||||
4. Give me the exact commands to investigate each hypothesis"
|
||||
```
|
||||
|
||||
**Common Causes Claude Will Identify**:
|
||||
- Exit 137: OOMKilled (memory limit hit)
|
||||
- Exit 1: Application error (bad config, missing dependency)
|
||||
- Exit 143: SIGTERM (graceful shutdown timeout)
|
||||
|
||||
#### OOMKilled
|
||||
|
||||
```bash
|
||||
kubectl top pods -n <ns> && kubectl describe pod <pod> -n <ns> | claude "This pod was OOMKilled:
|
||||
1. Compare requests vs limits vs actual usage
|
||||
2. Is this a memory leak or under-provisioning?
|
||||
3. If leak: what patterns in the container suggest investigation paths?
|
||||
4. If under-provisioned: suggest optimal resource settings based on this data"
|
||||
```
|
||||
|
||||
**Follow-up for Memory Leaks**:
|
||||
```bash
|
||||
claude "The pod has been restarting every 2 hours with OOMKilled.
|
||||
Memory grows linearly from 200Mi to 512Mi limit before crash.
|
||||
Language: Node.js 18
|
||||
What are the top 3 things to check for memory leaks in this stack?"
|
||||
```
|
||||
|
||||
#### ImagePullBackOff
|
||||
|
||||
```bash
|
||||
kubectl describe pod <pod> -n <ns> | claude "ImagePullBackOff diagnosis:
|
||||
1. Is this an auth issue, network issue, or wrong image name?
|
||||
2. What's the exact error message telling us?
|
||||
3. Give me commands to verify the image exists and credentials work"
|
||||
```
|
||||
|
||||
#### Pending Pod (Not Scheduling)
|
||||
|
||||
```bash
|
||||
kubectl describe pod <pod> -n <ns> && kubectl describe nodes | claude "Pod stuck in Pending:
|
||||
1. Is this resource constraints, node selectors, or affinity rules?
|
||||
2. Which nodes were considered and why rejected?
|
||||
3. What's the quickest fix vs proper solution?"
|
||||
```
|
||||
|
||||
#### Service Not Reachable
|
||||
|
||||
```bash
|
||||
kubectl get svc,endpoints -n <ns> && kubectl describe svc <svc> -n <ns> | claude "Service not reachable:
|
||||
1. Are there healthy endpoints?
|
||||
2. Is the selector matching pods correctly?
|
||||
3. Is it a network policy blocking traffic?
|
||||
Give me diagnostic commands for each possibility"
|
||||
```
|
||||
|
||||
### Case Study: Production Outage Root Cause
|
||||
|
||||
**Situation**: E-commerce platform, 3 AM page, checkout service returning 503s.
|
||||
|
||||
**FIRE in Action**:
|
||||
|
||||
```bash
|
||||
# F - First Response
|
||||
claude "INCIDENT: checkout-service returning 503s, started 10 min ago.
|
||||
Impact: 100% of checkout attempts failing.
|
||||
Environment: AWS EKS production, us-east-1.
|
||||
Recent changes: deployment 2 hours ago (new feature flag logic).
|
||||
What's the fastest diagnostic path?"
|
||||
|
||||
# I - Investigate (Claude suggested checking pods first)
|
||||
kubectl get pods -n checkout -l app=checkout-service
|
||||
# Output: 3/5 pods in CrashLoopBackOff
|
||||
|
||||
kubectl logs checkout-service-xxx --previous | tail -50 | claude "Analyze crash logs"
|
||||
# Claude identifies: panic: nil pointer dereference in feature flag code
|
||||
|
||||
# R - Remediate
|
||||
claude "Root cause identified: nil pointer in feature flag logic from recent deploy.
|
||||
Options:
|
||||
A) Rollback to previous version
|
||||
B) Hotfix the nil check
|
||||
Which is faster and safer at 3 AM?"
|
||||
# Claude recommends: Rollback (faster, proven state, fix properly tomorrow)
|
||||
|
||||
kubectl rollout undo deployment/checkout-service -n checkout
|
||||
# Service restored in 2 minutes
|
||||
|
||||
# E - Evaluate (next day)
|
||||
claude "Create postmortem from this incident:
|
||||
Timeline: 3:02 AM alert, 3:15 AM root cause found, 3:17 AM rollback, 3:19 AM resolved
|
||||
Root cause: Feature flag nil pointer from commit abc123
|
||||
Impact: 15 minutes checkout downtime
|
||||
Format: Blameless, focused on prevention"
|
||||
```
|
||||
|
||||
**Outcome**: 15-minute MTTR, clear postmortem, prevention action items identified.
|
||||
|
||||
## Log Analysis & Correlation
|
||||
|
||||
### Multi-Service Log Correlation
|
||||
|
||||
```bash
|
||||
# Collect logs from related services
|
||||
kubectl logs -l app=api-gateway -n ingress --since=10m > gateway.log
|
||||
kubectl logs -l app=auth-service -n auth --since=10m > auth.log
|
||||
kubectl logs -l app=payment-service -n payment --since=10m > payment.log
|
||||
|
||||
# Analyze correlation
|
||||
cat gateway.log auth.log payment.log | claude "Correlate these logs:
|
||||
1. Find the request flow for failed transactions
|
||||
2. Identify where the failure originates
|
||||
3. Are there patterns in timing or specific endpoints?
|
||||
4. Create a timeline of events"
|
||||
```
|
||||
|
||||
### Log Pattern Detection
|
||||
|
||||
```bash
|
||||
# Find anomalies in error patterns
|
||||
grep -E "ERROR|WARN|Exception" app.log | claude "Analyze error patterns:
|
||||
1. Cluster similar errors (group by type, not timestamp)
|
||||
2. What's the most frequent vs most severe?
|
||||
3. Which errors are correlated (same root cause)?
|
||||
4. Prioritize investigation order"
|
||||
```
|
||||
|
||||
### Prometheus/Grafana Query Help
|
||||
|
||||
```bash
|
||||
claude "I need a PromQL query to:
|
||||
- Show p99 latency for the payment-service
|
||||
- Group by endpoint
|
||||
- Alert if > 500ms for 5 minutes
|
||||
Include the alert rule YAML too"
|
||||
```
|
||||
|
||||
## What Claude CAN'T Do (Limitations)
|
||||
|
||||
Understanding limitations prevents frustration and unsafe reliance.
|
||||
|
||||
| Limitation | Impact | Workaround |
|
||||
|------------|--------|------------|
|
||||
| **No real-time cluster state** | Can't see current pod status | Use K8s MCP or paste kubectl output |
|
||||
| **No direct API access** | Can't call AWS/GCP APIs | Use MCP servers or share CLI output |
|
||||
| **Context window limits** | ~100K tokens max | Focus on relevant logs, not full dumps |
|
||||
| **No persistent memory** | Forgets between sessions | Use CLAUDE.md for project context |
|
||||
| **Hallucination risk** | May suggest invalid flags | Always verify commands before running |
|
||||
| **No real-time metrics** | Can't see current graphs | Screenshot Grafana or paste metric values |
|
||||
| **No secrets access** | Can't read vault/secrets | Good! Never share secrets with any LLM |
|
||||
|
||||
### When NOT to Use Claude
|
||||
|
||||
- **Time-critical decisions under 30 seconds**: Your muscle memory is faster
|
||||
- **Highly confidential incidents**: Data breach investigation (legal implications)
|
||||
- **Simple, obvious fixes**: If you know the answer, just do it
|
||||
- **Compliance-restricted environments**: Check if AI tools are allowed
|
||||
|
||||
### When Claude Excels
|
||||
|
||||
- **Complex root cause analysis**: Multiple interacting systems
|
||||
- **Documentation generation**: Postmortems, runbooks, procedures
|
||||
- **Learning new tools**: Unfamiliar cloud services, new k8s features
|
||||
- **Second opinion**: Validating your hypothesis
|
||||
- **Bulk operations**: Generating configs for multiple environments
|
||||
|
||||
---
|
||||
|
||||
# Pattern: Incident Response
|
||||
|
||||
**Goal**: Structured workflows for incident management.
|
||||
|
||||
## Solo Incident Workflow
|
||||
|
||||
**Reality**: At 3 AM, you're alone. This workflow is designed for one person.
|
||||
|
||||
### FIRE in Action: Solo Incident
|
||||
|
||||
#### F - First Response (30 seconds)
|
||||
|
||||
```bash
|
||||
claude "INCIDENT: [symptom - be specific]
|
||||
Context: [service], [environment], [time started]
|
||||
Recent changes: [deploys, infra changes, traffic spikes]
|
||||
Current impact: [% users affected, revenue impact if known]
|
||||
What are the 3 most critical things to check first?"
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
claude "INCIDENT: API returning 500 errors on /checkout endpoint
|
||||
Context: checkout-service, production-us-east-1, started 5 min ago
|
||||
Recent changes: deployed v2.3.4 at 2:45 AM, added new payment provider
|
||||
Current impact: ~30% of checkout requests failing
|
||||
What are the 3 most critical things to check first?"
|
||||
```
|
||||
|
||||
#### I - Investigate (2-5 minutes)
|
||||
|
||||
Run Claude's suggested commands, share output:
|
||||
|
||||
```bash
|
||||
# Claude suggested checking pod health first
|
||||
kubectl get pods -n checkout | claude "Quick assessment of this pod list"
|
||||
|
||||
# Then checking recent logs
|
||||
kubectl logs -l app=checkout --since=5m | head -100 | claude "Analyze for error patterns"
|
||||
|
||||
# Then checking the deployment diff
|
||||
kubectl rollout history deployment/checkout-service -n checkout | claude "What changed in the last deployment?"
|
||||
```
|
||||
|
||||
**Pro tip**: Keep a terminal for running commands, another for Claude conversation.
|
||||
|
||||
#### R - Remediate (with approval)
|
||||
|
||||
```bash
|
||||
claude "Based on investigation:
|
||||
- Root cause: [your understanding]
|
||||
- Evidence: [key findings]
|
||||
|
||||
Propose remediation options:
|
||||
1. Quick mitigation (restore service)
|
||||
2. Proper fix (address root cause)
|
||||
|
||||
CONSTRAINT: I need to approve before any action. Show exact commands."
|
||||
```
|
||||
|
||||
**Approval Gate Example**:
|
||||
```
|
||||
Claude: "Recommended: Rollback to v2.3.3
|
||||
Command: kubectl rollout undo deployment/checkout-service -n checkout
|
||||
Risk: Low - previous version was stable for 2 weeks
|
||||
Alternative: Scale down the new payment provider feature flag
|
||||
|
||||
Which approach do you want to take?"
|
||||
|
||||
You: "Proceed with rollback"
|
||||
```
|
||||
|
||||
#### E - Evaluate (post-incident, not during)
|
||||
|
||||
```bash
|
||||
claude "Create incident postmortem:
|
||||
|
||||
Timeline:
|
||||
- 2:45 AM: Deployed v2.3.4
|
||||
- 3:00 AM: First alerts fired
|
||||
- 3:05 AM: Incident declared
|
||||
- 3:12 AM: Root cause identified (nil pointer in new payment provider code)
|
||||
- 3:15 AM: Rollback initiated
|
||||
- 3:17 AM: Service restored
|
||||
|
||||
Format: Blameless, focus on systems not people
|
||||
Include: Action items with owners"
|
||||
```
|
||||
|
||||
## Communication During Incidents
|
||||
|
||||
### Stakeholder Update Generator
|
||||
|
||||
```bash
|
||||
claude "Generate incident update for stakeholders:
|
||||
|
||||
Incident: Checkout service degradation
|
||||
Current status: Mitigated, monitoring
|
||||
Impact: 15 minutes of 30% checkout failures
|
||||
ETA to full resolution: 2 hours (proper fix in next deploy)
|
||||
|
||||
Audience: Non-technical executives
|
||||
Tone: Professional, reassuring, factual
|
||||
Length: 3 sentences max"
|
||||
```
|
||||
|
||||
**Output Example**:
|
||||
> We experienced a 15-minute disruption to our checkout service affecting approximately 30% of transactions, which has now been resolved. The issue was caused by a software bug in a recent update and was quickly rolled back. We'll deploy a permanent fix during our next scheduled maintenance window with no expected customer impact.
|
||||
|
||||
### Incident Bridge Prompt
|
||||
|
||||
For real-time incident channels:
|
||||
|
||||
```bash
|
||||
claude "I'm managing an incident bridge. Help me:
|
||||
1. Maintain a running timeline of events
|
||||
2. Suggest next investigation steps when we hit dead ends
|
||||
3. Draft comms updates every 15 minutes
|
||||
4. Flag when I should escalate
|
||||
|
||||
Current status: [paste latest update]
|
||||
What should I communicate to the bridge now?"
|
||||
```
|
||||
|
||||
## Multi-Agent Pattern: Post-Incident Analysis
|
||||
|
||||
**When to use multi-agent**: Not during active incidents. Use for comprehensive analysis afterward.
|
||||
|
||||
```bash
|
||||
# Agent 1: Timeline Reconstruction
|
||||
claude "You are an incident timeline analyst.
|
||||
From these logs and Slack messages, reconstruct a precise timeline:
|
||||
[paste logs and comms]
|
||||
Output: Timestamped events, who did what when"
|
||||
|
||||
# Agent 2: Root Cause Analysis
|
||||
claude "You are a root cause analyst.
|
||||
Given this timeline and system architecture, perform 5-whys analysis:
|
||||
[paste timeline from Agent 1]
|
||||
Output: Root cause chain, contributing factors"
|
||||
|
||||
# Agent 3: Prevention Recommendations
|
||||
claude "You are an SRE process improvement specialist.
|
||||
Given this root cause analysis:
|
||||
[paste RCA from Agent 2]
|
||||
Output: Prioritized prevention measures, effort estimates, ownership suggestions"
|
||||
```
|
||||
|
||||
### Case Study: OpsWorker.ai MTTR Reduction
|
||||
|
||||
**Context**: SRE team managing 200+ microservices, 5 on-call engineers.
|
||||
|
||||
**Before Claude**:
|
||||
- Average MTTR: 45 minutes
|
||||
- Postmortems: Often delayed or skipped
|
||||
- Knowledge silos: Each engineer knew different services
|
||||
|
||||
**Claude Integration**:
|
||||
1. FIRE framework adopted for all incidents
|
||||
2. Claude generates initial postmortem draft within 1 hour
|
||||
3. Runbooks augmented with Claude-assisted troubleshooting
|
||||
|
||||
**After 3 Months**:
|
||||
- Average MTTR: 18 minutes (60% reduction)
|
||||
- Postmortem completion: 95% within 24 hours
|
||||
- Knowledge sharing: Claude-generated runbooks accessible to all
|
||||
|
||||
**Key Insight**: Biggest gains weren't speed—they were consistency and documentation.
|
||||
|
||||
---
|
||||
|
||||
# Pattern: Infrastructure as Code
|
||||
|
||||
**Goal**: Leverage Claude for Terraform, Ansible, and GitOps workflows.
|
||||
|
||||
## Terraform with Claude
|
||||
|
||||
### Reference: Anton Babenko's Terraform Skill
|
||||
|
||||
The most comprehensive Terraform skill for Claude Code:
|
||||
|
||||
**Repository**: [antonbabenko/terraform-skill](https://github.com/antonbabenko/terraform-skill)
|
||||
**Author**: Anton Babenko (creator of terraform-aws-modules, 1B+ downloads)
|
||||
|
||||
```bash
|
||||
# Install
|
||||
cd ~/.claude/skills/
|
||||
git clone https://github.com/antonbabenko/terraform-skill.git terraform
|
||||
```
|
||||
|
||||
**What it provides**:
|
||||
- Best practices for module structure
|
||||
- AWS, GCP, Azure patterns
|
||||
- State management guidance
|
||||
- CI/CD integration patterns
|
||||
|
||||
### Common Terraform Prompts
|
||||
|
||||
#### Plan Review
|
||||
|
||||
```bash
|
||||
terraform plan -out=plan.txt && cat plan.txt | claude "Review this Terraform plan:
|
||||
1. Any dangerous changes? (data loss, downtime)
|
||||
2. Are the changes what we expect?
|
||||
3. Any missing changes we should add?
|
||||
4. Cost implications if visible"
|
||||
```
|
||||
|
||||
#### Module Generation
|
||||
|
||||
```bash
|
||||
claude "Generate a Terraform module for:
|
||||
- AWS ECS Fargate service
|
||||
- With ALB and target group
|
||||
- Auto-scaling based on CPU
|
||||
- Secrets from SSM Parameter Store
|
||||
|
||||
Follow these conventions:
|
||||
- Use for_each over count
|
||||
- All resources tagged with var.tags
|
||||
- Output the service URL and ARN"
|
||||
```
|
||||
|
||||
#### State Surgery Helper
|
||||
|
||||
```bash
|
||||
claude "I need to move a resource to a different state file:
|
||||
Current state: terraform-prod/terraform.tfstate
|
||||
Resource: aws_s3_bucket.logs
|
||||
Target state: terraform-shared/terraform.tfstate
|
||||
|
||||
What's the safest procedure? Include rollback steps."
|
||||
```
|
||||
|
||||
### Drift Detection Workflow
|
||||
|
||||
```bash
|
||||
# Detect drift
|
||||
terraform plan -detailed-exitcode 2>&1 | tee drift.txt
|
||||
|
||||
# Analyze with Claude
|
||||
cat drift.txt | claude "Analyze this Terraform drift:
|
||||
1. What changed outside of Terraform?
|
||||
2. Is this drift expected (manual change) or concerning?
|
||||
3. Should we import the changes or revert to Terraform state?
|
||||
4. What's the safest remediation path?"
|
||||
```
|
||||
|
||||
## Ansible with Claude
|
||||
|
||||
### Playbook Review
|
||||
|
||||
```bash
|
||||
cat playbook.yml | claude "Review this Ansible playbook:
|
||||
1. Idempotency issues?
|
||||
2. Security concerns?
|
||||
3. Error handling gaps?
|
||||
4. Performance optimizations?"
|
||||
```
|
||||
|
||||
### Role Generation
|
||||
|
||||
```bash
|
||||
claude "Generate an Ansible role for:
|
||||
- Installing and configuring Nginx
|
||||
- SSL certificates via Let's Encrypt (certbot)
|
||||
- Hardened configuration (disable server tokens, etc.)
|
||||
- Log rotation
|
||||
|
||||
Follow best practices:
|
||||
- Use handlers for service restarts
|
||||
- Variables in defaults/main.yml
|
||||
- Include molecule tests structure"
|
||||
```
|
||||
|
||||
## GitOps with Claude
|
||||
|
||||
### ArgoCD Application Review
|
||||
|
||||
```bash
|
||||
cat application.yaml | claude "Review this ArgoCD Application:
|
||||
1. Sync policy appropriate for the environment?
|
||||
2. Resource health checks defined?
|
||||
3. Any sync wave ordering issues?
|
||||
4. Namespace and project permissions correct?"
|
||||
```
|
||||
|
||||
### Helm Values Generation
|
||||
|
||||
```bash
|
||||
claude "Generate Helm values for deploying [application] to:
|
||||
- Environment: staging
|
||||
- Resources: Limited (cost-conscious)
|
||||
- Replicas: 2
|
||||
- Ingress: Internal only
|
||||
- Secrets: From external-secrets operator
|
||||
|
||||
Base chart: [chart name]
|
||||
Include comments explaining each value"
|
||||
```
|
||||
|
||||
## Security Review Automation
|
||||
|
||||
### Infrastructure Security Scan
|
||||
|
||||
```bash
|
||||
# Run tfsec or checkov, analyze results
|
||||
tfsec . --format=json | claude "Analyze these security findings:
|
||||
1. Prioritize by severity and exploitability
|
||||
2. Which are false positives in our context?
|
||||
3. For real issues: what's the fix?
|
||||
4. Which can we ignore with a documented reason?"
|
||||
```
|
||||
|
||||
### IAM Policy Review
|
||||
|
||||
```bash
|
||||
cat iam-policy.json | claude "Review this IAM policy:
|
||||
1. Does it follow least privilege?
|
||||
2. Any overly permissive actions? (*, admin, etc.)
|
||||
3. Resource constraints appropriate?
|
||||
4. Suggest a more restrictive version that still works"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Guardrails & Adoption
|
||||
|
||||
**Goal**: Implement Claude Code safely and get team buy-in.
|
||||
|
||||
## Cost Awareness
|
||||
|
||||
### Claude Code Costs
|
||||
|
||||
| Model | Input (1M tokens) | Output (1M tokens) |
|
||||
|-------|-------------------|-------------------|
|
||||
| Sonnet 4 | $3 | $15 |
|
||||
| Opus 4 | $15 | $75 |
|
||||
|
||||
**Typical DevOps session**: 20K-50K tokens = $0.10-$0.50
|
||||
|
||||
**Cost control strategies**:
|
||||
1. Use Sonnet for routine tasks (default)
|
||||
2. Reserve Opus for complex multi-system analysis
|
||||
3. Use `/compact` to reduce context when conversation gets long
|
||||
4. Avoid pasting entire log files; grep relevant sections first
|
||||
|
||||
### Infrastructure Costs from Claude Suggestions
|
||||
|
||||
**Beware**: Claude doesn't see your cloud bill. Always ask:
|
||||
|
||||
```bash
|
||||
claude "Before I apply these changes, estimate:
|
||||
1. Monthly cost impact (compute, storage, network)
|
||||
2. Any resources that could scale unbounded?
|
||||
3. Cost optimization alternatives?"
|
||||
```
|
||||
|
||||
## Security Boundaries
|
||||
|
||||
### Never Share with Claude
|
||||
|
||||
| Data Type | Why Not | Alternative |
|
||||
|-----------|---------|-------------|
|
||||
| API keys, tokens | Could be cached/logged | Use placeholders: `<API_KEY>` |
|
||||
| Production secrets | Security risk | Describe the secret type, not value |
|
||||
| Customer PII | Privacy/compliance | Use anonymized examples |
|
||||
| Proprietary algorithms | IP protection | Describe behavior, not code |
|
||||
| Incident details with PII | Legal liability | Sanitize before sharing |
|
||||
|
||||
### Safe Prompting Template
|
||||
|
||||
```bash
|
||||
claude "Debug this authentication issue:
|
||||
- Service: auth-service
|
||||
- Error: 401 Unauthorized for valid tokens
|
||||
- Environment: staging (not production)
|
||||
- Token format: JWT with claims [user_id, org_id, exp]
|
||||
- NOTE: I've redacted all actual token values
|
||||
|
||||
Here's the sanitized log:
|
||||
[paste log with secrets replaced]"
|
||||
```
|
||||
|
||||
### Approval Gates for Production
|
||||
|
||||
Always require human approval for:
|
||||
|
||||
```yaml
|
||||
# Example: Production change checklist
|
||||
approval_required:
|
||||
- kubectl delete
|
||||
- kubectl scale (down)
|
||||
- terraform destroy
|
||||
- DROP TABLE / DELETE FROM
|
||||
- rm -rf (outside tmp directories)
|
||||
- Any production database write
|
||||
- Any IAM policy change
|
||||
- Any security group modification
|
||||
```
|
||||
|
||||
## Team Rollout Checklist
|
||||
|
||||
### Phase 1: Pilot (1-2 engineers, 2 weeks)
|
||||
|
||||
- [ ] Install Claude Code for pilot users
|
||||
- [ ] Create team CLAUDE.md with common context
|
||||
- [ ] Document first 5 successful use cases
|
||||
- [ ] Identify one workflow to standardize
|
||||
- [ ] Track time saved (before/after)
|
||||
|
||||
### Phase 2: Expand (Team, 4 weeks)
|
||||
|
||||
- [ ] Share pilot learnings in team meeting
|
||||
- [ ] Create team-specific prompts library
|
||||
- [ ] Establish security guidelines (what to share/not share)
|
||||
- [ ] Set up shared skills/commands repository
|
||||
- [ ] Define when to use Claude vs when not to
|
||||
|
||||
### Phase 3: Optimize (Ongoing)
|
||||
|
||||
- [ ] Monthly review of prompt library
|
||||
- [ ] A/B test: Claude-assisted vs traditional for similar incidents
|
||||
- [ ] Contribute back to community (awesome-lists, this guide)
|
||||
- [ ] Track MTTR, postmortem completion, documentation quality
|
||||
|
||||
### Adoption Pitfalls to Avoid
|
||||
|
||||
| Pitfall | Why It Happens | Prevention |
|
||||
|---------|---------------|------------|
|
||||
| **Over-reliance** | Claude is so helpful | Mandate learning time, not just output |
|
||||
| **Blind trust** | Commands usually work | Always review before running |
|
||||
| **Context dumping** | Hope Claude figures it out | Provide focused context, not everything |
|
||||
| **Skipping verification** | Time pressure | Build verification into workflow |
|
||||
| **Shadow usage** | No team visibility | Share wins, normalize usage |
|
||||
|
||||
---
|
||||
|
||||
# Quick Reference
|
||||
|
||||
## FIRE Framework Summary
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ F - FIRST RESPONSE │
|
||||
│ "INCIDENT: [symptom]. Context: [service, env, time]. │
|
||||
│ Recent changes: [what]. Impact: [who affected]. │
|
||||
│ What are the 3 most critical things to check?" │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ I - INVESTIGATE │
|
||||
│ Run Claude's suggested commands │
|
||||
│ Share output: "[output] | claude 'Analyze this'" │
|
||||
│ Iterate until root cause identified │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ R - REMEDIATE │
|
||||
│ "Based on [findings], propose remediation. │
|
||||
│ CONSTRAINT: I need to approve before any action." │
|
||||
│ APPROVE → Execute │ REJECT → More investigation │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ E - EVALUATE │
|
||||
│ "Create postmortem: Timeline, root cause, prevention. │
|
||||
│ Format: Blameless, action items with owners." │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Prompts by Symptom
|
||||
|
||||
### Kubernetes
|
||||
|
||||
| Symptom | Prompt |
|
||||
|---------|--------|
|
||||
| CrashLoopBackOff | `kubectl describe pod <pod> -n <ns> \| claude "Exit code meaning? 3 likely causes? Commands to investigate?"` |
|
||||
| OOMKilled | `kubectl top pods && describe pod \| claude "Leak or under-provisioned? Optimal resources?"` |
|
||||
| ImagePullBackOff | `kubectl describe pod \| claude "Auth, network, or wrong image? Verification commands?"` |
|
||||
| Pending | `kubectl describe pod && describe nodes \| claude "Resource, selector, or affinity issue?"` |
|
||||
| Service unreachable | `kubectl get svc,endpoints \| claude "Healthy endpoints? Selector matching? Network policy?"` |
|
||||
|
||||
### Cloud/Infrastructure
|
||||
|
||||
| Symptom | Prompt |
|
||||
|---------|--------|
|
||||
| High latency | `[metrics] \| claude "Bottleneck location? Is it compute, network, or dependency?"` |
|
||||
| Disk full | `df -h && du -sh /* \| claude "What's consuming space? Safe to delete?"` |
|
||||
| Connection refused | `netstat -tlnp \| claude "Service listening? Port correct? Firewall rules?"` |
|
||||
| SSL cert expiry | `openssl s_client -connect host:443 \| claude "Days until expiry? Renewal steps?"` |
|
||||
| DNS issues | `dig +trace domain \| claude "Where does resolution fail?"` |
|
||||
|
||||
### Terraform
|
||||
|
||||
| Task | Prompt |
|
||||
|------|--------|
|
||||
| Plan review | `terraform plan \| claude "Dangerous changes? Missing changes? Cost impact?"` |
|
||||
| Drift analysis | `terraform plan -detailed-exitcode \| claude "What drifted? Expected? Remediation?"` |
|
||||
| Module request | `claude "Generate Terraform module for [resource] with [requirements]"` |
|
||||
|
||||
## MCP Servers for DevOps
|
||||
|
||||
| Server | Purpose | Install |
|
||||
|--------|---------|---------|
|
||||
| Kubernetes | Direct cluster access | `npx -y @anthropic/mcp-kubernetes` |
|
||||
| AWS | AWS API access | `npx -y @anthropic/mcp-aws` |
|
||||
| GCP | GCP API access | `npx -y @anthropic/mcp-gcp` |
|
||||
| Prometheus | Direct metrics queries | Community: search awesome-mcp-servers |
|
||||
| Terraform | State/plan analysis | Community: search awesome-mcp-servers |
|
||||
|
||||
**Config location**: `~/.claude/mcp.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubernetes": {
|
||||
"command": "npx",
|
||||
"args": ["-y", "@anthropic/mcp-kubernetes"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## External Resources
|
||||
|
||||
### Awesome Lists
|
||||
|
||||
- **[awesome-claude-code-subagents](https://github.com/VoltAgent/awesome-claude-code-subagents)** (8.1k stars): Agent personas including SRE
|
||||
- **[awesome-claude-skills](https://github.com/travisvn/awesome-claude-skills)** (4.6k stars): Skills including infra-related
|
||||
|
||||
### Official Resources
|
||||
|
||||
- **[terraform-skill](https://github.com/antonbabenko/terraform-skill)**: Production-grade Terraform skill by Anton Babenko
|
||||
- **[Claude Code Docs](https://docs.anthropic.com/en/docs/claude-code)**: Official documentation
|
||||
|
||||
### Community
|
||||
|
||||
- **Anthropic Discord**: #claude-code channel
|
||||
- **Reddit**: r/ClaudeAI
|
||||
- **GitHub**: Open issues on awesome-lists for feature requests
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- **[Agent Template](../examples/agents/devops-sre.md)**: DevOps/SRE agent persona for Claude
|
||||
- **[CLAUDE.md Template](../examples/claude-md/devops-sre.md)**: Project configuration for DevOps teams
|
||||
- **[Security Hardening Guide](./security-hardening.md)**: Additional security practices
|
||||
- **[Architecture Guide](./architecture.md)**: How Claude Code works internally
|
||||
|
||||
---
|
||||
|
||||
*Contributions welcome! If you have DevOps prompts that work well, consider adding them to the awesome-lists or submitting a PR to this guide.*
|
||||
|
|
@ -10,7 +10,7 @@
|
|||
|
||||
**Last updated**: January 2026
|
||||
|
||||
**Version**: 3.9.8
|
||||
**Version**: 3.9.9
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -4920,6 +4920,17 @@ git clone https://github.com/antonbabenko/terraform-skill.git terraform
|
|||
|
||||
If you create specialized skills for other domains (DevOps, data science, ML/AI, etc.), consider sharing them with the community through similar repositories or pull requests to existing collections.
|
||||
|
||||
### DevOps & SRE Guide
|
||||
|
||||
For comprehensive DevOps/SRE workflows, see **[DevOps & SRE Guide](./devops-sre.md)**:
|
||||
- **The FIRE Framework**: First Response → Investigate → Remediate → Evaluate
|
||||
- **Kubernetes troubleshooting**: Prompts by symptom (CrashLoopBackOff, OOMKilled, etc.)
|
||||
- **Incident response**: Solo and multi-agent patterns
|
||||
- **IaC patterns**: Terraform, Ansible, GitOps workflows
|
||||
- **Guardrails**: Security boundaries and team adoption checklist
|
||||
|
||||
**Quick Start**: [Agent Template](../examples/agents/devops-sre.md) | [CLAUDE.md Template](../examples/claude-md/devops-sre.md)
|
||||
|
||||
---
|
||||
|
||||
# 6. Commands
|
||||
|
|
@ -11151,4 +11162,4 @@ Thumbs.db
|
|||
|
||||
**Contributions**: Issues and PRs welcome.
|
||||
|
||||
**Last updated**: January 2026 | **Version**: 3.9.8
|
||||
**Last updated**: January 2026 | **Version**: 3.9.9
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@
|
|||
# Source: guide/ultimate-guide.md
|
||||
# Purpose: Condensed index for LLMs to quickly answer user questions about Claude Code
|
||||
|
||||
version: "3.9.8"
|
||||
version: "3.9.9"
|
||||
updated: "2026-01-20"
|
||||
|
||||
# ════════════════════════════════════════════════════════════════
|
||||
|
|
@ -88,33 +88,45 @@ deep_dive:
|
|||
skill_examples: 4608
|
||||
community_skills_cybersec: 4788
|
||||
community_skills_iac: 4871
|
||||
commands: 4939
|
||||
command_template: 5009
|
||||
hooks: 5262
|
||||
hook_templates: 5407
|
||||
security_hooks: 5669
|
||||
mcp_servers: 5810
|
||||
mcp_config: 6104
|
||||
mcp_security: 6472
|
||||
cicd: 6790
|
||||
ide_integration: 7479
|
||||
feedback_loops: 7549
|
||||
batch_operations: 7979
|
||||
pitfalls: 8098
|
||||
git_best_practices: 8367
|
||||
cost_optimization: 8833
|
||||
session_teleportation: 9432
|
||||
commands_table: 9608
|
||||
shortcuts_table: 9641
|
||||
troubleshooting: 9767
|
||||
cheatsheet: 10142
|
||||
daily_workflow: 10218
|
||||
# AI Ecosystem (Section 11, ~line 10494)
|
||||
ai_ecosystem: 10494
|
||||
ai_ecosystem_complementarity: 10494
|
||||
ai_ecosystem_tool_matrix: 10520
|
||||
ai_ecosystem_workflows: 10545
|
||||
ai_ecosystem_integration: 10671
|
||||
# DevOps/SRE Guide (guide/devops-sre.md)
|
||||
devops_sre_guide: "guide/devops-sre.md"
|
||||
devops_fire_framework: "guide/devops-sre.md:50"
|
||||
devops_k8s_troubleshooting: "guide/devops-sre.md:120"
|
||||
devops_k8s_prompts: "guide/devops-sre.md:160"
|
||||
devops_incident_response: "guide/devops-sre.md:340"
|
||||
devops_iac_patterns: "guide/devops-sre.md:520"
|
||||
devops_guardrails: "guide/devops-sre.md:650"
|
||||
devops_limitations: "guide/devops-sre.md:290"
|
||||
devops_quick_reference: "guide/devops-sre.md:750"
|
||||
devops_agent: "examples/agents/devops-sre.md"
|
||||
devops_claude_md: "examples/claude-md/devops-sre.md"
|
||||
commands: 4950
|
||||
command_template: 5020
|
||||
hooks: 5273
|
||||
hook_templates: 5418
|
||||
security_hooks: 5680
|
||||
mcp_servers: 5821
|
||||
mcp_config: 6115
|
||||
mcp_security: 6483
|
||||
cicd: 6801
|
||||
ide_integration: 7490
|
||||
feedback_loops: 7560
|
||||
batch_operations: 7990
|
||||
pitfalls: 8109
|
||||
git_best_practices: 8378
|
||||
cost_optimization: 8844
|
||||
session_teleportation: 9443
|
||||
commands_table: 9619
|
||||
shortcuts_table: 9652
|
||||
troubleshooting: 9778
|
||||
cheatsheet: 10153
|
||||
daily_workflow: 10229
|
||||
# AI Ecosystem (Section 11, ~line 10491)
|
||||
ai_ecosystem: 10491
|
||||
ai_ecosystem_complementarity: 10493
|
||||
ai_ecosystem_tool_matrix: 10531
|
||||
ai_ecosystem_workflows: 10556
|
||||
ai_ecosystem_integration: 10682
|
||||
ai_ecosystem_detailed: "guide/ai-ecosystem.md"
|
||||
ai_ecosystem_voice_to_text: "guide/ai-ecosystem.md:449"
|
||||
ai_ecosystem_alternative_providers: "guide/ai-ecosystem.md:959"
|
||||
|
|
@ -448,7 +460,7 @@ ecosystem:
|
|||
- "Cross-links modified → Update all 4 repos"
|
||||
history:
|
||||
- date: "2026-01-20"
|
||||
event: "Code Landing sync v3.9.8, 66 templates, cross-links"
|
||||
event: "Code Landing sync v3.9.9, 66 templates, cross-links"
|
||||
commit: "5b5ce62"
|
||||
- date: "2026-01-20"
|
||||
event: "Cowork Landing fix (paths, README, UI badges)"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue