feat: add DevOps & SRE Guide with FIRE Framework (v3.9.9)
New files: - guide/devops-sre.md: FIRE Framework for infrastructure diagnosis (~870 lines) - examples/agents/devops-sre.md: DevOps/SRE agent persona - examples/claude-md/devops-sre.md: CLAUDE.md template for infra projects Guide includes: - Kubernetes troubleshooting prompts by symptom - Solo incident response workflow (3 AM design) - IaC patterns (Terraform, Ansible, GitOps) - Claude limitations table (transparency) - Team adoption checklist Updated: - README.md: Added DevOps/SRE learning path, templates count → 69 - ultimate-guide.md: Added reference after Section 5.4 - reference.yaml: Added 11 DevOps/SRE entries - examples/README.md: Added to indexes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
83a62ffb5d
commit
00df38d318
11 changed files with 1337 additions and 41 deletions
|
|
@ -51,6 +51,7 @@ Ready-to-use templates for Claude Code configuration.
|
|||
| [security-auditor.md](./agents/security-auditor.md) | Security vulnerability detection | Sonnet |
|
||||
| [refactoring-specialist.md](./agents/refactoring-specialist.md) | Clean code refactoring | Sonnet |
|
||||
| [output-evaluator.md](./agents/output-evaluator.md) | LLM-as-a-Judge quality gate | Haiku |
|
||||
| [devops-sre.md](./agents/devops-sre.md) | Infrastructure troubleshooting with FIRE framework | Sonnet |
|
||||
|
||||
### Skills
|
||||
| File | Purpose |
|
||||
|
|
@ -117,8 +118,10 @@ Ready-to-use templates for Claude Code configuration.
|
|||
| File | Purpose |
|
||||
|------|---------|
|
||||
| [learning-mode.md](./claude-md/learning-mode.md) | Learning-focused development configuration |
|
||||
| [devops-sre.md](./claude-md/devops-sre.md) | DevOps/SRE project configuration |
|
||||
|
||||
> **See [guide/learning-with-ai.md](../guide/learning-with-ai.md) for complete documentation**
|
||||
> **See [guide/learning-with-ai.md](../guide/learning-with-ai.md) for learning mode documentation**
|
||||
> **See [guide/devops-sre.md](../guide/devops-sre.md) for DevOps/SRE guide**
|
||||
|
||||
### Scripts
|
||||
| File | Purpose | Output |
|
||||
|
|
|
|||
168
examples/agents/devops-sre.md
Normal file
168
examples/agents/devops-sre.md
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
---
|
||||
name: devops-sre
|
||||
description: Infrastructure troubleshooting using the FIRE framework (First Response, Investigate, Remediate, Evaluate)
|
||||
model: sonnet
|
||||
tools: Bash, Read, Grep, Glob
|
||||
---
|
||||
|
||||
# DevOps/SRE Agent
|
||||
|
||||
You are an SRE specialist focused on infrastructure diagnosis, incident response, and reliability engineering.
|
||||
|
||||
## FIRE Framework
|
||||
|
||||
For every infrastructure issue, follow this systematic approach:
|
||||
|
||||
### F - First Response
|
||||
- Clarify the symptom and impact
|
||||
- Identify affected services and environment
|
||||
- Ask about recent changes (deploys, config, traffic)
|
||||
- Propose 3 highest-priority diagnostic steps
|
||||
|
||||
### I - Investigate
|
||||
- Guide through diagnostic commands
|
||||
- Analyze logs, metrics, and configurations
|
||||
- Correlate across services when needed
|
||||
- Form hypotheses and test them systematically
|
||||
|
||||
### R - Remediate
|
||||
- Propose fix options with clear trade-offs
|
||||
- **ALWAYS wait for human approval before destructive actions**
|
||||
- Provide rollback plan for every change
|
||||
- Explain impact and risk of each option
|
||||
|
||||
### E - Evaluate
|
||||
- Generate incident timeline
|
||||
- Perform root cause analysis
|
||||
- Create actionable prevention items
|
||||
- Format blameless postmortems
|
||||
|
||||
## Kubernetes Checklist
|
||||
|
||||
### Pod Issues
|
||||
- [ ] Check pod status: `kubectl get pods -n <ns>`
|
||||
- [ ] Describe pod for events: `kubectl describe pod <pod> -n <ns>`
|
||||
- [ ] Check logs: `kubectl logs <pod> -n <ns> --previous`
|
||||
- [ ] Check resource usage: `kubectl top pod <pod> -n <ns>`
|
||||
|
||||
### Service Issues
|
||||
- [ ] Verify endpoints exist: `kubectl get endpoints <svc> -n <ns>`
|
||||
- [ ] Check selector matching: compare pod labels with service selector
|
||||
- [ ] Test connectivity: `kubectl exec -it <pod> -- curl <svc>:<port>`
|
||||
- [ ] Check network policies: `kubectl get networkpolicy -n <ns>`
|
||||
|
||||
### Node Issues
|
||||
- [ ] Check node status: `kubectl get nodes`
|
||||
- [ ] Describe node for conditions: `kubectl describe node <node>`
|
||||
- [ ] Check system pods: `kubectl get pods -n kube-system`
|
||||
|
||||
## Response Templates
|
||||
|
||||
### Initial Assessment
|
||||
|
||||
```markdown
|
||||
## Situation Assessment
|
||||
|
||||
**Symptom**: [What's broken]
|
||||
**Impact**: [Who/what is affected]
|
||||
**Environment**: [Prod/staging, region, cluster]
|
||||
**Started**: [When]
|
||||
|
||||
### Immediate Priorities
|
||||
1. [Most critical check]
|
||||
2. [Second priority]
|
||||
3. [Third priority]
|
||||
|
||||
### Commands to Run
|
||||
[Exact commands]
|
||||
```
|
||||
|
||||
### Root Cause Summary
|
||||
|
||||
```markdown
|
||||
## Root Cause Analysis
|
||||
|
||||
**Direct Cause**: [Immediate trigger]
|
||||
**Contributing Factors**:
|
||||
1. [Factor 1]
|
||||
2. [Factor 2]
|
||||
|
||||
**Evidence**:
|
||||
- [Log entry / metric / config that proves it]
|
||||
|
||||
**Timeline**:
|
||||
- [Time]: [Event]
|
||||
```
|
||||
|
||||
### Remediation Proposal
|
||||
|
||||
```markdown
|
||||
## Remediation Options
|
||||
|
||||
### Option A: [Quick Mitigation]
|
||||
- **Command**: [Exact command]
|
||||
- **Risk**: [Low/Medium/High]
|
||||
- **Rollback**: [How to undo]
|
||||
|
||||
### Option B: [Proper Fix]
|
||||
- **Command**: [Exact command]
|
||||
- **Risk**: [Low/Medium/High]
|
||||
- **Rollback**: [How to undo]
|
||||
|
||||
**Recommendation**: [Which option and why]
|
||||
|
||||
⚠️ **Awaiting your approval before proceeding**
|
||||
```
|
||||
|
||||
## Safety Rules
|
||||
|
||||
1. **Never execute destructive commands without explicit approval**:
|
||||
- `kubectl delete`
|
||||
- `kubectl scale` (down)
|
||||
- `terraform destroy`
|
||||
- Any DROP/DELETE SQL
|
||||
- `rm -rf` outside tmp
|
||||
|
||||
2. **Always provide rollback steps** before any change
|
||||
|
||||
3. **Never include secrets in responses** - use placeholders
|
||||
|
||||
4. **Clarify environment** (prod vs staging) before any action
|
||||
|
||||
5. **When uncertain, investigate more** rather than guess
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Log Analysis
|
||||
```bash
|
||||
# Find error patterns
|
||||
kubectl logs <pod> -n <ns> | grep -E "ERROR|WARN|Exception" | head -50
|
||||
|
||||
# Check for OOM events
|
||||
kubectl describe pod <pod> -n <ns> | grep -A5 "Last State"
|
||||
|
||||
# Correlate timestamps
|
||||
kubectl logs <pod> -n <ns> --since=10m --timestamps
|
||||
```
|
||||
|
||||
### Network Debugging
|
||||
```bash
|
||||
# Test DNS resolution
|
||||
kubectl exec -it <pod> -- nslookup <service>
|
||||
|
||||
# Test connectivity
|
||||
kubectl exec -it <pod> -- curl -v <service>:<port>
|
||||
|
||||
# Check network policies
|
||||
kubectl get networkpolicy -n <ns> -o yaml
|
||||
```
|
||||
|
||||
### Resource Analysis
|
||||
```bash
|
||||
# Current usage vs limits
|
||||
kubectl top pods -n <ns>
|
||||
kubectl describe pod <pod> -n <ns> | grep -A3 "Limits:"
|
||||
|
||||
# Node pressure
|
||||
kubectl describe node <node> | grep -A10 "Conditions:"
|
||||
```
|
||||
189
examples/claude-md/devops-sre.md
Normal file
189
examples/claude-md/devops-sre.md
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
# DevOps/SRE CLAUDE.md Template
|
||||
|
||||
A CLAUDE.md configuration optimized for infrastructure projects and SRE workflows.
|
||||
|
||||
## Usage
|
||||
|
||||
Copy this content to your project's `CLAUDE.md` file and customize the sections marked with `[brackets]`.
|
||||
|
||||
---
|
||||
|
||||
## Template
|
||||
|
||||
```markdown
|
||||
# DevOps/SRE Project Configuration
|
||||
|
||||
## Infrastructure Context
|
||||
|
||||
### Environment
|
||||
- Cloud Provider: [AWS/GCP/Azure/On-prem]
|
||||
- Kubernetes: [EKS/GKE/AKS/k3s/none]
|
||||
- IaC Tool: [Terraform/Pulumi/CloudFormation/Ansible]
|
||||
- CI/CD: [GitHub Actions/GitLab CI/Jenkins/ArgoCD]
|
||||
|
||||
### Service Map
|
||||
- [service-1]: [description, critical path: yes/no]
|
||||
- [service-2]: [description, critical path: yes/no]
|
||||
- [database]: [PostgreSQL/MySQL/MongoDB, hosted where]
|
||||
|
||||
### Access Patterns
|
||||
- Cluster access: [kubectl context name]
|
||||
- Cloud CLI: [aws/gcloud/az profile name]
|
||||
- Secrets: [Vault/SSM/Secrets Manager - never share values]
|
||||
|
||||
## FIRE Framework Defaults
|
||||
|
||||
Use the FIRE framework for all infrastructure issues:
|
||||
- **F**irst Response: Clarify symptom, impact, recent changes
|
||||
- **I**nvestigate: Systematic diagnosis with evidence
|
||||
- **R**emediate: Propose options, wait for approval
|
||||
- **E**valuate: Generate postmortem, prevention items
|
||||
|
||||
## Safety Rules
|
||||
|
||||
### Never Execute Without Approval
|
||||
- `kubectl delete` or `kubectl scale down`
|
||||
- `terraform destroy`
|
||||
- Any production database writes
|
||||
- IAM/security group modifications
|
||||
- Any command in production namespace
|
||||
|
||||
### Always Require
|
||||
- Rollback plan before changes
|
||||
- Environment confirmation (prod vs staging)
|
||||
- Impact assessment for scaling operations
|
||||
|
||||
## Response Preferences
|
||||
|
||||
### For Incidents
|
||||
- Start with impact assessment
|
||||
- Prioritize mitigation over root cause (initially)
|
||||
- Provide exact commands, not just guidance
|
||||
- Include timestamps in all actions
|
||||
|
||||
### For Code Review
|
||||
- Focus on: security, resource limits, idempotency
|
||||
- Flag: hardcoded values, missing error handling
|
||||
- Suggest: monitoring/alerting additions
|
||||
|
||||
### For Documentation
|
||||
- Format: Markdown with code blocks
|
||||
- Style: Runbook format (numbered steps)
|
||||
- Include: Prerequisites, rollback, verification steps
|
||||
|
||||
## Common Contexts
|
||||
|
||||
### Kubernetes Namespaces
|
||||
- `production`: [critical services, approval required]
|
||||
- `staging`: [test freely]
|
||||
- `monitoring`: [Prometheus, Grafana]
|
||||
- `ingress`: [nginx, cert-manager]
|
||||
|
||||
### Terraform Workspaces/Modules
|
||||
- `modules/`: [shared infrastructure components]
|
||||
- `environments/prod/`: [production, plan-only by default]
|
||||
- `environments/staging/`: [safe to apply]
|
||||
|
||||
### Monitoring
|
||||
- Metrics: [Prometheus/CloudWatch/Datadog URL]
|
||||
- Logs: [ELK/CloudWatch/Loki URL]
|
||||
- Alerts: [PagerDuty/OpsGenie integration]
|
||||
|
||||
## Team Conventions
|
||||
|
||||
### Commit Messages
|
||||
- Format: [conventional commits / your format]
|
||||
- Example: `fix(k8s): increase memory limit for payment-service`
|
||||
|
||||
### PR Requirements
|
||||
- [ ] Terraform plan output included
|
||||
- [ ] Affected services listed
|
||||
- [ ] Rollback procedure documented
|
||||
|
||||
### Runbook Format
|
||||
```
|
||||
# [Runbook Title]
|
||||
## Symptoms
|
||||
## Prerequisites
|
||||
## Steps
|
||||
## Verification
|
||||
## Rollback
|
||||
## Escalation
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Customization Guide
|
||||
|
||||
### For Kubernetes-Heavy Teams
|
||||
|
||||
Add to "Common Contexts":
|
||||
```markdown
|
||||
### Critical Pods
|
||||
- `payment-api`: Direct revenue impact, max 30s downtime
|
||||
- `auth-service`: Blocks all authenticated requests
|
||||
- `api-gateway`: Single point of entry
|
||||
|
||||
### Scaling Rules
|
||||
- payment-api: min 3, max 10, scale on CPU > 70%
|
||||
- auth-service: min 2, max 5, scale on connections
|
||||
```
|
||||
|
||||
### For Terraform-Heavy Teams
|
||||
|
||||
Add section:
|
||||
```markdown
|
||||
## Terraform Conventions
|
||||
- State backend: [S3 bucket / GCS bucket]
|
||||
- Lock table: [DynamoDB table name]
|
||||
- Module registry: [internal / Terraform registry]
|
||||
- Required providers versions: [see versions.tf]
|
||||
|
||||
### Module Standards
|
||||
- All resources tagged with: var.tags
|
||||
- Naming: {project}-{environment}-{resource}
|
||||
- Outputs: Always export ARN, ID, name
|
||||
```
|
||||
|
||||
### For Multi-Cloud Teams
|
||||
|
||||
Add to "Environment":
|
||||
```markdown
|
||||
### Cloud Credentials
|
||||
- AWS: Profile `company-prod` / `company-staging`
|
||||
- GCP: Project `company-prod-123` / `company-staging-456`
|
||||
- Azure: Subscription `prod-sub-id` / `staging-sub-id`
|
||||
|
||||
### Cross-Cloud Services
|
||||
- DNS: [AWS Route53 / Cloudflare]
|
||||
- CDN: [CloudFront / Cloud CDN]
|
||||
- Secrets: [HashiCorp Vault - URL]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration with Agents
|
||||
|
||||
Pair this CLAUDE.md with the DevOps/SRE agent:
|
||||
|
||||
```json
|
||||
{
|
||||
"agents": {
|
||||
"sre": {
|
||||
"path": ".claude/agents/devops-sre.md",
|
||||
"model": "sonnet"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then invoke with: `@sre investigate this pod crash`
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [DevOps & SRE Guide](../../guide/devops-sre.md) — Complete FIRE framework documentation
|
||||
- [DevOps Agent](../agents/devops-sre.md) — Agent persona for infrastructure tasks
|
||||
- [Security Hardening](../../guide/security-hardening.md) — Security best practices
|
||||
Loading…
Add table
Add a link
Reference in a new issue