feat: add agent/skill quality audit tooling + Grenier evaluation
AUDIT TOOLING (3 templates): - Command: /audit-agents-skills (quick project audits) - 16-criteria framework (Identity 3x, Prompt 2x, Validation 1x, Design 2x) - Weighted scoring: 32 pts (agents/skills), 20 pts (commands) - Production grading (A-F, 80% threshold) - Fix mode with actionable suggestions - Skill: audit-agents-skills (advanced audits) - 3 modes: Quick (top-5), Full (all 16), Comparative (vs templates) - JSON + Markdown output for CI/CD - Scoring grids: criteria.yaml (externalized for reuse) EVALUATION: - Grenier agent/skill quality (3/5 - Moderate Value) - Gap: 29.5% deploy without evaluation (LangChang 2026) - Integration: Created audit command + skill + criteria - Industry context: 18% cite agent bugs as top challenge DOCUMENTATION: - Guide refs: 2 strategic call-outs (after Agent/Skill validation) - CHANGELOG: New "Added" section + evaluation details - README: Templates 106→107, Evaluations 49→24 (count corrections) - reference.yaml: 10 new audit entries + updated counts SYNC: - Landing index.html: Templates 107, Evals 24, Quiz 257 - Landing examples/index.html: Templates 107 FILES: 14 changed, 4148 insertions (+1250 lines new audit content) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
c5fad9f092
commit
b48d95c024
14 changed files with 4148 additions and 13 deletions
|
|
@ -4948,6 +4948,8 @@ Before deploying a custom agent, validate against these criteria:
|
|||
|
||||
> 💡 **Rule of Three**: If an agent doesn't save significant time on at least 3 recurring tasks, it's probably over-engineering. Start with skills, graduate to agents only when complexity demands it.
|
||||
|
||||
> **Automated audit**: Run `/audit-agents-skills` for a comprehensive quality audit across all agents, skills, and commands. Scores each file on 16 criteria with weighted grading (32 points for agents/skills, 20 for commands). See `examples/skills/audit-agents-skills/` for the full scoring methodology.
|
||||
|
||||
## 4.5 Agent Examples
|
||||
|
||||
### Example 1: Code Reviewer Agent
|
||||
|
|
@ -5490,6 +5492,8 @@ skills-ref validate ./my-skill # Check frontmatter + naming conventions
|
|||
skills-ref to-prompt ./my-skill # Generate <available_skills> XML for agent prompts
|
||||
```
|
||||
|
||||
> **Beyond spec validation**: `/audit-agents-skills` extends frontmatter checks with content quality, design patterns, and production readiness scoring. Works on both skills and agents together with weighted criteria (32 points max per file).
|
||||
|
||||
## 5.3 Skill Template
|
||||
|
||||
```markdown
|
||||
|
|
@ -15985,6 +15989,193 @@ I'll decide based on our team context.
|
|||
|
||||
---
|
||||
|
||||
## 9.20 Agent Teams (Multi-Agent Coordination)
|
||||
|
||||
**Reading time**: 5 minutes (overview) | [Full workflow guide →](./workflows/agent-teams.md) (~30 min)
|
||||
**Skill level**: Month 2+ (Advanced)
|
||||
**Status**: ⚠️ Experimental (v2.1.32+, Opus 4.6 required)
|
||||
|
||||
### What Are Agent Teams?
|
||||
|
||||
**Agent teams** enable multiple Claude instances to work in parallel on a shared codebase, coordinating autonomously without human intervention. One session acts as **team lead** to break down tasks and synthesize findings from **teammate** sessions.
|
||||
|
||||
**Key difference from Multi-Instance** (§9.17):
|
||||
- **Multi-Instance** = You manually orchestrate separate Claude sessions (independent projects, no shared state)
|
||||
- **Agent Teams** = Claude manages coordination automatically (shared codebase, git-based communication)
|
||||
|
||||
```
|
||||
Setup:
|
||||
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
|
||||
claude
|
||||
|
||||
OR in ~/.claude/settings.json:
|
||||
{
|
||||
"experimental": {
|
||||
"agentTeams": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### When Introduced & Production Validation
|
||||
|
||||
**Version**: v2.1.32 (2026-02-05) as research preview
|
||||
**Model requirement**: Opus 4.6 minimum
|
||||
|
||||
**Production metrics** (validated cases):
|
||||
- **Fountain** (workforce management): 50% faster screening, 2x conversions
|
||||
- **CRED** (15M users, financial services): 2x execution speed
|
||||
- **Anthropic Research**: Autonomous C compiler completion (no human intervention)
|
||||
|
||||
Source: [2026 Agentic Coding Trends Report](https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf), [Anthropic Engineering Blog](https://www.anthropic.com/engineering/building-c-compiler)
|
||||
|
||||
### Architecture Quick View
|
||||
|
||||
```
|
||||
Team Lead (Main Session)
|
||||
├─ Breaks tasks into subtasks
|
||||
├─ Spawns teammate sessions (each with 1M token context)
|
||||
└─ Synthesizes findings from all agents
|
||||
│
|
||||
├─ Teammate 1: Task A (independent context)
|
||||
└─ Teammate 2: Task B (independent context)
|
||||
|
||||
Coordination: Git-based (task locking, continuous merge, conflict resolution)
|
||||
Navigation: Shift+Up/Down or tmux to switch between agents
|
||||
```
|
||||
|
||||
### Teams vs Multi-Instance vs Dual-Instance
|
||||
|
||||
| Pattern | Coordination | Best For | Cost | Setup |
|
||||
|---------|--------------|----------|------|-------|
|
||||
| **Agent Teams** | Automatic (git-based) | Read-heavy tasks needing coordination | High (3x+) | Experimental flag |
|
||||
| **Multi-Instance** ([§9.17](#917-scaling-patterns-multi-instance-workflows)) | Manual (human) | Independent parallel tasks | Medium (2x) | Multiple terminals |
|
||||
| **Dual-Instance** | Manual (human) | Quality assurance (plan-execute) | Medium (2x) | 2 terminals |
|
||||
|
||||
### Use Cases That Work Well
|
||||
|
||||
**✅ Excellent fit** (read-heavy, clear boundaries):
|
||||
1. **Multi-layer code review**: Security agent + API agent + Frontend agent (Fountain: 50% faster)
|
||||
2. **Parallel hypothesis testing**: Debug by testing 3 theories simultaneously
|
||||
3. **Large-scale refactoring**: 47+ files across layers with clear interfaces
|
||||
4. **Full codebase analysis**: Architecture review, pattern detection
|
||||
|
||||
**❌ Poor fit** (avoid these):
|
||||
- Simple tasks (<5 files affected) — coordination overhead not justified
|
||||
- Write-heavy tasks (many shared file modifications) — merge conflict risks
|
||||
- Sequential dependencies — no parallelization benefit
|
||||
- Budget-constrained projects — 3x token cost multiplier
|
||||
|
||||
### Quick Example: Multi-Layer Code Review
|
||||
|
||||
```markdown
|
||||
Prompt:
|
||||
"Review this PR comprehensively using agent teams:
|
||||
- Security agent: Check for vulnerabilities, auth issues, data exposure
|
||||
- API agent: Review endpoint design, validation, error handling
|
||||
- Frontend agent: Check UI patterns, accessibility, performance
|
||||
|
||||
PR: https://github.com/company/repo/pull/123"
|
||||
|
||||
Result:
|
||||
Team lead spawns 3 agents → Each analyzes their domain in parallel →
|
||||
Team lead synthesizes findings → Comprehensive review in 1/3 the time
|
||||
```
|
||||
|
||||
### Critical Limitations
|
||||
|
||||
**Read-heavy > Write-heavy trade-off**:
|
||||
```
|
||||
✅ Good: Code review (agents read, analyze, report)
|
||||
✅ Good: Bug tracing (agents read logs, trace execution)
|
||||
✅ Good: Architecture analysis (agents read structure)
|
||||
|
||||
⚠️ Risky: Refactoring shared types (merge conflicts)
|
||||
⚠️ Risky: Database schema changes (coordinated migrations)
|
||||
❌ Bad: Same file modified by multiple agents (conflict hell)
|
||||
```
|
||||
|
||||
**Mitigation**: Assign non-overlapping file sets, use interface-first approach, define contracts before parallel work.
|
||||
|
||||
**Token intensity**: 3x+ cost multiplier (3 agents = 3 model inferences). Only justified when time saved > cost increase.
|
||||
|
||||
**Experimental status**: No stability guarantee, bugs expected, feature may change. Report issues to [Anthropic GitHub](https://github.com/anthropics/claude-code/issues).
|
||||
|
||||
### Decision Tree: When to Use Agent Teams
|
||||
|
||||
```
|
||||
Is task simple (<5 files)? ──YES──> Single agent
|
||||
│
|
||||
NO
|
||||
│
|
||||
Tasks completely independent? ──YES──> Multi-Instance (§9.17)
|
||||
│
|
||||
NO
|
||||
│
|
||||
Need quality assurance split? ──YES──> Dual-Instance
|
||||
│
|
||||
NO
|
||||
│
|
||||
Read-heavy (analysis, review)? ──YES──> Agent Teams ✓
|
||||
│
|
||||
NO
|
||||
│
|
||||
Write-heavy (many file mods)? ──YES──> Single agent
|
||||
│
|
||||
NO
|
||||
│
|
||||
Budget-constrained? ──YES──> Single agent
|
||||
│
|
||||
NO
|
||||
│
|
||||
Complex coordination needed? ──YES──> Agent Teams ✓
|
||||
──NO──> Single agent
|
||||
```
|
||||
|
||||
### Practitioner Testimonial
|
||||
|
||||
**Paul Rayner** (CEO Virtual Genius, EventStorming Handbook author):
|
||||
> "Running 3 concurrent agent team sessions across separate terminals. Pretty impressive compared to previous multi-terminal workflows without coordination."
|
||||
|
||||
**Workflows used** (Feb 2026):
|
||||
1. Job search app: Design research + bug fixing
|
||||
2. Business ops: Operating system + conference planning
|
||||
3. Infrastructure: Playwright MCP + beads framework management
|
||||
|
||||
Source: [Paul Rayner LinkedIn](https://www.linkedin.com/posts/thepaulrayner_this-is-wild-i-just-upgraded-claude-code-activity-7425635159678414850-MNyv)
|
||||
|
||||
### Navigation Between Agents
|
||||
|
||||
**Built-in controls**:
|
||||
- **Shift+Up/Down**: Switch between sub-agents
|
||||
- **tmux**: Use tmux commands if in tmux session
|
||||
- **Direct takeover**: Take control of any agent's work mid-execution
|
||||
|
||||
**Monitoring**: Each agent reports progress, team lead synthesizes when all complete.
|
||||
|
||||
### Full Documentation
|
||||
|
||||
This section is a quick overview. For complete guide:
|
||||
- **[Agent Teams Workflow](./workflows/agent-teams.md)** (~30 min, 10 sections)
|
||||
- Architecture deep-dive (team lead, teammates, git coordination)
|
||||
- Setup instructions (2 methods)
|
||||
- 5 production use cases with metrics
|
||||
- Workflow impact analysis (before/after)
|
||||
- Limitations & gotchas (read/write trade-offs)
|
||||
- Decision framework (Teams vs Multi-Instance vs Beads)
|
||||
- Best practices, troubleshooting
|
||||
|
||||
**Related patterns**:
|
||||
- [§9.17 Multi-Instance Workflows](#917-scaling-patterns-multi-instance-workflows) — Manual parallel coordination
|
||||
- [§4.3 Sub-Agents](#43-sub-agents) — Single-agent task delegation
|
||||
- [AI Ecosystem: Beads Framework](./ai-ecosystem.md) — Alternative orchestration (Gas Town)
|
||||
|
||||
**Official sources**:
|
||||
- [Introducing Claude Opus 4.6](https://www.anthropic.com/news/claude-opus-4-6) (Anthropic, Feb 2026)
|
||||
- [Building a C compiler with agent teams](https://www.anthropic.com/engineering/building-c-compiler) (Anthropic Engineering, Feb 2026)
|
||||
- [2026 Agentic Coding Trends Report](https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf) (Anthropic, Jan 2026)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Section 9 Recap: Pattern Mastery Checklist
|
||||
|
||||
Before moving to Section 10 (Reference), verify you understand:
|
||||
|
|
@ -16016,6 +16207,7 @@ Before moving to Section 10 (Reference), verify you understand:
|
|||
- [ ] **Session Teleportation**: Migrate sessions between cloud and local environments
|
||||
- [ ] **Background Tasks**: Run tasks in cloud while working locally (`%` prefix)
|
||||
- [ ] **Multi-Instance Scaling**: Understand when/how to orchestrate parallel Claude instances (advanced teams only)
|
||||
- [ ] **Agent Teams**: Multi-agent coordination for read-heavy tasks (experimental, Opus 4.6+)
|
||||
- [ ] **Permutation Frameworks**: Systematically test multiple approaches before committing
|
||||
|
||||
### What's Next?
|
||||
|
|
|
|||
1220
guide/workflows/agent-teams.md
Normal file
1220
guide/workflows/agent-teams.md
Normal file
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue