claude-code-ultimate-guide/examples/commands/audit-agents-skills.md
Florian BRUNIAUX b48d95c024 feat: add agent/skill quality audit tooling + Grenier evaluation
AUDIT TOOLING (3 templates):
- Command: /audit-agents-skills (quick project audits)
  - 16-criteria framework (Identity 3x, Prompt 2x, Validation 1x, Design 2x)
  - Weighted scoring: 32 pts (agents/skills), 20 pts (commands)
  - Production grading (A-F, 80% threshold)
  - Fix mode with actionable suggestions
- Skill: audit-agents-skills (advanced audits)
  - 3 modes: Quick (top-5), Full (all 16), Comparative (vs templates)
  - JSON + Markdown output for CI/CD
- Scoring grids: criteria.yaml (externalized for reuse)

EVALUATION:
- Grenier agent/skill quality (3/5 - Moderate Value)
  - Gap: 29.5% deploy without evaluation (LangChang 2026)
  - Integration: Created audit command + skill + criteria
  - Industry context: 18% cite agent bugs as top challenge

DOCUMENTATION:
- Guide refs: 2 strategic call-outs (after Agent/Skill validation)
- CHANGELOG: New "Added" section + evaluation details
- README: Templates 106→107, Evaluations 49→24 (count corrections)
- reference.yaml: 10 new audit entries + updated counts

SYNC:
- Landing index.html: Templates 107, Evals 24, Quiz 257
- Landing examples/index.html: Templates 107

FILES: 14 changed, 4148 insertions (+1250 lines new audit content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 15:40:18 +01:00

475 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
name: audit-agents-skills
description: Audit quality of agents, skills, and commands in a Claude Code project
argument-hint: "[path] [--fix] [--verbose]"
---
# Audit Agents/Skills/Commands Quality
Comprehensive quality audit for Claude Code agents, skills, and commands. Scores each file on weighted criteria with production readiness grading.
## Arguments
- `[path]` - Directory to audit (default: current project `.claude/`)
- `--fix` - Generate fix suggestions for failing criteria
- `--verbose` - Show details for all criteria (not just failures)
## Usage
```bash
/audit-agents-skills # Audit current project
/audit-agents-skills --fix # Audit + fix suggestions
/audit-agents-skills ~/other-repo # Audit another project
/audit-agents-skills --verbose # Full details for all criteria
```
---
## Phase 1: Discovery
**Objective**: Locate and classify all agents, skills, and commands
### Steps
1. **Scan directories**:
```
.claude/agents/
.claude/skills/
.claude/commands/
examples/agents/ (if exists)
examples/skills/ (if exists)
examples/commands/ (if exists)
```
2. **Classify files**:
- **Agent**: File in `agents/` directory with YAML frontmatter containing `tools:` field
- **Skill**: File in `skills/` directory OR has `SKILL.md` name OR frontmatter with `allowed-tools:` field
- **Command**: File in `commands/` directory with frontmatter containing `name:` and `description:`
3. **Display summary**:
```
Found: X agents, Y skills, Z commands
```
---
## Phase 2: Audit Individual Files
Each file type is scored on **weighted criteria**. Maximum scores:
- **Agents**: 32 points
- **Skills**: 32 points
- **Commands**: 20 points
### Agents (32 points max)
#### Identity (weight: 3x) - 12 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Clear `name` field | 3 | Frontmatter YAML has `name:` field that's descriptive (not generic like "agent1") |
| `description` with triggers | 3 | Description contains "when", "use", or "trigger" keywords indicating activation context |
| `model` specified | 3 | Frontmatter has `model:` field (sonnet/haiku/opus) |
| `tools` restricted appropriately | 3 | Tools list doesn't include Bash unless justified, or includes explanation for risky tools |
**Rationale**: Identity determines **discoverability** and **activation**. If users can't locate or invoke the agent, downstream quality is irrelevant.
#### Prompt Quality (weight: 2x) - 8 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Role defined | 2 | Contains "You are" or "Your role" statement defining agent persona |
| Output format specified | 2 | Has section titled "Output", "Format", or "Deliverables" specifying expected structure |
| Scope/limits defined | 2 | Has section defining scope, triggers, or when NOT to use the agent |
| Anti-hallucination measures | 2 | Contains keywords: "verify", "cite", "source", "evidence", or warnings against hallucination |
**Rationale**: Prompt quality determines **reliability** and **accuracy** of agent responses.
#### Validation (weight: 1x) - 4 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| 3+ usage examples | 1 | Has "Examples", "Usage", or "Scenarios" section with at least 3 distinct examples |
| Edge cases documented | 1 | Mentions "edge case", "error", "failure", or "limitation" scenarios |
| Integration documented | 1 | References other agents, skills, or tools it works with |
| Error handling described | 1 | Mentions "fallback", "recovery", "error handling", or failure modes |
**Rationale**: Validation ensures **robustness** through comprehensive testing scenarios.
#### Design (weight: 2x) - 8 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Single responsibility | 2 | File size <5000 tokens AND description is focused (not "general purpose" or multiple verbs) |
| No duplication | 2 | Description doesn't overlap significantly with other agents (>50% keyword similarity check) |
| Composable (skills references) | 2 | References skills or other agents it can invoke, showing modularity |
| Reasonable token budget | 2 | File size <8000 tokens (avoids context bloat) |
**Rationale**: Design patterns determine **maintainability** and **scalability** of agent architecture.
---
### Skills (32 points max)
#### Structure (weight: 3x) - 12 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Valid SKILL.md or frontmatter | 3 | File named `SKILL.md` OR has YAML frontmatter with `name:` field |
| `name` valid | 3 | Name is lowercase, 1-64 chars, matches pattern `[a-z0-9-]+` (no spaces/special chars) |
| `description` non-empty | 3 | Description field exists and is >20 characters |
| `allowed-tools` specified | 3 | Frontmatter has `allowed-tools:` field listing tool permissions |
**Rationale**: Structure compliance ensures **spec compatibility** with Claude Code runtime.
#### Content (weight: 2x) - 8 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Methodology/workflow described | 2 | Has section titled "Methodology", "Workflow", "Process", or numbered steps |
| Output format specified | 2 | Has section specifying deliverable format (Markdown, JSON, report structure) |
| Examples provided | 2 | Has "Examples", "Usage", or "Scenarios" section with concrete instances |
| Checklists included | 2 | Contains Markdown checkbox syntax `- [ ]` or `- [x]` for actionable items |
**Rationale**: Content richness determines **usability** and **learning curve**.
#### Technical (weight: 1x) - 4 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Scripts have error handling | 1 | If bundled scripts exist, contain `set -e`, `trap`, or `|| exit` patterns |
| No hardcoded paths | 1 | No absolute paths like `/Users/`, `/home/`, `C:\` in code or instructions |
| No secrets | 1 | No keywords: "password", "secret", "token", "api_key", "credentials" in plaintext |
| Dependencies documented | 1 | If external tools required, has "Requirements", "Dependencies", or "Prerequisites" section |
**Rationale**: Technical hygiene prevents **portability issues** and **security risks**.
#### Design (weight: 2x) - 8 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Single responsibility | 2 | Description is focused on one domain (not "general" or multi-purpose) |
| Clear triggers | 2 | Has section defining "When to use", "Triggers", or "Activation criteria" |
| No overlap with other skills | 2 | Description doesn't duplicate >50% of keywords from other skills in project |
| Portable | 2 | No Claude Code-specific extensions that break portability (check for custom APIs) |
**Rationale**: Design determines **findability** and **maintainability** across projects.
---
### Commands (20 points max)
#### Structure (weight: 3x) - 12 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Valid frontmatter | 3 | Has YAML frontmatter with both `name:` and `description:` fields |
| `argument-hint` if takes args | 3 | If `$ARGUMENTS` variable is used in body, frontmatter has `argument-hint:` field |
| Step-by-step workflow | 3 | Body contains numbered sections (1., 2., 3.) or clear phase structure |
| Usage examples | 3 | Has section titled "Usage", "Examples", or shows invocation patterns |
**Rationale**: Structure determines **usability** and **learnability** for command users.
#### Quality (weight: 2x) - 8 points
| Criterion | Points | Detection |
|-----------|--------|-----------|
| Error handling | 2 | Mentions "error", "failure", "fallback", or conditional paths for failures |
| Output format defined | 2 | Specifies what command outputs (report, file, summary) and its structure |
| Validation gates | 2 | Contains checkpoints, verification steps, or "before proceeding" checks |
| Arguments parsed properly | 2 | If takes args, shows how to parse/validate `$ARGUMENTS` (default values, validation) |
**Rationale**: Quality determines **reliability** and **production readiness**.
---
## Phase 3: Scoring
### Individual File Score
```
Score = (Points Obtained / Max Points) × 100
```
**Example**: Agent scores 26/32 points → 81% score
### Grade Assignment
| Grade | Score Range | Status |
|-------|-------------|--------|
| A | 90-100% | Production-ready ✅ |
| B | 80-89% | Good (production threshold) ⚠️ |
| C | 70-79% | Needs improvement 🔧 |
| D | 60-69% | Significant gaps ⚠️ |
| F | <60% | Critical issues |
**Production Threshold**: 80% (Grade B or higher)
### Overall Project Score
Weighted average by file type:
```
Overall = (Σ Agent Scores × Agent Count + Σ Skill Scores × Skill Count + Σ Command Scores × Command Count) / Total Files
```
---
## Phase 4: Report Generation
### Report Structure
```markdown
# Audit: Agents/Skills/Commands
**Project**: {path}
**Date**: {date}
**Overall Score**: {score}% ({grade})
**Files Audited**: {total} ({n} agents, {n} skills, {n} commands)
**Production Ready**: {count} files ({percentage}%)
---
## Summary
| Type | Files | Avg Score | Grade | Production Ready |
|------|-------|-----------|-------|------------------|
| Agents | X | Y% | Z | N/X (%) |
| Skills | X | Y% | Z | N/X (%) |
| Commands | X | Y% | Z | N/X (%) |
---
## Individual Scores
| File | Type | Score | Grade | Top Issues |
|------|------|-------|-------|------------|
| agent-name.md | Agent | 85% | B | Missing anti-hallucination measures, no edge cases |
| skill-name/ | Skill | 72% | C | Hardcoded paths, no error handling |
| command.md | Command | 95% | A | None |
---
## Top Issues (Across All Files)
1. **Missing error handling** (8 files affected)
- Impact: Runtime failures unhandled
- Fix: Add error handling sections, fallback strategies
2. **Hardcoded paths** (5 files affected)
- Impact: Portability broken across systems
- Fix: Use relative paths or environment variables
3. **No usage examples** (4 files affected)
- Impact: Poor learnability, unclear invocation
- Fix: Add "Examples" section with 3+ scenarios
---
## Detailed Breakdown
<details>
<summary>agent-name.md (Agent, 85%, Grade B)</summary>
### Scores by Category
| Category | Points | Max | Pass |
|----------|--------|-----|------|
| Identity | 12 | 12 | ✅ |
| Prompt Quality | 6 | 8 | ⚠️ |
| Validation | 2 | 4 | ❌ |
| Design | 6 | 8 | ⚠️ |
### Failed Criteria
-**Anti-hallucination measures** (2 pts): No keywords found for source verification
-**Edge cases documented** (1 pt): No mention of failure scenarios
-**Integration documented** (1 pt): No references to other agents/skills
### Recommendations
1. Add "Source Verification" section requiring citation of claims
2. Document edge cases: API failures, timeout scenarios, invalid input
3. List compatible skills/agents for composition patterns
</details>
---
## Recommendations (Prioritized)
### High Priority (Critical for production)
1. **Add error handling to 8 files**
- Files: [list]
- Action: Add error handling sections, define fallback behaviors
2. **Remove hardcoded paths from 5 files**
- Files: [list]
- Action: Replace with `$HOME`, relative paths, or env vars
### Medium Priority (Improves quality)
3. **Add usage examples to 4 files**
- Files: [list]
- Action: Create "Examples" section with 3+ scenarios
4. **Define output formats in 3 files**
- Files: [list]
- Action: Specify deliverable structure (Markdown/JSON/report)
### Low Priority (Polish)
5. **Add integration docs to 2 files**
- Files: [list]
- Action: List compatible agents/skills for composition
---
## Next Steps
1. Review failures: Focus on Grade D/F files first
2. Run with `--fix` for automated suggestions
3. Re-audit after improvements to track progress
4. Aim for 80%+ (Grade B) across all files for production readiness
```
---
## Phase 5: Fix Mode (Optional)
**Trigger**: `--fix` flag
For each failing criterion, generate specific fix suggestion:
### Example Fix Suggestions
**File**: `agent-name.md`
**Issue**: Missing anti-hallucination measures (2 pts lost)
**Suggested Fix**:
```markdown
Add this section after the "Methodology" section:
## Source Verification
- Always cite sources for factual claims
- Use phrases like "According to [source]..." or "Based on [documentation]..."
- If uncertain, explicitly state "I don't have verified information on..."
- Never invent statistics, version numbers, or API details
```
**File**: `skill-debugging/scripts/analyze.sh`
**Issue**: No error handling (1 pt lost)
**Suggested Fix**:
```bash
Add to top of script:
set -e # Exit on error
trap 'echo "Error on line $LINENO"' ERR
# Replace risky commands:
curl https://api.example.com # ❌ No error check
curl https://api.example.com || { # ✅ Error handled
echo "API call failed"
exit 1
}
```
---
## Verbose Mode (Optional)
**Trigger**: `--verbose` flag
By default, report shows only **failed criteria**. Verbose mode shows **all criteria** with pass/fail status:
```markdown
### All Criteria (Verbose)
| Criterion | Status | Points | Notes |
|-----------|--------|--------|-------|
| Clear name | ✅ Pass | 3/3 | Name is "debugging-specialist" (descriptive) |
| Description with triggers | ✅ Pass | 3/3 | Contains "Use when debugging..." |
| Model specified | ❌ Fail | 0/3 | No `model:` field in frontmatter |
| Tools restricted | ⚠️ Partial | 2/3 | Includes Bash but no justification |
| ... | ... | ... | ... |
```
---
## Industry Context
**Source**: LangChain Agent Report 2026 (verified)
**Key Statistics**:
- 29.5% of organizations deploy agents without systematic evaluation
- 18% cite "agent bugs" as their top challenge
- Only 12% use automated quality checks
**Implication**: This audit addresses a **real industry gap**. Most teams deploy agents/skills without validation, leading to production issues. The 80% threshold (Grade B) aligns with industry best practices for production readiness.
**Comparison**: Manual checklists (like the Guide's Agent Validation Checklist on line 4921) are comprehensive but error-prone. Automated scoring reduces human error and provides quantitative metrics for tracking improvements over time.
---
## Related
- **Agent Validation Checklist** (guide line 4921): Manual 16-criteria checklist
- **Skill Validation** (guide line 5491): Spec compliance documentation
- **Examples**: `examples/agents/`, `examples/skills/`, `examples/commands/`
- **Advanced Audit**: Use `audit-agents-skills` skill (see `examples/skills/`) for comparative analysis vs templates
---
## Implementation Notes
### Detection Patterns
**Frontmatter Parsing**:
```python
import re
yaml_match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
if yaml_match:
import yaml
frontmatter = yaml.safe_load(yaml_match.group(1))
```
**Keyword Detection** (case-insensitive):
```python
has_trigger = any(word in description.lower() for word in ['when', 'use', 'trigger'])
```
**Token Counting** (approximate):
```python
tokens = len(content.split()) * 1.3 # Rough estimate: 1 token ≈ 0.75 words
```
### Overlap Detection
Compare descriptions using Jaccard similarity:
```python
def jaccard_similarity(desc1, desc2):
words1 = set(desc1.lower().split())
words2 = set(desc2.lower().split())
intersection = words1 & words2
union = words1 | words2
return len(intersection) / len(union) if union else 0
# Flag if similarity > 0.5 (50% keyword overlap)
```
### Grade Color Coding (Terminal Output)
```python
COLORS = {
'A': '\033[92m', # Green
'B': '\033[93m', # Yellow
'C': '\033[93m', # Yellow
'D': '\033[91m', # Red
'F': '\033[91m' # Red
}
```
---
**Command ready for use**: `/audit-agents-skills`