AUDIT TOOLING (3 templates): - Command: /audit-agents-skills (quick project audits) - 16-criteria framework (Identity 3x, Prompt 2x, Validation 1x, Design 2x) - Weighted scoring: 32 pts (agents/skills), 20 pts (commands) - Production grading (A-F, 80% threshold) - Fix mode with actionable suggestions - Skill: audit-agents-skills (advanced audits) - 3 modes: Quick (top-5), Full (all 16), Comparative (vs templates) - JSON + Markdown output for CI/CD - Scoring grids: criteria.yaml (externalized for reuse) EVALUATION: - Grenier agent/skill quality (3/5 - Moderate Value) - Gap: 29.5% deploy without evaluation (LangChang 2026) - Integration: Created audit command + skill + criteria - Industry context: 18% cite agent bugs as top challenge DOCUMENTATION: - Guide refs: 2 strategic call-outs (after Agent/Skill validation) - CHANGELOG: New "Added" section + evaluation details - README: Templates 106→107, Evaluations 49→24 (count corrections) - reference.yaml: 10 new audit entries + updated counts SYNC: - Landing index.html: Templates 107, Evals 24, Quiz 257 - Landing examples/index.html: Templates 107 FILES: 14 changed, 4148 insertions (+1250 lines new audit content) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
475 lines
16 KiB
Markdown
475 lines
16 KiB
Markdown
---
|
||
name: audit-agents-skills
|
||
description: Audit quality of agents, skills, and commands in a Claude Code project
|
||
argument-hint: "[path] [--fix] [--verbose]"
|
||
---
|
||
|
||
# Audit Agents/Skills/Commands Quality
|
||
|
||
Comprehensive quality audit for Claude Code agents, skills, and commands. Scores each file on weighted criteria with production readiness grading.
|
||
|
||
## Arguments
|
||
|
||
- `[path]` - Directory to audit (default: current project `.claude/`)
|
||
- `--fix` - Generate fix suggestions for failing criteria
|
||
- `--verbose` - Show details for all criteria (not just failures)
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
/audit-agents-skills # Audit current project
|
||
/audit-agents-skills --fix # Audit + fix suggestions
|
||
/audit-agents-skills ~/other-repo # Audit another project
|
||
/audit-agents-skills --verbose # Full details for all criteria
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 1: Discovery
|
||
|
||
**Objective**: Locate and classify all agents, skills, and commands
|
||
|
||
### Steps
|
||
|
||
1. **Scan directories**:
|
||
```
|
||
.claude/agents/
|
||
.claude/skills/
|
||
.claude/commands/
|
||
examples/agents/ (if exists)
|
||
examples/skills/ (if exists)
|
||
examples/commands/ (if exists)
|
||
```
|
||
|
||
2. **Classify files**:
|
||
- **Agent**: File in `agents/` directory with YAML frontmatter containing `tools:` field
|
||
- **Skill**: File in `skills/` directory OR has `SKILL.md` name OR frontmatter with `allowed-tools:` field
|
||
- **Command**: File in `commands/` directory with frontmatter containing `name:` and `description:`
|
||
|
||
3. **Display summary**:
|
||
```
|
||
Found: X agents, Y skills, Z commands
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 2: Audit Individual Files
|
||
|
||
Each file type is scored on **weighted criteria**. Maximum scores:
|
||
- **Agents**: 32 points
|
||
- **Skills**: 32 points
|
||
- **Commands**: 20 points
|
||
|
||
### Agents (32 points max)
|
||
|
||
#### Identity (weight: 3x) - 12 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Clear `name` field | 3 | Frontmatter YAML has `name:` field that's descriptive (not generic like "agent1") |
|
||
| `description` with triggers | 3 | Description contains "when", "use", or "trigger" keywords indicating activation context |
|
||
| `model` specified | 3 | Frontmatter has `model:` field (sonnet/haiku/opus) |
|
||
| `tools` restricted appropriately | 3 | Tools list doesn't include Bash unless justified, or includes explanation for risky tools |
|
||
|
||
**Rationale**: Identity determines **discoverability** and **activation**. If users can't locate or invoke the agent, downstream quality is irrelevant.
|
||
|
||
#### Prompt Quality (weight: 2x) - 8 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Role defined | 2 | Contains "You are" or "Your role" statement defining agent persona |
|
||
| Output format specified | 2 | Has section titled "Output", "Format", or "Deliverables" specifying expected structure |
|
||
| Scope/limits defined | 2 | Has section defining scope, triggers, or when NOT to use the agent |
|
||
| Anti-hallucination measures | 2 | Contains keywords: "verify", "cite", "source", "evidence", or warnings against hallucination |
|
||
|
||
**Rationale**: Prompt quality determines **reliability** and **accuracy** of agent responses.
|
||
|
||
#### Validation (weight: 1x) - 4 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| 3+ usage examples | 1 | Has "Examples", "Usage", or "Scenarios" section with at least 3 distinct examples |
|
||
| Edge cases documented | 1 | Mentions "edge case", "error", "failure", or "limitation" scenarios |
|
||
| Integration documented | 1 | References other agents, skills, or tools it works with |
|
||
| Error handling described | 1 | Mentions "fallback", "recovery", "error handling", or failure modes |
|
||
|
||
**Rationale**: Validation ensures **robustness** through comprehensive testing scenarios.
|
||
|
||
#### Design (weight: 2x) - 8 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Single responsibility | 2 | File size <5000 tokens AND description is focused (not "general purpose" or multiple verbs) |
|
||
| No duplication | 2 | Description doesn't overlap significantly with other agents (>50% keyword similarity check) |
|
||
| Composable (skills references) | 2 | References skills or other agents it can invoke, showing modularity |
|
||
| Reasonable token budget | 2 | File size <8000 tokens (avoids context bloat) |
|
||
|
||
**Rationale**: Design patterns determine **maintainability** and **scalability** of agent architecture.
|
||
|
||
---
|
||
|
||
### Skills (32 points max)
|
||
|
||
#### Structure (weight: 3x) - 12 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Valid SKILL.md or frontmatter | 3 | File named `SKILL.md` OR has YAML frontmatter with `name:` field |
|
||
| `name` valid | 3 | Name is lowercase, 1-64 chars, matches pattern `[a-z0-9-]+` (no spaces/special chars) |
|
||
| `description` non-empty | 3 | Description field exists and is >20 characters |
|
||
| `allowed-tools` specified | 3 | Frontmatter has `allowed-tools:` field listing tool permissions |
|
||
|
||
**Rationale**: Structure compliance ensures **spec compatibility** with Claude Code runtime.
|
||
|
||
#### Content (weight: 2x) - 8 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Methodology/workflow described | 2 | Has section titled "Methodology", "Workflow", "Process", or numbered steps |
|
||
| Output format specified | 2 | Has section specifying deliverable format (Markdown, JSON, report structure) |
|
||
| Examples provided | 2 | Has "Examples", "Usage", or "Scenarios" section with concrete instances |
|
||
| Checklists included | 2 | Contains Markdown checkbox syntax `- [ ]` or `- [x]` for actionable items |
|
||
|
||
**Rationale**: Content richness determines **usability** and **learning curve**.
|
||
|
||
#### Technical (weight: 1x) - 4 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Scripts have error handling | 1 | If bundled scripts exist, contain `set -e`, `trap`, or `|| exit` patterns |
|
||
| No hardcoded paths | 1 | No absolute paths like `/Users/`, `/home/`, `C:\` in code or instructions |
|
||
| No secrets | 1 | No keywords: "password", "secret", "token", "api_key", "credentials" in plaintext |
|
||
| Dependencies documented | 1 | If external tools required, has "Requirements", "Dependencies", or "Prerequisites" section |
|
||
|
||
**Rationale**: Technical hygiene prevents **portability issues** and **security risks**.
|
||
|
||
#### Design (weight: 2x) - 8 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Single responsibility | 2 | Description is focused on one domain (not "general" or multi-purpose) |
|
||
| Clear triggers | 2 | Has section defining "When to use", "Triggers", or "Activation criteria" |
|
||
| No overlap with other skills | 2 | Description doesn't duplicate >50% of keywords from other skills in project |
|
||
| Portable | 2 | No Claude Code-specific extensions that break portability (check for custom APIs) |
|
||
|
||
**Rationale**: Design determines **findability** and **maintainability** across projects.
|
||
|
||
---
|
||
|
||
### Commands (20 points max)
|
||
|
||
#### Structure (weight: 3x) - 12 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Valid frontmatter | 3 | Has YAML frontmatter with both `name:` and `description:` fields |
|
||
| `argument-hint` if takes args | 3 | If `$ARGUMENTS` variable is used in body, frontmatter has `argument-hint:` field |
|
||
| Step-by-step workflow | 3 | Body contains numbered sections (1., 2., 3.) or clear phase structure |
|
||
| Usage examples | 3 | Has section titled "Usage", "Examples", or shows invocation patterns |
|
||
|
||
**Rationale**: Structure determines **usability** and **learnability** for command users.
|
||
|
||
#### Quality (weight: 2x) - 8 points
|
||
|
||
| Criterion | Points | Detection |
|
||
|-----------|--------|-----------|
|
||
| Error handling | 2 | Mentions "error", "failure", "fallback", or conditional paths for failures |
|
||
| Output format defined | 2 | Specifies what command outputs (report, file, summary) and its structure |
|
||
| Validation gates | 2 | Contains checkpoints, verification steps, or "before proceeding" checks |
|
||
| Arguments parsed properly | 2 | If takes args, shows how to parse/validate `$ARGUMENTS` (default values, validation) |
|
||
|
||
**Rationale**: Quality determines **reliability** and **production readiness**.
|
||
|
||
---
|
||
|
||
## Phase 3: Scoring
|
||
|
||
### Individual File Score
|
||
|
||
```
|
||
Score = (Points Obtained / Max Points) × 100
|
||
```
|
||
|
||
**Example**: Agent scores 26/32 points → 81% score
|
||
|
||
### Grade Assignment
|
||
|
||
| Grade | Score Range | Status |
|
||
|-------|-------------|--------|
|
||
| A | 90-100% | Production-ready ✅ |
|
||
| B | 80-89% | Good (production threshold) ⚠️ |
|
||
| C | 70-79% | Needs improvement 🔧 |
|
||
| D | 60-69% | Significant gaps ⚠️ |
|
||
| F | <60% | Critical issues ❌ |
|
||
|
||
**Production Threshold**: 80% (Grade B or higher)
|
||
|
||
### Overall Project Score
|
||
|
||
Weighted average by file type:
|
||
```
|
||
Overall = (Σ Agent Scores × Agent Count + Σ Skill Scores × Skill Count + Σ Command Scores × Command Count) / Total Files
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 4: Report Generation
|
||
|
||
### Report Structure
|
||
|
||
```markdown
|
||
# Audit: Agents/Skills/Commands
|
||
|
||
**Project**: {path}
|
||
**Date**: {date}
|
||
**Overall Score**: {score}% ({grade})
|
||
**Files Audited**: {total} ({n} agents, {n} skills, {n} commands)
|
||
**Production Ready**: {count} files ({percentage}%)
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
| Type | Files | Avg Score | Grade | Production Ready |
|
||
|------|-------|-----------|-------|------------------|
|
||
| Agents | X | Y% | Z | N/X (%) |
|
||
| Skills | X | Y% | Z | N/X (%) |
|
||
| Commands | X | Y% | Z | N/X (%) |
|
||
|
||
---
|
||
|
||
## Individual Scores
|
||
|
||
| File | Type | Score | Grade | Top Issues |
|
||
|------|------|-------|-------|------------|
|
||
| agent-name.md | Agent | 85% | B | Missing anti-hallucination measures, no edge cases |
|
||
| skill-name/ | Skill | 72% | C | Hardcoded paths, no error handling |
|
||
| command.md | Command | 95% | A | None |
|
||
|
||
---
|
||
|
||
## Top Issues (Across All Files)
|
||
|
||
1. **Missing error handling** (8 files affected)
|
||
- Impact: Runtime failures unhandled
|
||
- Fix: Add error handling sections, fallback strategies
|
||
|
||
2. **Hardcoded paths** (5 files affected)
|
||
- Impact: Portability broken across systems
|
||
- Fix: Use relative paths or environment variables
|
||
|
||
3. **No usage examples** (4 files affected)
|
||
- Impact: Poor learnability, unclear invocation
|
||
- Fix: Add "Examples" section with 3+ scenarios
|
||
|
||
---
|
||
|
||
## Detailed Breakdown
|
||
|
||
<details>
|
||
<summary>agent-name.md (Agent, 85%, Grade B)</summary>
|
||
|
||
### Scores by Category
|
||
|
||
| Category | Points | Max | Pass |
|
||
|----------|--------|-----|------|
|
||
| Identity | 12 | 12 | ✅ |
|
||
| Prompt Quality | 6 | 8 | ⚠️ |
|
||
| Validation | 2 | 4 | ❌ |
|
||
| Design | 6 | 8 | ⚠️ |
|
||
|
||
### Failed Criteria
|
||
|
||
- ❌ **Anti-hallucination measures** (2 pts): No keywords found for source verification
|
||
- ❌ **Edge cases documented** (1 pt): No mention of failure scenarios
|
||
- ❌ **Integration documented** (1 pt): No references to other agents/skills
|
||
|
||
### Recommendations
|
||
|
||
1. Add "Source Verification" section requiring citation of claims
|
||
2. Document edge cases: API failures, timeout scenarios, invalid input
|
||
3. List compatible skills/agents for composition patterns
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## Recommendations (Prioritized)
|
||
|
||
### High Priority (Critical for production)
|
||
|
||
1. **Add error handling to 8 files**
|
||
- Files: [list]
|
||
- Action: Add error handling sections, define fallback behaviors
|
||
|
||
2. **Remove hardcoded paths from 5 files**
|
||
- Files: [list]
|
||
- Action: Replace with `$HOME`, relative paths, or env vars
|
||
|
||
### Medium Priority (Improves quality)
|
||
|
||
3. **Add usage examples to 4 files**
|
||
- Files: [list]
|
||
- Action: Create "Examples" section with 3+ scenarios
|
||
|
||
4. **Define output formats in 3 files**
|
||
- Files: [list]
|
||
- Action: Specify deliverable structure (Markdown/JSON/report)
|
||
|
||
### Low Priority (Polish)
|
||
|
||
5. **Add integration docs to 2 files**
|
||
- Files: [list]
|
||
- Action: List compatible agents/skills for composition
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
1. Review failures: Focus on Grade D/F files first
|
||
2. Run with `--fix` for automated suggestions
|
||
3. Re-audit after improvements to track progress
|
||
4. Aim for 80%+ (Grade B) across all files for production readiness
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 5: Fix Mode (Optional)
|
||
|
||
**Trigger**: `--fix` flag
|
||
|
||
For each failing criterion, generate specific fix suggestion:
|
||
|
||
### Example Fix Suggestions
|
||
|
||
**File**: `agent-name.md`
|
||
**Issue**: Missing anti-hallucination measures (2 pts lost)
|
||
|
||
**Suggested Fix**:
|
||
```markdown
|
||
Add this section after the "Methodology" section:
|
||
|
||
## Source Verification
|
||
|
||
- Always cite sources for factual claims
|
||
- Use phrases like "According to [source]..." or "Based on [documentation]..."
|
||
- If uncertain, explicitly state "I don't have verified information on..."
|
||
- Never invent statistics, version numbers, or API details
|
||
```
|
||
|
||
**File**: `skill-debugging/scripts/analyze.sh`
|
||
**Issue**: No error handling (1 pt lost)
|
||
|
||
**Suggested Fix**:
|
||
```bash
|
||
Add to top of script:
|
||
|
||
set -e # Exit on error
|
||
trap 'echo "Error on line $LINENO"' ERR
|
||
|
||
# Replace risky commands:
|
||
curl https://api.example.com # ❌ No error check
|
||
curl https://api.example.com || { # ✅ Error handled
|
||
echo "API call failed"
|
||
exit 1
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Verbose Mode (Optional)
|
||
|
||
**Trigger**: `--verbose` flag
|
||
|
||
By default, report shows only **failed criteria**. Verbose mode shows **all criteria** with pass/fail status:
|
||
|
||
```markdown
|
||
### All Criteria (Verbose)
|
||
|
||
| Criterion | Status | Points | Notes |
|
||
|-----------|--------|--------|-------|
|
||
| Clear name | ✅ Pass | 3/3 | Name is "debugging-specialist" (descriptive) |
|
||
| Description with triggers | ✅ Pass | 3/3 | Contains "Use when debugging..." |
|
||
| Model specified | ❌ Fail | 0/3 | No `model:` field in frontmatter |
|
||
| Tools restricted | ⚠️ Partial | 2/3 | Includes Bash but no justification |
|
||
| ... | ... | ... | ... |
|
||
```
|
||
|
||
---
|
||
|
||
## Industry Context
|
||
|
||
**Source**: LangChain Agent Report 2026 (verified)
|
||
|
||
**Key Statistics**:
|
||
- 29.5% of organizations deploy agents without systematic evaluation
|
||
- 18% cite "agent bugs" as their top challenge
|
||
- Only 12% use automated quality checks
|
||
|
||
**Implication**: This audit addresses a **real industry gap**. Most teams deploy agents/skills without validation, leading to production issues. The 80% threshold (Grade B) aligns with industry best practices for production readiness.
|
||
|
||
**Comparison**: Manual checklists (like the Guide's Agent Validation Checklist on line 4921) are comprehensive but error-prone. Automated scoring reduces human error and provides quantitative metrics for tracking improvements over time.
|
||
|
||
---
|
||
|
||
## Related
|
||
|
||
- **Agent Validation Checklist** (guide line 4921): Manual 16-criteria checklist
|
||
- **Skill Validation** (guide line 5491): Spec compliance documentation
|
||
- **Examples**: `examples/agents/`, `examples/skills/`, `examples/commands/`
|
||
- **Advanced Audit**: Use `audit-agents-skills` skill (see `examples/skills/`) for comparative analysis vs templates
|
||
|
||
---
|
||
|
||
## Implementation Notes
|
||
|
||
### Detection Patterns
|
||
|
||
**Frontmatter Parsing**:
|
||
```python
|
||
import re
|
||
yaml_match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
|
||
if yaml_match:
|
||
import yaml
|
||
frontmatter = yaml.safe_load(yaml_match.group(1))
|
||
```
|
||
|
||
**Keyword Detection** (case-insensitive):
|
||
```python
|
||
has_trigger = any(word in description.lower() for word in ['when', 'use', 'trigger'])
|
||
```
|
||
|
||
**Token Counting** (approximate):
|
||
```python
|
||
tokens = len(content.split()) * 1.3 # Rough estimate: 1 token ≈ 0.75 words
|
||
```
|
||
|
||
### Overlap Detection
|
||
|
||
Compare descriptions using Jaccard similarity:
|
||
```python
|
||
def jaccard_similarity(desc1, desc2):
|
||
words1 = set(desc1.lower().split())
|
||
words2 = set(desc2.lower().split())
|
||
intersection = words1 & words2
|
||
union = words1 | words2
|
||
return len(intersection) / len(union) if union else 0
|
||
|
||
# Flag if similarity > 0.5 (50% keyword overlap)
|
||
```
|
||
|
||
### Grade Color Coding (Terminal Output)
|
||
|
||
```python
|
||
COLORS = {
|
||
'A': '\033[92m', # Green
|
||
'B': '\033[93m', # Yellow
|
||
'C': '\033[93m', # Yellow
|
||
'D': '\033[91m', # Red
|
||
'F': '\033[91m' # Red
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
**Command ready for use**: `/audit-agents-skills`
|