Reorganize 22 guide files from a flat directory into 5 thematic subdirs: - core/ (architecture, methodologies, known-issues, claude-code-releases, visual-reference) - security/ (security-hardening, sandbox-isolation, sandbox-native, production-safety, data-privacy) - ecosystem/ (ai-ecosystem, mcp-servers-ecosystem, third-party-tools, remarkable-ai) - roles/ (ai-roles, adoption-approaches, learning-with-ai, agent-evaluation) - ops/ (devops-sre, observability, ai-traceability) All internal links updated across ~50 files (ultimate-guide.md, workflows/, diagrams/, README.md, docs/, tools/, examples/, machine-readable/). Also: merge search-tools-cheatsheet.md into cheatsheet.md, rewrite guide/README.md with H2 grouped sections, update CLAUDE.md repository structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
12 KiB
| title | description | tags | |||
|---|---|---|---|---|---|
| Agent Evaluation | Metrics, patterns, and tools for measuring custom agent effectiveness |
|
Agent Evaluation
Quick nav: Why Evaluate? · Metrics to Track · Implementation · Example · Tools
Why Evaluate Agents?
When you create custom agents in .claude/agents/, you're encoding specialized expertise into reusable workflows. But how do you know if your agents are actually effective?
Without evaluation, you're building blind:
- ❌ No way to measure if agent responses are improving or degrading over time
- ❌ Can't compare different agent configurations objectively
- ❌ Difficult to identify which aspects of agent context/instructions need refinement
- ❌ No data to justify investment in agent development
With evaluation, you iterate with confidence:
- ✅ Quantify agent quality through metrics (response time, accuracy, tool usage)
- ✅ A/B test different agent configurations with measurable outcomes
- ✅ Identify patterns in successful vs failed interactions
- ✅ Build feedback loops for continuous improvement
Core principle: Agents are code. Like all code, they need tests, metrics, and observability.
Metrics to Track
1. Response Quality Metrics
What to measure:
- Task completion rate: Did the agent accomplish the stated goal?
- Correctness: Were the agent's outputs factually accurate?
- Relevance: Did the response stay on-topic and address the actual question?
- Hallucination rate: How often did the agent invent information?
How to track:
# Post-response hook: .claude/hooks/log-response-quality.sh
# Triggered after each agent response
# Log structure:
{
"timestamp": "2026-02-10T14:32:00Z",
"agent_id": "backend-architect",
"task_completed": true,
"correctness_score": 4.5, # User rating 1-5
"hallucinations": 0,
"response_tokens": 1250
}
Implementation tip: Use user feedback prompts (thumbs up/down) or automated checks (test suite passing after agent code generation).
2. Tool Usage Metrics
What to measure:
- Tool call success rate: Percentage of tool calls that executed without errors
- Tool selection accuracy: Did agent choose the right tool for the task?
- Tool call efficiency: Minimum calls to achieve goal (avoid unnecessary reads/searches)
- Error recovery: Did agent handle tool failures gracefully?
How to track:
# Post-tool-use hook: .claude/hooks/log-tool-usage.sh
# Triggered after each tool call
# Log structure:
{
"timestamp": "2026-02-10T14:32:05Z",
"agent_id": "backend-architect",
"tool_name": "Read",
"tool_success": true,
"tool_parameters": {"file_path": "src/auth.ts"},
"execution_time_ms": 45
}
Implementation tip: Use Claude Code hooks system (see examples/hooks/) to automatically log tool calls.
3. Performance Metrics
What to measure:
- Response time: Total time from user prompt to complete response
- Token efficiency: Input/output tokens used per task
- Context utilization: How much of context window was used?
- Cost per task: API cost for the full interaction
How to track:
# Session-end hook: .claude/hooks/log-performance.sh
# Triggered at end of session
# Log structure:
{
"timestamp": "2026-02-10T14:35:00Z",
"agent_id": "backend-architect",
"session_duration_s": 180,
"input_tokens": 3500,
"output_tokens": 2800,
"total_cost_usd": 0.15,
"context_utilization": 0.42
}
Implementation tip: Parse Claude Code session logs or use MCP observability tools.
4. User Satisfaction Metrics
What to measure:
- Explicit feedback: User ratings, comments, bug reports
- Implicit signals: Did user accept agent's suggestions? Did they retry the prompt?
- Adoption rate: How often is this agent used vs alternatives?
- Retention: Do users return to this agent for similar tasks?
How to track:
# Manual feedback collection
# After agent completes task, prompt user:
"Rate this agent's performance (1-5): _"
# Log:
{
"timestamp": "2026-02-10T14:35:10Z",
"agent_id": "backend-architect",
"user_rating": 5,
"user_comment": "Perfect analysis of auth flow",
"would_use_again": true
}
Implementation tip: Add feedback prompts to agent templates or use post-session surveys.
Implementation Patterns
Pattern 1: Logging Hook System
Use Case: Automatically track all agent interactions without manual intervention
Setup:
# .claude/hooks/post-tool-use.sh
#!/bin/bash
# Triggered after every tool call
AGENT_ID=$(echo "$CLAUDE_AGENT_ID" | jq -r)
TOOL_NAME=$(echo "$CLAUDE_TOOL_NAME" | jq -r)
TOOL_SUCCESS=$(echo "$CLAUDE_TOOL_SUCCESS" | jq -r)
# Append to metrics log
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"tool\":\"$TOOL_NAME\",\"success\":$TOOL_SUCCESS}" \
>> .claude/logs/agent-metrics.jsonl
Pros: Zero manual overhead, complete coverage, time-series data Cons: Requires parsing Claude Code environment variables (may change across versions)
Pattern 2: Agent Unit Tests
Use Case: Regression testing to ensure agent improvements don't break existing capabilities
Setup:
# tests/agents/backend-architect.test.sh
#!/bin/bash
# Test 1: Agent correctly identifies hexagonal architecture layers
echo "Test: Hexagonal architecture analysis"
RESULT=$(claude agent backend-architect "Analyze src/auth.ts for layer violations")
if echo "$RESULT" | grep -q "domain layer"; then
echo "✅ PASS: Identified layers"
else
echo "❌ FAIL: Did not identify layers"
exit 1
fi
# Test 2: Agent recommends correct patterns
echo "Test: Pattern recommendations"
RESULT=$(claude agent backend-architect "Improve error handling in src/api.ts")
if echo "$RESULT" | grep -q "Result<T, E>"; then
echo "✅ PASS: Recommended Result pattern"
else
echo "❌ FAIL: Incorrect pattern"
exit 1
fi
Pros: Automated, catches regressions, CI/CD integration Cons: Requires maintenance, may have false positives/negatives
Pattern 3: A/B Testing Configurations
Use Case: Compare two versions of agent to determine which performs better
Setup:
# .claude/agents/backend-architect-v1.md (control)
name: backend-architect
version: 1.0
instructions: |
You are a backend architect specializing in...
[original instructions]
# .claude/agents/backend-architect-v2.md (experiment)
name: backend-architect-v2
version: 2.0
instructions: |
You are a backend architect specializing in...
[modified instructions with new pattern emphasis]
Evaluation:
# Run same task with both agents, compare metrics
# Task: "Analyze src/auth.ts for security issues"
# Version 1 metrics:
# - Response time: 45s
# - Issues found: 3
# - User rating: 4/5
# Version 2 metrics:
# - Response time: 38s
# - Issues found: 5 (2 additional critical issues)
# - User rating: 5/5
# Conclusion: Version 2 is more thorough and faster → promote to production
Pros: Data-driven decisions, quantifiable improvements Cons: Requires discipline to run controlled experiments
Pattern 4: Feedback Loop Integration
Use Case: Continuously improve agent based on real-world usage data
Setup:
# After agent completes task
echo "How would you rate this response? (1-5, or 'skip'): "
read RATING
if [ "$RATING" != "skip" ]; then
echo "Any specific feedback?: "
read COMMENT
# Log feedback
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"rating\":$RATING,\"comment\":\"$COMMENT\"}" \
>> .claude/logs/agent-feedback.jsonl
fi
# Weekly: Review feedback.jsonl, identify patterns
# Monthly: Update agent instructions based on aggregated feedback
Pros: Aligns agent with actual user needs, identifies edge cases Cons: Requires manual review and action on feedback
Example: Agent with Evaluation
Full template available: examples/agents/analytics-with-eval/ includes complete agent definition, hooks, analysis scripts, and report template.
Setup: Analytics Agent with Built-in Metrics
# .claude/agents/analytics-agent.md
---
name: analytics-agent
description: SQL query generator with evaluation hooks
version: 1.0
tools:
- Read
- Write
- Bash
hooks:
post_response: .claude/hooks/log-analytics-metrics.sh
---
# Analytics Agent
You are an expert SQL analyst helping users query databases.
## Evaluation Criteria
After each query:
1. **Correctness**: Does query produce expected results?
2. **Performance**: Query execution time < 5s?
3. **Safety**: No destructive operations (DELETE, DROP, TRUNCATE)?
4. **Best practices**: Uses proper JOINs, indexes, parameterized queries?
## Instructions
[... agent instructions ...]
Metrics Hook
# .claude/hooks/log-analytics-metrics.sh
#!/bin/bash
# Triggered after analytics-agent response
# Extract query from response (naive grep, improve with jq)
QUERY=$(echo "$CLAUDE_RESPONSE" | grep -oP 'SELECT.*?;')
if [ -n "$QUERY" ]; then
# Test query (requires database connection)
EXEC_TIME=$( (time psql -U user -d db -c "$QUERY") 2>&1 | grep real | awk '{print $2}')
# Check for destructive operations
if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE'; then
SAFETY="FAIL"
else
SAFETY="PASS"
fi
# Log metrics
echo "{\"timestamp\":\"$(date -Iseconds)\",\"query\":\"$QUERY\",\"exec_time\":\"$EXEC_TIME\",\"safety\":\"$SAFETY\"}" \
>> .claude/logs/analytics-metrics.jsonl
fi
Analysis
# Monthly review: Analyze metrics
jq -s 'group_by(.safety) | map({safety: .[0].safety, count: length})' \
.claude/logs/analytics-metrics.jsonl
# Output:
# [
# {"safety": "PASS", "count": 127},
# {"safety": "FAIL", "count": 3}
# ]
# Action: Review 3 failed queries, update agent instructions to prevent future violations
Tools & References
Open-Source Evaluation Frameworks
nao (Analytics Agents)
What it provides:
- Built-in evaluation framework for analytics agents
- Unit testing capabilities for agent responses
- Metrics collection (response quality, tool usage, performance)
- Feedback loop integration
How to adapt for Claude Code:
- Context builder pattern: Apply nao's structured context approach to
.claude/agents/config - Evaluation hooks: Translate nao's evaluation framework to Claude Code hooks system
- Metrics schema: Use nao's metrics schema as template for your logs
Status: Production-ready, actively maintained, TypeScript + Python
Claude Code Native Patterns
Hooks system: .claude/hooks/ for automated logging (see examples/hooks/README.md)
Agents directory: .claude/agents/ for custom agent definitions (see guide/ultimate-guide.md Section 4)
MCP observability: Use MCP servers for advanced logging and metrics aggregation
Best Practices
Start Simple
Week 1: Add basic logging hook (tool calls only) Week 2: Add user feedback prompt (manual ratings) Week 3: Build dashboard to visualize metrics Week 4: Run first A/B test on agent configuration
Focus on Actionable Metrics
Don't track metrics you won't act on. Prioritize:
- Task completion rate → Refine agent instructions
- Tool call errors → Improve context or add examples
- User ratings → Identify confusing or unhelpful responses
Automate Where Possible
Manual evaluation doesn't scale. Use:
- Hooks for automatic logging
- CI/CD integration for agent unit tests
- Scripts for periodic metric aggregation
Build Feedback Loops
Metrics are useless without action:
- Weekly: Review metrics, identify patterns
- Monthly: Update agent instructions based on data
- Quarterly: Major agent refactoring if needed
Related Sections
- Agents: Creating custom agents
- Hooks: Automation with event hooks
- Observability: Logging and monitoring strategies
- AI Ecosystem: External frameworks like nao
Next steps:
- Add logging hook to your most-used agent
- Collect 1 week of metrics
- Analyze and refine agent based on data
Template: See examples/agents/analytics-with-eval/ for complete implementation with hooks, scripts, and report template