Major addition: Complete agent evaluation framework with production-ready template. ## Added - **Resource Evaluation**: nao framework (score 3/5) - Identified critical gap: agent evaluation not documented - Technical challenge adjusted score 2/5 → 3/5 - All claims fact-checked (TypeScript 58.9%, Python 38.5%) - **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens) - Metrics: response quality, tool usage, performance, satisfaction - Patterns: logging hooks, unit tests, A/B testing, feedback loops - Example: analytics agent with built-in metrics - Tools: nao framework reference, Claude Code hooks integration - **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks - nao (Analytics Agents): Database-agnostic, built-in evaluation - Transposable patterns: context builder, evaluation hooks, DB integrations - **Template**: Analytics Agent with Evaluation (5 files, ~1K lines) - README: setup, usage, troubleshooting - Agent: SQL generator with evaluation criteria, safety rules - Hook: automated metrics logging (safety, performance, errors) - Script: analysis with stats, safety reports, recommendations - Report template: monthly evaluation format ## Changed - Agent Evaluation Guide: updated template references, verified links - Landing Site: templates count 110 → 114 - Version: 3.23.5 → 3.24.0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
436 lines
12 KiB
Markdown
436 lines
12 KiB
Markdown
# Agent Evaluation
|
|
|
|
**Quick nav**: [Why Evaluate?](#why-evaluate-agents) · [Metrics to Track](#metrics-to-track) · [Implementation](#implementation-patterns) · [Example](#example-agent-with-evaluation) · [Tools](#tools--references)
|
|
|
|
---
|
|
|
|
## Why Evaluate Agents?
|
|
|
|
When you create custom agents in `.claude/agents/`, you're encoding specialized expertise into reusable workflows. But how do you know if your agents are actually effective?
|
|
|
|
**Without evaluation**, you're building blind:
|
|
- ❌ No way to measure if agent responses are improving or degrading over time
|
|
- ❌ Can't compare different agent configurations objectively
|
|
- ❌ Difficult to identify which aspects of agent context/instructions need refinement
|
|
- ❌ No data to justify investment in agent development
|
|
|
|
**With evaluation**, you iterate with confidence:
|
|
- ✅ Quantify agent quality through metrics (response time, accuracy, tool usage)
|
|
- ✅ A/B test different agent configurations with measurable outcomes
|
|
- ✅ Identify patterns in successful vs failed interactions
|
|
- ✅ Build feedback loops for continuous improvement
|
|
|
|
**Core principle**: Agents are code. Like all code, they need tests, metrics, and observability.
|
|
|
|
---
|
|
|
|
## Metrics to Track
|
|
|
|
### 1. Response Quality Metrics
|
|
|
|
**What to measure**:
|
|
- **Task completion rate**: Did the agent accomplish the stated goal?
|
|
- **Correctness**: Were the agent's outputs factually accurate?
|
|
- **Relevance**: Did the response stay on-topic and address the actual question?
|
|
- **Hallucination rate**: How often did the agent invent information?
|
|
|
|
**How to track**:
|
|
```bash
|
|
# Post-response hook: .claude/hooks/log-response-quality.sh
|
|
# Triggered after each agent response
|
|
|
|
# Log structure:
|
|
{
|
|
"timestamp": "2026-02-10T14:32:00Z",
|
|
"agent_id": "backend-architect",
|
|
"task_completed": true,
|
|
"correctness_score": 4.5, # User rating 1-5
|
|
"hallucinations": 0,
|
|
"response_tokens": 1250
|
|
}
|
|
```
|
|
|
|
**Implementation tip**: Use user feedback prompts (thumbs up/down) or automated checks (test suite passing after agent code generation).
|
|
|
|
---
|
|
|
|
### 2. Tool Usage Metrics
|
|
|
|
**What to measure**:
|
|
- **Tool call success rate**: Percentage of tool calls that executed without errors
|
|
- **Tool selection accuracy**: Did agent choose the right tool for the task?
|
|
- **Tool call efficiency**: Minimum calls to achieve goal (avoid unnecessary reads/searches)
|
|
- **Error recovery**: Did agent handle tool failures gracefully?
|
|
|
|
**How to track**:
|
|
```bash
|
|
# Post-tool-use hook: .claude/hooks/log-tool-usage.sh
|
|
# Triggered after each tool call
|
|
|
|
# Log structure:
|
|
{
|
|
"timestamp": "2026-02-10T14:32:05Z",
|
|
"agent_id": "backend-architect",
|
|
"tool_name": "Read",
|
|
"tool_success": true,
|
|
"tool_parameters": {"file_path": "src/auth.ts"},
|
|
"execution_time_ms": 45
|
|
}
|
|
```
|
|
|
|
**Implementation tip**: Use Claude Code hooks system (see `examples/hooks/`) to automatically log tool calls.
|
|
|
|
---
|
|
|
|
### 3. Performance Metrics
|
|
|
|
**What to measure**:
|
|
- **Response time**: Total time from user prompt to complete response
|
|
- **Token efficiency**: Input/output tokens used per task
|
|
- **Context utilization**: How much of context window was used?
|
|
- **Cost per task**: API cost for the full interaction
|
|
|
|
**How to track**:
|
|
```bash
|
|
# Session-end hook: .claude/hooks/log-performance.sh
|
|
# Triggered at end of session
|
|
|
|
# Log structure:
|
|
{
|
|
"timestamp": "2026-02-10T14:35:00Z",
|
|
"agent_id": "backend-architect",
|
|
"session_duration_s": 180,
|
|
"input_tokens": 3500,
|
|
"output_tokens": 2800,
|
|
"total_cost_usd": 0.15,
|
|
"context_utilization": 0.42
|
|
}
|
|
```
|
|
|
|
**Implementation tip**: Parse Claude Code session logs or use MCP observability tools.
|
|
|
|
---
|
|
|
|
### 4. User Satisfaction Metrics
|
|
|
|
**What to measure**:
|
|
- **Explicit feedback**: User ratings, comments, bug reports
|
|
- **Implicit signals**: Did user accept agent's suggestions? Did they retry the prompt?
|
|
- **Adoption rate**: How often is this agent used vs alternatives?
|
|
- **Retention**: Do users return to this agent for similar tasks?
|
|
|
|
**How to track**:
|
|
```bash
|
|
# Manual feedback collection
|
|
# After agent completes task, prompt user:
|
|
"Rate this agent's performance (1-5): _"
|
|
|
|
# Log:
|
|
{
|
|
"timestamp": "2026-02-10T14:35:10Z",
|
|
"agent_id": "backend-architect",
|
|
"user_rating": 5,
|
|
"user_comment": "Perfect analysis of auth flow",
|
|
"would_use_again": true
|
|
}
|
|
```
|
|
|
|
**Implementation tip**: Add feedback prompts to agent templates or use post-session surveys.
|
|
|
|
---
|
|
|
|
## Implementation Patterns
|
|
|
|
### Pattern 1: Logging Hook System
|
|
|
|
**Use Case**: Automatically track all agent interactions without manual intervention
|
|
|
|
**Setup**:
|
|
```bash
|
|
# .claude/hooks/post-tool-use.sh
|
|
#!/bin/bash
|
|
# Triggered after every tool call
|
|
|
|
AGENT_ID=$(echo "$CLAUDE_AGENT_ID" | jq -r)
|
|
TOOL_NAME=$(echo "$CLAUDE_TOOL_NAME" | jq -r)
|
|
TOOL_SUCCESS=$(echo "$CLAUDE_TOOL_SUCCESS" | jq -r)
|
|
|
|
# Append to metrics log
|
|
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"tool\":\"$TOOL_NAME\",\"success\":$TOOL_SUCCESS}" \
|
|
>> .claude/logs/agent-metrics.jsonl
|
|
```
|
|
|
|
**Pros**: Zero manual overhead, complete coverage, time-series data
|
|
**Cons**: Requires parsing Claude Code environment variables (may change across versions)
|
|
|
|
---
|
|
|
|
### Pattern 2: Agent Unit Tests
|
|
|
|
**Use Case**: Regression testing to ensure agent improvements don't break existing capabilities
|
|
|
|
**Setup**:
|
|
```bash
|
|
# tests/agents/backend-architect.test.sh
|
|
#!/bin/bash
|
|
|
|
# Test 1: Agent correctly identifies hexagonal architecture layers
|
|
echo "Test: Hexagonal architecture analysis"
|
|
RESULT=$(claude agent backend-architect "Analyze src/auth.ts for layer violations")
|
|
if echo "$RESULT" | grep -q "domain layer"; then
|
|
echo "✅ PASS: Identified layers"
|
|
else
|
|
echo "❌ FAIL: Did not identify layers"
|
|
exit 1
|
|
fi
|
|
|
|
# Test 2: Agent recommends correct patterns
|
|
echo "Test: Pattern recommendations"
|
|
RESULT=$(claude agent backend-architect "Improve error handling in src/api.ts")
|
|
if echo "$RESULT" | grep -q "Result<T, E>"; then
|
|
echo "✅ PASS: Recommended Result pattern"
|
|
else
|
|
echo "❌ FAIL: Incorrect pattern"
|
|
exit 1
|
|
fi
|
|
```
|
|
|
|
**Pros**: Automated, catches regressions, CI/CD integration
|
|
**Cons**: Requires maintenance, may have false positives/negatives
|
|
|
|
---
|
|
|
|
### Pattern 3: A/B Testing Configurations
|
|
|
|
**Use Case**: Compare two versions of agent to determine which performs better
|
|
|
|
**Setup**:
|
|
```yaml
|
|
# .claude/agents/backend-architect-v1.md (control)
|
|
name: backend-architect
|
|
version: 1.0
|
|
instructions: |
|
|
You are a backend architect specializing in...
|
|
[original instructions]
|
|
|
|
# .claude/agents/backend-architect-v2.md (experiment)
|
|
name: backend-architect-v2
|
|
version: 2.0
|
|
instructions: |
|
|
You are a backend architect specializing in...
|
|
[modified instructions with new pattern emphasis]
|
|
```
|
|
|
|
**Evaluation**:
|
|
```bash
|
|
# Run same task with both agents, compare metrics
|
|
# Task: "Analyze src/auth.ts for security issues"
|
|
|
|
# Version 1 metrics:
|
|
# - Response time: 45s
|
|
# - Issues found: 3
|
|
# - User rating: 4/5
|
|
|
|
# Version 2 metrics:
|
|
# - Response time: 38s
|
|
# - Issues found: 5 (2 additional critical issues)
|
|
# - User rating: 5/5
|
|
|
|
# Conclusion: Version 2 is more thorough and faster → promote to production
|
|
```
|
|
|
|
**Pros**: Data-driven decisions, quantifiable improvements
|
|
**Cons**: Requires discipline to run controlled experiments
|
|
|
|
---
|
|
|
|
### Pattern 4: Feedback Loop Integration
|
|
|
|
**Use Case**: Continuously improve agent based on real-world usage data
|
|
|
|
**Setup**:
|
|
```bash
|
|
# After agent completes task
|
|
echo "How would you rate this response? (1-5, or 'skip'): "
|
|
read RATING
|
|
|
|
if [ "$RATING" != "skip" ]; then
|
|
echo "Any specific feedback?: "
|
|
read COMMENT
|
|
|
|
# Log feedback
|
|
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"rating\":$RATING,\"comment\":\"$COMMENT\"}" \
|
|
>> .claude/logs/agent-feedback.jsonl
|
|
fi
|
|
|
|
# Weekly: Review feedback.jsonl, identify patterns
|
|
# Monthly: Update agent instructions based on aggregated feedback
|
|
```
|
|
|
|
**Pros**: Aligns agent with actual user needs, identifies edge cases
|
|
**Cons**: Requires manual review and action on feedback
|
|
|
|
---
|
|
|
|
## Example: Agent with Evaluation
|
|
|
|
**Full template available**: [`examples/agents/analytics-with-eval/`](../../examples/agents/analytics-with-eval/) includes complete agent definition, hooks, analysis scripts, and report template.
|
|
|
|
### Setup: Analytics Agent with Built-in Metrics
|
|
|
|
```yaml
|
|
# .claude/agents/analytics-agent.md
|
|
---
|
|
name: analytics-agent
|
|
description: SQL query generator with evaluation hooks
|
|
version: 1.0
|
|
tools:
|
|
- Read
|
|
- Write
|
|
- Bash
|
|
hooks:
|
|
post_response: .claude/hooks/log-analytics-metrics.sh
|
|
---
|
|
|
|
# Analytics Agent
|
|
|
|
You are an expert SQL analyst helping users query databases.
|
|
|
|
## Evaluation Criteria
|
|
|
|
After each query:
|
|
1. **Correctness**: Does query produce expected results?
|
|
2. **Performance**: Query execution time < 5s?
|
|
3. **Safety**: No destructive operations (DELETE, DROP, TRUNCATE)?
|
|
4. **Best practices**: Uses proper JOINs, indexes, parameterized queries?
|
|
|
|
## Instructions
|
|
|
|
[... agent instructions ...]
|
|
```
|
|
|
|
### Metrics Hook
|
|
|
|
```bash
|
|
# .claude/hooks/log-analytics-metrics.sh
|
|
#!/bin/bash
|
|
# Triggered after analytics-agent response
|
|
|
|
# Extract query from response (naive grep, improve with jq)
|
|
QUERY=$(echo "$CLAUDE_RESPONSE" | grep -oP 'SELECT.*?;')
|
|
|
|
if [ -n "$QUERY" ]; then
|
|
# Test query (requires database connection)
|
|
EXEC_TIME=$( (time psql -U user -d db -c "$QUERY") 2>&1 | grep real | awk '{print $2}')
|
|
|
|
# Check for destructive operations
|
|
if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE'; then
|
|
SAFETY="FAIL"
|
|
else
|
|
SAFETY="PASS"
|
|
fi
|
|
|
|
# Log metrics
|
|
echo "{\"timestamp\":\"$(date -Iseconds)\",\"query\":\"$QUERY\",\"exec_time\":\"$EXEC_TIME\",\"safety\":\"$SAFETY\"}" \
|
|
>> .claude/logs/analytics-metrics.jsonl
|
|
fi
|
|
```
|
|
|
|
### Analysis
|
|
|
|
```bash
|
|
# Monthly review: Analyze metrics
|
|
jq -s 'group_by(.safety) | map({safety: .[0].safety, count: length})' \
|
|
.claude/logs/analytics-metrics.jsonl
|
|
|
|
# Output:
|
|
# [
|
|
# {"safety": "PASS", "count": 127},
|
|
# {"safety": "FAIL", "count": 3}
|
|
# ]
|
|
|
|
# Action: Review 3 failed queries, update agent instructions to prevent future violations
|
|
```
|
|
|
|
---
|
|
|
|
## Tools & References
|
|
|
|
### Open-Source Evaluation Frameworks
|
|
|
|
#### nao (Analytics Agents)
|
|
|
|
**URL**: [github.com/getnao/nao](https://github.com/getnao/nao/)
|
|
|
|
**What it provides**:
|
|
- Built-in evaluation framework for analytics agents
|
|
- Unit testing capabilities for agent responses
|
|
- Metrics collection (response quality, tool usage, performance)
|
|
- Feedback loop integration
|
|
|
|
**How to adapt for Claude Code**:
|
|
- **Context builder pattern**: Apply nao's structured context approach to `.claude/agents/` config
|
|
- **Evaluation hooks**: Translate nao's evaluation framework to Claude Code hooks system
|
|
- **Metrics schema**: Use nao's metrics schema as template for your logs
|
|
|
|
**Status**: Production-ready, actively maintained, TypeScript + Python
|
|
|
|
---
|
|
|
|
### Claude Code Native Patterns
|
|
|
|
**Hooks system**: `.claude/hooks/` for automated logging (see `examples/hooks/README.md`)
|
|
|
|
**Agents directory**: `.claude/agents/` for custom agent definitions (see `guide/ultimate-guide.md` Section 4)
|
|
|
|
**MCP observability**: Use MCP servers for advanced logging and metrics aggregation
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### Start Simple
|
|
|
|
**Week 1**: Add basic logging hook (tool calls only)
|
|
**Week 2**: Add user feedback prompt (manual ratings)
|
|
**Week 3**: Build dashboard to visualize metrics
|
|
**Week 4**: Run first A/B test on agent configuration
|
|
|
|
### Focus on Actionable Metrics
|
|
|
|
Don't track metrics you won't act on. Prioritize:
|
|
1. **Task completion rate** → Refine agent instructions
|
|
2. **Tool call errors** → Improve context or add examples
|
|
3. **User ratings** → Identify confusing or unhelpful responses
|
|
|
|
### Automate Where Possible
|
|
|
|
Manual evaluation doesn't scale. Use:
|
|
- Hooks for automatic logging
|
|
- CI/CD integration for agent unit tests
|
|
- Scripts for periodic metric aggregation
|
|
|
|
### Build Feedback Loops
|
|
|
|
Metrics are useless without action:
|
|
- Weekly: Review metrics, identify patterns
|
|
- Monthly: Update agent instructions based on data
|
|
- Quarterly: Major agent refactoring if needed
|
|
|
|
---
|
|
|
|
## Related Sections
|
|
|
|
- **[Agents](./ultimate-guide.md#4-agents)**: Creating custom agents
|
|
- **[Hooks](./ultimate-guide.md#5-hooks)**: Automation with event hooks
|
|
- **[Observability](./observability.md)**: Logging and monitoring strategies
|
|
- **[AI Ecosystem](./ai-ecosystem.md#82-domain-specific-agent-frameworks)**: External frameworks like nao
|
|
|
|
---
|
|
|
|
**Next steps**:
|
|
1. Add logging hook to your most-used agent
|
|
2. Collect 1 week of metrics
|
|
3. Analyze and refine agent based on data
|
|
|
|
**Template**: See `examples/agents/analytics-with-eval/` for complete implementation with hooks, scripts, and report template
|