claude-code-ultimate-guide/guide/agent-evaluation.md
Florian BRUNIAUX ef7cdd899e release: v3.24.0 - Agent Evaluation Framework
Major addition: Complete agent evaluation framework with production-ready template.

## Added

- **Resource Evaluation**: nao framework (score 3/5)
  - Identified critical gap: agent evaluation not documented
  - Technical challenge adjusted score 2/5 → 3/5
  - All claims fact-checked (TypeScript 58.9%, Python 38.5%)

- **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens)
  - Metrics: response quality, tool usage, performance, satisfaction
  - Patterns: logging hooks, unit tests, A/B testing, feedback loops
  - Example: analytics agent with built-in metrics
  - Tools: nao framework reference, Claude Code hooks integration

- **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks
  - nao (Analytics Agents): Database-agnostic, built-in evaluation
  - Transposable patterns: context builder, evaluation hooks, DB integrations

- **Template**: Analytics Agent with Evaluation (5 files, ~1K lines)
  - README: setup, usage, troubleshooting
  - Agent: SQL generator with evaluation criteria, safety rules
  - Hook: automated metrics logging (safety, performance, errors)
  - Script: analysis with stats, safety reports, recommendations
  - Report template: monthly evaluation format

## Changed

- Agent Evaluation Guide: updated template references, verified links
- Landing Site: templates count 110 → 114
- Version: 3.23.5 → 3.24.0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-10 11:52:13 +01:00

436 lines
12 KiB
Markdown

# Agent Evaluation
**Quick nav**: [Why Evaluate?](#why-evaluate-agents) · [Metrics to Track](#metrics-to-track) · [Implementation](#implementation-patterns) · [Example](#example-agent-with-evaluation) · [Tools](#tools--references)
---
## Why Evaluate Agents?
When you create custom agents in `.claude/agents/`, you're encoding specialized expertise into reusable workflows. But how do you know if your agents are actually effective?
**Without evaluation**, you're building blind:
- ❌ No way to measure if agent responses are improving or degrading over time
- ❌ Can't compare different agent configurations objectively
- ❌ Difficult to identify which aspects of agent context/instructions need refinement
- ❌ No data to justify investment in agent development
**With evaluation**, you iterate with confidence:
- ✅ Quantify agent quality through metrics (response time, accuracy, tool usage)
- ✅ A/B test different agent configurations with measurable outcomes
- ✅ Identify patterns in successful vs failed interactions
- ✅ Build feedback loops for continuous improvement
**Core principle**: Agents are code. Like all code, they need tests, metrics, and observability.
---
## Metrics to Track
### 1. Response Quality Metrics
**What to measure**:
- **Task completion rate**: Did the agent accomplish the stated goal?
- **Correctness**: Were the agent's outputs factually accurate?
- **Relevance**: Did the response stay on-topic and address the actual question?
- **Hallucination rate**: How often did the agent invent information?
**How to track**:
```bash
# Post-response hook: .claude/hooks/log-response-quality.sh
# Triggered after each agent response
# Log structure:
{
"timestamp": "2026-02-10T14:32:00Z",
"agent_id": "backend-architect",
"task_completed": true,
"correctness_score": 4.5, # User rating 1-5
"hallucinations": 0,
"response_tokens": 1250
}
```
**Implementation tip**: Use user feedback prompts (thumbs up/down) or automated checks (test suite passing after agent code generation).
---
### 2. Tool Usage Metrics
**What to measure**:
- **Tool call success rate**: Percentage of tool calls that executed without errors
- **Tool selection accuracy**: Did agent choose the right tool for the task?
- **Tool call efficiency**: Minimum calls to achieve goal (avoid unnecessary reads/searches)
- **Error recovery**: Did agent handle tool failures gracefully?
**How to track**:
```bash
# Post-tool-use hook: .claude/hooks/log-tool-usage.sh
# Triggered after each tool call
# Log structure:
{
"timestamp": "2026-02-10T14:32:05Z",
"agent_id": "backend-architect",
"tool_name": "Read",
"tool_success": true,
"tool_parameters": {"file_path": "src/auth.ts"},
"execution_time_ms": 45
}
```
**Implementation tip**: Use Claude Code hooks system (see `examples/hooks/`) to automatically log tool calls.
---
### 3. Performance Metrics
**What to measure**:
- **Response time**: Total time from user prompt to complete response
- **Token efficiency**: Input/output tokens used per task
- **Context utilization**: How much of context window was used?
- **Cost per task**: API cost for the full interaction
**How to track**:
```bash
# Session-end hook: .claude/hooks/log-performance.sh
# Triggered at end of session
# Log structure:
{
"timestamp": "2026-02-10T14:35:00Z",
"agent_id": "backend-architect",
"session_duration_s": 180,
"input_tokens": 3500,
"output_tokens": 2800,
"total_cost_usd": 0.15,
"context_utilization": 0.42
}
```
**Implementation tip**: Parse Claude Code session logs or use MCP observability tools.
---
### 4. User Satisfaction Metrics
**What to measure**:
- **Explicit feedback**: User ratings, comments, bug reports
- **Implicit signals**: Did user accept agent's suggestions? Did they retry the prompt?
- **Adoption rate**: How often is this agent used vs alternatives?
- **Retention**: Do users return to this agent for similar tasks?
**How to track**:
```bash
# Manual feedback collection
# After agent completes task, prompt user:
"Rate this agent's performance (1-5): _"
# Log:
{
"timestamp": "2026-02-10T14:35:10Z",
"agent_id": "backend-architect",
"user_rating": 5,
"user_comment": "Perfect analysis of auth flow",
"would_use_again": true
}
```
**Implementation tip**: Add feedback prompts to agent templates or use post-session surveys.
---
## Implementation Patterns
### Pattern 1: Logging Hook System
**Use Case**: Automatically track all agent interactions without manual intervention
**Setup**:
```bash
# .claude/hooks/post-tool-use.sh
#!/bin/bash
# Triggered after every tool call
AGENT_ID=$(echo "$CLAUDE_AGENT_ID" | jq -r)
TOOL_NAME=$(echo "$CLAUDE_TOOL_NAME" | jq -r)
TOOL_SUCCESS=$(echo "$CLAUDE_TOOL_SUCCESS" | jq -r)
# Append to metrics log
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"tool\":\"$TOOL_NAME\",\"success\":$TOOL_SUCCESS}" \
>> .claude/logs/agent-metrics.jsonl
```
**Pros**: Zero manual overhead, complete coverage, time-series data
**Cons**: Requires parsing Claude Code environment variables (may change across versions)
---
### Pattern 2: Agent Unit Tests
**Use Case**: Regression testing to ensure agent improvements don't break existing capabilities
**Setup**:
```bash
# tests/agents/backend-architect.test.sh
#!/bin/bash
# Test 1: Agent correctly identifies hexagonal architecture layers
echo "Test: Hexagonal architecture analysis"
RESULT=$(claude agent backend-architect "Analyze src/auth.ts for layer violations")
if echo "$RESULT" | grep -q "domain layer"; then
echo "✅ PASS: Identified layers"
else
echo "❌ FAIL: Did not identify layers"
exit 1
fi
# Test 2: Agent recommends correct patterns
echo "Test: Pattern recommendations"
RESULT=$(claude agent backend-architect "Improve error handling in src/api.ts")
if echo "$RESULT" | grep -q "Result<T, E>"; then
echo "✅ PASS: Recommended Result pattern"
else
echo "❌ FAIL: Incorrect pattern"
exit 1
fi
```
**Pros**: Automated, catches regressions, CI/CD integration
**Cons**: Requires maintenance, may have false positives/negatives
---
### Pattern 3: A/B Testing Configurations
**Use Case**: Compare two versions of agent to determine which performs better
**Setup**:
```yaml
# .claude/agents/backend-architect-v1.md (control)
name: backend-architect
version: 1.0
instructions: |
You are a backend architect specializing in...
[original instructions]
# .claude/agents/backend-architect-v2.md (experiment)
name: backend-architect-v2
version: 2.0
instructions: |
You are a backend architect specializing in...
[modified instructions with new pattern emphasis]
```
**Evaluation**:
```bash
# Run same task with both agents, compare metrics
# Task: "Analyze src/auth.ts for security issues"
# Version 1 metrics:
# - Response time: 45s
# - Issues found: 3
# - User rating: 4/5
# Version 2 metrics:
# - Response time: 38s
# - Issues found: 5 (2 additional critical issues)
# - User rating: 5/5
# Conclusion: Version 2 is more thorough and faster → promote to production
```
**Pros**: Data-driven decisions, quantifiable improvements
**Cons**: Requires discipline to run controlled experiments
---
### Pattern 4: Feedback Loop Integration
**Use Case**: Continuously improve agent based on real-world usage data
**Setup**:
```bash
# After agent completes task
echo "How would you rate this response? (1-5, or 'skip'): "
read RATING
if [ "$RATING" != "skip" ]; then
echo "Any specific feedback?: "
read COMMENT
# Log feedback
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"rating\":$RATING,\"comment\":\"$COMMENT\"}" \
>> .claude/logs/agent-feedback.jsonl
fi
# Weekly: Review feedback.jsonl, identify patterns
# Monthly: Update agent instructions based on aggregated feedback
```
**Pros**: Aligns agent with actual user needs, identifies edge cases
**Cons**: Requires manual review and action on feedback
---
## Example: Agent with Evaluation
**Full template available**: [`examples/agents/analytics-with-eval/`](../../examples/agents/analytics-with-eval/) includes complete agent definition, hooks, analysis scripts, and report template.
### Setup: Analytics Agent with Built-in Metrics
```yaml
# .claude/agents/analytics-agent.md
---
name: analytics-agent
description: SQL query generator with evaluation hooks
version: 1.0
tools:
- Read
- Write
- Bash
hooks:
post_response: .claude/hooks/log-analytics-metrics.sh
---
# Analytics Agent
You are an expert SQL analyst helping users query databases.
## Evaluation Criteria
After each query:
1. **Correctness**: Does query produce expected results?
2. **Performance**: Query execution time < 5s?
3. **Safety**: No destructive operations (DELETE, DROP, TRUNCATE)?
4. **Best practices**: Uses proper JOINs, indexes, parameterized queries?
## Instructions
[... agent instructions ...]
```
### Metrics Hook
```bash
# .claude/hooks/log-analytics-metrics.sh
#!/bin/bash
# Triggered after analytics-agent response
# Extract query from response (naive grep, improve with jq)
QUERY=$(echo "$CLAUDE_RESPONSE" | grep -oP 'SELECT.*?;')
if [ -n "$QUERY" ]; then
# Test query (requires database connection)
EXEC_TIME=$( (time psql -U user -d db -c "$QUERY") 2>&1 | grep real | awk '{print $2}')
# Check for destructive operations
if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE'; then
SAFETY="FAIL"
else
SAFETY="PASS"
fi
# Log metrics
echo "{\"timestamp\":\"$(date -Iseconds)\",\"query\":\"$QUERY\",\"exec_time\":\"$EXEC_TIME\",\"safety\":\"$SAFETY\"}" \
>> .claude/logs/analytics-metrics.jsonl
fi
```
### Analysis
```bash
# Monthly review: Analyze metrics
jq -s 'group_by(.safety) | map({safety: .[0].safety, count: length})' \
.claude/logs/analytics-metrics.jsonl
# Output:
# [
# {"safety": "PASS", "count": 127},
# {"safety": "FAIL", "count": 3}
# ]
# Action: Review 3 failed queries, update agent instructions to prevent future violations
```
---
## Tools & References
### Open-Source Evaluation Frameworks
#### nao (Analytics Agents)
**URL**: [github.com/getnao/nao](https://github.com/getnao/nao/)
**What it provides**:
- Built-in evaluation framework for analytics agents
- Unit testing capabilities for agent responses
- Metrics collection (response quality, tool usage, performance)
- Feedback loop integration
**How to adapt for Claude Code**:
- **Context builder pattern**: Apply nao's structured context approach to `.claude/agents/` config
- **Evaluation hooks**: Translate nao's evaluation framework to Claude Code hooks system
- **Metrics schema**: Use nao's metrics schema as template for your logs
**Status**: Production-ready, actively maintained, TypeScript + Python
---
### Claude Code Native Patterns
**Hooks system**: `.claude/hooks/` for automated logging (see `examples/hooks/README.md`)
**Agents directory**: `.claude/agents/` for custom agent definitions (see `guide/ultimate-guide.md` Section 4)
**MCP observability**: Use MCP servers for advanced logging and metrics aggregation
---
## Best Practices
### Start Simple
**Week 1**: Add basic logging hook (tool calls only)
**Week 2**: Add user feedback prompt (manual ratings)
**Week 3**: Build dashboard to visualize metrics
**Week 4**: Run first A/B test on agent configuration
### Focus on Actionable Metrics
Don't track metrics you won't act on. Prioritize:
1. **Task completion rate** → Refine agent instructions
2. **Tool call errors** → Improve context or add examples
3. **User ratings** → Identify confusing or unhelpful responses
### Automate Where Possible
Manual evaluation doesn't scale. Use:
- Hooks for automatic logging
- CI/CD integration for agent unit tests
- Scripts for periodic metric aggregation
### Build Feedback Loops
Metrics are useless without action:
- Weekly: Review metrics, identify patterns
- Monthly: Update agent instructions based on data
- Quarterly: Major agent refactoring if needed
---
## Related Sections
- **[Agents](./ultimate-guide.md#4-agents)**: Creating custom agents
- **[Hooks](./ultimate-guide.md#5-hooks)**: Automation with event hooks
- **[Observability](./observability.md)**: Logging and monitoring strategies
- **[AI Ecosystem](./ai-ecosystem.md#82-domain-specific-agent-frameworks)**: External frameworks like nao
---
**Next steps**:
1. Add logging hook to your most-used agent
2. Collect 1 week of metrics
3. Analyze and refine agent based on data
**Template**: See `examples/agents/analytics-with-eval/` for complete implementation with hooks, scripts, and report template