release: v3.24.0 - Agent Evaluation Framework
Major addition: Complete agent evaluation framework with production-ready template. ## Added - **Resource Evaluation**: nao framework (score 3/5) - Identified critical gap: agent evaluation not documented - Technical challenge adjusted score 2/5 → 3/5 - All claims fact-checked (TypeScript 58.9%, Python 38.5%) - **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens) - Metrics: response quality, tool usage, performance, satisfaction - Patterns: logging hooks, unit tests, A/B testing, feedback loops - Example: analytics agent with built-in metrics - Tools: nao framework reference, Claude Code hooks integration - **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks - nao (Analytics Agents): Database-agnostic, built-in evaluation - Transposable patterns: context builder, evaluation hooks, DB integrations - **Template**: Analytics Agent with Evaluation (5 files, ~1K lines) - README: setup, usage, troubleshooting - Agent: SQL generator with evaluation criteria, safety rules - Hook: automated metrics logging (safety, performance, errors) - Script: analysis with stats, safety reports, recommendations - Report template: monthly evaluation format ## Changed - Agent Evaluation Guide: updated template references, verified links - Landing Site: templates count 110 → 114 - Version: 3.23.5 → 3.24.0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
1fb783ebb8
commit
ef7cdd899e
15 changed files with 1782 additions and 16 deletions
|
|
@ -16,6 +16,7 @@ Core documentation for mastering Claude Code.
|
|||
| [architecture.md](./architecture.md) | How Claude Code works internally (master loop, tools, context) | 25 min |
|
||||
| [learning-with-ai.md](./learning-with-ai.md) | Guide for juniors on using AI without losing skills | 15 min |
|
||||
| [adoption-approaches.md](./adoption-approaches.md) | Implementation strategies for teams | 15 min |
|
||||
| [agent-evaluation.md](./agent-evaluation.md) | **Agent quality metrics**: Measuring custom agent effectiveness with hooks, tests, and feedback loops | 20 min |
|
||||
| [data-privacy.md](./data-privacy.md) | Data retention and privacy guide | 10 min |
|
||||
| [observability.md](./observability.md) | Session monitoring and cost tracking | 15 min |
|
||||
| [methodologies.md](./methodologies.md) | 15 development methodologies reference (TDD, SDD, BDD, etc.) | 20 min |
|
||||
|
|
|
|||
436
guide/agent-evaluation.md
Normal file
436
guide/agent-evaluation.md
Normal file
|
|
@ -0,0 +1,436 @@
|
|||
# Agent Evaluation
|
||||
|
||||
**Quick nav**: [Why Evaluate?](#why-evaluate-agents) · [Metrics to Track](#metrics-to-track) · [Implementation](#implementation-patterns) · [Example](#example-agent-with-evaluation) · [Tools](#tools--references)
|
||||
|
||||
---
|
||||
|
||||
## Why Evaluate Agents?
|
||||
|
||||
When you create custom agents in `.claude/agents/`, you're encoding specialized expertise into reusable workflows. But how do you know if your agents are actually effective?
|
||||
|
||||
**Without evaluation**, you're building blind:
|
||||
- ❌ No way to measure if agent responses are improving or degrading over time
|
||||
- ❌ Can't compare different agent configurations objectively
|
||||
- ❌ Difficult to identify which aspects of agent context/instructions need refinement
|
||||
- ❌ No data to justify investment in agent development
|
||||
|
||||
**With evaluation**, you iterate with confidence:
|
||||
- ✅ Quantify agent quality through metrics (response time, accuracy, tool usage)
|
||||
- ✅ A/B test different agent configurations with measurable outcomes
|
||||
- ✅ Identify patterns in successful vs failed interactions
|
||||
- ✅ Build feedback loops for continuous improvement
|
||||
|
||||
**Core principle**: Agents are code. Like all code, they need tests, metrics, and observability.
|
||||
|
||||
---
|
||||
|
||||
## Metrics to Track
|
||||
|
||||
### 1. Response Quality Metrics
|
||||
|
||||
**What to measure**:
|
||||
- **Task completion rate**: Did the agent accomplish the stated goal?
|
||||
- **Correctness**: Were the agent's outputs factually accurate?
|
||||
- **Relevance**: Did the response stay on-topic and address the actual question?
|
||||
- **Hallucination rate**: How often did the agent invent information?
|
||||
|
||||
**How to track**:
|
||||
```bash
|
||||
# Post-response hook: .claude/hooks/log-response-quality.sh
|
||||
# Triggered after each agent response
|
||||
|
||||
# Log structure:
|
||||
{
|
||||
"timestamp": "2026-02-10T14:32:00Z",
|
||||
"agent_id": "backend-architect",
|
||||
"task_completed": true,
|
||||
"correctness_score": 4.5, # User rating 1-5
|
||||
"hallucinations": 0,
|
||||
"response_tokens": 1250
|
||||
}
|
||||
```
|
||||
|
||||
**Implementation tip**: Use user feedback prompts (thumbs up/down) or automated checks (test suite passing after agent code generation).
|
||||
|
||||
---
|
||||
|
||||
### 2. Tool Usage Metrics
|
||||
|
||||
**What to measure**:
|
||||
- **Tool call success rate**: Percentage of tool calls that executed without errors
|
||||
- **Tool selection accuracy**: Did agent choose the right tool for the task?
|
||||
- **Tool call efficiency**: Minimum calls to achieve goal (avoid unnecessary reads/searches)
|
||||
- **Error recovery**: Did agent handle tool failures gracefully?
|
||||
|
||||
**How to track**:
|
||||
```bash
|
||||
# Post-tool-use hook: .claude/hooks/log-tool-usage.sh
|
||||
# Triggered after each tool call
|
||||
|
||||
# Log structure:
|
||||
{
|
||||
"timestamp": "2026-02-10T14:32:05Z",
|
||||
"agent_id": "backend-architect",
|
||||
"tool_name": "Read",
|
||||
"tool_success": true,
|
||||
"tool_parameters": {"file_path": "src/auth.ts"},
|
||||
"execution_time_ms": 45
|
||||
}
|
||||
```
|
||||
|
||||
**Implementation tip**: Use Claude Code hooks system (see `examples/hooks/`) to automatically log tool calls.
|
||||
|
||||
---
|
||||
|
||||
### 3. Performance Metrics
|
||||
|
||||
**What to measure**:
|
||||
- **Response time**: Total time from user prompt to complete response
|
||||
- **Token efficiency**: Input/output tokens used per task
|
||||
- **Context utilization**: How much of context window was used?
|
||||
- **Cost per task**: API cost for the full interaction
|
||||
|
||||
**How to track**:
|
||||
```bash
|
||||
# Session-end hook: .claude/hooks/log-performance.sh
|
||||
# Triggered at end of session
|
||||
|
||||
# Log structure:
|
||||
{
|
||||
"timestamp": "2026-02-10T14:35:00Z",
|
||||
"agent_id": "backend-architect",
|
||||
"session_duration_s": 180,
|
||||
"input_tokens": 3500,
|
||||
"output_tokens": 2800,
|
||||
"total_cost_usd": 0.15,
|
||||
"context_utilization": 0.42
|
||||
}
|
||||
```
|
||||
|
||||
**Implementation tip**: Parse Claude Code session logs or use MCP observability tools.
|
||||
|
||||
---
|
||||
|
||||
### 4. User Satisfaction Metrics
|
||||
|
||||
**What to measure**:
|
||||
- **Explicit feedback**: User ratings, comments, bug reports
|
||||
- **Implicit signals**: Did user accept agent's suggestions? Did they retry the prompt?
|
||||
- **Adoption rate**: How often is this agent used vs alternatives?
|
||||
- **Retention**: Do users return to this agent for similar tasks?
|
||||
|
||||
**How to track**:
|
||||
```bash
|
||||
# Manual feedback collection
|
||||
# After agent completes task, prompt user:
|
||||
"Rate this agent's performance (1-5): _"
|
||||
|
||||
# Log:
|
||||
{
|
||||
"timestamp": "2026-02-10T14:35:10Z",
|
||||
"agent_id": "backend-architect",
|
||||
"user_rating": 5,
|
||||
"user_comment": "Perfect analysis of auth flow",
|
||||
"would_use_again": true
|
||||
}
|
||||
```
|
||||
|
||||
**Implementation tip**: Add feedback prompts to agent templates or use post-session surveys.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Patterns
|
||||
|
||||
### Pattern 1: Logging Hook System
|
||||
|
||||
**Use Case**: Automatically track all agent interactions without manual intervention
|
||||
|
||||
**Setup**:
|
||||
```bash
|
||||
# .claude/hooks/post-tool-use.sh
|
||||
#!/bin/bash
|
||||
# Triggered after every tool call
|
||||
|
||||
AGENT_ID=$(echo "$CLAUDE_AGENT_ID" | jq -r)
|
||||
TOOL_NAME=$(echo "$CLAUDE_TOOL_NAME" | jq -r)
|
||||
TOOL_SUCCESS=$(echo "$CLAUDE_TOOL_SUCCESS" | jq -r)
|
||||
|
||||
# Append to metrics log
|
||||
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"tool\":\"$TOOL_NAME\",\"success\":$TOOL_SUCCESS}" \
|
||||
>> .claude/logs/agent-metrics.jsonl
|
||||
```
|
||||
|
||||
**Pros**: Zero manual overhead, complete coverage, time-series data
|
||||
**Cons**: Requires parsing Claude Code environment variables (may change across versions)
|
||||
|
||||
---
|
||||
|
||||
### Pattern 2: Agent Unit Tests
|
||||
|
||||
**Use Case**: Regression testing to ensure agent improvements don't break existing capabilities
|
||||
|
||||
**Setup**:
|
||||
```bash
|
||||
# tests/agents/backend-architect.test.sh
|
||||
#!/bin/bash
|
||||
|
||||
# Test 1: Agent correctly identifies hexagonal architecture layers
|
||||
echo "Test: Hexagonal architecture analysis"
|
||||
RESULT=$(claude agent backend-architect "Analyze src/auth.ts for layer violations")
|
||||
if echo "$RESULT" | grep -q "domain layer"; then
|
||||
echo "✅ PASS: Identified layers"
|
||||
else
|
||||
echo "❌ FAIL: Did not identify layers"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Test 2: Agent recommends correct patterns
|
||||
echo "Test: Pattern recommendations"
|
||||
RESULT=$(claude agent backend-architect "Improve error handling in src/api.ts")
|
||||
if echo "$RESULT" | grep -q "Result<T, E>"; then
|
||||
echo "✅ PASS: Recommended Result pattern"
|
||||
else
|
||||
echo "❌ FAIL: Incorrect pattern"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
**Pros**: Automated, catches regressions, CI/CD integration
|
||||
**Cons**: Requires maintenance, may have false positives/negatives
|
||||
|
||||
---
|
||||
|
||||
### Pattern 3: A/B Testing Configurations
|
||||
|
||||
**Use Case**: Compare two versions of agent to determine which performs better
|
||||
|
||||
**Setup**:
|
||||
```yaml
|
||||
# .claude/agents/backend-architect-v1.md (control)
|
||||
name: backend-architect
|
||||
version: 1.0
|
||||
instructions: |
|
||||
You are a backend architect specializing in...
|
||||
[original instructions]
|
||||
|
||||
# .claude/agents/backend-architect-v2.md (experiment)
|
||||
name: backend-architect-v2
|
||||
version: 2.0
|
||||
instructions: |
|
||||
You are a backend architect specializing in...
|
||||
[modified instructions with new pattern emphasis]
|
||||
```
|
||||
|
||||
**Evaluation**:
|
||||
```bash
|
||||
# Run same task with both agents, compare metrics
|
||||
# Task: "Analyze src/auth.ts for security issues"
|
||||
|
||||
# Version 1 metrics:
|
||||
# - Response time: 45s
|
||||
# - Issues found: 3
|
||||
# - User rating: 4/5
|
||||
|
||||
# Version 2 metrics:
|
||||
# - Response time: 38s
|
||||
# - Issues found: 5 (2 additional critical issues)
|
||||
# - User rating: 5/5
|
||||
|
||||
# Conclusion: Version 2 is more thorough and faster → promote to production
|
||||
```
|
||||
|
||||
**Pros**: Data-driven decisions, quantifiable improvements
|
||||
**Cons**: Requires discipline to run controlled experiments
|
||||
|
||||
---
|
||||
|
||||
### Pattern 4: Feedback Loop Integration
|
||||
|
||||
**Use Case**: Continuously improve agent based on real-world usage data
|
||||
|
||||
**Setup**:
|
||||
```bash
|
||||
# After agent completes task
|
||||
echo "How would you rate this response? (1-5, or 'skip'): "
|
||||
read RATING
|
||||
|
||||
if [ "$RATING" != "skip" ]; then
|
||||
echo "Any specific feedback?: "
|
||||
read COMMENT
|
||||
|
||||
# Log feedback
|
||||
echo "{\"timestamp\":\"$(date -Iseconds)\",\"agent\":\"$AGENT_ID\",\"rating\":$RATING,\"comment\":\"$COMMENT\"}" \
|
||||
>> .claude/logs/agent-feedback.jsonl
|
||||
fi
|
||||
|
||||
# Weekly: Review feedback.jsonl, identify patterns
|
||||
# Monthly: Update agent instructions based on aggregated feedback
|
||||
```
|
||||
|
||||
**Pros**: Aligns agent with actual user needs, identifies edge cases
|
||||
**Cons**: Requires manual review and action on feedback
|
||||
|
||||
---
|
||||
|
||||
## Example: Agent with Evaluation
|
||||
|
||||
**Full template available**: [`examples/agents/analytics-with-eval/`](../../examples/agents/analytics-with-eval/) includes complete agent definition, hooks, analysis scripts, and report template.
|
||||
|
||||
### Setup: Analytics Agent with Built-in Metrics
|
||||
|
||||
```yaml
|
||||
# .claude/agents/analytics-agent.md
|
||||
---
|
||||
name: analytics-agent
|
||||
description: SQL query generator with evaluation hooks
|
||||
version: 1.0
|
||||
tools:
|
||||
- Read
|
||||
- Write
|
||||
- Bash
|
||||
hooks:
|
||||
post_response: .claude/hooks/log-analytics-metrics.sh
|
||||
---
|
||||
|
||||
# Analytics Agent
|
||||
|
||||
You are an expert SQL analyst helping users query databases.
|
||||
|
||||
## Evaluation Criteria
|
||||
|
||||
After each query:
|
||||
1. **Correctness**: Does query produce expected results?
|
||||
2. **Performance**: Query execution time < 5s?
|
||||
3. **Safety**: No destructive operations (DELETE, DROP, TRUNCATE)?
|
||||
4. **Best practices**: Uses proper JOINs, indexes, parameterized queries?
|
||||
|
||||
## Instructions
|
||||
|
||||
[... agent instructions ...]
|
||||
```
|
||||
|
||||
### Metrics Hook
|
||||
|
||||
```bash
|
||||
# .claude/hooks/log-analytics-metrics.sh
|
||||
#!/bin/bash
|
||||
# Triggered after analytics-agent response
|
||||
|
||||
# Extract query from response (naive grep, improve with jq)
|
||||
QUERY=$(echo "$CLAUDE_RESPONSE" | grep -oP 'SELECT.*?;')
|
||||
|
||||
if [ -n "$QUERY" ]; then
|
||||
# Test query (requires database connection)
|
||||
EXEC_TIME=$( (time psql -U user -d db -c "$QUERY") 2>&1 | grep real | awk '{print $2}')
|
||||
|
||||
# Check for destructive operations
|
||||
if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE'; then
|
||||
SAFETY="FAIL"
|
||||
else
|
||||
SAFETY="PASS"
|
||||
fi
|
||||
|
||||
# Log metrics
|
||||
echo "{\"timestamp\":\"$(date -Iseconds)\",\"query\":\"$QUERY\",\"exec_time\":\"$EXEC_TIME\",\"safety\":\"$SAFETY\"}" \
|
||||
>> .claude/logs/analytics-metrics.jsonl
|
||||
fi
|
||||
```
|
||||
|
||||
### Analysis
|
||||
|
||||
```bash
|
||||
# Monthly review: Analyze metrics
|
||||
jq -s 'group_by(.safety) | map({safety: .[0].safety, count: length})' \
|
||||
.claude/logs/analytics-metrics.jsonl
|
||||
|
||||
# Output:
|
||||
# [
|
||||
# {"safety": "PASS", "count": 127},
|
||||
# {"safety": "FAIL", "count": 3}
|
||||
# ]
|
||||
|
||||
# Action: Review 3 failed queries, update agent instructions to prevent future violations
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tools & References
|
||||
|
||||
### Open-Source Evaluation Frameworks
|
||||
|
||||
#### nao (Analytics Agents)
|
||||
|
||||
**URL**: [github.com/getnao/nao](https://github.com/getnao/nao/)
|
||||
|
||||
**What it provides**:
|
||||
- Built-in evaluation framework for analytics agents
|
||||
- Unit testing capabilities for agent responses
|
||||
- Metrics collection (response quality, tool usage, performance)
|
||||
- Feedback loop integration
|
||||
|
||||
**How to adapt for Claude Code**:
|
||||
- **Context builder pattern**: Apply nao's structured context approach to `.claude/agents/` config
|
||||
- **Evaluation hooks**: Translate nao's evaluation framework to Claude Code hooks system
|
||||
- **Metrics schema**: Use nao's metrics schema as template for your logs
|
||||
|
||||
**Status**: Production-ready, actively maintained, TypeScript + Python
|
||||
|
||||
---
|
||||
|
||||
### Claude Code Native Patterns
|
||||
|
||||
**Hooks system**: `.claude/hooks/` for automated logging (see `examples/hooks/README.md`)
|
||||
|
||||
**Agents directory**: `.claude/agents/` for custom agent definitions (see `guide/ultimate-guide.md` Section 4)
|
||||
|
||||
**MCP observability**: Use MCP servers for advanced logging and metrics aggregation
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Start Simple
|
||||
|
||||
**Week 1**: Add basic logging hook (tool calls only)
|
||||
**Week 2**: Add user feedback prompt (manual ratings)
|
||||
**Week 3**: Build dashboard to visualize metrics
|
||||
**Week 4**: Run first A/B test on agent configuration
|
||||
|
||||
### Focus on Actionable Metrics
|
||||
|
||||
Don't track metrics you won't act on. Prioritize:
|
||||
1. **Task completion rate** → Refine agent instructions
|
||||
2. **Tool call errors** → Improve context or add examples
|
||||
3. **User ratings** → Identify confusing or unhelpful responses
|
||||
|
||||
### Automate Where Possible
|
||||
|
||||
Manual evaluation doesn't scale. Use:
|
||||
- Hooks for automatic logging
|
||||
- CI/CD integration for agent unit tests
|
||||
- Scripts for periodic metric aggregation
|
||||
|
||||
### Build Feedback Loops
|
||||
|
||||
Metrics are useless without action:
|
||||
- Weekly: Review metrics, identify patterns
|
||||
- Monthly: Update agent instructions based on data
|
||||
- Quarterly: Major agent refactoring if needed
|
||||
|
||||
---
|
||||
|
||||
## Related Sections
|
||||
|
||||
- **[Agents](./ultimate-guide.md#4-agents)**: Creating custom agents
|
||||
- **[Hooks](./ultimate-guide.md#5-hooks)**: Automation with event hooks
|
||||
- **[Observability](./observability.md)**: Logging and monitoring strategies
|
||||
- **[AI Ecosystem](./ai-ecosystem.md#82-domain-specific-agent-frameworks)**: External frameworks like nao
|
||||
|
||||
---
|
||||
|
||||
**Next steps**:
|
||||
1. Add logging hook to your most-used agent
|
||||
2. Collect 1 week of metrics
|
||||
3. Analyze and refine agent based on data
|
||||
|
||||
**Template**: See `examples/agents/analytics-with-eval/` for complete implementation with hooks, scripts, and report template
|
||||
|
|
@ -1609,6 +1609,32 @@ If you're not using Gas Town/multiclaude, you can still:
|
|||
- Experimentation tolerance is high (work may be lost/redone)
|
||||
- Team has SRE capacity to monitor/intervene
|
||||
|
||||
### 8.2 Domain-Specific Agent Frameworks
|
||||
|
||||
Beyond general-purpose coding assistants, specialized frameworks target specific use cases with built-in context, evaluation, and deployment patterns.
|
||||
|
||||
#### nao (Analytics Agents)
|
||||
|
||||
**URL**: [github.com/getnao/nao](https://github.com/getnao/nao/) | **Stack**: TypeScript 58.9%, Python 38.5%
|
||||
|
||||
**What it is**: Open-source framework for building and deploying analytics agents. Two-step architecture: build agent context via CLI (databases, docs, metadata) → deploy chat UI for natural language data queries.
|
||||
|
||||
**Key features**:
|
||||
- Database agnostic (PostgreSQL, BigQuery, Snowflake, Databricks)
|
||||
- Built-in evaluation framework with unit testing
|
||||
- Native data visualization in chat interface
|
||||
- Self-hosted deployment with Docker
|
||||
- Stack: Fastify, Drizzle ORM, tRPC, React, shadcn UI
|
||||
|
||||
**Relevance to Claude Code**: While nao deploys agents as standalone services (not Claude Code plugins), its patterns are transposable:
|
||||
- **Context builder architecture**: Structuring complex agent context (similar to `.claude/agents/` best practices)
|
||||
- **Evaluation framework**: Measuring agent quality through metrics, unit tests, and feedback loops (gap in current Claude Code workflows)
|
||||
- **Database integrations**: Patterns for injecting database context into agent prompts
|
||||
|
||||
**When to use**: Data teams building conversational analytics interfaces for business users. For Claude Code users, nao serves as reference architecture for agent evaluation and database context patterns.
|
||||
|
||||
**Status**: Active open-source project, production-ready, well-documented
|
||||
|
||||
---
|
||||
|
||||
## 9. Cost & Subscription Strategy
|
||||
|
|
|
|||
|
|
@ -6,7 +6,7 @@
|
|||
|
||||
**Written with**: Claude (Anthropic)
|
||||
|
||||
**Version**: 3.23.5 | **Last Updated**: February 2026
|
||||
**Version**: 3.24.0 | **Last Updated**: February 2026
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -520,4 +520,4 @@ where.exe claude; claude doctor; claude mcp list
|
|||
|
||||
**Author**: Florian BRUNIAUX | [@Méthode Aristote](https://methode-aristote.fr) | Written with Claude
|
||||
|
||||
*Last updated: February 2026 | Version 3.23.5*
|
||||
*Last updated: February 2026 | Version 3.24.0*
|
||||
|
|
|
|||
|
|
@ -10,7 +10,7 @@
|
|||
|
||||
**Last updated**: January 2026
|
||||
|
||||
**Version**: 3.23.5
|
||||
**Version**: 3.24.0
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -4241,7 +4241,7 @@ The `.claude/` folder is your project's Claude Code directory for memory, settin
|
|||
| Personal preferences | `CLAUDE.md` | ❌ Gitignore |
|
||||
| Personal permissions | `settings.local.json` | ❌ Gitignore |
|
||||
|
||||
### 3.23.5 Version Control & Backup
|
||||
### 3.24.0 Version Control & Backup
|
||||
|
||||
**Problem**: Without version control, losing your Claude Code configuration means hours of manual reconfiguration across agents, skills, hooks, and MCP servers.
|
||||
|
||||
|
|
@ -19293,4 +19293,4 @@ We'll evaluate and add it to this section if it meets quality criteria.
|
|||
|
||||
**Contributions**: Issues and PRs welcome.
|
||||
|
||||
**Last updated**: January 2026 | **Version**: 3.23.5
|
||||
**Last updated**: January 2026 | **Version**: 3.24.0
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue