release: v3.24.0 - Agent Evaluation Framework

Major addition: Complete agent evaluation framework with production-ready template. ## Added - **Resource Evaluation**: nao framework (score 3/5) - Identified critical gap: agent evaluation not documented - Technical challenge adjusted score 2/5 → 3/5 - All claims fact-checked (TypeScript 58.9%, Python 38.5%) - **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens) - Metrics: response quality, tool usage, performance, satisfaction - Patterns: logging hooks, unit tests, A/B testing, feedback loops - Example: analytics agent with built-in metrics - Tools: nao framework reference, Claude Code hooks integration - **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks - nao (Analytics Agents): Database-agnostic, built-in evaluation - Transposable patterns: context builder, evaluation hooks, DB integrations - **Template**: Analytics Agent with Evaluation (5 files, ~1K lines) - README: setup, usage, troubleshooting - Agent: SQL generator with evaluation criteria, safety rules - Hook: automated metrics logging (safety, performance, errors) - Script: analysis with stats, safety reports, recommendations - Report template: monthly evaluation format ## Changed - Agent Evaluation Guide: updated template references, verified links - Landing Site: templates count 110 → 114 - Version: 3.23.5 → 3.24.0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-10 11:52:13 +01:00 · 2026-02-10 11:52:13 +01:00 · ef7cdd899e
commit ef7cdd899e
parent 1fb783ebb8
15 changed files with 1782 additions and 16 deletions
--- a/examples/agents/analytics-with-eval/eval/report-template.md
+++ b/examples/agents/analytics-with-eval/eval/report-template.md
@ -0,0 +1,257 @@
+# Analytics Agent Evaluation Report
+
+**Month**: [YYYY-MM]
+**Report Date**: [YYYY-MM-DD]
+**Evaluator**: [Your Name]
+**Agent Version**: 1.0
+
+---
+
+## Executive Summary
+
+[2-3 sentence overview of agent performance this month]
+
+**Key Metrics**:
+- Total queries: [X]
+- Safety pass rate: [Y]%
+- Avg execution time: [Z]s
+
+**Status**: 🟢 Healthy / 🟡 Needs Attention / 🔴 Critical
+
+---
+
+## Metrics Overview
+
+### Volume
+
+| Metric | Value |
+|--------|-------|
+| Total queries generated | [X] |
+| Unique users/sessions | [Y] |
+| Queries per day (avg) | [Z] |
+| Growth vs last month | [+/-]% |
+
+### Quality Metrics
+
+| Metric | Target | Actual | Status |
+|--------|--------|--------|--------|
+| Safety pass rate | >95% | [X]% | 🟢/🟡/🔴 |
+| Query correctness | >90% | [Y]% | 🟢/🟡/🔴 |
+| User satisfaction | >4.0/5 | [Z]/5 | 🟢/🟡/🔴 |
+
+### Performance Metrics
+
+| Metric | Target | Actual | Status |
+|--------|--------|--------|--------|
+| Mean execution time | <3s | [X]s | 🟢/🟡/🔴 |
+| P95 execution time | <5s | [Y]s | 🟢/🟡/🔴 |
+| P99 execution time | <10s | [Z]s | 🟢/🟡/🔴 |
+
+---
+
+## Safety Analysis
+
+### Safety Check Results
+
+```
+Total: [X] queries
+- PASS: [Y] ([Z]%)
+- FAIL: [A] ([B]%)
+```
+
+### Top Safety Failures
+
+1. **[Failure Type]** - [X] occurrences
+   - Example: `[SQL query snippet]`
+   - Root cause: [Brief explanation]
+   - Action: [What was done to fix]
+
+2. **[Failure Type]** - [Y] occurrences
+   - Example: `[SQL query snippet]`
+   - Root cause: [Brief explanation]
+   - Action: [What was done to fix]
+
+### Trends
+
+[Graph or description showing safety pass rate over time]
+
+---
+
+## Performance Analysis
+
+### Execution Time Distribution
+
+```
+Mean:   [X]s
+Median: [Y]s
+P95:    [Z]s
+P99:    [A]s
+Max:    [B]s
+```
+
+### Slowest Queries
+
+1. **[Query description]** - [X]s
+   ```sql
+   [SQL query]
+   ```
+   - Reason: [Why slow]
+   - Optimization: [What could improve it]
+
+2. **[Query description]** - [Y]s
+   ```sql
+   [SQL query]
+   ```
+   - Reason: [Why slow]
+   - Optimization: [What could improve it]
+
+---
+
+## User Feedback
+
+### Explicit Feedback
+
+- **Positive**: [X] responses
+  - Common praise: "[Theme 1]", "[Theme 2]"
+- **Negative**: [Y] responses
+  - Common complaints: "[Theme 1]", "[Theme 2]"
+
+### Implicit Signals
+
+- **Query retry rate**: [X]% (users re-running queries)
+- **Query modification rate**: [Y]% (users editing generated queries)
+- **Adoption rate**: [Z] queries/user/week
+
+### Notable Feedback
+
+> "[User quote 1]"
+— [User name/role, if available]
+
+> "[User quote 2]"
+— [User name/role, if available]
+
+---
+
+## Incident Log
+
+### Critical Issues
+
+| Date | Issue | Impact | Resolution |
+|------|-------|--------|------------|
+| [YYYY-MM-DD] | [Brief description] | [High/Medium/Low] | [What was done] |
+
+### Near-Misses
+
+[List of queries that almost caused problems but were caught by safety checks]
+
+---
+
+## Improvements Made
+
+### Agent Instruction Updates
+
+1. **[Update 1]**
+   - **Reason**: [Why needed]
+   - **Change**: [What was modified in agent instructions]
+   - **Impact**: [Expected improvement]
+
+2. **[Update 2]**
+   - **Reason**: [Why needed]
+   - **Change**: [What was modified]
+   - **Impact**: [Expected improvement]
+
+### Hook/Metrics Updates
+
+- [Any changes to metrics collection or analysis]
+
+---
+
+## A/B Test Results (if applicable)
+
+### Test: [Description]
+
+**Period**: [Start date] to [End date]
+
+**Variants**:
+- **Control (A)**: [Description]
+- **Experiment (B)**: [Description]
+
+**Metrics**:
+
+| Metric | Control (A) | Experiment (B) | Change |
+|--------|-------------|----------------|--------|
+| Safety pass rate | [X]% | [Y]% | [+/-]% |
+| Avg exec time | [X]s | [Y]s | [+/-]s |
+| User satisfaction | [X]/5 | [Y]/5 | [+/-] |
+
+**Decision**: ✅ Promote B / ❌ Keep A / ⏸️ Needs more data
+
+**Rationale**: [Why this decision]
+
+---
+
+## Recommendations
+
+### High Priority
+
+1. **[Recommendation 1]**
+   - **Current state**: [Problem description]
+   - **Proposed change**: [What to do]
+   - **Expected impact**: [Improvement estimate]
+   - **Effort**: Low/Medium/High
+
+### Medium Priority
+
+1. **[Recommendation 2]**
+   - **Current state**: [Problem description]
+   - **Proposed change**: [What to do]
+   - **Expected impact**: [Improvement estimate]
+   - **Effort**: Low/Medium/High
+
+### Low Priority / Future
+
+- [Quick list of nice-to-have improvements]
+
+---
+
+## Next Month Goals
+
+1. **[Goal 1]**: [Specific, measurable target]
+2. **[Goal 2]**: [Specific, measurable target]
+3. **[Goal 3]**: [Specific, measurable target]
+
+---
+
+## Appendix
+
+### Methodology
+
+**Data sources**:
+- `.claude/logs/analytics-metrics.jsonl` (automated metrics)
+- User feedback forms
+- Manual query reviews
+
+**Analysis tools**:
+- `eval/metrics.sh` for automated reporting
+- SQL queries for deep-dive analysis
+- Manual review of safety failures
+
+**Limitations**:
+- [Any known gaps in data collection]
+- [Potential biases in analysis]
+
+### Raw Data
+
+**Export**: `analytics-metrics-[YYYY-MM].json`
+
+**Query**:
+```bash
+jq 'select(.timestamp >= "2026-MM-01" and .timestamp < "2026-MM+1-01")' \
+  .claude/logs/analytics-metrics.jsonl > analytics-metrics-2026-MM.json
+```
+
+---
+
+**Previous Reports**: [Link to folder with past reports]
+
+**Questions?** Contact [evaluation team email/slack]