release: v3.24.0 - Agent Evaluation Framework
Major addition: Complete agent evaluation framework with production-ready template. ## Added - **Resource Evaluation**: nao framework (score 3/5) - Identified critical gap: agent evaluation not documented - Technical challenge adjusted score 2/5 → 3/5 - All claims fact-checked (TypeScript 58.9%, Python 38.5%) - **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens) - Metrics: response quality, tool usage, performance, satisfaction - Patterns: logging hooks, unit tests, A/B testing, feedback loops - Example: analytics agent with built-in metrics - Tools: nao framework reference, Claude Code hooks integration - **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks - nao (Analytics Agents): Database-agnostic, built-in evaluation - Transposable patterns: context builder, evaluation hooks, DB integrations - **Template**: Analytics Agent with Evaluation (5 files, ~1K lines) - README: setup, usage, troubleshooting - Agent: SQL generator with evaluation criteria, safety rules - Hook: automated metrics logging (safety, performance, errors) - Script: analysis with stats, safety reports, recommendations - Report template: monthly evaluation format ## Changed - Agent Evaluation Guide: updated template references, verified links - Landing Site: templates count 110 → 114 - Version: 3.23.5 → 3.24.0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
1fb783ebb8
commit
ef7cdd899e
15 changed files with 1782 additions and 16 deletions
257
examples/agents/analytics-with-eval/eval/report-template.md
Normal file
257
examples/agents/analytics-with-eval/eval/report-template.md
Normal file
|
|
@ -0,0 +1,257 @@
|
|||
# Analytics Agent Evaluation Report
|
||||
|
||||
**Month**: [YYYY-MM]
|
||||
**Report Date**: [YYYY-MM-DD]
|
||||
**Evaluator**: [Your Name]
|
||||
**Agent Version**: 1.0
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
[2-3 sentence overview of agent performance this month]
|
||||
|
||||
**Key Metrics**:
|
||||
- Total queries: [X]
|
||||
- Safety pass rate: [Y]%
|
||||
- Avg execution time: [Z]s
|
||||
|
||||
**Status**: 🟢 Healthy / 🟡 Needs Attention / 🔴 Critical
|
||||
|
||||
---
|
||||
|
||||
## Metrics Overview
|
||||
|
||||
### Volume
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Total queries generated | [X] |
|
||||
| Unique users/sessions | [Y] |
|
||||
| Queries per day (avg) | [Z] |
|
||||
| Growth vs last month | [+/-]% |
|
||||
|
||||
### Quality Metrics
|
||||
|
||||
| Metric | Target | Actual | Status |
|
||||
|--------|--------|--------|--------|
|
||||
| Safety pass rate | >95% | [X]% | 🟢/🟡/🔴 |
|
||||
| Query correctness | >90% | [Y]% | 🟢/🟡/🔴 |
|
||||
| User satisfaction | >4.0/5 | [Z]/5 | 🟢/🟡/🔴 |
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
| Metric | Target | Actual | Status |
|
||||
|--------|--------|--------|--------|
|
||||
| Mean execution time | <3s | [X]s | 🟢/🟡/🔴 |
|
||||
| P95 execution time | <5s | [Y]s | 🟢/🟡/🔴 |
|
||||
| P99 execution time | <10s | [Z]s | 🟢/🟡/🔴 |
|
||||
|
||||
---
|
||||
|
||||
## Safety Analysis
|
||||
|
||||
### Safety Check Results
|
||||
|
||||
```
|
||||
Total: [X] queries
|
||||
- PASS: [Y] ([Z]%)
|
||||
- FAIL: [A] ([B]%)
|
||||
```
|
||||
|
||||
### Top Safety Failures
|
||||
|
||||
1. **[Failure Type]** - [X] occurrences
|
||||
- Example: `[SQL query snippet]`
|
||||
- Root cause: [Brief explanation]
|
||||
- Action: [What was done to fix]
|
||||
|
||||
2. **[Failure Type]** - [Y] occurrences
|
||||
- Example: `[SQL query snippet]`
|
||||
- Root cause: [Brief explanation]
|
||||
- Action: [What was done to fix]
|
||||
|
||||
### Trends
|
||||
|
||||
[Graph or description showing safety pass rate over time]
|
||||
|
||||
---
|
||||
|
||||
## Performance Analysis
|
||||
|
||||
### Execution Time Distribution
|
||||
|
||||
```
|
||||
Mean: [X]s
|
||||
Median: [Y]s
|
||||
P95: [Z]s
|
||||
P99: [A]s
|
||||
Max: [B]s
|
||||
```
|
||||
|
||||
### Slowest Queries
|
||||
|
||||
1. **[Query description]** - [X]s
|
||||
```sql
|
||||
[SQL query]
|
||||
```
|
||||
- Reason: [Why slow]
|
||||
- Optimization: [What could improve it]
|
||||
|
||||
2. **[Query description]** - [Y]s
|
||||
```sql
|
||||
[SQL query]
|
||||
```
|
||||
- Reason: [Why slow]
|
||||
- Optimization: [What could improve it]
|
||||
|
||||
---
|
||||
|
||||
## User Feedback
|
||||
|
||||
### Explicit Feedback
|
||||
|
||||
- **Positive**: [X] responses
|
||||
- Common praise: "[Theme 1]", "[Theme 2]"
|
||||
- **Negative**: [Y] responses
|
||||
- Common complaints: "[Theme 1]", "[Theme 2]"
|
||||
|
||||
### Implicit Signals
|
||||
|
||||
- **Query retry rate**: [X]% (users re-running queries)
|
||||
- **Query modification rate**: [Y]% (users editing generated queries)
|
||||
- **Adoption rate**: [Z] queries/user/week
|
||||
|
||||
### Notable Feedback
|
||||
|
||||
> "[User quote 1]"
|
||||
— [User name/role, if available]
|
||||
|
||||
> "[User quote 2]"
|
||||
— [User name/role, if available]
|
||||
|
||||
---
|
||||
|
||||
## Incident Log
|
||||
|
||||
### Critical Issues
|
||||
|
||||
| Date | Issue | Impact | Resolution |
|
||||
|------|-------|--------|------------|
|
||||
| [YYYY-MM-DD] | [Brief description] | [High/Medium/Low] | [What was done] |
|
||||
|
||||
### Near-Misses
|
||||
|
||||
[List of queries that almost caused problems but were caught by safety checks]
|
||||
|
||||
---
|
||||
|
||||
## Improvements Made
|
||||
|
||||
### Agent Instruction Updates
|
||||
|
||||
1. **[Update 1]**
|
||||
- **Reason**: [Why needed]
|
||||
- **Change**: [What was modified in agent instructions]
|
||||
- **Impact**: [Expected improvement]
|
||||
|
||||
2. **[Update 2]**
|
||||
- **Reason**: [Why needed]
|
||||
- **Change**: [What was modified]
|
||||
- **Impact**: [Expected improvement]
|
||||
|
||||
### Hook/Metrics Updates
|
||||
|
||||
- [Any changes to metrics collection or analysis]
|
||||
|
||||
---
|
||||
|
||||
## A/B Test Results (if applicable)
|
||||
|
||||
### Test: [Description]
|
||||
|
||||
**Period**: [Start date] to [End date]
|
||||
|
||||
**Variants**:
|
||||
- **Control (A)**: [Description]
|
||||
- **Experiment (B)**: [Description]
|
||||
|
||||
**Metrics**:
|
||||
|
||||
| Metric | Control (A) | Experiment (B) | Change |
|
||||
|--------|-------------|----------------|--------|
|
||||
| Safety pass rate | [X]% | [Y]% | [+/-]% |
|
||||
| Avg exec time | [X]s | [Y]s | [+/-]s |
|
||||
| User satisfaction | [X]/5 | [Y]/5 | [+/-] |
|
||||
|
||||
**Decision**: ✅ Promote B / ❌ Keep A / ⏸️ Needs more data
|
||||
|
||||
**Rationale**: [Why this decision]
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### High Priority
|
||||
|
||||
1. **[Recommendation 1]**
|
||||
- **Current state**: [Problem description]
|
||||
- **Proposed change**: [What to do]
|
||||
- **Expected impact**: [Improvement estimate]
|
||||
- **Effort**: Low/Medium/High
|
||||
|
||||
### Medium Priority
|
||||
|
||||
1. **[Recommendation 2]**
|
||||
- **Current state**: [Problem description]
|
||||
- **Proposed change**: [What to do]
|
||||
- **Expected impact**: [Improvement estimate]
|
||||
- **Effort**: Low/Medium/High
|
||||
|
||||
### Low Priority / Future
|
||||
|
||||
- [Quick list of nice-to-have improvements]
|
||||
|
||||
---
|
||||
|
||||
## Next Month Goals
|
||||
|
||||
1. **[Goal 1]**: [Specific, measurable target]
|
||||
2. **[Goal 2]**: [Specific, measurable target]
|
||||
3. **[Goal 3]**: [Specific, measurable target]
|
||||
|
||||
---
|
||||
|
||||
## Appendix
|
||||
|
||||
### Methodology
|
||||
|
||||
**Data sources**:
|
||||
- `.claude/logs/analytics-metrics.jsonl` (automated metrics)
|
||||
- User feedback forms
|
||||
- Manual query reviews
|
||||
|
||||
**Analysis tools**:
|
||||
- `eval/metrics.sh` for automated reporting
|
||||
- SQL queries for deep-dive analysis
|
||||
- Manual review of safety failures
|
||||
|
||||
**Limitations**:
|
||||
- [Any known gaps in data collection]
|
||||
- [Potential biases in analysis]
|
||||
|
||||
### Raw Data
|
||||
|
||||
**Export**: `analytics-metrics-[YYYY-MM].json`
|
||||
|
||||
**Query**:
|
||||
```bash
|
||||
jq 'select(.timestamp >= "2026-MM-01" and .timestamp < "2026-MM+1-01")' \
|
||||
.claude/logs/analytics-metrics.jsonl > analytics-metrics-2026-MM.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Previous Reports**: [Link to folder with past reports]
|
||||
|
||||
**Questions?** Contact [evaluation team email/slack]
|
||||
Loading…
Add table
Add a link
Reference in a new issue