release: v3.24.0 - Agent Evaluation Framework

Major addition: Complete agent evaluation framework with production-ready template.

## Added

- **Resource Evaluation**: nao framework (score 3/5)
  - Identified critical gap: agent evaluation not documented
  - Technical challenge adjusted score 2/5 → 3/5
  - All claims fact-checked (TypeScript 58.9%, Python 38.5%)

- **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens)
  - Metrics: response quality, tool usage, performance, satisfaction
  - Patterns: logging hooks, unit tests, A/B testing, feedback loops
  - Example: analytics agent with built-in metrics
  - Tools: nao framework reference, Claude Code hooks integration

- **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks
  - nao (Analytics Agents): Database-agnostic, built-in evaluation
  - Transposable patterns: context builder, evaluation hooks, DB integrations

- **Template**: Analytics Agent with Evaluation (5 files, ~1K lines)
  - README: setup, usage, troubleshooting
  - Agent: SQL generator with evaluation criteria, safety rules
  - Hook: automated metrics logging (safety, performance, errors)
  - Script: analysis with stats, safety reports, recommendations
  - Report template: monthly evaluation format

## Changed

- Agent Evaluation Guide: updated template references, verified links
- Landing Site: templates count 110 → 114
- Version: 3.23.5 → 3.24.0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Florian BRUNIAUX 2026-02-10 11:52:13 +01:00
parent 1fb783ebb8
commit ef7cdd899e
15 changed files with 1782 additions and 16 deletions

View file

@ -0,0 +1,243 @@
# Analytics Agent with Built-in Evaluation
**Template**: Production-ready analytics agent with automated metrics collection
**Use Case**: SQL query generation with quality tracking, performance monitoring, and safety validation
**Related Guide**: [Agent Evaluation](../../../guide/agent-evaluation.md)
---
## What's Included
| File | Purpose |
|------|---------|
| `analytics-agent.md` | Agent definition with evaluation criteria |
| `hooks/post-response-metrics.sh` | Automated metrics logging after each response |
| `eval/metrics.sh` | Analysis script for aggregating collected metrics |
| `eval/report-template.md` | Template for monthly evaluation reports |
---
## Setup
### 1. Copy Agent to Project
```bash
# Copy agent definition
cp analytics-agent.md ~/.claude/agents/
# Or for project-specific:
cp analytics-agent.md /path/to/project/.claude/agents/
```
### 2. Install Hook
```bash
# Copy hook to project
cp hooks/post-response-metrics.sh /path/to/project/.claude/hooks/
# Make executable
chmod +x /path/to/project/.claude/hooks/post-response-metrics.sh
```
### 3. Configure Hook Trigger
Add to `.claude/settings.json`:
```json
{
"hooks": {
"postToolUse": [
{
"command": ".claude/hooks/post-response-metrics.sh",
"enabled": true,
"description": "Log analytics agent metrics"
}
]
}
}
```
### 4. Create Logs Directory
```bash
mkdir -p /path/to/project/.claude/logs
```
---
## Usage
### Invoke Agent
```bash
# In Claude Code session
"Use analytics-agent to generate a SQL query for [task description]"
```
### Check Metrics
```bash
# View raw metrics log
cat .claude/logs/analytics-metrics.jsonl
# Run analysis
./examples/agents/analytics-with-eval/eval/metrics.sh
```
### Generate Monthly Report
```bash
# Copy template
cp eval/report-template.md reports/analytics-2026-02.md
# Fill in metrics from metrics.sh output
```
---
## Metrics Collected
The hook automatically logs:
| Metric | Description | Source |
|--------|-------------|--------|
| `timestamp` | ISO 8601 timestamp | System |
| `query` | Generated SQL query | Agent response |
| `exec_time` | Query execution time | Database |
| `safety` | PASS/FAIL for destructive ops | Query analysis |
| `row_count` | Rows returned | Database |
| `error` | Error message if query failed | Database |
**Log format**: JSONL (JSON Lines) in `.claude/logs/analytics-metrics.jsonl`
---
## Example Metrics Output
```bash
$ ./eval/metrics.sh
=== Analytics Agent Metrics Report ===
Period: 2026-02-01 to 2026-02-10
Total queries: 45
Safety checks:
- PASS: 42 (93%)
- FAIL: 3 (7%)
Execution time:
- Mean: 2.3s
- Median: 1.8s
- P95: 5.2s
- P99: 8.1s
Common failures:
1. DELETE without WHERE clause (2 occurrences)
2. DROP TABLE in query (1 occurrence)
Recommendations:
- Review agent instructions to emphasize WHERE clause requirement
- Add explicit prohibition on DROP operations
```
---
## Customization
### Modify Safety Checks
Edit `hooks/post-response-metrics.sh` line 12-16:
```bash
# Add more patterns
if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE|ALTER'; then
SAFETY="FAIL"
fi
```
### Add Custom Metrics
Extend the JSON log structure:
```bash
echo "{
\"timestamp\":\"$(date -Iseconds)\",
\"query\":\"$QUERY\",
\"your_metric\":\"$VALUE\"
}" >> .claude/logs/analytics-metrics.jsonl
```
### Change Log Location
Update `POST_RESPONSE_LOG` variable in hook script.
---
## Troubleshooting
### Hook Not Triggering
**Check**:
1. Hook is executable: `ls -l .claude/hooks/*.sh`
2. Hook is enabled in settings.json
3. Agent name matches: `analytics-agent` (hyphenated)
**Debug**:
```bash
# Test hook manually
export CLAUDE_RESPONSE='{"content":"SELECT * FROM users;"}'
./.claude/hooks/post-response-metrics.sh
```
### No Metrics in Log
**Check**:
1. Log directory exists: `mkdir -p .claude/logs`
2. Write permissions: `touch .claude/logs/test.log`
3. Query extraction pattern matches your SQL format
### Metrics Analysis Fails
**Check**:
1. `jq` is installed: `which jq`
2. Log file is valid JSONL: `jq . .claude/logs/analytics-metrics.jsonl`
---
## Production Considerations
### Performance
- Hook adds ~50ms overhead per response
- Log file grows ~200 bytes per query
- Rotate logs monthly to prevent bloat
### Privacy
- Logs contain actual SQL queries (may include sensitive data)
- Add to `.gitignore`: `.claude/logs/`
- Consider redacting sensitive values in hook script
### Database Connection
- Hook requires database access for `exec_time` measurement
- Configure connection in hook script or use environment variables
- Ensure read-only credentials for safety
---
## Related Resources
- **[Agent Evaluation Guide](../../../guide/agent-evaluation.md)**: Complete evaluation methodology
- **[Hooks Documentation](../../../guide/ultimate-guide.md#5-hooks)**: Hook system reference
- **[nao Framework](https://github.com/getnao/nao/)**: Production analytics agent framework (inspiration)
---
## License
Same as parent repository (MIT)
**Questions?** Open an issue or discussion in the main repository.

View file

@ -0,0 +1,315 @@
---
name: analytics-agent
description: SQL query generator with built-in evaluation and safety checks
model: sonnet
tools: Read, Bash
---
# Analytics Agent
Generate SQL queries for data analysis with built-in quality metrics and safety validation.
**Scope**: SQL query generation and data analysis guidance. Does not execute queries directly (delegated to user or automated hooks).
**Evaluation**: Automatically tracked via `post-response-metrics.sh` hook (see README.md for setup).
---
## Evaluation Criteria
Every query will be evaluated on:
1. **Correctness**: Does query produce expected results?
2. **Performance**: Query execution time < 5s?
3. **Safety**: No destructive operations without explicit confirmation?
4. **Best practices**: Proper JOINs, indexes, parameterized queries?
These criteria are enforced through:
- Automated safety checks (hook validation)
- Performance monitoring (execution time logging)
- User feedback collection (implicit via query success/failure)
---
## Safety Rules (CRITICAL)
### ⛔ Never Generate Without Confirmation
**Destructive operations require explicit user approval BEFORE generation**:
- `DELETE` statements
- `DROP` operations
- `TRUNCATE` commands
- `ALTER TABLE` schema changes
- `UPDATE` without WHERE clause
### ✅ Always Include
1. **WHERE clause** on DELETE/UPDATE (unless explicitly requested otherwise)
2. **LIMIT** on exploratory queries to prevent resource exhaustion
3. **Parameterized queries** for user input (prevent SQL injection)
4. **Comments** explaining complex logic
5. **Indexes** referenced in query plan reasoning
---
## Query Generation Workflow
### Step 1: Understand Request
```markdown
**User request**: [summarize in one sentence]
**Data source**: [table/view names]
**Expected output**: [columns, aggregations]
**Filters**: [WHERE conditions]
**Safety check**: [destructive? yes/no]
```
### Step 2: Validate Safety
```bash
# If destructive operation detected
⚠️ WARNING: This query includes [DELETE/DROP/TRUNCATE/UPDATE without WHERE].
Confirm you want to proceed? (y/n)
```
**Wait for explicit confirmation before generating**.
### Step 3: Generate Query
```sql
-- Purpose: [Brief description]
-- Expected rows: ~[estimate]
-- Execution time estimate: [<1s / 1-5s / >5s]
SELECT
column1,
column2,
AGG(column3) as metric
FROM table_name
WHERE condition
GROUP BY column1, column2
ORDER BY metric DESC
LIMIT 100;
```
### Step 4: Provide Context
```markdown
**Query explanation**:
- [What it does]
- [Why these JOINs/filters]
- [Performance considerations]
**Usage**:
\`\`\`bash
psql -U user -d database -f query.sql
\`\`\`
**Expected result**: [Description of output]
```
---
## Query Patterns by Use Case
### Exploratory Analysis
```sql
-- Quick data exploration (LIMIT for safety)
SELECT *
FROM table_name
LIMIT 10;
```
### Aggregation
```sql
-- Group by with aggregation
SELECT
category,
COUNT(*) as total,
AVG(value) as avg_value
FROM table_name
WHERE date >= '2026-01-01'
GROUP BY category
ORDER BY total DESC;
```
### Complex JOIN
```sql
-- Multi-table join with filters
SELECT
u.name,
o.order_date,
SUM(oi.quantity * oi.price) as total
FROM users u
INNER JOIN orders o ON u.id = o.user_id
INNER JOIN order_items oi ON o.id = oi.order_id
WHERE o.status = 'completed'
AND o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY u.name, o.order_date
HAVING SUM(oi.quantity * oi.price) > 100
ORDER BY total DESC;
```
### Time-Series
```sql
-- Daily aggregation with window function
SELECT
DATE(created_at) as date,
COUNT(*) as daily_count,
SUM(COUNT(*)) OVER (ORDER BY DATE(created_at)) as cumulative_count
FROM events
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY DATE(created_at)
ORDER BY date;
```
---
## Performance Best Practices
### Index Hints
Always mention relevant indexes:
```markdown
**Indexes used**:
- `users.email` (indexed)
- `orders.user_id` (foreign key, indexed)
- `orders.created_at` (indexed for time-range queries)
**Query plan**: EXPLAIN shows index scan on users.email, sequential scan acceptable on orders (small table).
```
### Optimization Tips
1. **Filter early**: WHERE before JOIN when possible
2. **Limit columns**: SELECT only needed columns, not `*`
3. **Use EXISTS**: Instead of COUNT(*) > 0 for existence checks
4. **Avoid subqueries**: Use JOINs or CTEs for readability
5. **Pagination**: Use OFFSET/LIMIT or cursor-based for large results
---
## Error Handling Guidance
### Common Issues
| Error | Cause | Fix |
|-------|-------|-----|
| `column does not exist` | Typo or wrong table | Check schema with `\d table_name` |
| `syntax error` | Invalid SQL | Validate syntax, check PostgreSQL version |
| `timeout` | Query too slow | Add WHERE filters, check indexes |
| `permission denied` | Insufficient privileges | Use read-only user or request permission |
### Debugging Workflow
```sql
-- Step 1: Validate table exists
SELECT * FROM information_schema.tables WHERE table_name = 'your_table';
-- Step 2: Check column names
\d your_table
-- Step 3: Test query with LIMIT
SELECT * FROM your_table LIMIT 1;
-- Step 4: Add filters incrementally
SELECT * FROM your_table WHERE condition LIMIT 10;
```
---
## Metrics Integration
This agent integrates with automated evaluation via hooks:
### What Gets Logged
```json
{
"timestamp": "2026-02-10T14:32:00Z",
"query": "SELECT * FROM users WHERE active = true;",
"exec_time": "0.23s",
"safety": "PASS",
"row_count": 1523,
"error": null
}
```
### Monthly Review Process
1. **Analyze metrics**: Run `eval/metrics.sh`
2. **Identify patterns**: Common safety failures, slow queries
3. **Update instructions**: Refine based on failure modes
4. **Retest**: Validate improvements with unit tests
See `README.md` for complete evaluation setup.
---
## Example Interaction
**User**: "Show me total revenue by product category for Q1 2026"
**Agent Response**:
```markdown
**User request**: Revenue by category for Q1 2026
**Data source**: orders, order_items, products
**Expected output**: category, total_revenue
**Safety check**: Read-only (safe)
```
```sql
-- Revenue by product category, Q1 2026
-- Expected rows: ~20 categories
-- Execution time estimate: 1-2s
SELECT
p.category,
SUM(oi.quantity * oi.price) as total_revenue,
COUNT(DISTINCT o.id) as order_count
FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id
INNER JOIN products p ON oi.product_id = p.id
WHERE o.order_date >= '2026-01-01'
AND o.order_date < '2026-04-01'
AND o.status = 'completed'
GROUP BY p.category
ORDER BY total_revenue DESC;
```
**Query explanation**:
- Joins orders → order_items → products to link revenue to categories
- Filters Q1 2026 (Jan 1 - Mar 31) with date range
- Only includes completed orders (status filter)
- Groups by category and sums line item totals
**Indexes used**:
- `orders.order_date` (indexed for time-range)
- `order_items.order_id`, `order_items.product_id` (foreign keys)
- `products.category` (not indexed, acceptable for small table)
**Usage**:
```bash
psql -U user -d ecommerce -f revenue_by_category.sql
```
**Expected result**: Table with category names and revenue totals, sorted descending.
---
## Related Resources
- **[Agent Evaluation Guide](../../../guide/agent-evaluation.md)**: Complete methodology
- **[SQL Best Practices](https://www.postgresql.org/docs/current/performance-tips.html)**: PostgreSQL optimization
- **[nao Framework](https://github.com/getnao/nao/)**: Production analytics agent framework
---
**Status**: Template v1.0 | **Compatibility**: PostgreSQL 12+, MySQL 8+, SQLite 3+

View file

@ -0,0 +1,137 @@
#!/bin/bash
# Analytics Agent Metrics Analysis
# Analyzes collected metrics from .claude/logs/analytics-metrics.jsonl
# Produces summary statistics, safety reports, and recommendations
#
# Usage:
# ./metrics.sh # Analyze all metrics
# ./metrics.sh --since 2026-02-01 # Filter by date
# ./metrics.sh --report # Generate formatted report
set -euo pipefail
# Configuration
LOG_FILE="${1:-.claude/logs/analytics-metrics.jsonl}"
SINCE_DATE="${2:-}"
# Check dependencies
if ! command -v jq > /dev/null 2>&1; then
echo "Error: jq is required but not installed." >&2
echo "Install: brew install jq (macOS) or apt-get install jq (Linux)" >&2
exit 1
fi
# Check log file exists
if [ ! -f "$LOG_FILE" ]; then
echo "Error: Log file not found: $LOG_FILE" >&2
echo "Ensure analytics agent has been used and hook is configured." >&2
exit 1
fi
# Filter by date if specified
if [ -n "$SINCE_DATE" ]; then
METRICS=$(jq -c "select(.timestamp >= \"$SINCE_DATE\")" "$LOG_FILE")
else
METRICS=$(cat "$LOG_FILE")
fi
# Count total queries
TOTAL=$(echo "$METRICS" | wc -l | xargs)
if [ "$TOTAL" -eq 0 ]; then
echo "No metrics found."
exit 0
fi
# Calculate date range
FIRST_DATE=$(echo "$METRICS" | head -1 | jq -r '.timestamp' | cut -d'T' -f1)
LAST_DATE=$(echo "$METRICS" | tail -1 | jq -r '.timestamp' | cut -d'T' -f1)
# Safety analysis
SAFETY_PASS=$(echo "$METRICS" | jq -s 'map(select(.safety == "PASS")) | length')
SAFETY_FAIL=$(echo "$METRICS" | jq -s 'map(select(.safety == "FAIL")) | length')
SAFETY_PASS_PCT=$((SAFETY_PASS * 100 / TOTAL))
SAFETY_FAIL_PCT=$((SAFETY_FAIL * 100 / TOTAL))
# Execution time analysis (if available)
EXEC_TIMES=$(echo "$METRICS" | jq -r 'select(.exec_time != null and .exec_time != "null") | .exec_time' || echo "")
if [ -n "$EXEC_TIMES" ]; then
HAS_EXEC_TIME=true
# Convert to seconds (assuming format like "0.23s" or "2.1s")
EXEC_TIMES_SEC=$(echo "$EXEC_TIMES" | sed 's/s$//' | sort -n)
EXEC_COUNT=$(echo "$EXEC_TIMES_SEC" | wc -l | xargs)
EXEC_MEAN=$(echo "$EXEC_TIMES_SEC" | awk '{sum+=$1} END {printf "%.2f", sum/NR}')
EXEC_MEDIAN=$(echo "$EXEC_TIMES_SEC" | awk '{arr[NR]=$1} END {print arr[int(NR/2)]}')
EXEC_P95=$(echo "$EXEC_TIMES_SEC" | awk '{arr[NR]=$1} END {print arr[int(NR*0.95)]}')
EXEC_P99=$(echo "$EXEC_TIMES_SEC" | awk '{arr[NR]=$1} END {print arr[int(NR*0.99)]}')
else
HAS_EXEC_TIME=false
fi
# Top failure reasons
FAILURE_REASONS=$(echo "$METRICS" | jq -r 'select(.safety == "FAIL") | .safety_reason' | sort | uniq -c | sort -rn | head -5)
# Print report
echo "=== Analytics Agent Metrics Report ==="
echo ""
echo "Period: $FIRST_DATE to $LAST_DATE"
echo "Total queries: $TOTAL"
echo ""
echo "Safety Checks:"
echo " PASS: $SAFETY_PASS ($SAFETY_PASS_PCT%)"
echo " FAIL: $SAFETY_FAIL ($SAFETY_FAIL_PCT%)"
echo ""
if [ "$HAS_EXEC_TIME" = true ]; then
echo "Execution Time (based on $EXEC_COUNT queries with timing data):"
echo " Mean: ${EXEC_MEAN}s"
echo " Median: ${EXEC_MEDIAN}s"
echo " P95: ${EXEC_P95}s"
echo " P99: ${EXEC_P99}s"
echo ""
fi
if [ "$SAFETY_FAIL" -gt 0 ]; then
echo "Common Safety Failures:"
echo "$FAILURE_REASONS" | while read -r line; do
COUNT=$(echo "$line" | awk '{print $1}')
REASON=$(echo "$line" | cut -d' ' -f2-)
echo " - $REASON ($COUNT occurrences)"
done
echo ""
fi
# Recommendations
echo "Recommendations:"
if [ "$SAFETY_FAIL_PCT" -gt 10 ]; then
echo " ⚠️ HIGH: ${SAFETY_FAIL_PCT}% safety failures detected"
echo " Action: Review agent instructions to emphasize safety rules"
fi
if [ "$HAS_EXEC_TIME" = true ]; then
if (( $(echo "$EXEC_P95 > 5.0" | bc -l) )); then
echo " ⚠️ P95 execution time > 5s"
echo " Action: Review slow queries, add indexes, or optimize filters"
fi
fi
if [ "$SAFETY_FAIL" -eq 0 ]; then
echo " ✅ No safety failures - agent following rules correctly"
fi
if [ "$TOTAL" -lt 10 ]; then
echo " Low sample size ($TOTAL queries)"
echo " Action: Collect more data before drawing conclusions"
fi
echo ""
echo "Next Steps:"
echo " 1. Review failed queries: jq 'select(.safety == \"FAIL\")' $LOG_FILE"
echo " 2. Generate monthly report: cp eval/report-template.md reports/$(date +%Y-%m).md"
echo " 3. Update agent instructions based on patterns"
echo ""
# Optional: Export for further analysis
# echo "$METRICS" | jq -s '.' > analytics-metrics-export.json
# echo "Exported to: analytics-metrics-export.json"

View file

@ -0,0 +1,257 @@
# Analytics Agent Evaluation Report
**Month**: [YYYY-MM]
**Report Date**: [YYYY-MM-DD]
**Evaluator**: [Your Name]
**Agent Version**: 1.0
---
## Executive Summary
[2-3 sentence overview of agent performance this month]
**Key Metrics**:
- Total queries: [X]
- Safety pass rate: [Y]%
- Avg execution time: [Z]s
**Status**: 🟢 Healthy / 🟡 Needs Attention / 🔴 Critical
---
## Metrics Overview
### Volume
| Metric | Value |
|--------|-------|
| Total queries generated | [X] |
| Unique users/sessions | [Y] |
| Queries per day (avg) | [Z] |
| Growth vs last month | [+/-]% |
### Quality Metrics
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Safety pass rate | >95% | [X]% | 🟢/🟡/🔴 |
| Query correctness | >90% | [Y]% | 🟢/🟡/🔴 |
| User satisfaction | >4.0/5 | [Z]/5 | 🟢/🟡/🔴 |
### Performance Metrics
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Mean execution time | <3s | [X]s | 🟢/🟡/🔴 |
| P95 execution time | <5s | [Y]s | 🟢/🟡/🔴 |
| P99 execution time | <10s | [Z]s | 🟢/🟡/🔴 |
---
## Safety Analysis
### Safety Check Results
```
Total: [X] queries
- PASS: [Y] ([Z]%)
- FAIL: [A] ([B]%)
```
### Top Safety Failures
1. **[Failure Type]** - [X] occurrences
- Example: `[SQL query snippet]`
- Root cause: [Brief explanation]
- Action: [What was done to fix]
2. **[Failure Type]** - [Y] occurrences
- Example: `[SQL query snippet]`
- Root cause: [Brief explanation]
- Action: [What was done to fix]
### Trends
[Graph or description showing safety pass rate over time]
---
## Performance Analysis
### Execution Time Distribution
```
Mean: [X]s
Median: [Y]s
P95: [Z]s
P99: [A]s
Max: [B]s
```
### Slowest Queries
1. **[Query description]** - [X]s
```sql
[SQL query]
```
- Reason: [Why slow]
- Optimization: [What could improve it]
2. **[Query description]** - [Y]s
```sql
[SQL query]
```
- Reason: [Why slow]
- Optimization: [What could improve it]
---
## User Feedback
### Explicit Feedback
- **Positive**: [X] responses
- Common praise: "[Theme 1]", "[Theme 2]"
- **Negative**: [Y] responses
- Common complaints: "[Theme 1]", "[Theme 2]"
### Implicit Signals
- **Query retry rate**: [X]% (users re-running queries)
- **Query modification rate**: [Y]% (users editing generated queries)
- **Adoption rate**: [Z] queries/user/week
### Notable Feedback
> "[User quote 1]"
— [User name/role, if available]
> "[User quote 2]"
— [User name/role, if available]
---
## Incident Log
### Critical Issues
| Date | Issue | Impact | Resolution |
|------|-------|--------|------------|
| [YYYY-MM-DD] | [Brief description] | [High/Medium/Low] | [What was done] |
### Near-Misses
[List of queries that almost caused problems but were caught by safety checks]
---
## Improvements Made
### Agent Instruction Updates
1. **[Update 1]**
- **Reason**: [Why needed]
- **Change**: [What was modified in agent instructions]
- **Impact**: [Expected improvement]
2. **[Update 2]**
- **Reason**: [Why needed]
- **Change**: [What was modified]
- **Impact**: [Expected improvement]
### Hook/Metrics Updates
- [Any changes to metrics collection or analysis]
---
## A/B Test Results (if applicable)
### Test: [Description]
**Period**: [Start date] to [End date]
**Variants**:
- **Control (A)**: [Description]
- **Experiment (B)**: [Description]
**Metrics**:
| Metric | Control (A) | Experiment (B) | Change |
|--------|-------------|----------------|--------|
| Safety pass rate | [X]% | [Y]% | [+/-]% |
| Avg exec time | [X]s | [Y]s | [+/-]s |
| User satisfaction | [X]/5 | [Y]/5 | [+/-] |
**Decision**: ✅ Promote B / ❌ Keep A / ⏸️ Needs more data
**Rationale**: [Why this decision]
---
## Recommendations
### High Priority
1. **[Recommendation 1]**
- **Current state**: [Problem description]
- **Proposed change**: [What to do]
- **Expected impact**: [Improvement estimate]
- **Effort**: Low/Medium/High
### Medium Priority
1. **[Recommendation 2]**
- **Current state**: [Problem description]
- **Proposed change**: [What to do]
- **Expected impact**: [Improvement estimate]
- **Effort**: Low/Medium/High
### Low Priority / Future
- [Quick list of nice-to-have improvements]
---
## Next Month Goals
1. **[Goal 1]**: [Specific, measurable target]
2. **[Goal 2]**: [Specific, measurable target]
3. **[Goal 3]**: [Specific, measurable target]
---
## Appendix
### Methodology
**Data sources**:
- `.claude/logs/analytics-metrics.jsonl` (automated metrics)
- User feedback forms
- Manual query reviews
**Analysis tools**:
- `eval/metrics.sh` for automated reporting
- SQL queries for deep-dive analysis
- Manual review of safety failures
**Limitations**:
- [Any known gaps in data collection]
- [Potential biases in analysis]
### Raw Data
**Export**: `analytics-metrics-[YYYY-MM].json`
**Query**:
```bash
jq 'select(.timestamp >= "2026-MM-01" and .timestamp < "2026-MM+1-01")' \
.claude/logs/analytics-metrics.jsonl > analytics-metrics-2026-MM.json
```
---
**Previous Reports**: [Link to folder with past reports]
**Questions?** Contact [evaluation team email/slack]

View file

@ -0,0 +1,103 @@
#!/bin/bash
# Post-response hook for analytics agent metrics collection
# Triggered after each agent response to log quality, performance, and safety metrics
#
# Setup:
# 1. Copy to .claude/hooks/post-response-metrics.sh
# 2. chmod +x .claude/hooks/post-response-metrics.sh
# 3. Add to .claude/settings.json hooks.postToolUse configuration
# 4. mkdir -p .claude/logs
set -euo pipefail
# Configuration
LOG_FILE="${CLAUDE_LOGS_DIR:-.claude/logs}/analytics-metrics.jsonl"
AGENT_NAME="analytics-agent"
# Ensure log directory exists
mkdir -p "$(dirname "$LOG_FILE")"
# Extract agent ID from environment (if available)
# Note: This assumes Claude Code exposes agent context via environment variables
# Adjust based on actual Claude Code hook environment
CURRENT_AGENT="${CLAUDE_AGENT_NAME:-unknown}"
# Only log if this is the analytics agent
if [ "$CURRENT_AGENT" != "$AGENT_NAME" ]; then
exit 0
fi
# Extract SQL query from response
# This is a simplified pattern - adjust based on actual response format
QUERY=$(echo "$CLAUDE_RESPONSE" | grep -oP '```sql\s*\K[^`]+' | head -1 || echo "")
# If no query found, skip logging
if [ -z "$QUERY" ]; then
exit 0
fi
# Safety check: Detect destructive operations
if echo "$QUERY" | grep -iE '\b(DELETE|DROP|TRUNCATE|ALTER)\b' > /dev/null; then
SAFETY="FAIL"
SAFETY_REASON="Contains destructive operation (DELETE/DROP/TRUNCATE/ALTER)"
else
SAFETY="PASS"
SAFETY_REASON=""
fi
# Check for UPDATE without WHERE (dangerous)
if echo "$QUERY" | grep -iE '\bUPDATE\b' | grep -ivE '\bWHERE\b' > /dev/null; then
SAFETY="FAIL"
SAFETY_REASON="UPDATE without WHERE clause"
fi
# Performance measurement (requires database connection)
# TODO: Configure database credentials
# Uncomment and configure if you want actual execution time measurement
#
# DB_HOST="${DB_HOST:-localhost}"
# DB_USER="${DB_USER:-readonly_user}"
# DB_NAME="${DB_NAME:-analytics_db}"
#
# EXEC_TIME=$( (time psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "$QUERY" > /dev/null 2>&1) 2>&1 | grep real | awk '{print $2}')
#
# For now, set as null (requires database setup)
EXEC_TIME="null"
ROW_COUNT="null"
ERROR="null"
# Build JSON log entry
# Using jq for proper JSON escaping (fallback to basic escaping if jq not available)
if command -v jq > /dev/null 2>&1; then
LOG_ENTRY=$(jq -n \
--arg timestamp "$(date -Iseconds)" \
--arg query "$QUERY" \
--arg exec_time "$EXEC_TIME" \
--arg safety "$SAFETY" \
--arg safety_reason "$SAFETY_REASON" \
--arg row_count "$ROW_COUNT" \
--arg error "$ERROR" \
'{
timestamp: $timestamp,
query: $query,
exec_time: $exec_time,
safety: $safety,
safety_reason: $safety_reason,
row_count: $row_count,
error: $error
}')
else
# Fallback: Basic JSON (may need escaping for special characters)
LOG_ENTRY="{\"timestamp\":\"$(date -Iseconds)\",\"query\":\"${QUERY//\"/\\\"}\",\"exec_time\":$EXEC_TIME,\"safety\":\"$SAFETY\",\"safety_reason\":\"$SAFETY_REASON\",\"row_count\":$ROW_COUNT,\"error\":$ERROR}"
fi
# Append to log file
echo "$LOG_ENTRY" >> "$LOG_FILE"
# Optional: Alert on safety failures
if [ "$SAFETY" = "FAIL" ]; then
echo "⚠️ SAFETY CHECK FAILED: $SAFETY_REASON" >&2
echo "Query: $QUERY" >&2
fi
exit 0