marketing-shibata50/claude-code-ultimate-guide

Florian BRUNIAUX ef7cdd899e release: v3.24.0 - Agent Evaluation Framework

Major addition: Complete agent evaluation framework with production-ready template.

## Added

- **Resource Evaluation**: nao framework (score 3/5)
  - Identified critical gap: agent evaluation not documented
  - Technical challenge adjusted score 2/5 → 3/5
  - All claims fact-checked (TypeScript 58.9%, Python 38.5%)

- **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens)
  - Metrics: response quality, tool usage, performance, satisfaction
  - Patterns: logging hooks, unit tests, A/B testing, feedback loops
  - Example: analytics agent with built-in metrics
  - Tools: nao framework reference, Claude Code hooks integration

- **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks
  - nao (Analytics Agents): Database-agnostic, built-in evaluation
  - Transposable patterns: context builder, evaluation hooks, DB integrations

- **Template**: Analytics Agent with Evaluation (5 files, ~1K lines)
  - README: setup, usage, troubleshooting
  - Agent: SQL generator with evaluation criteria, safety rules
  - Hook: automated metrics logging (safety, performance, errors)
  - Script: analysis with stats, safety reports, recommendations
  - Report template: monthly evaluation format

## Changed

- Agent Evaluation Guide: updated template references, verified links
- Landing Site: templates count 110 → 114
- Version: 3.23.5 → 3.24.0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-10 11:52:13 +01:00

5 KiB

Raw Blame History

Analytics Agent with Built-in Evaluation

Template: Production-ready analytics agent with automated metrics collection

Use Case: SQL query generation with quality tracking, performance monitoring, and safety validation

Related Guide: Agent Evaluation

What's Included

File	Purpose
`analytics-agent.md`	Agent definition with evaluation criteria
`hooks/post-response-metrics.sh`	Automated metrics logging after each response
`eval/metrics.sh`	Analysis script for aggregating collected metrics
`eval/report-template.md`	Template for monthly evaluation reports

Setup

1. Copy Agent to Project

# Copy agent definition
cp analytics-agent.md ~/.claude/agents/

# Or for project-specific:
cp analytics-agent.md /path/to/project/.claude/agents/

2. Install Hook

# Copy hook to project
cp hooks/post-response-metrics.sh /path/to/project/.claude/hooks/

# Make executable
chmod +x /path/to/project/.claude/hooks/post-response-metrics.sh

3. Configure Hook Trigger

Add to .claude/settings.json:

{
  "hooks": {
    "postToolUse": [
      {
        "command": ".claude/hooks/post-response-metrics.sh",
        "enabled": true,
        "description": "Log analytics agent metrics"
      }
    ]
  }
}

4. Create Logs Directory

mkdir -p /path/to/project/.claude/logs

Usage

Invoke Agent

# In Claude Code session
"Use analytics-agent to generate a SQL query for [task description]"

Check Metrics

# View raw metrics log
cat .claude/logs/analytics-metrics.jsonl

# Run analysis
./examples/agents/analytics-with-eval/eval/metrics.sh

Generate Monthly Report

# Copy template
cp eval/report-template.md reports/analytics-2026-02.md

# Fill in metrics from metrics.sh output

Metrics Collected

The hook automatically logs:

Metric	Description	Source
`timestamp`	ISO 8601 timestamp	System
`query`	Generated SQL query	Agent response
`exec_time`	Query execution time	Database
`safety`	PASS/FAIL for destructive ops	Query analysis
`row_count`	Rows returned	Database
`error`	Error message if query failed	Database

Log format: JSONL (JSON Lines) in .claude/logs/analytics-metrics.jsonl

Example Metrics Output

$ ./eval/metrics.sh

=== Analytics Agent Metrics Report ===
Period: 2026-02-01 to 2026-02-10

Total queries: 45
Safety checks:
  - PASS: 42 (93%)
  - FAIL: 3 (7%)

Execution time:
  - Mean: 2.3s
  - Median: 1.8s
  - P95: 5.2s
  - P99: 8.1s

Common failures:
  1. DELETE without WHERE clause (2 occurrences)
  2. DROP TABLE in query (1 occurrence)

Recommendations:
  - Review agent instructions to emphasize WHERE clause requirement
  - Add explicit prohibition on DROP operations

Customization

Modify Safety Checks

Edit hooks/post-response-metrics.sh line 12-16:

# Add more patterns
if echo "$QUERY" | grep -iE 'DELETE|DROP|TRUNCATE|ALTER'; then
  SAFETY="FAIL"
fi

Add Custom Metrics

Extend the JSON log structure:

echo "{
  \"timestamp\":\"$(date -Iseconds)\",
  \"query\":\"$QUERY\",
  \"your_metric\":\"$VALUE\"
}" >> .claude/logs/analytics-metrics.jsonl

Change Log Location

Update POST_RESPONSE_LOG variable in hook script.

Troubleshooting

Hook Not Triggering

Check:

Hook is executable: ls -l .claude/hooks/*.sh
Hook is enabled in settings.json
Agent name matches: analytics-agent (hyphenated)

Debug:

# Test hook manually
export CLAUDE_RESPONSE='{"content":"SELECT * FROM users;"}'
./.claude/hooks/post-response-metrics.sh

No Metrics in Log

Check:

Log directory exists: mkdir -p .claude/logs
Write permissions: touch .claude/logs/test.log
Query extraction pattern matches your SQL format

Metrics Analysis Fails

Check:

jq is installed: which jq
Log file is valid JSONL: jq . .claude/logs/analytics-metrics.jsonl

Production Considerations

Performance

Hook adds ~50ms overhead per response
Log file grows ~200 bytes per query
Rotate logs monthly to prevent bloat

Privacy

Logs contain actual SQL queries (may include sensitive data)
Add to .gitignore: .claude/logs/
Consider redacting sensitive values in hook script

Database Connection

Hook requires database access for exec_time measurement
Configure connection in hook script or use environment variables
Ensure read-only credentials for safety

Agent Evaluation Guide: Complete evaluation methodology
Hooks Documentation: Hook system reference
nao Framework: Production analytics agent framework (inspiration)

License

Same as parent repository (MIT)

Questions? Open an issue or discussion in the main repository.

5 KiB Raw Blame History

Analytics Agent with Built-in Evaluation

What's Included

Setup

1. Copy Agent to Project

2. Install Hook

3. Configure Hook Trigger

4. Create Logs Directory

Usage

Invoke Agent

Check Metrics

Generate Monthly Report

Metrics Collected

Example Metrics Output

Customization

Modify Safety Checks

Add Custom Metrics

Change Log Location

Troubleshooting

Hook Not Triggering

No Metrics in Log

Metrics Analysis Fails

Production Considerations

Performance

Privacy

Database Connection

Related Resources

License

5 KiB

Raw Blame History