release: v3.24.0 - Agent Evaluation Framework

Major addition: Complete agent evaluation framework with production-ready template. ## Added - **Resource Evaluation**: nao framework (score 3/5) - Identified critical gap: agent evaluation not documented - Technical challenge adjusted score 2/5 → 3/5 - All claims fact-checked (TypeScript 58.9%, Python 38.5%) - **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens) - Metrics: response quality, tool usage, performance, satisfaction - Patterns: logging hooks, unit tests, A/B testing, feedback loops - Example: analytics agent with built-in metrics - Tools: nao framework reference, Claude Code hooks integration - **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks - nao (Analytics Agents): Database-agnostic, built-in evaluation - Transposable patterns: context builder, evaluation hooks, DB integrations - **Template**: Analytics Agent with Evaluation (5 files, ~1K lines) - README: setup, usage, troubleshooting - Agent: SQL generator with evaluation criteria, safety rules - Hook: automated metrics logging (safety, performance, errors) - Script: analysis with stats, safety reports, recommendations - Report template: monthly evaluation format ## Changed - Agent Evaluation Guide: updated template references, verified links - Landing Site: templates count 110 → 114 - Version: 3.23.5 → 3.24.0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-10 11:52:13 +01:00 · 2026-02-10 11:52:13 +01:00 · ef7cdd899e
commit ef7cdd899e
parent 1fb783ebb8
15 changed files with 1782 additions and 16 deletions
--- a/docs/resource-evaluations/nao-framework.md
+++ b/docs/resource-evaluations/nao-framework.md
@ -0,0 +1,202 @@
+# Resource Evaluation: nao Framework
+
+**URL**: https://github.com/getnao/nao/
+**Type**: Open-source framework for building analytics agents
+**Evaluation Date**: 2026-02-10
+**Evaluator**: Claude Code (with technical-writer challenge)
+**Target Guide**: Claude Code Ultimate Guide
+
+---
+
+## Summary
+
+**nao** is an open-source framework for building and deploying analytics agents with a two-step architecture: build agent context via CLI tools, then deploy chat UI for end users to query data conversationally.
+
+**Key Features**:
+- Context builder supporting flexible data, metadata, documentation, and tool integrations
+- Database agnostic (PostgreSQL, BigQuery, Snowflake, Databricks)
+- Built-in evaluation framework with unit testing capabilities
+- Self-hosted deployment with Docker containerization
+- Native data visualization within chat interface
+- Tech stack: Fastify, Drizzle ORM, tRPC, React, shadcn UI (TypeScript 58.9%, Python 38.5%)
+
+---
+
+## Relevance Score: 3/5 (Moderate - Useful Complement)
+
+### Initial Score: 2/5 → Adjusted to 3/5 after technical challenge
+
+**Justification for 3/5**:
+
+**Relevance (+2 points)**:
+- ✅ **Transposable architecture patterns** to Claude Code agents (.claude/agents/)
+- ✅ **Evaluation framework** addresses critical gap in guide (no section on agent evaluation)
+- ✅ **Database context patterns** applicable to agents with DB integrations
+
+**Limitations (-2 points)**:
+- ❌ No direct integration with Claude Code CLI (not a plugin/MCP server)
+- ❌ Deployment scope (production chat UI) differs from guide's dev CLI focus
+
+**Partial overlap**: Agent concepts and evaluation patterns are transposable, but final product (deployed analytics UI) is not guide's focus.
+
+---
+
+## Comparative Analysis
+
+| Aspect | nao | Current Guide | Gap? |
+|--------|-----|---------------|------|
+| **Custom Agents** | ✅ Complete build + deploy framework | ✅ Extensive docs (.claude/agents/) | ➖ No |
+| **Agent Architecture** | ➕ Structured context builder pattern | ⚠️ No complex context patterns | ✅ **Minor gap** |
+| **Agent Evaluation** | ➕ Integrated framework (metrics, unit tests, feedback) | ❌ No mention | ✅ **Critical gap** |
+| **Database Integrations** | ➕ Native support: BigQuery, Snowflake, Databricks, PostgreSQL | ✅ Mentioned (database-branch-setup.md) | ➖ No |
+| **Agent Deployment** | ➕ Deployable chat UI with visualizations | ❌ Not covered (CLI focus) | ⚠️ Out of scope |
+| **Analytics-Specific** | ➕ Specialized for conversational analytics | ❌ No analytics focus | ➖ No |
+| **Context Management** | ✅ Context builder with docs/metadata | ✅ CLAUDE.md, .claude/rules/ | ➖ No |
+| **Self-Hosted** | ✅ Docker + deployment guides | ✅ Mentioned (security-hardening.md) | ➖ No |
+
+---
+
+## Integration Recommendations
+
+**Score 3/5 → Integrate (3 approaches)**
+
+### ✅ Priority 1: Dedicated Section (Recommended)
+
+**Create**: `guide/agent-evaluation.md` (~800 tokens)
+
+**Content**:
+- **Why Evaluate?**: Measure quality, track usage, identify bottlenecks
+- **Metrics to Track**: Response time, tool call success rate, context relevance, error rate
+- **Implementation Patterns**: Logging hooks, metrics aggregation, A/B testing, feedback collection
+- **Reference**: Mention nao as complete evaluation framework
+
+**Estimated tokens**: 500-800
+**Suggested timeline**: Week 1
+**Insertion point**: After `guide/ultimate-guide.md` section 4 (Agents)
+
+---
+
+### ✅ Priority 2: Agent Template with Evaluation (Optional)
+
+**Create**: `examples/agents/analytics-with-eval/`
+
+**Structure**:
+```
+analytics-with-eval/
+├── config.yaml          # Agent config
+├── context.md           # Agent instructions
+├── eval/
+│   ├── metrics.sh       # Collect metrics
+│   └── report.md        # Example report
+└── hooks/
+    └── post_response.sh # Log response metrics
+```
+
+**Estimated tokens**: 300-400 (code + README)
+**Suggested timeline**: Week 2-3
+**Value**: Demonstrates concrete evaluation patterns
+
+---
+
+### ✅ Priority 3: Ecosystem Mention (Minimal)
+
+**Add**: Section "Domain-Specific Agent Frameworks" in `guide/ai-ecosystem.md`
+
+**Content**: 1 paragraph + link to nao
+
+**Estimated tokens**: 100-150
+**Suggested timeline**: Immediate
+**Suggested line**: After orchestration frameworks section (~line 2200)
+
+---
+
+## Technical Challenge Results
+
+The technical-writer agent identified several biases in the initial evaluation:
+
+### Missed Points in Initial Evaluation
+
+1. **Transposable agent architecture**: nao's `context builder` pattern applicable to Claude Code agents
+2. **Evaluation framework = critical gap**: Guide has NO mention of agent evaluation (metrics, testing, feedback)
+3. **Database context patterns**: Patterns for context injection from databases not documented
+
+### Score Justification
+
+**Correction**: 2/5 → **3/5** (Moderate)
+
+**Arguments for 3/5**:
+- Transposable architecture patterns (+1)
+- Evaluation framework addresses identified gap (+1)
+- Usable database context patterns (+0.5)
+- Open-source, well-documented (+0.5)
+
+### Gap "Agent Evaluation" Must Be Addressed
+
+**YES, the guide MUST have this section** because:
+- Devs create agents without knowing how to measure quality
+- Anthropic docs mention evaluations but not in Claude Code context
+- nao proves this is feasible and useful (production-ready)
+
+### Risks of Non-Integration
+
+1. **Evaluation gap remains undocumented** → Devs don't know how to measure agent quality
+2. **Database context patterns undocumented** → Devs reinvent already-proven patterns
+3. **Loss of credibility** → If evaluation becomes standard, guide will be behind
+
+### Why Initial Evaluation Was Biased
+
+1. **Confusion between scope and relevance**: Different scope ≠ not relevant
+2. **Focus on final product**: Evaluated nao as *competing product*, not *pattern source*
+3. **Underestimation of gaps**: Agent evaluation = critical gap not previously identified
+4. **Premature rejection**: "Don't integrate" despite identifying 2 major gaps
+
+**Lesson**: Evaluate resources for *transposable patterns*, not just direct integration.
+
+---
+
+## Fact-Check Results
+
+All technical claims verified by re-fetching GitHub repository:
+
+| Claim | Verified | Source |
+|-------|----------|--------|
+| TypeScript 58.9%, Python 38.5% | ✅ | Repository footer |
+| Stack: Fastify, Drizzle, tRPC, React, shadcn, TanStack Query | ✅ | "Stack" section |
+| Databases: PostgreSQL, BigQuery, Snowflake, Databricks | ✅ | Repository topics |
+| Evaluation framework with unit testing | ✅ | "Evaluation framework" section |
+| GitHub repo integration + Slack bot | ✅ | Quickstart + topics |
+| Docker containerization + self-hosted | ✅ | "Docker" section + docs |
+
+**Corrections made**: None (all initial claims were accurate)
+
+**Stats requiring external research**: None (all verifiable on GitHub page)
+
+---
+
+## Final Decision
+
+- **Final Score**: **3/5 (Moderate - Useful Complement)**
+- **Action**: **Integrate** via 3 approaches (priority 1 + 2 + 3)
+- **Confidence**: **High** (all claims fact-checked ✅)
+
+### Concrete Action Plan
+
+1. **Immediate** (today): Add mention in `guide/ai-ecosystem.md` section "Domain-Specific Agent Frameworks"
+2. **Week 1**: Create `guide/agent-evaluation.md` with patterns inspired by nao
+3. **Week 2-3**: Create template `examples/agents/analytics-with-eval/` with metrics
+
+### Added Value for Guide
+
+- ✅ Addresses critical gap (agent evaluation)
+- ✅ Adds transposable patterns (context builder, DB integrations)
+- ✅ Demonstrates complete lifecycle (build → eval → iterate)
+- ✅ References production-ready framework in specific domain
+
+---
+
+## Metadata
+
+**Evaluation performed with**: WebFetch (2×), Grep (3×), Task (technical-writer challenge)
+**Evaluation time**: ~15 minutes
+**Quality**: Complete challenge + fact-check ✅
+**Follow-up**: Implement integration recommendations (A, B, C)