release: v3.24.0 - Agent Evaluation Framework

Major addition: Complete agent evaluation framework with production-ready template.

## Added

- **Resource Evaluation**: nao framework (score 3/5)
  - Identified critical gap: agent evaluation not documented
  - Technical challenge adjusted score 2/5 → 3/5
  - All claims fact-checked (TypeScript 58.9%, Python 38.5%)

- **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens)
  - Metrics: response quality, tool usage, performance, satisfaction
  - Patterns: logging hooks, unit tests, A/B testing, feedback loops
  - Example: analytics agent with built-in metrics
  - Tools: nao framework reference, Claude Code hooks integration

- **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks
  - nao (Analytics Agents): Database-agnostic, built-in evaluation
  - Transposable patterns: context builder, evaluation hooks, DB integrations

- **Template**: Analytics Agent with Evaluation (5 files, ~1K lines)
  - README: setup, usage, troubleshooting
  - Agent: SQL generator with evaluation criteria, safety rules
  - Hook: automated metrics logging (safety, performance, errors)
  - Script: analysis with stats, safety reports, recommendations
  - Report template: monthly evaluation format

## Changed

- Agent Evaluation Guide: updated template references, verified links
- Landing Site: templates count 110 → 114
- Version: 3.23.5 → 3.24.0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Florian BRUNIAUX 2026-02-10 11:52:13 +01:00
parent 1fb783ebb8
commit ef7cdd899e
15 changed files with 1782 additions and 16 deletions

View file

@ -0,0 +1,202 @@
# Resource Evaluation: nao Framework
**URL**: https://github.com/getnao/nao/
**Type**: Open-source framework for building analytics agents
**Evaluation Date**: 2026-02-10
**Evaluator**: Claude Code (with technical-writer challenge)
**Target Guide**: Claude Code Ultimate Guide
---
## Summary
**nao** is an open-source framework for building and deploying analytics agents with a two-step architecture: build agent context via CLI tools, then deploy chat UI for end users to query data conversationally.
**Key Features**:
- Context builder supporting flexible data, metadata, documentation, and tool integrations
- Database agnostic (PostgreSQL, BigQuery, Snowflake, Databricks)
- Built-in evaluation framework with unit testing capabilities
- Self-hosted deployment with Docker containerization
- Native data visualization within chat interface
- Tech stack: Fastify, Drizzle ORM, tRPC, React, shadcn UI (TypeScript 58.9%, Python 38.5%)
---
## Relevance Score: 3/5 (Moderate - Useful Complement)
### Initial Score: 2/5 → Adjusted to 3/5 after technical challenge
**Justification for 3/5**:
**Relevance (+2 points)**:
- ✅ **Transposable architecture patterns** to Claude Code agents (.claude/agents/)
- ✅ **Evaluation framework** addresses critical gap in guide (no section on agent evaluation)
- ✅ **Database context patterns** applicable to agents with DB integrations
**Limitations (-2 points)**:
- ❌ No direct integration with Claude Code CLI (not a plugin/MCP server)
- ❌ Deployment scope (production chat UI) differs from guide's dev CLI focus
**Partial overlap**: Agent concepts and evaluation patterns are transposable, but final product (deployed analytics UI) is not guide's focus.
---
## Comparative Analysis
| Aspect | nao | Current Guide | Gap? |
|--------|-----|---------------|------|
| **Custom Agents** | ✅ Complete build + deploy framework | ✅ Extensive docs (.claude/agents/) | No |
| **Agent Architecture** | Structured context builder pattern | ⚠️ No complex context patterns | ✅ **Minor gap** |
| **Agent Evaluation** | Integrated framework (metrics, unit tests, feedback) | ❌ No mention | ✅ **Critical gap** |
| **Database Integrations** | Native support: BigQuery, Snowflake, Databricks, PostgreSQL | ✅ Mentioned (database-branch-setup.md) | No |
| **Agent Deployment** | Deployable chat UI with visualizations | ❌ Not covered (CLI focus) | ⚠️ Out of scope |
| **Analytics-Specific** | Specialized for conversational analytics | ❌ No analytics focus | No |
| **Context Management** | ✅ Context builder with docs/metadata | ✅ CLAUDE.md, .claude/rules/ | No |
| **Self-Hosted** | ✅ Docker + deployment guides | ✅ Mentioned (security-hardening.md) | No |
---
## Integration Recommendations
**Score 3/5 → Integrate (3 approaches)**
### ✅ Priority 1: Dedicated Section (Recommended)
**Create**: `guide/agent-evaluation.md` (~800 tokens)
**Content**:
- **Why Evaluate?**: Measure quality, track usage, identify bottlenecks
- **Metrics to Track**: Response time, tool call success rate, context relevance, error rate
- **Implementation Patterns**: Logging hooks, metrics aggregation, A/B testing, feedback collection
- **Reference**: Mention nao as complete evaluation framework
**Estimated tokens**: 500-800
**Suggested timeline**: Week 1
**Insertion point**: After `guide/ultimate-guide.md` section 4 (Agents)
---
### ✅ Priority 2: Agent Template with Evaluation (Optional)
**Create**: `examples/agents/analytics-with-eval/`
**Structure**:
```
analytics-with-eval/
├── config.yaml # Agent config
├── context.md # Agent instructions
├── eval/
│ ├── metrics.sh # Collect metrics
│ └── report.md # Example report
└── hooks/
└── post_response.sh # Log response metrics
```
**Estimated tokens**: 300-400 (code + README)
**Suggested timeline**: Week 2-3
**Value**: Demonstrates concrete evaluation patterns
---
### ✅ Priority 3: Ecosystem Mention (Minimal)
**Add**: Section "Domain-Specific Agent Frameworks" in `guide/ai-ecosystem.md`
**Content**: 1 paragraph + link to nao
**Estimated tokens**: 100-150
**Suggested timeline**: Immediate
**Suggested line**: After orchestration frameworks section (~line 2200)
---
## Technical Challenge Results
The technical-writer agent identified several biases in the initial evaluation:
### Missed Points in Initial Evaluation
1. **Transposable agent architecture**: nao's `context builder` pattern applicable to Claude Code agents
2. **Evaluation framework = critical gap**: Guide has NO mention of agent evaluation (metrics, testing, feedback)
3. **Database context patterns**: Patterns for context injection from databases not documented
### Score Justification
**Correction**: 2/5 → **3/5** (Moderate)
**Arguments for 3/5**:
- Transposable architecture patterns (+1)
- Evaluation framework addresses identified gap (+1)
- Usable database context patterns (+0.5)
- Open-source, well-documented (+0.5)
### Gap "Agent Evaluation" Must Be Addressed
**YES, the guide MUST have this section** because:
- Devs create agents without knowing how to measure quality
- Anthropic docs mention evaluations but not in Claude Code context
- nao proves this is feasible and useful (production-ready)
### Risks of Non-Integration
1. **Evaluation gap remains undocumented** → Devs don't know how to measure agent quality
2. **Database context patterns undocumented** → Devs reinvent already-proven patterns
3. **Loss of credibility** → If evaluation becomes standard, guide will be behind
### Why Initial Evaluation Was Biased
1. **Confusion between scope and relevance**: Different scope ≠ not relevant
2. **Focus on final product**: Evaluated nao as *competing product*, not *pattern source*
3. **Underestimation of gaps**: Agent evaluation = critical gap not previously identified
4. **Premature rejection**: "Don't integrate" despite identifying 2 major gaps
**Lesson**: Evaluate resources for *transposable patterns*, not just direct integration.
---
## Fact-Check Results
All technical claims verified by re-fetching GitHub repository:
| Claim | Verified | Source |
|-------|----------|--------|
| TypeScript 58.9%, Python 38.5% | ✅ | Repository footer |
| Stack: Fastify, Drizzle, tRPC, React, shadcn, TanStack Query | ✅ | "Stack" section |
| Databases: PostgreSQL, BigQuery, Snowflake, Databricks | ✅ | Repository topics |
| Evaluation framework with unit testing | ✅ | "Evaluation framework" section |
| GitHub repo integration + Slack bot | ✅ | Quickstart + topics |
| Docker containerization + self-hosted | ✅ | "Docker" section + docs |
**Corrections made**: None (all initial claims were accurate)
**Stats requiring external research**: None (all verifiable on GitHub page)
---
## Final Decision
- **Final Score**: **3/5 (Moderate - Useful Complement)**
- **Action**: **Integrate** via 3 approaches (priority 1 + 2 + 3)
- **Confidence**: **High** (all claims fact-checked ✅)
### Concrete Action Plan
1. **Immediate** (today): Add mention in `guide/ai-ecosystem.md` section "Domain-Specific Agent Frameworks"
2. **Week 1**: Create `guide/agent-evaluation.md` with patterns inspired by nao
3. **Week 2-3**: Create template `examples/agents/analytics-with-eval/` with metrics
### Added Value for Guide
- ✅ Addresses critical gap (agent evaluation)
- ✅ Adds transposable patterns (context builder, DB integrations)
- ✅ Demonstrates complete lifecycle (build → eval → iterate)
- ✅ References production-ready framework in specific domain
---
## Metadata
**Evaluation performed with**: WebFetch (2×), Grep (3×), Task (technical-writer challenge)
**Evaluation time**: ~15 minutes
**Quality**: Complete challenge + fact-check ✅
**Follow-up**: Implement integration recommendations (A, B, C)