claude-code-ultimate-guide/docs/resource-evaluations/nao-framework.md
Florian BRUNIAUX ef7cdd899e release: v3.24.0 - Agent Evaluation Framework
Major addition: Complete agent evaluation framework with production-ready template.

## Added

- **Resource Evaluation**: nao framework (score 3/5)
  - Identified critical gap: agent evaluation not documented
  - Technical challenge adjusted score 2/5 → 3/5
  - All claims fact-checked (TypeScript 58.9%, Python 38.5%)

- **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens)
  - Metrics: response quality, tool usage, performance, satisfaction
  - Patterns: logging hooks, unit tests, A/B testing, feedback loops
  - Example: analytics agent with built-in metrics
  - Tools: nao framework reference, Claude Code hooks integration

- **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks
  - nao (Analytics Agents): Database-agnostic, built-in evaluation
  - Transposable patterns: context builder, evaluation hooks, DB integrations

- **Template**: Analytics Agent with Evaluation (5 files, ~1K lines)
  - README: setup, usage, troubleshooting
  - Agent: SQL generator with evaluation criteria, safety rules
  - Hook: automated metrics logging (safety, performance, errors)
  - Script: analysis with stats, safety reports, recommendations
  - Report template: monthly evaluation format

## Changed

- Agent Evaluation Guide: updated template references, verified links
- Landing Site: templates count 110 → 114
- Version: 3.23.5 → 3.24.0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-10 11:52:13 +01:00

202 lines
7.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Resource Evaluation: nao Framework
**URL**: https://github.com/getnao/nao/
**Type**: Open-source framework for building analytics agents
**Evaluation Date**: 2026-02-10
**Evaluator**: Claude Code (with technical-writer challenge)
**Target Guide**: Claude Code Ultimate Guide
---
## Summary
**nao** is an open-source framework for building and deploying analytics agents with a two-step architecture: build agent context via CLI tools, then deploy chat UI for end users to query data conversationally.
**Key Features**:
- Context builder supporting flexible data, metadata, documentation, and tool integrations
- Database agnostic (PostgreSQL, BigQuery, Snowflake, Databricks)
- Built-in evaluation framework with unit testing capabilities
- Self-hosted deployment with Docker containerization
- Native data visualization within chat interface
- Tech stack: Fastify, Drizzle ORM, tRPC, React, shadcn UI (TypeScript 58.9%, Python 38.5%)
---
## Relevance Score: 3/5 (Moderate - Useful Complement)
### Initial Score: 2/5 → Adjusted to 3/5 after technical challenge
**Justification for 3/5**:
**Relevance (+2 points)**:
-**Transposable architecture patterns** to Claude Code agents (.claude/agents/)
-**Evaluation framework** addresses critical gap in guide (no section on agent evaluation)
-**Database context patterns** applicable to agents with DB integrations
**Limitations (-2 points)**:
- ❌ No direct integration with Claude Code CLI (not a plugin/MCP server)
- ❌ Deployment scope (production chat UI) differs from guide's dev CLI focus
**Partial overlap**: Agent concepts and evaluation patterns are transposable, but final product (deployed analytics UI) is not guide's focus.
---
## Comparative Analysis
| Aspect | nao | Current Guide | Gap? |
|--------|-----|---------------|------|
| **Custom Agents** | ✅ Complete build + deploy framework | ✅ Extensive docs (.claude/agents/) | No |
| **Agent Architecture** | Structured context builder pattern | ⚠️ No complex context patterns | ✅ **Minor gap** |
| **Agent Evaluation** | Integrated framework (metrics, unit tests, feedback) | ❌ No mention | ✅ **Critical gap** |
| **Database Integrations** | Native support: BigQuery, Snowflake, Databricks, PostgreSQL | ✅ Mentioned (database-branch-setup.md) | No |
| **Agent Deployment** | Deployable chat UI with visualizations | ❌ Not covered (CLI focus) | ⚠️ Out of scope |
| **Analytics-Specific** | Specialized for conversational analytics | ❌ No analytics focus | No |
| **Context Management** | ✅ Context builder with docs/metadata | ✅ CLAUDE.md, .claude/rules/ | No |
| **Self-Hosted** | ✅ Docker + deployment guides | ✅ Mentioned (security-hardening.md) | No |
---
## Integration Recommendations
**Score 3/5 → Integrate (3 approaches)**
### ✅ Priority 1: Dedicated Section (Recommended)
**Create**: `guide/agent-evaluation.md` (~800 tokens)
**Content**:
- **Why Evaluate?**: Measure quality, track usage, identify bottlenecks
- **Metrics to Track**: Response time, tool call success rate, context relevance, error rate
- **Implementation Patterns**: Logging hooks, metrics aggregation, A/B testing, feedback collection
- **Reference**: Mention nao as complete evaluation framework
**Estimated tokens**: 500-800
**Suggested timeline**: Week 1
**Insertion point**: After `guide/ultimate-guide.md` section 4 (Agents)
---
### ✅ Priority 2: Agent Template with Evaluation (Optional)
**Create**: `examples/agents/analytics-with-eval/`
**Structure**:
```
analytics-with-eval/
├── config.yaml # Agent config
├── context.md # Agent instructions
├── eval/
│ ├── metrics.sh # Collect metrics
│ └── report.md # Example report
└── hooks/
└── post_response.sh # Log response metrics
```
**Estimated tokens**: 300-400 (code + README)
**Suggested timeline**: Week 2-3
**Value**: Demonstrates concrete evaluation patterns
---
### ✅ Priority 3: Ecosystem Mention (Minimal)
**Add**: Section "Domain-Specific Agent Frameworks" in `guide/ai-ecosystem.md`
**Content**: 1 paragraph + link to nao
**Estimated tokens**: 100-150
**Suggested timeline**: Immediate
**Suggested line**: After orchestration frameworks section (~line 2200)
---
## Technical Challenge Results
The technical-writer agent identified several biases in the initial evaluation:
### Missed Points in Initial Evaluation
1. **Transposable agent architecture**: nao's `context builder` pattern applicable to Claude Code agents
2. **Evaluation framework = critical gap**: Guide has NO mention of agent evaluation (metrics, testing, feedback)
3. **Database context patterns**: Patterns for context injection from databases not documented
### Score Justification
**Correction**: 2/5 → **3/5** (Moderate)
**Arguments for 3/5**:
- Transposable architecture patterns (+1)
- Evaluation framework addresses identified gap (+1)
- Usable database context patterns (+0.5)
- Open-source, well-documented (+0.5)
### Gap "Agent Evaluation" Must Be Addressed
**YES, the guide MUST have this section** because:
- Devs create agents without knowing how to measure quality
- Anthropic docs mention evaluations but not in Claude Code context
- nao proves this is feasible and useful (production-ready)
### Risks of Non-Integration
1. **Evaluation gap remains undocumented** → Devs don't know how to measure agent quality
2. **Database context patterns undocumented** → Devs reinvent already-proven patterns
3. **Loss of credibility** → If evaluation becomes standard, guide will be behind
### Why Initial Evaluation Was Biased
1. **Confusion between scope and relevance**: Different scope ≠ not relevant
2. **Focus on final product**: Evaluated nao as *competing product*, not *pattern source*
3. **Underestimation of gaps**: Agent evaluation = critical gap not previously identified
4. **Premature rejection**: "Don't integrate" despite identifying 2 major gaps
**Lesson**: Evaluate resources for *transposable patterns*, not just direct integration.
---
## Fact-Check Results
All technical claims verified by re-fetching GitHub repository:
| Claim | Verified | Source |
|-------|----------|--------|
| TypeScript 58.9%, Python 38.5% | ✅ | Repository footer |
| Stack: Fastify, Drizzle, tRPC, React, shadcn, TanStack Query | ✅ | "Stack" section |
| Databases: PostgreSQL, BigQuery, Snowflake, Databricks | ✅ | Repository topics |
| Evaluation framework with unit testing | ✅ | "Evaluation framework" section |
| GitHub repo integration + Slack bot | ✅ | Quickstart + topics |
| Docker containerization + self-hosted | ✅ | "Docker" section + docs |
**Corrections made**: None (all initial claims were accurate)
**Stats requiring external research**: None (all verifiable on GitHub page)
---
## Final Decision
- **Final Score**: **3/5 (Moderate - Useful Complement)**
- **Action**: **Integrate** via 3 approaches (priority 1 + 2 + 3)
- **Confidence**: **High** (all claims fact-checked ✅)
### Concrete Action Plan
1. **Immediate** (today): Add mention in `guide/ai-ecosystem.md` section "Domain-Specific Agent Frameworks"
2. **Week 1**: Create `guide/agent-evaluation.md` with patterns inspired by nao
3. **Week 2-3**: Create template `examples/agents/analytics-with-eval/` with metrics
### Added Value for Guide
- ✅ Addresses critical gap (agent evaluation)
- ✅ Adds transposable patterns (context builder, DB integrations)
- ✅ Demonstrates complete lifecycle (build → eval → iterate)
- ✅ References production-ready framework in specific domain
---
## Metadata
**Evaluation performed with**: WebFetch (2×), Grep (3×), Task (technical-writer challenge)
**Evaluation time**: ~15 minutes
**Quality**: Complete challenge + fact-check ✅
**Follow-up**: Implement integration recommendations (A, B, C)