Major addition: Complete agent evaluation framework with production-ready template. ## Added - **Resource Evaluation**: nao framework (score 3/5) - Identified critical gap: agent evaluation not documented - Technical challenge adjusted score 2/5 → 3/5 - All claims fact-checked (TypeScript 58.9%, Python 38.5%) - **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens) - Metrics: response quality, tool usage, performance, satisfaction - Patterns: logging hooks, unit tests, A/B testing, feedback loops - Example: analytics agent with built-in metrics - Tools: nao framework reference, Claude Code hooks integration - **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks - nao (Analytics Agents): Database-agnostic, built-in evaluation - Transposable patterns: context builder, evaluation hooks, DB integrations - **Template**: Analytics Agent with Evaluation (5 files, ~1K lines) - README: setup, usage, troubleshooting - Agent: SQL generator with evaluation criteria, safety rules - Hook: automated metrics logging (safety, performance, errors) - Script: analysis with stats, safety reports, recommendations - Report template: monthly evaluation format ## Changed - Agent Evaluation Guide: updated template references, verified links - Landing Site: templates count 110 → 114 - Version: 3.23.5 → 3.24.0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
202 lines
7.7 KiB
Markdown
202 lines
7.7 KiB
Markdown
# Resource Evaluation: nao Framework
|
||
|
||
**URL**: https://github.com/getnao/nao/
|
||
**Type**: Open-source framework for building analytics agents
|
||
**Evaluation Date**: 2026-02-10
|
||
**Evaluator**: Claude Code (with technical-writer challenge)
|
||
**Target Guide**: Claude Code Ultimate Guide
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
**nao** is an open-source framework for building and deploying analytics agents with a two-step architecture: build agent context via CLI tools, then deploy chat UI for end users to query data conversationally.
|
||
|
||
**Key Features**:
|
||
- Context builder supporting flexible data, metadata, documentation, and tool integrations
|
||
- Database agnostic (PostgreSQL, BigQuery, Snowflake, Databricks)
|
||
- Built-in evaluation framework with unit testing capabilities
|
||
- Self-hosted deployment with Docker containerization
|
||
- Native data visualization within chat interface
|
||
- Tech stack: Fastify, Drizzle ORM, tRPC, React, shadcn UI (TypeScript 58.9%, Python 38.5%)
|
||
|
||
---
|
||
|
||
## Relevance Score: 3/5 (Moderate - Useful Complement)
|
||
|
||
### Initial Score: 2/5 → Adjusted to 3/5 after technical challenge
|
||
|
||
**Justification for 3/5**:
|
||
|
||
**Relevance (+2 points)**:
|
||
- ✅ **Transposable architecture patterns** to Claude Code agents (.claude/agents/)
|
||
- ✅ **Evaluation framework** addresses critical gap in guide (no section on agent evaluation)
|
||
- ✅ **Database context patterns** applicable to agents with DB integrations
|
||
|
||
**Limitations (-2 points)**:
|
||
- ❌ No direct integration with Claude Code CLI (not a plugin/MCP server)
|
||
- ❌ Deployment scope (production chat UI) differs from guide's dev CLI focus
|
||
|
||
**Partial overlap**: Agent concepts and evaluation patterns are transposable, but final product (deployed analytics UI) is not guide's focus.
|
||
|
||
---
|
||
|
||
## Comparative Analysis
|
||
|
||
| Aspect | nao | Current Guide | Gap? |
|
||
|--------|-----|---------------|------|
|
||
| **Custom Agents** | ✅ Complete build + deploy framework | ✅ Extensive docs (.claude/agents/) | ➖ No |
|
||
| **Agent Architecture** | ➕ Structured context builder pattern | ⚠️ No complex context patterns | ✅ **Minor gap** |
|
||
| **Agent Evaluation** | ➕ Integrated framework (metrics, unit tests, feedback) | ❌ No mention | ✅ **Critical gap** |
|
||
| **Database Integrations** | ➕ Native support: BigQuery, Snowflake, Databricks, PostgreSQL | ✅ Mentioned (database-branch-setup.md) | ➖ No |
|
||
| **Agent Deployment** | ➕ Deployable chat UI with visualizations | ❌ Not covered (CLI focus) | ⚠️ Out of scope |
|
||
| **Analytics-Specific** | ➕ Specialized for conversational analytics | ❌ No analytics focus | ➖ No |
|
||
| **Context Management** | ✅ Context builder with docs/metadata | ✅ CLAUDE.md, .claude/rules/ | ➖ No |
|
||
| **Self-Hosted** | ✅ Docker + deployment guides | ✅ Mentioned (security-hardening.md) | ➖ No |
|
||
|
||
---
|
||
|
||
## Integration Recommendations
|
||
|
||
**Score 3/5 → Integrate (3 approaches)**
|
||
|
||
### ✅ Priority 1: Dedicated Section (Recommended)
|
||
|
||
**Create**: `guide/agent-evaluation.md` (~800 tokens)
|
||
|
||
**Content**:
|
||
- **Why Evaluate?**: Measure quality, track usage, identify bottlenecks
|
||
- **Metrics to Track**: Response time, tool call success rate, context relevance, error rate
|
||
- **Implementation Patterns**: Logging hooks, metrics aggregation, A/B testing, feedback collection
|
||
- **Reference**: Mention nao as complete evaluation framework
|
||
|
||
**Estimated tokens**: 500-800
|
||
**Suggested timeline**: Week 1
|
||
**Insertion point**: After `guide/ultimate-guide.md` section 4 (Agents)
|
||
|
||
---
|
||
|
||
### ✅ Priority 2: Agent Template with Evaluation (Optional)
|
||
|
||
**Create**: `examples/agents/analytics-with-eval/`
|
||
|
||
**Structure**:
|
||
```
|
||
analytics-with-eval/
|
||
├── config.yaml # Agent config
|
||
├── context.md # Agent instructions
|
||
├── eval/
|
||
│ ├── metrics.sh # Collect metrics
|
||
│ └── report.md # Example report
|
||
└── hooks/
|
||
└── post_response.sh # Log response metrics
|
||
```
|
||
|
||
**Estimated tokens**: 300-400 (code + README)
|
||
**Suggested timeline**: Week 2-3
|
||
**Value**: Demonstrates concrete evaluation patterns
|
||
|
||
---
|
||
|
||
### ✅ Priority 3: Ecosystem Mention (Minimal)
|
||
|
||
**Add**: Section "Domain-Specific Agent Frameworks" in `guide/ai-ecosystem.md`
|
||
|
||
**Content**: 1 paragraph + link to nao
|
||
|
||
**Estimated tokens**: 100-150
|
||
**Suggested timeline**: Immediate
|
||
**Suggested line**: After orchestration frameworks section (~line 2200)
|
||
|
||
---
|
||
|
||
## Technical Challenge Results
|
||
|
||
The technical-writer agent identified several biases in the initial evaluation:
|
||
|
||
### Missed Points in Initial Evaluation
|
||
|
||
1. **Transposable agent architecture**: nao's `context builder` pattern applicable to Claude Code agents
|
||
2. **Evaluation framework = critical gap**: Guide has NO mention of agent evaluation (metrics, testing, feedback)
|
||
3. **Database context patterns**: Patterns for context injection from databases not documented
|
||
|
||
### Score Justification
|
||
|
||
**Correction**: 2/5 → **3/5** (Moderate)
|
||
|
||
**Arguments for 3/5**:
|
||
- Transposable architecture patterns (+1)
|
||
- Evaluation framework addresses identified gap (+1)
|
||
- Usable database context patterns (+0.5)
|
||
- Open-source, well-documented (+0.5)
|
||
|
||
### Gap "Agent Evaluation" Must Be Addressed
|
||
|
||
**YES, the guide MUST have this section** because:
|
||
- Devs create agents without knowing how to measure quality
|
||
- Anthropic docs mention evaluations but not in Claude Code context
|
||
- nao proves this is feasible and useful (production-ready)
|
||
|
||
### Risks of Non-Integration
|
||
|
||
1. **Evaluation gap remains undocumented** → Devs don't know how to measure agent quality
|
||
2. **Database context patterns undocumented** → Devs reinvent already-proven patterns
|
||
3. **Loss of credibility** → If evaluation becomes standard, guide will be behind
|
||
|
||
### Why Initial Evaluation Was Biased
|
||
|
||
1. **Confusion between scope and relevance**: Different scope ≠ not relevant
|
||
2. **Focus on final product**: Evaluated nao as *competing product*, not *pattern source*
|
||
3. **Underestimation of gaps**: Agent evaluation = critical gap not previously identified
|
||
4. **Premature rejection**: "Don't integrate" despite identifying 2 major gaps
|
||
|
||
**Lesson**: Evaluate resources for *transposable patterns*, not just direct integration.
|
||
|
||
---
|
||
|
||
## Fact-Check Results
|
||
|
||
All technical claims verified by re-fetching GitHub repository:
|
||
|
||
| Claim | Verified | Source |
|
||
|-------|----------|--------|
|
||
| TypeScript 58.9%, Python 38.5% | ✅ | Repository footer |
|
||
| Stack: Fastify, Drizzle, tRPC, React, shadcn, TanStack Query | ✅ | "Stack" section |
|
||
| Databases: PostgreSQL, BigQuery, Snowflake, Databricks | ✅ | Repository topics |
|
||
| Evaluation framework with unit testing | ✅ | "Evaluation framework" section |
|
||
| GitHub repo integration + Slack bot | ✅ | Quickstart + topics |
|
||
| Docker containerization + self-hosted | ✅ | "Docker" section + docs |
|
||
|
||
**Corrections made**: None (all initial claims were accurate)
|
||
|
||
**Stats requiring external research**: None (all verifiable on GitHub page)
|
||
|
||
---
|
||
|
||
## Final Decision
|
||
|
||
- **Final Score**: **3/5 (Moderate - Useful Complement)**
|
||
- **Action**: **Integrate** via 3 approaches (priority 1 + 2 + 3)
|
||
- **Confidence**: **High** (all claims fact-checked ✅)
|
||
|
||
### Concrete Action Plan
|
||
|
||
1. **Immediate** (today): Add mention in `guide/ai-ecosystem.md` section "Domain-Specific Agent Frameworks"
|
||
2. **Week 1**: Create `guide/agent-evaluation.md` with patterns inspired by nao
|
||
3. **Week 2-3**: Create template `examples/agents/analytics-with-eval/` with metrics
|
||
|
||
### Added Value for Guide
|
||
|
||
- ✅ Addresses critical gap (agent evaluation)
|
||
- ✅ Adds transposable patterns (context builder, DB integrations)
|
||
- ✅ Demonstrates complete lifecycle (build → eval → iterate)
|
||
- ✅ References production-ready framework in specific domain
|
||
|
||
---
|
||
|
||
## Metadata
|
||
|
||
**Evaluation performed with**: WebFetch (2×), Grep (3×), Task (technical-writer challenge)
|
||
**Evaluation time**: ~15 minutes
|
||
**Quality**: Complete challenge + fact-check ✅
|
||
**Follow-up**: Implement integration recommendations (A, B, C)
|