Major addition: Complete agent evaluation framework with production-ready template. ## Added - **Resource Evaluation**: nao framework (score 3/5) - Identified critical gap: agent evaluation not documented - Technical challenge adjusted score 2/5 → 3/5 - All claims fact-checked (TypeScript 58.9%, Python 38.5%) - **Guide Section**: Agent Evaluation (guide/agent-evaluation.md, ~3K tokens) - Metrics: response quality, tool usage, performance, satisfaction - Patterns: logging hooks, unit tests, A/B testing, feedback loops - Example: analytics agent with built-in metrics - Tools: nao framework reference, Claude Code hooks integration - **AI Ecosystem**: Section 8.2 Domain-Specific Agent Frameworks - nao (Analytics Agents): Database-agnostic, built-in evaluation - Transposable patterns: context builder, evaluation hooks, DB integrations - **Template**: Analytics Agent with Evaluation (5 files, ~1K lines) - README: setup, usage, troubleshooting - Agent: SQL generator with evaluation criteria, safety rules - Hook: automated metrics logging (safety, performance, errors) - Script: analysis with stats, safety reports, recommendations - Report template: monthly evaluation format ## Changed - Agent Evaluation Guide: updated template references, verified links - Landing Site: templates count 110 → 114 - Version: 3.23.5 → 3.24.0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.7 KiB
Resource Evaluation: nao Framework
URL: https://github.com/getnao/nao/ Type: Open-source framework for building analytics agents Evaluation Date: 2026-02-10 Evaluator: Claude Code (with technical-writer challenge) Target Guide: Claude Code Ultimate Guide
Summary
nao is an open-source framework for building and deploying analytics agents with a two-step architecture: build agent context via CLI tools, then deploy chat UI for end users to query data conversationally.
Key Features:
- Context builder supporting flexible data, metadata, documentation, and tool integrations
- Database agnostic (PostgreSQL, BigQuery, Snowflake, Databricks)
- Built-in evaluation framework with unit testing capabilities
- Self-hosted deployment with Docker containerization
- Native data visualization within chat interface
- Tech stack: Fastify, Drizzle ORM, tRPC, React, shadcn UI (TypeScript 58.9%, Python 38.5%)
Relevance Score: 3/5 (Moderate - Useful Complement)
Initial Score: 2/5 → Adjusted to 3/5 after technical challenge
Justification for 3/5:
Relevance (+2 points):
- ✅ Transposable architecture patterns to Claude Code agents (.claude/agents/)
- ✅ Evaluation framework addresses critical gap in guide (no section on agent evaluation)
- ✅ Database context patterns applicable to agents with DB integrations
Limitations (-2 points):
- ❌ No direct integration with Claude Code CLI (not a plugin/MCP server)
- ❌ Deployment scope (production chat UI) differs from guide's dev CLI focus
Partial overlap: Agent concepts and evaluation patterns are transposable, but final product (deployed analytics UI) is not guide's focus.
Comparative Analysis
| Aspect | nao | Current Guide | Gap? |
|---|---|---|---|
| Custom Agents | ✅ Complete build + deploy framework | ✅ Extensive docs (.claude/agents/) | ➖ No |
| Agent Architecture | ➕ Structured context builder pattern | ⚠️ No complex context patterns | ✅ Minor gap |
| Agent Evaluation | ➕ Integrated framework (metrics, unit tests, feedback) | ❌ No mention | ✅ Critical gap |
| Database Integrations | ➕ Native support: BigQuery, Snowflake, Databricks, PostgreSQL | ✅ Mentioned (database-branch-setup.md) | ➖ No |
| Agent Deployment | ➕ Deployable chat UI with visualizations | ❌ Not covered (CLI focus) | ⚠️ Out of scope |
| Analytics-Specific | ➕ Specialized for conversational analytics | ❌ No analytics focus | ➖ No |
| Context Management | ✅ Context builder with docs/metadata | ✅ CLAUDE.md, .claude/rules/ | ➖ No |
| Self-Hosted | ✅ Docker + deployment guides | ✅ Mentioned (security-hardening.md) | ➖ No |
Integration Recommendations
Score 3/5 → Integrate (3 approaches)
✅ Priority 1: Dedicated Section (Recommended)
Create: guide/agent-evaluation.md (~800 tokens)
Content:
- Why Evaluate?: Measure quality, track usage, identify bottlenecks
- Metrics to Track: Response time, tool call success rate, context relevance, error rate
- Implementation Patterns: Logging hooks, metrics aggregation, A/B testing, feedback collection
- Reference: Mention nao as complete evaluation framework
Estimated tokens: 500-800
Suggested timeline: Week 1
Insertion point: After guide/ultimate-guide.md section 4 (Agents)
✅ Priority 2: Agent Template with Evaluation (Optional)
Create: examples/agents/analytics-with-eval/
Structure:
analytics-with-eval/
├── config.yaml # Agent config
├── context.md # Agent instructions
├── eval/
│ ├── metrics.sh # Collect metrics
│ └── report.md # Example report
└── hooks/
└── post_response.sh # Log response metrics
Estimated tokens: 300-400 (code + README) Suggested timeline: Week 2-3 Value: Demonstrates concrete evaluation patterns
✅ Priority 3: Ecosystem Mention (Minimal)
Add: Section "Domain-Specific Agent Frameworks" in guide/ai-ecosystem.md
Content: 1 paragraph + link to nao
Estimated tokens: 100-150 Suggested timeline: Immediate Suggested line: After orchestration frameworks section (~line 2200)
Technical Challenge Results
The technical-writer agent identified several biases in the initial evaluation:
Missed Points in Initial Evaluation
- Transposable agent architecture: nao's
context builderpattern applicable to Claude Code agents - Evaluation framework = critical gap: Guide has NO mention of agent evaluation (metrics, testing, feedback)
- Database context patterns: Patterns for context injection from databases not documented
Score Justification
Correction: 2/5 → 3/5 (Moderate)
Arguments for 3/5:
- Transposable architecture patterns (+1)
- Evaluation framework addresses identified gap (+1)
- Usable database context patterns (+0.5)
- Open-source, well-documented (+0.5)
Gap "Agent Evaluation" Must Be Addressed
YES, the guide MUST have this section because:
- Devs create agents without knowing how to measure quality
- Anthropic docs mention evaluations but not in Claude Code context
- nao proves this is feasible and useful (production-ready)
Risks of Non-Integration
- Evaluation gap remains undocumented → Devs don't know how to measure agent quality
- Database context patterns undocumented → Devs reinvent already-proven patterns
- Loss of credibility → If evaluation becomes standard, guide will be behind
Why Initial Evaluation Was Biased
- Confusion between scope and relevance: Different scope ≠ not relevant
- Focus on final product: Evaluated nao as competing product, not pattern source
- Underestimation of gaps: Agent evaluation = critical gap not previously identified
- Premature rejection: "Don't integrate" despite identifying 2 major gaps
Lesson: Evaluate resources for transposable patterns, not just direct integration.
Fact-Check Results
All technical claims verified by re-fetching GitHub repository:
| Claim | Verified | Source |
|---|---|---|
| TypeScript 58.9%, Python 38.5% | ✅ | Repository footer |
| Stack: Fastify, Drizzle, tRPC, React, shadcn, TanStack Query | ✅ | "Stack" section |
| Databases: PostgreSQL, BigQuery, Snowflake, Databricks | ✅ | Repository topics |
| Evaluation framework with unit testing | ✅ | "Evaluation framework" section |
| GitHub repo integration + Slack bot | ✅ | Quickstart + topics |
| Docker containerization + self-hosted | ✅ | "Docker" section + docs |
Corrections made: None (all initial claims were accurate)
Stats requiring external research: None (all verifiable on GitHub page)
Final Decision
- Final Score: 3/5 (Moderate - Useful Complement)
- Action: Integrate via 3 approaches (priority 1 + 2 + 3)
- Confidence: High (all claims fact-checked ✅)
Concrete Action Plan
- Immediate (today): Add mention in
guide/ai-ecosystem.mdsection "Domain-Specific Agent Frameworks" - Week 1: Create
guide/agent-evaluation.mdwith patterns inspired by nao - Week 2-3: Create template
examples/agents/analytics-with-eval/with metrics
Added Value for Guide
- ✅ Addresses critical gap (agent evaluation)
- ✅ Adds transposable patterns (context builder, DB integrations)
- ✅ Demonstrates complete lifecycle (build → eval → iterate)
- ✅ References production-ready framework in specific domain
Metadata
Evaluation performed with: WebFetch (2×), Grep (3×), Task (technical-writer challenge) Evaluation time: ~15 minutes Quality: Complete challenge + fact-check ✅ Follow-up: Implement integration recommendations (A, B, C)