Eric S 9965d1baae feat: /eval skill — universal AI output evaluation suite

Evaluate any AI-powered feature: chat agents, content generators,
classifiers, summarizers, or custom AI systems.

- Asks what 'good' looks like, generates test scenarios automatically
- 13 criterion types (contains, regex, max_length, no_hallucination, etc.)
- Multi-turn chat and single-turn content modes
- Baseline comparison and regression detection
- Threshold gating (fail CI if score drops below X%)

Run: npx tsx eval/run-eval.ts
Docs: eval/README.md

2026-04-03 17:34:19 -07:00

6.4 KiB

Raw Blame History

/eval — AI Output Evaluation Suite

Evaluate any AI-powered feature: chat agents, content generators, classifiers, summarizers, code generators, or any system that takes input and produces AI output. Defines what "good" looks like through conversation, generates test cases, runs them, and scores results.

When to use

After changing prompts, models, or system instructions
Before deploying any AI feature to production
Weekly to catch quality drift
When defining quality standards for a new AI feature

How to invoke

/eval                          # Start fresh or run existing config
/eval --run                    # Skip setup, run existing eval.config.json
/eval --verbose                # Show full outputs during run
/eval --baseline               # Save results as baseline for future comparisons

Instructions for Claude

When the user invokes /eval:

Step 1: Check for existing config

ls eval.config.json 2>/dev/null && echo "CONFIG_EXISTS" || echo "NO_CONFIG"

If CONFIG_EXISTS and user did NOT pass --run: ask "You have an existing eval config with N scenarios. Want to run it, update it, or start fresh?" If CONFIG_EXISTS and user passed --run: skip to Step 4. If NO_CONFIG: proceed to Step 2.

Step 2: Understand what you're evaluating

Ask the user these questions ONE AT A TIME via AskUserQuestion. The first question determines the flow for the rest.

Q1: "What type of AI output are you evaluating?" Options:

A) Chat agent / conversational AI (multi-turn: user sends messages, agent responds)
B) Content generator (single input, long-form output: blog posts, emails, plans)
C) Classifier / scorer (input data, output label or score)
D) Summarizer / extractor (input document, output summary or structured data)
E) Something else (describe it)

Q2: "Describe what it does in one sentence." Example: "It generates SEO-optimized blog post outlines from a keyword."

Q3 (varies by type):

For CHAT AGENTS (A):

"What's the API endpoint? How do I send a message and get a response?"
"What should a good conversation look like? List 3-5 things the agent should always do."
"What should it never do?"
"Who are ideal users vs. who should be turned away?"

For CONTENT GENERATORS (B):

"What's the API endpoint or function? What input does it take?"
"What makes the output GOOD? (length, tone, structure, accuracy, keywords)"
"What makes the output BAD? (hallucinations, wrong tone, too short/long, missing sections)"
"Show me one example of good output if you have it."

For CLASSIFIERS (C):

"What's the API endpoint or function?"
"What are the possible output labels/scores?"
"Do you have labeled test data (known correct answers)?"
"What's the cost of a false positive vs. false negative?"

For SUMMARIZERS (D):

"What's the API endpoint or function?"
"What should the summary include? What should it exclude?"
"What's the max length?"
"Should it preserve specific details (names, numbers, dates)?"

For OTHER (E):

"Walk me through: what goes in, what comes out?"
"How do you know when the output is good vs. bad?"
"What are the failure modes you're worried about?"

Step 3: Generate eval config

Based on the user's answers, generate test cases appropriate to the type:

Chat agents: 10-20 multi-turn conversation scenarios (qualified users, unqualified users, edge cases, product knowledge tests, hostile users, capability boundaries)

Content generators: 10-15 input variations testing different topics, edge cases, and quality dimensions (accuracy, tone, length, structure, keyword inclusion)

Classifiers: 20-30 test inputs with known correct labels, covering each class, edge cases, and adversarial inputs

Summarizers: 10-15 test documents of varying length and complexity, checking for completeness, accuracy, length compliance, and hallucination

For all types, generate criteria based on:

Things the user said make output GOOD -> contains, regex, max_length checks
Things the user said make output BAD -> not_contains checks
Type-specific quality checks (see criterion types below)

Write the config to eval.config.json. Show the user and ask: "Does this cover what matters? Want to add or change anything?"

Step 4: Run the evals

npx tsx .claude/skills/eval/run-eval.ts [--config eval.config.json] [--verbose] [--baseline]

The runner handles all types. For chat agents, it sends messages sequentially and evaluates the full conversation. For single-input types, it sends one request per scenario and evaluates the output.

Step 5: Report results

Summary table (scenario x criterion, pass/fail)
Overall score: "X/Y criteria passed (Z%)"
If baseline exists: "Score changed from A% to B%"
Regressions: scenarios that got worse since last run
Top 3 failures with diagnosis
Recommendation: fix or ship

Step 6: Iterate

If failures exist:

Read the failing output from eval-results.json
Diagnose root cause (prompt issue, missing data, model limitation)
Suggest a fix
After fix: re-run to verify

Config format

{
  "name": "My AI Feature Eval",
  "type": "chat | content | classifier | summarizer | custom",
  "endpoint": "https://my-api.com/endpoint",
  "method": "POST",
  "headers": {},
  "request_template": {},
  "response_field": "response",
  "threshold": 80,
  "good_behaviors": [],
  "bad_behaviors": [],
  "scenarios": []
}

The type field determines how scenarios are executed:

chat: multi-turn (sends messages sequentially, maintains history)
content | classifier | summarizer | custom: single-turn (one request per scenario)

Criterion types

Type	Works for	Description
`contains`	All	Output contains a string (case-insensitive)
`not_contains`	All	Output does NOT contain a string
`regex`	All	Output matches a regex pattern
`max_length`	All	Output is under N characters
`min_length`	Content	Output is at least N characters
`max_sentences`	Chat, Content	Output is under N sentences
`response_time`	All	API responds within N milliseconds
`json_valid`	Classifier, Custom	Output is valid JSON
`json_field_equals`	Classifier	A JSON field equals an expected value
`no_hallucination`	Content, Summarizer	Output doesn't contain claims not in the input
`preserves_names`	Summarizer	Key names from input appear in output
`preserves_numbers`	Summarizer	Key numbers from input appear in output

6.4 KiB Raw Blame History