History

Eric S 9965d1baae feat: /eval skill — universal AI output evaluation suite Evaluate any AI-powered feature: chat agents, content generators, classifiers, summarizers, or custom AI systems. - Asks what 'good' looks like, generates test scenarios automatically - 13 criterion types (contains, regex, max_length, no_hallucination, etc.) - Multi-turn chat and single-turn content modes - Baseline comparison and regression detection - Threshold gating (fail CI if score drops below X%) Run: npx tsx eval/run-eval.ts Docs: eval/README.md		2026-04-03 17:34:19 -07:00
..
CLAUDE.md	feat: /eval skill — universal AI output evaluation suite	2026-04-03 17:34:19 -07:00
eval.config.example.json	feat: /eval skill — universal AI output evaluation suite	2026-04-03 17:34:19 -07:00
README.md	feat: /eval skill — universal AI output evaluation suite	2026-04-03 17:34:19 -07:00
run-eval.ts	feat: /eval skill — universal AI output evaluation suite	2026-04-03 17:34:19 -07:00

README.md

/eval — AI Output Evaluation Skill for Claude Code

Automated evaluation suite for any AI-powered feature. Works with chat agents, content generators, classifiers, summarizers, and any system that takes input and produces AI output.

What it does

Asks you what "good" looks like for your AI feature
Generates test scenarios and pass/fail criteria
Runs the scenarios against your API
Scores results and detects regressions
Helps you fix failures

Install

Copy the eval/ directory into your project's .claude/skills/:

cp -r eval/ your-project/.claude/skills/eval/

Or clone and symlink:

git clone https://github.com/nichetools/ai-marketing-skills.git
ln -s ai-marketing-skills/eval your-project/.claude/skills/eval

Quick start

In Claude Code, type /eval
Answer the questions about what your AI does and what good/bad looks like
The skill generates eval.config.json with test scenarios
It runs the scenarios and shows you the results
Fix failures, re-run, repeat

Running evals manually

# Run all scenarios
npx tsx .claude/skills/eval/run-eval.ts

# Use a specific config
npx tsx .claude/skills/eval/run-eval.ts --config my-eval.config.json

# See full AI outputs
npx tsx .claude/skills/eval/run-eval.ts --verbose

# Save current results as baseline
npx tsx .claude/skills/eval/run-eval.ts --baseline

Supported AI types

Type	How it works
Chat agent	Sends multi-turn messages, evaluates full conversation
Content generator	Sends single input, evaluates output quality
Classifier	Sends inputs with known labels, checks accuracy
Summarizer	Sends documents, checks summary quality
Custom	Any input/output API you define

Criterion types

Type	What it checks
`contains`	Output includes a string (case-insensitive)
`not_contains`	Output does NOT include a string
`regex`	Output matches a regex pattern
`max_length`	Output is under N characters
`min_length`	Output is at least N characters
`max_sentences`	Output is under N sentences
`response_time`	API responds within N milliseconds
`json_valid`	Output is valid JSON
`json_field_equals`	A JSON field equals an expected value
`no_hallucination`	Output doesn't contain claims not in the input
`preserves_names`	Key names from input appear in output
`preserves_numbers`	Key numbers from input appear in output

Config format

{
  "name": "My AI Feature Eval",
  "type": "chat",
  "endpoint": "https://my-api.com/chat",
  "method": "POST",
  "headers": { "Content-Type": "application/json" },
  "request_template": {
    "message": "{{message}}",
    "history": "{{history}}"
  },
  "response_field": "response",
  "threshold": 80,
  "scenarios": [
    {
      "name": "basic_greeting",
      "messages": ["Hello"],
      "criteria": [
        { "type": "contains", "value": "help", "description": "Offers help" }
      ]
    }
  ]
}

See eval.config.example.json for a full example.

Regression detection

Save a baseline after your first passing run:

npx tsx .claude/skills/eval/run-eval.ts --baseline

Future runs automatically compare against the baseline and flag:

Score drops
Individual scenarios that got worse
New failures that didn't exist before

When to run evals

After every prompt or model change
Before deploying to production
Weekly to catch quality drift from content/data changes
When onboarding a new AI feature

Adding custom criteria

Edit run-eval.ts and add a case to the evaluateCriterion function:

case 'my_custom_check':
  // Your logic here
  return someCondition;

Then use it in your config:

{ "type": "my_custom_check", "value": "whatever", "description": "My check" }

Philosophy

"Don't ship prompts without evals. It's the AI equivalent of shipping code without tests."

Manual testing is important for tone and feel. But automated evals catch regressions, enforce quality standards, and give you a score to track over time. Use both.