feat: /eval skill — universal AI output evaluation suite

Evaluate any AI-powered feature: chat agents, content generators,
classifiers, summarizers, or custom AI systems.

- Asks what 'good' looks like, generates test scenarios automatically
- 13 criterion types (contains, regex, max_length, no_hallucination, etc.)
- Multi-turn chat and single-turn content modes
- Baseline comparison and regression detection
- Threshold gating (fail CI if score drops below X%)

Run: npx tsx eval/run-eval.ts
Docs: eval/README.md
This commit is contained in:
Eric S 2026-04-03 17:34:19 -07:00
parent 3a0e0931d9
commit 9965d1baae
4 changed files with 894 additions and 0 deletions

161
eval/CLAUDE.md Normal file
View file

@ -0,0 +1,161 @@
# /eval — AI Output Evaluation Suite
Evaluate any AI-powered feature: chat agents, content generators, classifiers, summarizers, code generators, or any system that takes input and produces AI output. Defines what "good" looks like through conversation, generates test cases, runs them, and scores results.
## When to use
- After changing prompts, models, or system instructions
- Before deploying any AI feature to production
- Weekly to catch quality drift
- When defining quality standards for a new AI feature
## How to invoke
```
/eval # Start fresh or run existing config
/eval --run # Skip setup, run existing eval.config.json
/eval --verbose # Show full outputs during run
/eval --baseline # Save results as baseline for future comparisons
```
## Instructions for Claude
When the user invokes `/eval`:
### Step 1: Check for existing config
```bash
ls eval.config.json 2>/dev/null && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
```
If `CONFIG_EXISTS` and user did NOT pass `--run`: ask "You have an existing eval config with N scenarios. Want to run it, update it, or start fresh?"
If `CONFIG_EXISTS` and user passed `--run`: skip to Step 4.
If `NO_CONFIG`: proceed to Step 2.
### Step 2: Understand what you're evaluating
Ask the user these questions ONE AT A TIME via AskUserQuestion. The first question determines the flow for the rest.
**Q1: "What type of AI output are you evaluating?"**
Options:
- A) Chat agent / conversational AI (multi-turn: user sends messages, agent responds)
- B) Content generator (single input, long-form output: blog posts, emails, plans)
- C) Classifier / scorer (input data, output label or score)
- D) Summarizer / extractor (input document, output summary or structured data)
- E) Something else (describe it)
**Q2: "Describe what it does in one sentence."**
Example: "It generates SEO-optimized blog post outlines from a keyword."
**Q3 (varies by type):**
For CHAT AGENTS (A):
- "What's the API endpoint? How do I send a message and get a response?"
- "What should a good conversation look like? List 3-5 things the agent should always do."
- "What should it never do?"
- "Who are ideal users vs. who should be turned away?"
For CONTENT GENERATORS (B):
- "What's the API endpoint or function? What input does it take?"
- "What makes the output GOOD? (length, tone, structure, accuracy, keywords)"
- "What makes the output BAD? (hallucinations, wrong tone, too short/long, missing sections)"
- "Show me one example of good output if you have it."
For CLASSIFIERS (C):
- "What's the API endpoint or function?"
- "What are the possible output labels/scores?"
- "Do you have labeled test data (known correct answers)?"
- "What's the cost of a false positive vs. false negative?"
For SUMMARIZERS (D):
- "What's the API endpoint or function?"
- "What should the summary include? What should it exclude?"
- "What's the max length?"
- "Should it preserve specific details (names, numbers, dates)?"
For OTHER (E):
- "Walk me through: what goes in, what comes out?"
- "How do you know when the output is good vs. bad?"
- "What are the failure modes you're worried about?"
### Step 3: Generate eval config
Based on the user's answers, generate test cases appropriate to the type:
**Chat agents:** 10-20 multi-turn conversation scenarios (qualified users, unqualified users, edge cases, product knowledge tests, hostile users, capability boundaries)
**Content generators:** 10-15 input variations testing different topics, edge cases, and quality dimensions (accuracy, tone, length, structure, keyword inclusion)
**Classifiers:** 20-30 test inputs with known correct labels, covering each class, edge cases, and adversarial inputs
**Summarizers:** 10-15 test documents of varying length and complexity, checking for completeness, accuracy, length compliance, and hallucination
**For all types, generate criteria based on:**
- Things the user said make output GOOD -> `contains`, `regex`, `max_length` checks
- Things the user said make output BAD -> `not_contains` checks
- Type-specific quality checks (see criterion types below)
Write the config to `eval.config.json`. Show the user and ask: "Does this cover what matters? Want to add or change anything?"
### Step 4: Run the evals
```bash
npx tsx .claude/skills/eval/run-eval.ts [--config eval.config.json] [--verbose] [--baseline]
```
The runner handles all types. For chat agents, it sends messages sequentially and evaluates the full conversation. For single-input types, it sends one request per scenario and evaluates the output.
### Step 5: Report results
1. Summary table (scenario x criterion, pass/fail)
2. Overall score: "X/Y criteria passed (Z%)"
3. If baseline exists: "Score changed from A% to B%"
4. Regressions: scenarios that got worse since last run
5. Top 3 failures with diagnosis
6. Recommendation: fix or ship
### Step 6: Iterate
If failures exist:
1. Read the failing output from eval-results.json
2. Diagnose root cause (prompt issue, missing data, model limitation)
3. Suggest a fix
4. After fix: re-run to verify
## Config format
```json
{
"name": "My AI Feature Eval",
"type": "chat | content | classifier | summarizer | custom",
"endpoint": "https://my-api.com/endpoint",
"method": "POST",
"headers": {},
"request_template": {},
"response_field": "response",
"threshold": 80,
"good_behaviors": [],
"bad_behaviors": [],
"scenarios": []
}
```
The `type` field determines how scenarios are executed:
- `chat`: multi-turn (sends messages sequentially, maintains history)
- `content | classifier | summarizer | custom`: single-turn (one request per scenario)
## Criterion types
| Type | Works for | Description |
|------|-----------|-------------|
| `contains` | All | Output contains a string (case-insensitive) |
| `not_contains` | All | Output does NOT contain a string |
| `regex` | All | Output matches a regex pattern |
| `max_length` | All | Output is under N characters |
| `min_length` | Content | Output is at least N characters |
| `max_sentences` | Chat, Content | Output is under N sentences |
| `response_time` | All | API responds within N milliseconds |
| `json_valid` | Classifier, Custom | Output is valid JSON |
| `json_field_equals` | Classifier | A JSON field equals an expected value |
| `no_hallucination` | Content, Summarizer | Output doesn't contain claims not in the input |
| `preserves_names` | Summarizer | Key names from input appear in output |
| `preserves_numbers` | Summarizer | Key numbers from input appear in output |

148
eval/README.md Normal file
View file

@ -0,0 +1,148 @@
# /eval — AI Output Evaluation Skill for Claude Code
Automated evaluation suite for any AI-powered feature. Works with chat agents, content generators, classifiers, summarizers, and any system that takes input and produces AI output.
## What it does
1. Asks you what "good" looks like for your AI feature
2. Generates test scenarios and pass/fail criteria
3. Runs the scenarios against your API
4. Scores results and detects regressions
5. Helps you fix failures
## Install
Copy the `eval/` directory into your project's `.claude/skills/`:
```bash
cp -r eval/ your-project/.claude/skills/eval/
```
Or clone and symlink:
```bash
git clone https://github.com/nichetools/ai-marketing-skills.git
ln -s ai-marketing-skills/eval your-project/.claude/skills/eval
```
## Quick start
1. In Claude Code, type `/eval`
2. Answer the questions about what your AI does and what good/bad looks like
3. The skill generates `eval.config.json` with test scenarios
4. It runs the scenarios and shows you the results
5. Fix failures, re-run, repeat
## Running evals manually
```bash
# Run all scenarios
npx tsx .claude/skills/eval/run-eval.ts
# Use a specific config
npx tsx .claude/skills/eval/run-eval.ts --config my-eval.config.json
# See full AI outputs
npx tsx .claude/skills/eval/run-eval.ts --verbose
# Save current results as baseline
npx tsx .claude/skills/eval/run-eval.ts --baseline
```
## Supported AI types
| Type | How it works |
|------|-------------|
| **Chat agent** | Sends multi-turn messages, evaluates full conversation |
| **Content generator** | Sends single input, evaluates output quality |
| **Classifier** | Sends inputs with known labels, checks accuracy |
| **Summarizer** | Sends documents, checks summary quality |
| **Custom** | Any input/output API you define |
## Criterion types
| Type | What it checks |
|------|---------------|
| `contains` | Output includes a string (case-insensitive) |
| `not_contains` | Output does NOT include a string |
| `regex` | Output matches a regex pattern |
| `max_length` | Output is under N characters |
| `min_length` | Output is at least N characters |
| `max_sentences` | Output is under N sentences |
| `response_time` | API responds within N milliseconds |
| `json_valid` | Output is valid JSON |
| `json_field_equals` | A JSON field equals an expected value |
| `no_hallucination` | Output doesn't contain claims not in the input |
| `preserves_names` | Key names from input appear in output |
| `preserves_numbers` | Key numbers from input appear in output |
## Config format
```json
{
"name": "My AI Feature Eval",
"type": "chat",
"endpoint": "https://my-api.com/chat",
"method": "POST",
"headers": { "Content-Type": "application/json" },
"request_template": {
"message": "{{message}}",
"history": "{{history}}"
},
"response_field": "response",
"threshold": 80,
"scenarios": [
{
"name": "basic_greeting",
"messages": ["Hello"],
"criteria": [
{ "type": "contains", "value": "help", "description": "Offers help" }
]
}
]
}
```
See `eval.config.example.json` for a full example.
## Regression detection
Save a baseline after your first passing run:
```bash
npx tsx .claude/skills/eval/run-eval.ts --baseline
```
Future runs automatically compare against the baseline and flag:
- Score drops
- Individual scenarios that got worse
- New failures that didn't exist before
## When to run evals
- After every prompt or model change
- Before deploying to production
- Weekly to catch quality drift from content/data changes
- When onboarding a new AI feature
## Adding custom criteria
Edit `run-eval.ts` and add a case to the `evaluateCriterion` function:
```typescript
case 'my_custom_check':
// Your logic here
return someCondition;
```
Then use it in your config:
```json
{ "type": "my_custom_check", "value": "whatever", "description": "My check" }
```
## Philosophy
> "Don't ship prompts without evals. It's the AI equivalent of shipping code without tests."
Manual testing is important for tone and feel. But automated evals catch regressions, enforce quality standards, and give you a score to track over time. Use both.

View file

@ -0,0 +1,85 @@
{
"name": "Customer Support Bot Eval",
"type": "chat",
"endpoint": "https://your-api.com/api/chat",
"method": "POST",
"headers": {
"Content-Type": "application/json"
},
"request_template": {
"tenant": "{{tenant}}",
"message": "{{message}}",
"history": "{{history}}"
},
"response_field": "response",
"session_id_field": "session_id",
"defaults": {
"tenant": "default"
},
"threshold": 80,
"good_behaviors": [
"Greets the user",
"Asks clarifying questions",
"Provides specific answers with numbers",
"Offers to help further"
],
"bad_behaviors": [
"Says 'I don't know' about core product features",
"Makes up information",
"Gives responses longer than 3 sentences"
],
"scenarios": [
{
"name": "greeting",
"messages": ["Hello"],
"criteria": [
{ "type": "contains", "value": "help", "description": "Should offer help" },
{ "type": "max_sentences", "value": 3, "description": "Concise greeting" }
]
},
{
"name": "pricing_question",
"messages": [
"Hi there",
"How much does your product cost?"
],
"criteria": [
{ "type": "regex", "value": "\\$\\d+", "description": "Mentions a price" },
{ "type": "not_contains", "value": "I don't know", "description": "Knows pricing" },
{ "type": "max_sentences", "value": 4, "description": "Concise answer" }
]
},
{
"name": "unqualified_prospect",
"messages": [
"Hi, I'm a student doing research on your industry",
"I don't have any budget"
],
"criteria": [
{ "type": "not_contains", "value": "book a call", "description": "No booking push" },
{ "type": "contains", "value": "free", "description": "Suggests free resources" }
]
},
{
"name": "feature_question",
"messages": [
"Does your product integrate with Salesforce?"
],
"criteria": [
{ "type": "not_contains", "value": "I'm not sure", "description": "Knows integrations" },
{ "type": "max_sentences", "value": 3, "description": "Concise answer" }
]
},
{
"name": "hostile_prospect",
"messages": [
"Your product looks terrible",
"Why would anyone pay for this?"
],
"criteria": [
{ "type": "not_contains", "value": "sorry", "description": "Doesn't over-apologize" },
{ "type": "max_sentences", "value": 3, "description": "Stays concise under pressure" }
]
}
]
}

500
eval/run-eval.ts Normal file
View file

@ -0,0 +1,500 @@
#!/usr/bin/env npx tsx
/**
* Universal AI Eval Runner
*
* Evaluates any AI-powered API against test scenarios with pass/fail criteria.
* Works with chat agents (multi-turn), content generators, classifiers, and summarizers.
*
* Usage:
* npx tsx .claude/skills/eval/run-eval.ts
* npx tsx .claude/skills/eval/run-eval.ts --config path/to/config.json
* npx tsx .claude/skills/eval/run-eval.ts --verbose
* npx tsx .claude/skills/eval/run-eval.ts --baseline
*/
import { readFileSync, writeFileSync, existsSync } from 'fs';
import { resolve } from 'path';
// --- Types ---
interface Criterion {
type: string;
value?: string | number;
field?: string;
description: string;
}
interface Scenario {
name: string;
messages?: string[]; // For chat type (multi-turn)
input?: string | object; // For single-turn types
criteria: Criterion[];
tenant?: string;
}
interface EvalConfig {
name: string;
type: 'chat' | 'content' | 'classifier' | 'summarizer' | 'custom';
endpoint: string;
method?: string;
headers?: Record<string, string>;
request_template: Record<string, any>;
response_field: string;
session_id_field?: string;
defaults?: Record<string, any>;
threshold: number;
good_behaviors?: string[];
bad_behaviors?: string[];
scenarios: Scenario[];
}
interface ScenarioResult {
name: string;
criteria_results: { criterion: Criterion; passed: boolean; actual?: string }[];
passed: number;
total: number;
conversation?: { role: string; text: string }[];
response_times?: number[];
error?: string;
}
// --- Args ---
const args = process.argv.slice(2);
const configPath = args.includes('--config')
? args[args.indexOf('--config') + 1]
: 'eval.config.json';
const verbose = args.includes('--verbose');
const saveBaseline = args.includes('--baseline');
// --- Load config ---
const fullConfigPath = resolve(process.cwd(), configPath);
if (!existsSync(fullConfigPath)) {
console.error(`Config not found: ${fullConfigPath}`);
console.error('Create one with /eval or copy from .claude/skills/eval/eval.config.example.json');
process.exit(1);
}
const config: EvalConfig = JSON.parse(readFileSync(fullConfigPath, 'utf-8'));
console.log(`\n${'='.join ? '=' : '='}${'='.repeat(79)}`);
console.log(`EVAL: ${config.name}`);
console.log(`Type: ${config.type} | Scenarios: ${config.scenarios.length} | Threshold: ${config.threshold}%`);
console.log(`Endpoint: ${config.endpoint}`);
console.log('='.repeat(80));
// --- Helpers ---
function sleep(ms: number) {
return new Promise((r) => setTimeout(r, ms));
}
function countSentences(text: string): number {
return text.split(/[.!?]+/).filter((s) => s.trim().length > 0).length;
}
function interpolate(template: any, vars: Record<string, any>): any {
if (typeof template === 'string') {
return template.replace(/\{\{(\w+)\}\}/g, (_, key) => {
const val = vars[key];
return val !== undefined ? (typeof val === 'string' ? val : JSON.stringify(val)) : '';
});
}
if (Array.isArray(template)) {
return template.map((item) => interpolate(item, vars));
}
if (typeof template === 'object' && template !== null) {
const result: Record<string, any> = {};
for (const [k, v] of Object.entries(template)) {
result[k] = interpolate(v, vars);
}
return result;
}
return template;
}
function evaluateCriterion(
criterion: Criterion,
fullOutput: string,
responseTimeMs?: number,
input?: string
): boolean {
const lower = fullOutput.toLowerCase();
const val = criterion.value;
switch (criterion.type) {
case 'contains':
return lower.includes(String(val).toLowerCase());
case 'not_contains':
return !lower.includes(String(val).toLowerCase());
case 'regex':
return new RegExp(String(val), 'i').test(fullOutput);
case 'max_length':
return fullOutput.length <= Number(val);
case 'min_length':
return fullOutput.length >= Number(val);
case 'max_sentences':
return countSentences(fullOutput) <= Number(val);
case 'response_time':
return (responseTimeMs || 0) <= Number(val);
case 'json_valid':
try { JSON.parse(fullOutput); return true; } catch { return false; }
case 'json_field_equals':
try {
const parsed = JSON.parse(fullOutput);
return parsed[criterion.field || ''] === val;
} catch { return false; }
case 'no_hallucination':
// Basic check: output shouldn't contain proper nouns not in the input
// This is a heuristic, not perfect
if (!input) return true;
const outputNames = fullOutput.match(/[A-Z][a-z]+(?:\s[A-Z][a-z]+)*/g) || [];
const inputLower = input.toLowerCase();
const hallucinated = outputNames.filter(
(name) => !inputLower.includes(name.toLowerCase()) && name.length > 3
);
return hallucinated.length <= 2; // Allow some tolerance
case 'preserves_names':
if (!input) return true;
const names = input.match(/[A-Z][a-z]+(?:\s[A-Z][a-z]+)+/g) || [];
return names.length === 0 || names.some((n) => fullOutput.includes(n));
case 'preserves_numbers':
if (!input) return true;
const numbers = input.match(/\d[\d,]+(?:\.\d+)?/g) || [];
if (numbers.length === 0) return true;
return numbers.some((n) => fullOutput.includes(n));
default:
console.warn(`Unknown criterion type: ${criterion.type}`);
return true;
}
}
// --- Run scenarios ---
async function runChatScenario(scenario: Scenario): Promise<ScenarioResult> {
const messages = scenario.messages || [];
const conversation: { role: string; text: string }[] = [];
const responseTimes: number[] = [];
let sessionId: string | null = null;
const history: { role: string; text: string }[] = [];
for (const userMessage of messages) {
const vars: Record<string, any> = {
message: userMessage,
history: history,
session_id: sessionId || '',
...(config.defaults || {}),
...(scenario.tenant ? { tenant: scenario.tenant } : {}),
};
const body = interpolate(config.request_template, vars);
// Ensure history is passed as array, not string
if (body.history === '[]' || body.history === '') {
body.history = history.map((h) => ({ role: h.role, text: h.text }));
}
if (sessionId && config.session_id_field) {
body[config.session_id_field] = sessionId;
}
const start = Date.now();
try {
const res = await fetch(config.endpoint, {
method: config.method || 'POST',
headers: { 'Content-Type': 'application/json', ...(config.headers || {}) },
body: JSON.stringify(body),
});
const elapsed = Date.now() - start;
responseTimes.push(elapsed);
if (!res.ok) {
return {
name: scenario.name,
criteria_results: [],
passed: 0,
total: scenario.criteria.length,
error: `API returned ${res.status}`,
};
}
const data = await res.json();
const responseText = data[config.response_field] || '';
if (config.session_id_field && data[config.session_id_field]) {
sessionId = data[config.session_id_field];
}
conversation.push({ role: 'user', text: userMessage });
conversation.push({ role: 'agent', text: responseText });
history.push({ role: 'user', text: userMessage });
history.push({ role: 'model', text: responseText });
} catch (err: any) {
return {
name: scenario.name,
criteria_results: [],
passed: 0,
total: scenario.criteria.length,
error: err.message,
};
}
await sleep(1000);
}
// Evaluate criteria against full conversation
const fullOutput = conversation
.filter((c) => c.role === 'agent')
.map((c) => c.text)
.join('\n');
const avgResponseTime =
responseTimes.length > 0
? responseTimes.reduce((a, b) => a + b, 0) / responseTimes.length
: undefined;
const criteriaResults = scenario.criteria.map((criterion) => ({
criterion,
passed: evaluateCriterion(criterion, fullOutput, avgResponseTime),
}));
return {
name: scenario.name,
criteria_results: criteriaResults,
passed: criteriaResults.filter((r) => r.passed).length,
total: criteriaResults.length,
conversation,
response_times: responseTimes,
};
}
async function runSingleTurnScenario(scenario: Scenario): Promise<ScenarioResult> {
const input = typeof scenario.input === 'string' ? scenario.input : JSON.stringify(scenario.input);
const vars: Record<string, any> = {
input,
message: input,
...(config.defaults || {}),
...(scenario.tenant ? { tenant: scenario.tenant } : {}),
};
const body = interpolate(config.request_template, vars);
const start = Date.now();
try {
const res = await fetch(config.endpoint, {
method: config.method || 'POST',
headers: { 'Content-Type': 'application/json', ...(config.headers || {}) },
body: JSON.stringify(body),
});
const elapsed = Date.now() - start;
if (!res.ok) {
return {
name: scenario.name,
criteria_results: [],
passed: 0,
total: scenario.criteria.length,
error: `API returned ${res.status}`,
};
}
const data = await res.json();
const responseText =
typeof data === 'string' ? data : data[config.response_field] || JSON.stringify(data);
const criteriaResults = scenario.criteria.map((criterion) => ({
criterion,
passed: evaluateCriterion(criterion, responseText, elapsed, input),
}));
return {
name: scenario.name,
criteria_results: criteriaResults,
passed: criteriaResults.filter((r) => r.passed).length,
total: criteriaResults.length,
conversation: [
{ role: 'input', text: input },
{ role: 'output', text: responseText },
],
response_times: [elapsed],
};
} catch (err: any) {
return {
name: scenario.name,
criteria_results: [],
passed: 0,
total: scenario.criteria.length,
error: err.message,
};
}
}
// --- Main ---
async function main() {
const results: ScenarioResult[] = [];
for (let i = 0; i < config.scenarios.length; i++) {
const scenario = config.scenarios[i];
process.stdout.write(`[${i + 1}/${config.scenarios.length}] Running: ${scenario.name}... `);
let result: ScenarioResult;
if (config.type === 'chat') {
result = await runChatScenario(scenario);
} else {
result = await runSingleTurnScenario(scenario);
}
results.push(result);
if (result.error) {
console.log(`ERROR: ${result.error}`);
} else {
const fails = result.total - result.passed;
console.log(
`${result.passed}/${result.total}${fails > 0 ? ` (${fails} FAIL)` : ' (PASS)'}`
);
}
if (verbose && result.conversation) {
for (const msg of result.conversation) {
const label = msg.role === 'user' || msg.role === 'input' ? 'USER' : 'AGENT';
console.log(` ${label}: ${msg.text.slice(0, 200)}${msg.text.length > 200 ? '...' : ''}`);
}
console.log('');
}
}
// --- Summary table ---
console.log('\n' + '='.repeat(80));
console.log('EVAL RESULTS');
console.log('='.repeat(80));
// Collect all unique criterion descriptions
const allCriteria = new Set<string>();
for (const r of results) {
for (const cr of r.criteria_results) {
allCriteria.add(cr.criterion.description.slice(0, 12));
}
}
const criteriaHeaders = Array.from(allCriteria);
// Header
const nameWidth = 30;
const colWidth = 7;
let header = 'Scenario'.padEnd(nameWidth);
for (const ch of criteriaHeaders) {
header += ch.padEnd(colWidth);
}
header += 'Score';
console.log(header);
console.log('-'.repeat(header.length));
// Rows
let totalPassed = 0;
let totalCriteria = 0;
for (const r of results) {
let row = r.name.slice(0, nameWidth - 2).padEnd(nameWidth);
for (const ch of criteriaHeaders) {
const cr = r.criteria_results.find(
(c) => c.criterion.description.slice(0, 12) === ch
);
if (!cr) {
row += '-'.padEnd(colWidth);
} else {
row += (cr.passed ? 'OK' : 'FAIL').padEnd(colWidth);
}
}
row += `${r.passed}/${r.total}`;
console.log(row);
totalPassed += r.passed;
totalCriteria += r.total;
}
console.log('-'.repeat(header.length));
const pct = totalCriteria > 0 ? ((totalPassed / totalCriteria) * 100).toFixed(1) : '0';
console.log(`TOTAL: ${totalPassed}/${totalCriteria} criteria passed (${pct}%)\n`);
// --- Baseline comparison ---
const resultsPath = resolve(process.cwd(), 'eval-results.json');
const baselinePath = resolve(process.cwd(), 'eval-baseline.json');
if (existsSync(baselinePath) && !saveBaseline) {
try {
const baseline = JSON.parse(readFileSync(baselinePath, 'utf-8'));
const baselinePct = baseline.score_pct || 0;
const delta = parseFloat(pct) - baselinePct;
console.log(`Baseline: ${baselinePct}% | Current: ${pct}% | Delta: ${delta > 0 ? '+' : ''}${delta.toFixed(1)}%`);
if (delta < 0) {
console.log('WARNING: Score DROPPED from baseline. Check for regressions.');
// Find scenarios that got worse
for (const r of results) {
const baseScenario = baseline.results?.find((b: any) => b.name === r.name);
if (baseScenario && r.passed < baseScenario.passed) {
console.log(` REGRESSION: ${r.name} (was ${baseScenario.passed}/${baseScenario.total}, now ${r.passed}/${r.total})`);
}
}
}
console.log('');
} catch {
// Baseline unreadable, skip
}
}
// --- Save results ---
const output = {
name: config.name,
timestamp: new Date().toISOString(),
score_pct: parseFloat(pct),
total_passed: totalPassed,
total_criteria: totalCriteria,
results: results.map((r) => ({
name: r.name,
passed: r.passed,
total: r.total,
error: r.error,
criteria: r.criteria_results.map((cr) => ({
description: cr.criterion.description,
type: cr.criterion.type,
passed: cr.passed,
})),
conversation: r.conversation,
})),
};
writeFileSync(resultsPath, JSON.stringify(output, null, 2));
console.log(`Results written to: ${resultsPath}`);
if (saveBaseline) {
writeFileSync(baselinePath, JSON.stringify(output, null, 2));
console.log(`Baseline saved to: ${baselinePath}`);
}
// --- Threshold check ---
if (parseFloat(pct) < config.threshold) {
console.log(`\nFAIL: Score ${pct}% is below threshold ${config.threshold}%`);
process.exit(1);
} else {
console.log(`\nPASS: Score ${pct}% meets threshold ${config.threshold}%`);
}
}
main().catch((err) => {
console.error('Eval runner failed:', err);
process.exit(1);
});