From 45acb965baea890bcdf20abdeb96f497c4ff376a Mon Sep 17 00:00:00 2001 From: Jiayuan Zhang Date: Sun, 15 Feb 2026 18:32:04 +0800 Subject: [PATCH] docs: add SWE-bench section to CLAUDE.md Co-Authored-By: Claude Opus 4.6 --- CLAUDE.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/CLAUDE.md b/CLAUDE.md index a9112e6e..442df2e8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -190,6 +190,26 @@ Logged events: `run_start`, `run_end`, `llm_call`, `llm_result`, `tool_start`, ` Each line is a JSON object with `ts` (timestamp) and `event` (type), suitable for AI-assisted log analysis. Full event reference: `packages/core/src/agent/run-log.ts`. +## SWE-bench (Agent Benchmark) + +Run the Multica agent against [SWE-bench](https://www.swebench.com/), the standard benchmark for evaluating AI coding agents on real GitHub issues. + +```bash +# Download dataset +python scripts/swe-bench/download-dataset.py --dataset lite --limit 5 + +# Run agent against tasks +npx tsx scripts/swe-bench/run.ts --limit 5 --provider kimi-coding + +# Analyze results +npx tsx scripts/swe-bench/analyze.ts + +# Official evaluation (requires Docker) +bash scripts/swe-bench/evaluate.sh +``` + +Scripts are in `scripts/swe-bench/`. Full guide: `docs/swe-bench.md`. + ## E2E Testing (Agent-Driven) E2E tests are executed and analyzed by the Coding Agent (Claude Code), not by vitest. The Coding Agent runs the Multica agent via CLI, reads the structured run-log, and intelligently analyzes intermediate behavior and results.