From 10c57c0f7a0efabd87985b6b0820891dd58c5901 Mon Sep 17 00:00:00 2001 From: Jiayuan Zhang Date: Sun, 15 Feb 2026 18:30:58 +0800 Subject: [PATCH] docs: add SWE-bench runner guide Covers the full pipeline: dataset download, agent execution, result analysis, and official Docker evaluation. Includes runner options, output format, known limitations, and initial benchmark results. Co-Authored-By: Claude Opus 4.6 --- docs/swe-bench.md | 253 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 253 insertions(+) create mode 100644 docs/swe-bench.md diff --git a/docs/swe-bench.md b/docs/swe-bench.md new file mode 100644 index 00000000..51d89732 --- /dev/null +++ b/docs/swe-bench.md @@ -0,0 +1,253 @@ +# SWE-bench: Agent Coding Benchmark + +Run and evaluate the Multica agent against [SWE-bench](https://www.swebench.com/), the standard benchmark for AI coding agents. SWE-bench tasks are real GitHub issues from open-source Python projects — the agent must read the issue, explore the codebase, and produce a patch that fixes the bug. + +## Quick Start + +```bash +# 1. Download dataset (requires: pip install datasets) +python scripts/swe-bench/download-dataset.py --dataset lite --limit 5 + +# 2. Run the agent +npx tsx scripts/swe-bench/run.ts --limit 5 + +# 3. Analyze results +npx tsx scripts/swe-bench/analyze.ts +``` + +## Scripts + +``` +scripts/swe-bench/ +├── download-dataset.py # Download from HuggingFace → JSONL +├── run.ts # Core runner: Agent API → git diff → predictions +├── evaluate.sh # Official Docker evaluation harness wrapper +├── analyze.ts # Summarize run results +└── .gitignore # Ignores downloaded datasets and output files +``` + +## Pipeline + +``` + ┌──────────────────┐ + HuggingFace ──download──► JSONL ──┤ For each task: │ + │ 1. git clone │ + │ 2. git checkout │ + │ 3. Agent.run() │ + │ 4. git diff │ + └────────┬─────────┘ + │ + predictions.jsonl (SWE-bench format) + │ + ┌───────────────┴───────────────┐ + │ swebench.harness (Docker) │ + │ Apply patch → run tests │ + │ → pass/fail verdict │ + └───────────────────────────────┘ +``` + +## Dataset Variants + +| Variant | Size | HuggingFace ID | Recommended For | +|---------|------|----------------|-----------------| +| **Lite** | 300 tasks | `princeton-nlp/SWE-bench_Lite` | Quick iteration, development | +| **Verified** | 500 tasks | `princeton-nlp/SWE-bench_Verified` | Official benchmarking, leaderboard | +| **Full** | ~2294 tasks | `princeton-nlp/SWE-bench` | Comprehensive evaluation | + +```bash +# Download specific variant +python scripts/swe-bench/download-dataset.py --dataset verified +python scripts/swe-bench/download-dataset.py --dataset lite --limit 20 +``` + +## Runner Options + +```bash +npx tsx scripts/swe-bench/run.ts [options] + +Options: + --dataset PATH JSONL dataset path (default: scripts/swe-bench/lite.jsonl) + --provider NAME LLM provider (default: kimi-coding) + --model NAME Model override + --limit N Max tasks to run (default: all) + --offset N Skip first N tasks (default: 0) + --output PATH Output predictions JSONL (default: scripts/swe-bench/predictions.jsonl) + --workdir PATH Repo clone directory (default: /tmp/swe-bench) + --timeout MS Per-task timeout (default: 300000 = 5min) + --instance ID Run a single instance + --debug Enable debug logging +``` + +### Examples + +```bash +# Run 10 tasks with Anthropic Claude +npx tsx scripts/swe-bench/run.ts --limit 10 --provider anthropic + +# Run a specific instance +npx tsx scripts/swe-bench/run.ts --instance "django__django-16379" + +# Resume from task 50 with longer timeout +npx tsx scripts/swe-bench/run.ts --offset 50 --limit 10 --timeout 600000 + +# Compare providers (run separately, different output files) +npx tsx scripts/swe-bench/run.ts --provider kimi-coding --output scripts/swe-bench/pred-kimi.jsonl +npx tsx scripts/swe-bench/run.ts --provider anthropic --output scripts/swe-bench/pred-claude.jsonl +``` + +## How the Agent Solves Tasks + +For each task, the runner: + +1. **Clones the repository** to `/tmp/swe-bench//` and checks out `base_commit` +2. **Creates an Agent** with a focused system prompt and restricted tools (coding only — no web, no cron, no sessions) +3. **Runs the agent** with the issue description as the prompt +4. **Collects `git diff`** as the patch after the agent finishes +5. **Appends** the prediction to `predictions.jsonl` in SWE-bench format + +The agent has access to: +- `read`, `write`, `edit` — file operations +- `exec`, `process` — shell commands (for exploring code, running tests) +- `glob` — file search + +Tools explicitly denied: `web_fetch`, `web_search`, `cron`, `data`, `sessions_spawn`, `sessions_list`, `memory_search`, `send_file`. + +## Output Files + +After a run, two files are produced: + +### `predictions.jsonl` — SWE-bench format + +```json +{"instance_id": "astropy__astropy-12907", "model_patch": "diff --git a/...", "model_name_or_path": "multica-kimi-coding"} +``` + +This file is the input to the official evaluation harness. + +### `predictions.results.jsonl` — detailed run metrics + +```json +{ + "instance_id": "astropy__astropy-12907", + "success": true, + "patch": "diff --git a/...", + "error": null, + "duration_ms": 141892, + "session_id": "019c60c7-52ac-702a-9b9c-dc53c0daea6b" +} +``` + +## Analyzing Results + +```bash +# Summary report +npx tsx scripts/swe-bench/analyze.ts + +# Or specify a results file +npx tsx scripts/swe-bench/analyze.ts scripts/swe-bench/pred-kimi.results.jsonl +``` + +Output includes: +- Patch rate (how many tasks produced a diff) +- Duration statistics (avg/min/max) +- Error breakdown +- Per-repository stats +- Slowest tasks + +### Run-Log Analysis + +Each agent session writes a structured `run-log.jsonl` to `~/.super-multica/sessions//`. This captures every LLM call, tool invocation, and timing: + +```bash +# Find a session's run log +cat ~/.super-multica/sessions//run-log.jsonl | head -5 + +# Quick stats from a run log +cat ~/.super-multica/sessions//run-log.jsonl | python3 -c " +import json, sys +events = [json.loads(l) for l in sys.stdin if l.strip()] +tools = [e for e in events if e['event'] == 'tool_start'] +llm_ms = sum(e.get('duration_ms', 0) for e in events if e['event'] == 'llm_result') +print(f'LLM time: {llm_ms/1000:.1f}s | Tool calls: {len(tools)}') +" +``` + +## Official Evaluation (Docker) + +The runner produces patches, but **only the official SWE-bench harness determines pass/fail** by applying the patch and running the project's test suite. + +### Prerequisites + +- Docker running (at least 120GB storage, 16GB RAM, 8 CPU cores) +- `pip install swebench` + +### Run Evaluation + +```bash +# Using the wrapper script +bash scripts/swe-bench/evaluate.sh + +# Or directly +python -m swebench.harness.run_evaluation \ + --dataset_name princeton-nlp/SWE-bench_Lite \ + --predictions_path scripts/swe-bench/predictions.jsonl \ + --max_workers 4 \ + --run_id multica +``` + +Results are written to `logs/` and `evaluation_results/`. + +## Known Limitations and Improvements + +### Current Limitations + +1. **No Docker isolation for agent execution**: The agent runs on the host, so `pip install` and other commands affect the system Python. SWE-bench standard practice is to run each task in a Docker container. + +2. **`SMC_DATA_DIR` timing**: Setting `SMC_DATA_DIR` at runtime doesn't affect `DATA_DIR` (resolved at module import time). Sessions currently write to `~/.super-multica/sessions/`. To isolate, set the env var before the process starts: + ```bash + SMC_DATA_DIR=~/.swe-bench-eval npx tsx scripts/swe-bench/run.ts --limit 5 + ``` + +3. **Sequential execution**: Tasks run one at a time. For large-scale runs, launch multiple processes with `--offset`/`--limit` to parallelize: + ```bash + # Run 4 workers in parallel + npx tsx scripts/swe-bench/run.ts --offset 0 --limit 75 --output pred-0.jsonl & + npx tsx scripts/swe-bench/run.ts --offset 75 --limit 75 --output pred-1.jsonl & + npx tsx scripts/swe-bench/run.ts --offset 150 --limit 75 --output pred-2.jsonl & + npx tsx scripts/swe-bench/run.ts --offset 225 --limit 75 --output pred-3.jsonl & + wait + cat pred-*.jsonl > predictions.jsonl + ``` + +4. **Repo cloning per instance**: Each instance clones the full repo. For repos with many tasks (e.g., astropy, django), a shared clone with `git worktree` would be faster. + +### Potential Improvements + +- **Docker-per-task**: Run each agent in a Docker container matching the SWE-bench environment spec (correct Python version, pre-installed dependencies) +- **Shared repo pool**: Clone each unique repo once, use `git worktree` for per-task isolation +- **Cost tracking**: Parse run-log token counts for per-task and aggregate cost estimates +- **Multi-turn retries**: If the agent produces no patch, retry with feedback +- **System prompt tuning**: The current prompt is minimal; more detailed guidance (e.g., "search for related test files to understand expected behavior") could improve solve rate + +## Related Benchmarks + +| Benchmark | Focus | Notes | +|-----------|-------|-------| +| [SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/) | Bug fixing (Python) | Gold standard, 500 human-verified tasks | +| [SWE-bench Multilingual](https://github.com/SWE-bench/SWE-bench) | Bug fixing (7 languages) | Java, TS, JS, Go, Rust, C, C++ | +| [Terminal-Bench](https://www.swebench.com/) | CLI workflows | Multi-step sandboxed terminal tasks | +| [Aider Polyglot](https://aider.chat/docs/leaderboards/) | Code editing | 225 Exercism exercises, 6 languages | +| [DPAI Arena](https://www.jetbrains.com/) | Full dev workflow | JetBrains: patch, test, review, analysis | +| [HumanEval](https://github.com/openai/human-eval) | Function generation | 164 Python function tasks, largely saturated | + +## Initial Results (kimi-coding, 3 tasks) + +First run on 3 SWE-bench Lite tasks (all astropy): + +| Task | Status | Duration | LLM Time | Tools | Fix | +|------|--------|----------|----------|-------|-----| +| `astropy__astropy-12907` | PATCHED | 141.9s | 125.1s | 30 | `_cstack`: `= 1` → `= right` | +| `astropy__astropy-14182` | PATCHED | 192.0s | 166.9s | 56 | Added `header_rows` param to RST writer | +| `astropy__astropy-14365` | PATCHED | 65.7s | 49.6s | 23 | `re.compile()` + `re.IGNORECASE` | + +3/3 tasks produced patches. Formal evaluation pending (requires Docker harness).