docs: add SWE-bench runner guide

Covers the full pipeline: dataset download, agent execution,
result analysis, and official Docker evaluation. Includes
runner options, output format, known limitations, and initial
benchmark results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Jiayuan Zhang 2026-02-15 18:30:58 +08:00
parent 90d374ffd5
commit 10c57c0f7a

253
docs/swe-bench.md Normal file
View file

@ -0,0 +1,253 @@
# SWE-bench: Agent Coding Benchmark
Run and evaluate the Multica agent against [SWE-bench](https://www.swebench.com/), the standard benchmark for AI coding agents. SWE-bench tasks are real GitHub issues from open-source Python projects — the agent must read the issue, explore the codebase, and produce a patch that fixes the bug.
## Quick Start
```bash
# 1. Download dataset (requires: pip install datasets)
python scripts/swe-bench/download-dataset.py --dataset lite --limit 5
# 2. Run the agent
npx tsx scripts/swe-bench/run.ts --limit 5
# 3. Analyze results
npx tsx scripts/swe-bench/analyze.ts
```
## Scripts
```
scripts/swe-bench/
├── download-dataset.py # Download from HuggingFace → JSONL
├── run.ts # Core runner: Agent API → git diff → predictions
├── evaluate.sh # Official Docker evaluation harness wrapper
├── analyze.ts # Summarize run results
└── .gitignore # Ignores downloaded datasets and output files
```
## Pipeline
```
┌──────────────────┐
HuggingFace ──download──► JSONL ──┤ For each task: │
│ 1. git clone │
│ 2. git checkout │
│ 3. Agent.run() │
│ 4. git diff │
└────────┬─────────┘
predictions.jsonl (SWE-bench format)
┌───────────────┴───────────────┐
│ swebench.harness (Docker) │
│ Apply patch → run tests │
│ → pass/fail verdict │
└───────────────────────────────┘
```
## Dataset Variants
| Variant | Size | HuggingFace ID | Recommended For |
|---------|------|----------------|-----------------|
| **Lite** | 300 tasks | `princeton-nlp/SWE-bench_Lite` | Quick iteration, development |
| **Verified** | 500 tasks | `princeton-nlp/SWE-bench_Verified` | Official benchmarking, leaderboard |
| **Full** | ~2294 tasks | `princeton-nlp/SWE-bench` | Comprehensive evaluation |
```bash
# Download specific variant
python scripts/swe-bench/download-dataset.py --dataset verified
python scripts/swe-bench/download-dataset.py --dataset lite --limit 20
```
## Runner Options
```bash
npx tsx scripts/swe-bench/run.ts [options]
Options:
--dataset PATH JSONL dataset path (default: scripts/swe-bench/lite.jsonl)
--provider NAME LLM provider (default: kimi-coding)
--model NAME Model override
--limit N Max tasks to run (default: all)
--offset N Skip first N tasks (default: 0)
--output PATH Output predictions JSONL (default: scripts/swe-bench/predictions.jsonl)
--workdir PATH Repo clone directory (default: /tmp/swe-bench)
--timeout MS Per-task timeout (default: 300000 = 5min)
--instance ID Run a single instance
--debug Enable debug logging
```
### Examples
```bash
# Run 10 tasks with Anthropic Claude
npx tsx scripts/swe-bench/run.ts --limit 10 --provider anthropic
# Run a specific instance
npx tsx scripts/swe-bench/run.ts --instance "django__django-16379"
# Resume from task 50 with longer timeout
npx tsx scripts/swe-bench/run.ts --offset 50 --limit 10 --timeout 600000
# Compare providers (run separately, different output files)
npx tsx scripts/swe-bench/run.ts --provider kimi-coding --output scripts/swe-bench/pred-kimi.jsonl
npx tsx scripts/swe-bench/run.ts --provider anthropic --output scripts/swe-bench/pred-claude.jsonl
```
## How the Agent Solves Tasks
For each task, the runner:
1. **Clones the repository** to `/tmp/swe-bench/<instance_id>/` and checks out `base_commit`
2. **Creates an Agent** with a focused system prompt and restricted tools (coding only — no web, no cron, no sessions)
3. **Runs the agent** with the issue description as the prompt
4. **Collects `git diff`** as the patch after the agent finishes
5. **Appends** the prediction to `predictions.jsonl` in SWE-bench format
The agent has access to:
- `read`, `write`, `edit` — file operations
- `exec`, `process` — shell commands (for exploring code, running tests)
- `glob` — file search
Tools explicitly denied: `web_fetch`, `web_search`, `cron`, `data`, `sessions_spawn`, `sessions_list`, `memory_search`, `send_file`.
## Output Files
After a run, two files are produced:
### `predictions.jsonl` — SWE-bench format
```json
{"instance_id": "astropy__astropy-12907", "model_patch": "diff --git a/...", "model_name_or_path": "multica-kimi-coding"}
```
This file is the input to the official evaluation harness.
### `predictions.results.jsonl` — detailed run metrics
```json
{
"instance_id": "astropy__astropy-12907",
"success": true,
"patch": "diff --git a/...",
"error": null,
"duration_ms": 141892,
"session_id": "019c60c7-52ac-702a-9b9c-dc53c0daea6b"
}
```
## Analyzing Results
```bash
# Summary report
npx tsx scripts/swe-bench/analyze.ts
# Or specify a results file
npx tsx scripts/swe-bench/analyze.ts scripts/swe-bench/pred-kimi.results.jsonl
```
Output includes:
- Patch rate (how many tasks produced a diff)
- Duration statistics (avg/min/max)
- Error breakdown
- Per-repository stats
- Slowest tasks
### Run-Log Analysis
Each agent session writes a structured `run-log.jsonl` to `~/.super-multica/sessions/<session-id>/`. This captures every LLM call, tool invocation, and timing:
```bash
# Find a session's run log
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | head -5
# Quick stats from a run log
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | python3 -c "
import json, sys
events = [json.loads(l) for l in sys.stdin if l.strip()]
tools = [e for e in events if e['event'] == 'tool_start']
llm_ms = sum(e.get('duration_ms', 0) for e in events if e['event'] == 'llm_result')
print(f'LLM time: {llm_ms/1000:.1f}s | Tool calls: {len(tools)}')
"
```
## Official Evaluation (Docker)
The runner produces patches, but **only the official SWE-bench harness determines pass/fail** by applying the patch and running the project's test suite.
### Prerequisites
- Docker running (at least 120GB storage, 16GB RAM, 8 CPU cores)
- `pip install swebench`
### Run Evaluation
```bash
# Using the wrapper script
bash scripts/swe-bench/evaluate.sh
# Or directly
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path scripts/swe-bench/predictions.jsonl \
--max_workers 4 \
--run_id multica
```
Results are written to `logs/` and `evaluation_results/`.
## Known Limitations and Improvements
### Current Limitations
1. **No Docker isolation for agent execution**: The agent runs on the host, so `pip install` and other commands affect the system Python. SWE-bench standard practice is to run each task in a Docker container.
2. **`SMC_DATA_DIR` timing**: Setting `SMC_DATA_DIR` at runtime doesn't affect `DATA_DIR` (resolved at module import time). Sessions currently write to `~/.super-multica/sessions/`. To isolate, set the env var before the process starts:
```bash
SMC_DATA_DIR=~/.swe-bench-eval npx tsx scripts/swe-bench/run.ts --limit 5
```
3. **Sequential execution**: Tasks run one at a time. For large-scale runs, launch multiple processes with `--offset`/`--limit` to parallelize:
```bash
# Run 4 workers in parallel
npx tsx scripts/swe-bench/run.ts --offset 0 --limit 75 --output pred-0.jsonl &
npx tsx scripts/swe-bench/run.ts --offset 75 --limit 75 --output pred-1.jsonl &
npx tsx scripts/swe-bench/run.ts --offset 150 --limit 75 --output pred-2.jsonl &
npx tsx scripts/swe-bench/run.ts --offset 225 --limit 75 --output pred-3.jsonl &
wait
cat pred-*.jsonl > predictions.jsonl
```
4. **Repo cloning per instance**: Each instance clones the full repo. For repos with many tasks (e.g., astropy, django), a shared clone with `git worktree` would be faster.
### Potential Improvements
- **Docker-per-task**: Run each agent in a Docker container matching the SWE-bench environment spec (correct Python version, pre-installed dependencies)
- **Shared repo pool**: Clone each unique repo once, use `git worktree` for per-task isolation
- **Cost tracking**: Parse run-log token counts for per-task and aggregate cost estimates
- **Multi-turn retries**: If the agent produces no patch, retry with feedback
- **System prompt tuning**: The current prompt is minimal; more detailed guidance (e.g., "search for related test files to understand expected behavior") could improve solve rate
## Related Benchmarks
| Benchmark | Focus | Notes |
|-----------|-------|-------|
| [SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/) | Bug fixing (Python) | Gold standard, 500 human-verified tasks |
| [SWE-bench Multilingual](https://github.com/SWE-bench/SWE-bench) | Bug fixing (7 languages) | Java, TS, JS, Go, Rust, C, C++ |
| [Terminal-Bench](https://www.swebench.com/) | CLI workflows | Multi-step sandboxed terminal tasks |
| [Aider Polyglot](https://aider.chat/docs/leaderboards/) | Code editing | 225 Exercism exercises, 6 languages |
| [DPAI Arena](https://www.jetbrains.com/) | Full dev workflow | JetBrains: patch, test, review, analysis |
| [HumanEval](https://github.com/openai/human-eval) | Function generation | 164 Python function tasks, largely saturated |
## Initial Results (kimi-coding, 3 tasks)
First run on 3 SWE-bench Lite tasks (all astropy):
| Task | Status | Duration | LLM Time | Tools | Fix |
|------|--------|----------|----------|-------|-----|
| `astropy__astropy-12907` | PATCHED | 141.9s | 125.1s | 30 | `_cstack`: `= 1``= right` |
| `astropy__astropy-14182` | PATCHED | 192.0s | 166.9s | 56 | Added `header_rows` param to RST writer |
| `astropy__astropy-14365` | PATCHED | 65.7s | 49.6s | 23 | `re.compile()` + `re.IGNORECASE` |
3/3 tasks produced patches. Formal evaluation pending (requires Docker harness).