docs: add SWE-bench runner guide
Covers the full pipeline: dataset download, agent execution, result analysis, and official Docker evaluation. Includes runner options, output format, known limitations, and initial benchmark results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
90d374ffd5
commit
10c57c0f7a
1 changed files with 253 additions and 0 deletions
253
docs/swe-bench.md
Normal file
253
docs/swe-bench.md
Normal file
|
|
@ -0,0 +1,253 @@
|
|||
# SWE-bench: Agent Coding Benchmark
|
||||
|
||||
Run and evaluate the Multica agent against [SWE-bench](https://www.swebench.com/), the standard benchmark for AI coding agents. SWE-bench tasks are real GitHub issues from open-source Python projects — the agent must read the issue, explore the codebase, and produce a patch that fixes the bug.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Download dataset (requires: pip install datasets)
|
||||
python scripts/swe-bench/download-dataset.py --dataset lite --limit 5
|
||||
|
||||
# 2. Run the agent
|
||||
npx tsx scripts/swe-bench/run.ts --limit 5
|
||||
|
||||
# 3. Analyze results
|
||||
npx tsx scripts/swe-bench/analyze.ts
|
||||
```
|
||||
|
||||
## Scripts
|
||||
|
||||
```
|
||||
scripts/swe-bench/
|
||||
├── download-dataset.py # Download from HuggingFace → JSONL
|
||||
├── run.ts # Core runner: Agent API → git diff → predictions
|
||||
├── evaluate.sh # Official Docker evaluation harness wrapper
|
||||
├── analyze.ts # Summarize run results
|
||||
└── .gitignore # Ignores downloaded datasets and output files
|
||||
```
|
||||
|
||||
## Pipeline
|
||||
|
||||
```
|
||||
┌──────────────────┐
|
||||
HuggingFace ──download──► JSONL ──┤ For each task: │
|
||||
│ 1. git clone │
|
||||
│ 2. git checkout │
|
||||
│ 3. Agent.run() │
|
||||
│ 4. git diff │
|
||||
└────────┬─────────┘
|
||||
│
|
||||
predictions.jsonl (SWE-bench format)
|
||||
│
|
||||
┌───────────────┴───────────────┐
|
||||
│ swebench.harness (Docker) │
|
||||
│ Apply patch → run tests │
|
||||
│ → pass/fail verdict │
|
||||
└───────────────────────────────┘
|
||||
```
|
||||
|
||||
## Dataset Variants
|
||||
|
||||
| Variant | Size | HuggingFace ID | Recommended For |
|
||||
|---------|------|----------------|-----------------|
|
||||
| **Lite** | 300 tasks | `princeton-nlp/SWE-bench_Lite` | Quick iteration, development |
|
||||
| **Verified** | 500 tasks | `princeton-nlp/SWE-bench_Verified` | Official benchmarking, leaderboard |
|
||||
| **Full** | ~2294 tasks | `princeton-nlp/SWE-bench` | Comprehensive evaluation |
|
||||
|
||||
```bash
|
||||
# Download specific variant
|
||||
python scripts/swe-bench/download-dataset.py --dataset verified
|
||||
python scripts/swe-bench/download-dataset.py --dataset lite --limit 20
|
||||
```
|
||||
|
||||
## Runner Options
|
||||
|
||||
```bash
|
||||
npx tsx scripts/swe-bench/run.ts [options]
|
||||
|
||||
Options:
|
||||
--dataset PATH JSONL dataset path (default: scripts/swe-bench/lite.jsonl)
|
||||
--provider NAME LLM provider (default: kimi-coding)
|
||||
--model NAME Model override
|
||||
--limit N Max tasks to run (default: all)
|
||||
--offset N Skip first N tasks (default: 0)
|
||||
--output PATH Output predictions JSONL (default: scripts/swe-bench/predictions.jsonl)
|
||||
--workdir PATH Repo clone directory (default: /tmp/swe-bench)
|
||||
--timeout MS Per-task timeout (default: 300000 = 5min)
|
||||
--instance ID Run a single instance
|
||||
--debug Enable debug logging
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Run 10 tasks with Anthropic Claude
|
||||
npx tsx scripts/swe-bench/run.ts --limit 10 --provider anthropic
|
||||
|
||||
# Run a specific instance
|
||||
npx tsx scripts/swe-bench/run.ts --instance "django__django-16379"
|
||||
|
||||
# Resume from task 50 with longer timeout
|
||||
npx tsx scripts/swe-bench/run.ts --offset 50 --limit 10 --timeout 600000
|
||||
|
||||
# Compare providers (run separately, different output files)
|
||||
npx tsx scripts/swe-bench/run.ts --provider kimi-coding --output scripts/swe-bench/pred-kimi.jsonl
|
||||
npx tsx scripts/swe-bench/run.ts --provider anthropic --output scripts/swe-bench/pred-claude.jsonl
|
||||
```
|
||||
|
||||
## How the Agent Solves Tasks
|
||||
|
||||
For each task, the runner:
|
||||
|
||||
1. **Clones the repository** to `/tmp/swe-bench/<instance_id>/` and checks out `base_commit`
|
||||
2. **Creates an Agent** with a focused system prompt and restricted tools (coding only — no web, no cron, no sessions)
|
||||
3. **Runs the agent** with the issue description as the prompt
|
||||
4. **Collects `git diff`** as the patch after the agent finishes
|
||||
5. **Appends** the prediction to `predictions.jsonl` in SWE-bench format
|
||||
|
||||
The agent has access to:
|
||||
- `read`, `write`, `edit` — file operations
|
||||
- `exec`, `process` — shell commands (for exploring code, running tests)
|
||||
- `glob` — file search
|
||||
|
||||
Tools explicitly denied: `web_fetch`, `web_search`, `cron`, `data`, `sessions_spawn`, `sessions_list`, `memory_search`, `send_file`.
|
||||
|
||||
## Output Files
|
||||
|
||||
After a run, two files are produced:
|
||||
|
||||
### `predictions.jsonl` — SWE-bench format
|
||||
|
||||
```json
|
||||
{"instance_id": "astropy__astropy-12907", "model_patch": "diff --git a/...", "model_name_or_path": "multica-kimi-coding"}
|
||||
```
|
||||
|
||||
This file is the input to the official evaluation harness.
|
||||
|
||||
### `predictions.results.jsonl` — detailed run metrics
|
||||
|
||||
```json
|
||||
{
|
||||
"instance_id": "astropy__astropy-12907",
|
||||
"success": true,
|
||||
"patch": "diff --git a/...",
|
||||
"error": null,
|
||||
"duration_ms": 141892,
|
||||
"session_id": "019c60c7-52ac-702a-9b9c-dc53c0daea6b"
|
||||
}
|
||||
```
|
||||
|
||||
## Analyzing Results
|
||||
|
||||
```bash
|
||||
# Summary report
|
||||
npx tsx scripts/swe-bench/analyze.ts
|
||||
|
||||
# Or specify a results file
|
||||
npx tsx scripts/swe-bench/analyze.ts scripts/swe-bench/pred-kimi.results.jsonl
|
||||
```
|
||||
|
||||
Output includes:
|
||||
- Patch rate (how many tasks produced a diff)
|
||||
- Duration statistics (avg/min/max)
|
||||
- Error breakdown
|
||||
- Per-repository stats
|
||||
- Slowest tasks
|
||||
|
||||
### Run-Log Analysis
|
||||
|
||||
Each agent session writes a structured `run-log.jsonl` to `~/.super-multica/sessions/<session-id>/`. This captures every LLM call, tool invocation, and timing:
|
||||
|
||||
```bash
|
||||
# Find a session's run log
|
||||
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | head -5
|
||||
|
||||
# Quick stats from a run log
|
||||
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | python3 -c "
|
||||
import json, sys
|
||||
events = [json.loads(l) for l in sys.stdin if l.strip()]
|
||||
tools = [e for e in events if e['event'] == 'tool_start']
|
||||
llm_ms = sum(e.get('duration_ms', 0) for e in events if e['event'] == 'llm_result')
|
||||
print(f'LLM time: {llm_ms/1000:.1f}s | Tool calls: {len(tools)}')
|
||||
"
|
||||
```
|
||||
|
||||
## Official Evaluation (Docker)
|
||||
|
||||
The runner produces patches, but **only the official SWE-bench harness determines pass/fail** by applying the patch and running the project's test suite.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Docker running (at least 120GB storage, 16GB RAM, 8 CPU cores)
|
||||
- `pip install swebench`
|
||||
|
||||
### Run Evaluation
|
||||
|
||||
```bash
|
||||
# Using the wrapper script
|
||||
bash scripts/swe-bench/evaluate.sh
|
||||
|
||||
# Or directly
|
||||
python -m swebench.harness.run_evaluation \
|
||||
--dataset_name princeton-nlp/SWE-bench_Lite \
|
||||
--predictions_path scripts/swe-bench/predictions.jsonl \
|
||||
--max_workers 4 \
|
||||
--run_id multica
|
||||
```
|
||||
|
||||
Results are written to `logs/` and `evaluation_results/`.
|
||||
|
||||
## Known Limitations and Improvements
|
||||
|
||||
### Current Limitations
|
||||
|
||||
1. **No Docker isolation for agent execution**: The agent runs on the host, so `pip install` and other commands affect the system Python. SWE-bench standard practice is to run each task in a Docker container.
|
||||
|
||||
2. **`SMC_DATA_DIR` timing**: Setting `SMC_DATA_DIR` at runtime doesn't affect `DATA_DIR` (resolved at module import time). Sessions currently write to `~/.super-multica/sessions/`. To isolate, set the env var before the process starts:
|
||||
```bash
|
||||
SMC_DATA_DIR=~/.swe-bench-eval npx tsx scripts/swe-bench/run.ts --limit 5
|
||||
```
|
||||
|
||||
3. **Sequential execution**: Tasks run one at a time. For large-scale runs, launch multiple processes with `--offset`/`--limit` to parallelize:
|
||||
```bash
|
||||
# Run 4 workers in parallel
|
||||
npx tsx scripts/swe-bench/run.ts --offset 0 --limit 75 --output pred-0.jsonl &
|
||||
npx tsx scripts/swe-bench/run.ts --offset 75 --limit 75 --output pred-1.jsonl &
|
||||
npx tsx scripts/swe-bench/run.ts --offset 150 --limit 75 --output pred-2.jsonl &
|
||||
npx tsx scripts/swe-bench/run.ts --offset 225 --limit 75 --output pred-3.jsonl &
|
||||
wait
|
||||
cat pred-*.jsonl > predictions.jsonl
|
||||
```
|
||||
|
||||
4. **Repo cloning per instance**: Each instance clones the full repo. For repos with many tasks (e.g., astropy, django), a shared clone with `git worktree` would be faster.
|
||||
|
||||
### Potential Improvements
|
||||
|
||||
- **Docker-per-task**: Run each agent in a Docker container matching the SWE-bench environment spec (correct Python version, pre-installed dependencies)
|
||||
- **Shared repo pool**: Clone each unique repo once, use `git worktree` for per-task isolation
|
||||
- **Cost tracking**: Parse run-log token counts for per-task and aggregate cost estimates
|
||||
- **Multi-turn retries**: If the agent produces no patch, retry with feedback
|
||||
- **System prompt tuning**: The current prompt is minimal; more detailed guidance (e.g., "search for related test files to understand expected behavior") could improve solve rate
|
||||
|
||||
## Related Benchmarks
|
||||
|
||||
| Benchmark | Focus | Notes |
|
||||
|-----------|-------|-------|
|
||||
| [SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/) | Bug fixing (Python) | Gold standard, 500 human-verified tasks |
|
||||
| [SWE-bench Multilingual](https://github.com/SWE-bench/SWE-bench) | Bug fixing (7 languages) | Java, TS, JS, Go, Rust, C, C++ |
|
||||
| [Terminal-Bench](https://www.swebench.com/) | CLI workflows | Multi-step sandboxed terminal tasks |
|
||||
| [Aider Polyglot](https://aider.chat/docs/leaderboards/) | Code editing | 225 Exercism exercises, 6 languages |
|
||||
| [DPAI Arena](https://www.jetbrains.com/) | Full dev workflow | JetBrains: patch, test, review, analysis |
|
||||
| [HumanEval](https://github.com/openai/human-eval) | Function generation | 164 Python function tasks, largely saturated |
|
||||
|
||||
## Initial Results (kimi-coding, 3 tasks)
|
||||
|
||||
First run on 3 SWE-bench Lite tasks (all astropy):
|
||||
|
||||
| Task | Status | Duration | LLM Time | Tools | Fix |
|
||||
|------|--------|----------|----------|-------|-----|
|
||||
| `astropy__astropy-12907` | PATCHED | 141.9s | 125.1s | 30 | `_cstack`: `= 1` → `= right` |
|
||||
| `astropy__astropy-14182` | PATCHED | 192.0s | 166.9s | 56 | Added `header_rows` param to RST writer |
|
||||
| `astropy__astropy-14365` | PATCHED | 65.7s | 49.6s | 23 | `re.compile()` + `re.IGNORECASE` |
|
||||
|
||||
3/3 tasks produced patches. Formal evaluation pending (requires Docker harness).
|
||||
Loading…
Add table
Add a link
Reference in a new issue