docs: add SWE-bench runner guide

Covers the full pipeline: dataset download, agent execution, result analysis, and official Docker evaluation. Includes runner options, output format, known limitations, and initial benchmark results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 18:30:58 +08:00 · 2026-02-15 18:30:58 +08:00 · 10c57c0f7a
commit 10c57c0f7a
parent 90d374ffd5
1 changed files with 253 additions and 0 deletions
--- a/docs/swe-bench.md
+++ b/docs/swe-bench.md
@ -0,0 +1,253 @@
+# SWE-bench: Agent Coding Benchmark
+
+Run and evaluate the Multica agent against [SWE-bench](https://www.swebench.com/), the standard benchmark for AI coding agents. SWE-bench tasks are real GitHub issues from open-source Python projects — the agent must read the issue, explore the codebase, and produce a patch that fixes the bug.
+
+## Quick Start
+
+```bash
+# 1. Download dataset (requires: pip install datasets)
+python scripts/swe-bench/download-dataset.py --dataset lite --limit 5
+
+# 2. Run the agent
+npx tsx scripts/swe-bench/run.ts --limit 5
+
+# 3. Analyze results
+npx tsx scripts/swe-bench/analyze.ts
+```
+
+## Scripts
+
+```
+scripts/swe-bench/
+├── download-dataset.py    # Download from HuggingFace → JSONL
+├── run.ts                 # Core runner: Agent API → git diff → predictions
+├── evaluate.sh            # Official Docker evaluation harness wrapper
+├── analyze.ts             # Summarize run results
+└── .gitignore             # Ignores downloaded datasets and output files
+```
+
+## Pipeline
+
+```
+                                    ┌──────────────────┐
+  HuggingFace ──download──► JSONL ──┤  For each task:   │
+                                    │  1. git clone     │
+                                    │  2. git checkout   │
+                                    │  3. Agent.run()   │
+                                    │  4. git diff      │
+                                    └────────┬─────────┘
+                                             │
+                              predictions.jsonl (SWE-bench format)
+                                             │
+                              ┌───────────────┴───────────────┐
+                              │  swebench.harness (Docker)    │
+                              │  Apply patch → run tests      │
+                              │  → pass/fail verdict          │
+                              └───────────────────────────────┘
+```
+
+## Dataset Variants
+
+| Variant | Size | HuggingFace ID | Recommended For |
+|---------|------|----------------|-----------------|
+| **Lite** | 300 tasks | `princeton-nlp/SWE-bench_Lite` | Quick iteration, development |
+| **Verified** | 500 tasks | `princeton-nlp/SWE-bench_Verified` | Official benchmarking, leaderboard |
+| **Full** | ~2294 tasks | `princeton-nlp/SWE-bench` | Comprehensive evaluation |
+
+```bash
+# Download specific variant
+python scripts/swe-bench/download-dataset.py --dataset verified
+python scripts/swe-bench/download-dataset.py --dataset lite --limit 20
+```
+
+## Runner Options
+
+```bash
+npx tsx scripts/swe-bench/run.ts [options]
+
+Options:
+  --dataset PATH      JSONL dataset path          (default: scripts/swe-bench/lite.jsonl)
+  --provider NAME     LLM provider                (default: kimi-coding)
+  --model NAME        Model override
+  --limit N           Max tasks to run             (default: all)
+  --offset N          Skip first N tasks           (default: 0)
+  --output PATH       Output predictions JSONL     (default: scripts/swe-bench/predictions.jsonl)
+  --workdir PATH      Repo clone directory         (default: /tmp/swe-bench)
+  --timeout MS        Per-task timeout             (default: 300000 = 5min)
+  --instance ID       Run a single instance
+  --debug             Enable debug logging
+```
+
+### Examples
+
+```bash
+# Run 10 tasks with Anthropic Claude
+npx tsx scripts/swe-bench/run.ts --limit 10 --provider anthropic
+
+# Run a specific instance
+npx tsx scripts/swe-bench/run.ts --instance "django__django-16379"
+
+# Resume from task 50 with longer timeout
+npx tsx scripts/swe-bench/run.ts --offset 50 --limit 10 --timeout 600000
+
+# Compare providers (run separately, different output files)
+npx tsx scripts/swe-bench/run.ts --provider kimi-coding --output scripts/swe-bench/pred-kimi.jsonl
+npx tsx scripts/swe-bench/run.ts --provider anthropic   --output scripts/swe-bench/pred-claude.jsonl
+```
+
+## How the Agent Solves Tasks
+
+For each task, the runner:
+
+1. **Clones the repository** to `/tmp/swe-bench/<instance_id>/` and checks out `base_commit`
+2. **Creates an Agent** with a focused system prompt and restricted tools (coding only — no web, no cron, no sessions)
+3. **Runs the agent** with the issue description as the prompt
+4. **Collects `git diff`** as the patch after the agent finishes
+5. **Appends** the prediction to `predictions.jsonl` in SWE-bench format
+
+The agent has access to:
+- `read`, `write`, `edit` — file operations
+- `exec`, `process` — shell commands (for exploring code, running tests)
+- `glob` — file search
+
+Tools explicitly denied: `web_fetch`, `web_search`, `cron`, `data`, `sessions_spawn`, `sessions_list`, `memory_search`, `send_file`.
+
+## Output Files
+
+After a run, two files are produced:
+
+### `predictions.jsonl` — SWE-bench format
+
+```json
+{"instance_id": "astropy__astropy-12907", "model_patch": "diff --git a/...", "model_name_or_path": "multica-kimi-coding"}
+```
+
+This file is the input to the official evaluation harness.
+
+### `predictions.results.jsonl` — detailed run metrics
+
+```json
+{
+  "instance_id": "astropy__astropy-12907",
+  "success": true,
+  "patch": "diff --git a/...",
+  "error": null,
+  "duration_ms": 141892,
+  "session_id": "019c60c7-52ac-702a-9b9c-dc53c0daea6b"
+}
+```
+
+## Analyzing Results
+
+```bash
+# Summary report
+npx tsx scripts/swe-bench/analyze.ts
+
+# Or specify a results file
+npx tsx scripts/swe-bench/analyze.ts scripts/swe-bench/pred-kimi.results.jsonl
+```
+
+Output includes:
+- Patch rate (how many tasks produced a diff)
+- Duration statistics (avg/min/max)
+- Error breakdown
+- Per-repository stats
+- Slowest tasks
+
+### Run-Log Analysis
+
+Each agent session writes a structured `run-log.jsonl` to `~/.super-multica/sessions/<session-id>/`. This captures every LLM call, tool invocation, and timing:
+
+```bash
+# Find a session's run log
+cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | head -5
+
+# Quick stats from a run log
+cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | python3 -c "
+import json, sys
+events = [json.loads(l) for l in sys.stdin if l.strip()]
+tools = [e for e in events if e['event'] == 'tool_start']
+llm_ms = sum(e.get('duration_ms', 0) for e in events if e['event'] == 'llm_result')
+print(f'LLM time: {llm_ms/1000:.1f}s | Tool calls: {len(tools)}')
+"
+```
+
+## Official Evaluation (Docker)
+
+The runner produces patches, but **only the official SWE-bench harness determines pass/fail** by applying the patch and running the project's test suite.
+
+### Prerequisites
+
+- Docker running (at least 120GB storage, 16GB RAM, 8 CPU cores)
+- `pip install swebench`
+
+### Run Evaluation
+
+```bash
+# Using the wrapper script
+bash scripts/swe-bench/evaluate.sh
+
+# Or directly
+python -m swebench.harness.run_evaluation \
+  --dataset_name princeton-nlp/SWE-bench_Lite \
+  --predictions_path scripts/swe-bench/predictions.jsonl \
+  --max_workers 4 \
+  --run_id multica
+```
+
+Results are written to `logs/` and `evaluation_results/`.
+
+## Known Limitations and Improvements
+
+### Current Limitations
+
+1. **No Docker isolation for agent execution**: The agent runs on the host, so `pip install` and other commands affect the system Python. SWE-bench standard practice is to run each task in a Docker container.
+
+2. **`SMC_DATA_DIR` timing**: Setting `SMC_DATA_DIR` at runtime doesn't affect `DATA_DIR` (resolved at module import time). Sessions currently write to `~/.super-multica/sessions/`. To isolate, set the env var before the process starts:
+   ```bash
+   SMC_DATA_DIR=~/.swe-bench-eval npx tsx scripts/swe-bench/run.ts --limit 5
+   ```
+
+3. **Sequential execution**: Tasks run one at a time. For large-scale runs, launch multiple processes with `--offset`/`--limit` to parallelize:
+   ```bash
+   # Run 4 workers in parallel
+   npx tsx scripts/swe-bench/run.ts --offset 0   --limit 75 --output pred-0.jsonl &
+   npx tsx scripts/swe-bench/run.ts --offset 75  --limit 75 --output pred-1.jsonl &
+   npx tsx scripts/swe-bench/run.ts --offset 150 --limit 75 --output pred-2.jsonl &
+   npx tsx scripts/swe-bench/run.ts --offset 225 --limit 75 --output pred-3.jsonl &
+   wait
+   cat pred-*.jsonl > predictions.jsonl
+   ```
+
+4. **Repo cloning per instance**: Each instance clones the full repo. For repos with many tasks (e.g., astropy, django), a shared clone with `git worktree` would be faster.
+
+### Potential Improvements
+
+- **Docker-per-task**: Run each agent in a Docker container matching the SWE-bench environment spec (correct Python version, pre-installed dependencies)
+- **Shared repo pool**: Clone each unique repo once, use `git worktree` for per-task isolation
+- **Cost tracking**: Parse run-log token counts for per-task and aggregate cost estimates
+- **Multi-turn retries**: If the agent produces no patch, retry with feedback
+- **System prompt tuning**: The current prompt is minimal; more detailed guidance (e.g., "search for related test files to understand expected behavior") could improve solve rate
+
+## Related Benchmarks
+
+| Benchmark | Focus | Notes |
+|-----------|-------|-------|
+| [SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/) | Bug fixing (Python) | Gold standard, 500 human-verified tasks |
+| [SWE-bench Multilingual](https://github.com/SWE-bench/SWE-bench) | Bug fixing (7 languages) | Java, TS, JS, Go, Rust, C, C++ |
+| [Terminal-Bench](https://www.swebench.com/) | CLI workflows | Multi-step sandboxed terminal tasks |
+| [Aider Polyglot](https://aider.chat/docs/leaderboards/) | Code editing | 225 Exercism exercises, 6 languages |
+| [DPAI Arena](https://www.jetbrains.com/) | Full dev workflow | JetBrains: patch, test, review, analysis |
+| [HumanEval](https://github.com/openai/human-eval) | Function generation | 164 Python function tasks, largely saturated |
+
+## Initial Results (kimi-coding, 3 tasks)
+
+First run on 3 SWE-bench Lite tasks (all astropy):
+
+| Task | Status | Duration | LLM Time | Tools | Fix |
+|------|--------|----------|----------|-------|-----|
+| `astropy__astropy-12907` | PATCHED | 141.9s | 125.1s | 30 | `_cstack`: `= 1` → `= right` |
+| `astropy__astropy-14182` | PATCHED | 192.0s | 166.9s | 56 | Added `header_rows` param to RST writer |
+| `astropy__astropy-14365` | PATCHED | 65.7s | 49.6s | 23 | `re.compile()` + `re.IGNORECASE` |
+
+3/3 tasks produced patches. Formal evaluation pending (requires Docker harness).