Covers the full pipeline: dataset download, agent execution, result analysis, and official Docker evaluation. Includes runner options, output format, known limitations, and initial benchmark results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 KiB
SWE-bench: Agent Coding Benchmark
Run and evaluate the Multica agent against SWE-bench, the standard benchmark for AI coding agents. SWE-bench tasks are real GitHub issues from open-source Python projects — the agent must read the issue, explore the codebase, and produce a patch that fixes the bug.
Quick Start
# 1. Download dataset (requires: pip install datasets)
python scripts/swe-bench/download-dataset.py --dataset lite --limit 5
# 2. Run the agent
npx tsx scripts/swe-bench/run.ts --limit 5
# 3. Analyze results
npx tsx scripts/swe-bench/analyze.ts
Scripts
scripts/swe-bench/
├── download-dataset.py # Download from HuggingFace → JSONL
├── run.ts # Core runner: Agent API → git diff → predictions
├── evaluate.sh # Official Docker evaluation harness wrapper
├── analyze.ts # Summarize run results
└── .gitignore # Ignores downloaded datasets and output files
Pipeline
┌──────────────────┐
HuggingFace ──download──► JSONL ──┤ For each task: │
│ 1. git clone │
│ 2. git checkout │
│ 3. Agent.run() │
│ 4. git diff │
└────────┬─────────┘
│
predictions.jsonl (SWE-bench format)
│
┌───────────────┴───────────────┐
│ swebench.harness (Docker) │
│ Apply patch → run tests │
│ → pass/fail verdict │
└───────────────────────────────┘
Dataset Variants
| Variant | Size | HuggingFace ID | Recommended For |
|---|---|---|---|
| Lite | 300 tasks | princeton-nlp/SWE-bench_Lite |
Quick iteration, development |
| Verified | 500 tasks | princeton-nlp/SWE-bench_Verified |
Official benchmarking, leaderboard |
| Full | ~2294 tasks | princeton-nlp/SWE-bench |
Comprehensive evaluation |
# Download specific variant
python scripts/swe-bench/download-dataset.py --dataset verified
python scripts/swe-bench/download-dataset.py --dataset lite --limit 20
Runner Options
npx tsx scripts/swe-bench/run.ts [options]
Options:
--dataset PATH JSONL dataset path (default: scripts/swe-bench/lite.jsonl)
--provider NAME LLM provider (default: kimi-coding)
--model NAME Model override
--limit N Max tasks to run (default: all)
--offset N Skip first N tasks (default: 0)
--output PATH Output predictions JSONL (default: scripts/swe-bench/predictions.jsonl)
--workdir PATH Repo clone directory (default: /tmp/swe-bench)
--timeout MS Per-task timeout (default: 300000 = 5min)
--instance ID Run a single instance
--debug Enable debug logging
Examples
# Run 10 tasks with Anthropic Claude
npx tsx scripts/swe-bench/run.ts --limit 10 --provider anthropic
# Run a specific instance
npx tsx scripts/swe-bench/run.ts --instance "django__django-16379"
# Resume from task 50 with longer timeout
npx tsx scripts/swe-bench/run.ts --offset 50 --limit 10 --timeout 600000
# Compare providers (run separately, different output files)
npx tsx scripts/swe-bench/run.ts --provider kimi-coding --output scripts/swe-bench/pred-kimi.jsonl
npx tsx scripts/swe-bench/run.ts --provider anthropic --output scripts/swe-bench/pred-claude.jsonl
How the Agent Solves Tasks
For each task, the runner:
- Clones the repository to
/tmp/swe-bench/<instance_id>/and checks outbase_commit - Creates an Agent with a focused system prompt and restricted tools (coding only — no web, no cron, no sessions)
- Runs the agent with the issue description as the prompt
- Collects
git diffas the patch after the agent finishes - Appends the prediction to
predictions.jsonlin SWE-bench format
The agent has access to:
read,write,edit— file operationsexec,process— shell commands (for exploring code, running tests)glob— file search
Tools explicitly denied: web_fetch, web_search, cron, data, sessions_spawn, sessions_list, memory_search, send_file.
Output Files
After a run, two files are produced:
predictions.jsonl — SWE-bench format
{"instance_id": "astropy__astropy-12907", "model_patch": "diff --git a/...", "model_name_or_path": "multica-kimi-coding"}
This file is the input to the official evaluation harness.
predictions.results.jsonl — detailed run metrics
{
"instance_id": "astropy__astropy-12907",
"success": true,
"patch": "diff --git a/...",
"error": null,
"duration_ms": 141892,
"session_id": "019c60c7-52ac-702a-9b9c-dc53c0daea6b"
}
Analyzing Results
# Summary report
npx tsx scripts/swe-bench/analyze.ts
# Or specify a results file
npx tsx scripts/swe-bench/analyze.ts scripts/swe-bench/pred-kimi.results.jsonl
Output includes:
- Patch rate (how many tasks produced a diff)
- Duration statistics (avg/min/max)
- Error breakdown
- Per-repository stats
- Slowest tasks
Run-Log Analysis
Each agent session writes a structured run-log.jsonl to ~/.super-multica/sessions/<session-id>/. This captures every LLM call, tool invocation, and timing:
# Find a session's run log
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | head -5
# Quick stats from a run log
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | python3 -c "
import json, sys
events = [json.loads(l) for l in sys.stdin if l.strip()]
tools = [e for e in events if e['event'] == 'tool_start']
llm_ms = sum(e.get('duration_ms', 0) for e in events if e['event'] == 'llm_result')
print(f'LLM time: {llm_ms/1000:.1f}s | Tool calls: {len(tools)}')
"
Official Evaluation (Docker)
The runner produces patches, but only the official SWE-bench harness determines pass/fail by applying the patch and running the project's test suite.
Prerequisites
- Docker running (at least 120GB storage, 16GB RAM, 8 CPU cores)
pip install swebench
Run Evaluation
# Using the wrapper script
bash scripts/swe-bench/evaluate.sh
# Or directly
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path scripts/swe-bench/predictions.jsonl \
--max_workers 4 \
--run_id multica
Results are written to logs/ and evaluation_results/.
Known Limitations and Improvements
Current Limitations
-
No Docker isolation for agent execution: The agent runs on the host, so
pip installand other commands affect the system Python. SWE-bench standard practice is to run each task in a Docker container. -
SMC_DATA_DIRtiming: SettingSMC_DATA_DIRat runtime doesn't affectDATA_DIR(resolved at module import time). Sessions currently write to~/.super-multica/sessions/. To isolate, set the env var before the process starts:SMC_DATA_DIR=~/.swe-bench-eval npx tsx scripts/swe-bench/run.ts --limit 5 -
Sequential execution: Tasks run one at a time. For large-scale runs, launch multiple processes with
--offset/--limitto parallelize:# Run 4 workers in parallel npx tsx scripts/swe-bench/run.ts --offset 0 --limit 75 --output pred-0.jsonl & npx tsx scripts/swe-bench/run.ts --offset 75 --limit 75 --output pred-1.jsonl & npx tsx scripts/swe-bench/run.ts --offset 150 --limit 75 --output pred-2.jsonl & npx tsx scripts/swe-bench/run.ts --offset 225 --limit 75 --output pred-3.jsonl & wait cat pred-*.jsonl > predictions.jsonl -
Repo cloning per instance: Each instance clones the full repo. For repos with many tasks (e.g., astropy, django), a shared clone with
git worktreewould be faster.
Potential Improvements
- Docker-per-task: Run each agent in a Docker container matching the SWE-bench environment spec (correct Python version, pre-installed dependencies)
- Shared repo pool: Clone each unique repo once, use
git worktreefor per-task isolation - Cost tracking: Parse run-log token counts for per-task and aggregate cost estimates
- Multi-turn retries: If the agent produces no patch, retry with feedback
- System prompt tuning: The current prompt is minimal; more detailed guidance (e.g., "search for related test files to understand expected behavior") could improve solve rate
Related Benchmarks
| Benchmark | Focus | Notes |
|---|---|---|
| SWE-bench Verified | Bug fixing (Python) | Gold standard, 500 human-verified tasks |
| SWE-bench Multilingual | Bug fixing (7 languages) | Java, TS, JS, Go, Rust, C, C++ |
| Terminal-Bench | CLI workflows | Multi-step sandboxed terminal tasks |
| Aider Polyglot | Code editing | 225 Exercism exercises, 6 languages |
| DPAI Arena | Full dev workflow | JetBrains: patch, test, review, analysis |
| HumanEval | Function generation | 164 Python function tasks, largely saturated |
Initial Results (kimi-coding, 3 tasks)
First run on 3 SWE-bench Lite tasks (all astropy):
| Task | Status | Duration | LLM Time | Tools | Fix |
|---|---|---|---|---|---|
astropy__astropy-12907 |
PATCHED | 141.9s | 125.1s | 30 | _cstack: = 1 → = right |
astropy__astropy-14182 |
PATCHED | 192.0s | 166.9s | 56 | Added header_rows param to RST writer |
astropy__astropy-14365 |
PATCHED | 65.7s | 49.6s | 23 | re.compile() + re.IGNORECASE |
3/3 tasks produced patches. Formal evaluation pending (requires Docker harness).