multica/docs/swe-bench.md
Jiayuan Zhang 10c57c0f7a docs: add SWE-bench runner guide
Covers the full pipeline: dataset download, agent execution,
result analysis, and official Docker evaluation. Includes
runner options, output format, known limitations, and initial
benchmark results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 18:30:58 +08:00

10 KiB

SWE-bench: Agent Coding Benchmark

Run and evaluate the Multica agent against SWE-bench, the standard benchmark for AI coding agents. SWE-bench tasks are real GitHub issues from open-source Python projects — the agent must read the issue, explore the codebase, and produce a patch that fixes the bug.

Quick Start

# 1. Download dataset (requires: pip install datasets)
python scripts/swe-bench/download-dataset.py --dataset lite --limit 5

# 2. Run the agent
npx tsx scripts/swe-bench/run.ts --limit 5

# 3. Analyze results
npx tsx scripts/swe-bench/analyze.ts

Scripts

scripts/swe-bench/
├── download-dataset.py    # Download from HuggingFace → JSONL
├── run.ts                 # Core runner: Agent API → git diff → predictions
├── evaluate.sh            # Official Docker evaluation harness wrapper
├── analyze.ts             # Summarize run results
└── .gitignore             # Ignores downloaded datasets and output files

Pipeline

                                    ┌──────────────────┐
  HuggingFace ──download──► JSONL ──┤  For each task:   │
                                    │  1. git clone     │
                                    │  2. git checkout   │
                                    │  3. Agent.run()   │
                                    │  4. git diff      │
                                    └────────┬─────────┘
                                             │
                              predictions.jsonl (SWE-bench format)
                                             │
                              ┌───────────────┴───────────────┐
                              │  swebench.harness (Docker)    │
                              │  Apply patch → run tests      │
                              │  → pass/fail verdict          │
                              └───────────────────────────────┘

Dataset Variants

Variant Size HuggingFace ID Recommended For
Lite 300 tasks princeton-nlp/SWE-bench_Lite Quick iteration, development
Verified 500 tasks princeton-nlp/SWE-bench_Verified Official benchmarking, leaderboard
Full ~2294 tasks princeton-nlp/SWE-bench Comprehensive evaluation
# Download specific variant
python scripts/swe-bench/download-dataset.py --dataset verified
python scripts/swe-bench/download-dataset.py --dataset lite --limit 20

Runner Options

npx tsx scripts/swe-bench/run.ts [options]

Options:
  --dataset PATH      JSONL dataset path          (default: scripts/swe-bench/lite.jsonl)
  --provider NAME     LLM provider                (default: kimi-coding)
  --model NAME        Model override
  --limit N           Max tasks to run             (default: all)
  --offset N          Skip first N tasks           (default: 0)
  --output PATH       Output predictions JSONL     (default: scripts/swe-bench/predictions.jsonl)
  --workdir PATH      Repo clone directory         (default: /tmp/swe-bench)
  --timeout MS        Per-task timeout             (default: 300000 = 5min)
  --instance ID       Run a single instance
  --debug             Enable debug logging

Examples

# Run 10 tasks with Anthropic Claude
npx tsx scripts/swe-bench/run.ts --limit 10 --provider anthropic

# Run a specific instance
npx tsx scripts/swe-bench/run.ts --instance "django__django-16379"

# Resume from task 50 with longer timeout
npx tsx scripts/swe-bench/run.ts --offset 50 --limit 10 --timeout 600000

# Compare providers (run separately, different output files)
npx tsx scripts/swe-bench/run.ts --provider kimi-coding --output scripts/swe-bench/pred-kimi.jsonl
npx tsx scripts/swe-bench/run.ts --provider anthropic   --output scripts/swe-bench/pred-claude.jsonl

How the Agent Solves Tasks

For each task, the runner:

  1. Clones the repository to /tmp/swe-bench/<instance_id>/ and checks out base_commit
  2. Creates an Agent with a focused system prompt and restricted tools (coding only — no web, no cron, no sessions)
  3. Runs the agent with the issue description as the prompt
  4. Collects git diff as the patch after the agent finishes
  5. Appends the prediction to predictions.jsonl in SWE-bench format

The agent has access to:

  • read, write, edit — file operations
  • exec, process — shell commands (for exploring code, running tests)
  • glob — file search

Tools explicitly denied: web_fetch, web_search, cron, data, sessions_spawn, sessions_list, memory_search, send_file.

Output Files

After a run, two files are produced:

predictions.jsonl — SWE-bench format

{"instance_id": "astropy__astropy-12907", "model_patch": "diff --git a/...", "model_name_or_path": "multica-kimi-coding"}

This file is the input to the official evaluation harness.

predictions.results.jsonl — detailed run metrics

{
  "instance_id": "astropy__astropy-12907",
  "success": true,
  "patch": "diff --git a/...",
  "error": null,
  "duration_ms": 141892,
  "session_id": "019c60c7-52ac-702a-9b9c-dc53c0daea6b"
}

Analyzing Results

# Summary report
npx tsx scripts/swe-bench/analyze.ts

# Or specify a results file
npx tsx scripts/swe-bench/analyze.ts scripts/swe-bench/pred-kimi.results.jsonl

Output includes:

  • Patch rate (how many tasks produced a diff)
  • Duration statistics (avg/min/max)
  • Error breakdown
  • Per-repository stats
  • Slowest tasks

Run-Log Analysis

Each agent session writes a structured run-log.jsonl to ~/.super-multica/sessions/<session-id>/. This captures every LLM call, tool invocation, and timing:

# Find a session's run log
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | head -5

# Quick stats from a run log
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | python3 -c "
import json, sys
events = [json.loads(l) for l in sys.stdin if l.strip()]
tools = [e for e in events if e['event'] == 'tool_start']
llm_ms = sum(e.get('duration_ms', 0) for e in events if e['event'] == 'llm_result')
print(f'LLM time: {llm_ms/1000:.1f}s | Tool calls: {len(tools)}')
"

Official Evaluation (Docker)

The runner produces patches, but only the official SWE-bench harness determines pass/fail by applying the patch and running the project's test suite.

Prerequisites

  • Docker running (at least 120GB storage, 16GB RAM, 8 CPU cores)
  • pip install swebench

Run Evaluation

# Using the wrapper script
bash scripts/swe-bench/evaluate.sh

# Or directly
python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --predictions_path scripts/swe-bench/predictions.jsonl \
  --max_workers 4 \
  --run_id multica

Results are written to logs/ and evaluation_results/.

Known Limitations and Improvements

Current Limitations

  1. No Docker isolation for agent execution: The agent runs on the host, so pip install and other commands affect the system Python. SWE-bench standard practice is to run each task in a Docker container.

  2. SMC_DATA_DIR timing: Setting SMC_DATA_DIR at runtime doesn't affect DATA_DIR (resolved at module import time). Sessions currently write to ~/.super-multica/sessions/. To isolate, set the env var before the process starts:

    SMC_DATA_DIR=~/.swe-bench-eval npx tsx scripts/swe-bench/run.ts --limit 5
    
  3. Sequential execution: Tasks run one at a time. For large-scale runs, launch multiple processes with --offset/--limit to parallelize:

    # Run 4 workers in parallel
    npx tsx scripts/swe-bench/run.ts --offset 0   --limit 75 --output pred-0.jsonl &
    npx tsx scripts/swe-bench/run.ts --offset 75  --limit 75 --output pred-1.jsonl &
    npx tsx scripts/swe-bench/run.ts --offset 150 --limit 75 --output pred-2.jsonl &
    npx tsx scripts/swe-bench/run.ts --offset 225 --limit 75 --output pred-3.jsonl &
    wait
    cat pred-*.jsonl > predictions.jsonl
    
  4. Repo cloning per instance: Each instance clones the full repo. For repos with many tasks (e.g., astropy, django), a shared clone with git worktree would be faster.

Potential Improvements

  • Docker-per-task: Run each agent in a Docker container matching the SWE-bench environment spec (correct Python version, pre-installed dependencies)
  • Shared repo pool: Clone each unique repo once, use git worktree for per-task isolation
  • Cost tracking: Parse run-log token counts for per-task and aggregate cost estimates
  • Multi-turn retries: If the agent produces no patch, retry with feedback
  • System prompt tuning: The current prompt is minimal; more detailed guidance (e.g., "search for related test files to understand expected behavior") could improve solve rate
Benchmark Focus Notes
SWE-bench Verified Bug fixing (Python) Gold standard, 500 human-verified tasks
SWE-bench Multilingual Bug fixing (7 languages) Java, TS, JS, Go, Rust, C, C++
Terminal-Bench CLI workflows Multi-step sandboxed terminal tasks
Aider Polyglot Code editing 225 Exercism exercises, 6 languages
DPAI Arena Full dev workflow JetBrains: patch, test, review, analysis
HumanEval Function generation 164 Python function tasks, largely saturated

Initial Results (kimi-coding, 3 tasks)

First run on 3 SWE-bench Lite tasks (all astropy):

Task Status Duration LLM Time Tools Fix
astropy__astropy-12907 PATCHED 141.9s 125.1s 30 _cstack: = 1= right
astropy__astropy-14182 PATCHED 192.0s 166.9s 56 Added header_rows param to RST writer
astropy__astropy-14365 PATCHED 65.7s 49.6s 23 re.compile() + re.IGNORECASE

3/3 tasks produced patches. Formal evaluation pending (requires Docker harness).