multica/docs/swe-bench.md

# SWE-bench: Agent Coding Benchmark

Run and evaluate the Multica agent against [SWE-bench](https://www.swebench.com/), the standard benchmark for AI coding agents. SWE-bench tasks are real GitHub issues from open-source Python projects — the agent must read the issue, explore the codebase, and produce a patch that fixes the bug.

## Quick Start

```bash
# 1. Download dataset (requires: pip install datasets)
python scripts/swe-bench/download-dataset.py --dataset lite --limit 5

# 2. Run the agent
npx tsx scripts/swe-bench/run.ts --limit 5

# 3. Analyze results
npx tsx scripts/swe-bench/analyze.ts
```

## Scripts

```
scripts/swe-bench/
├── download-dataset.py    # Download from HuggingFace → JSONL
├── run.ts                 # Core runner: Agent API → git diff → predictions
├── evaluate.sh            # Official Docker evaluation harness wrapper
├── analyze.ts             # Summarize run results
└── .gitignore             # Ignores downloaded datasets and output files
```

## Pipeline

```
                                    ┌──────────────────┐
  HuggingFace ──download──► JSONL ──┤  For each task:   │
                                    │  1. git clone     │
                                    │  2. git checkout   │
                                    │  3. Agent.run()   │
                                    │  4. git diff      │
                                    └────────┬─────────┘
                                             │
                              predictions.jsonl (SWE-bench format)
                                             │
                              ┌───────────────┴───────────────┐
                              │  swebench.harness (Docker)    │
                              │  Apply patch → run tests      │
                              │  → pass/fail verdict          │
                              └───────────────────────────────┘
```

## Dataset Variants

| Variant | Size | HuggingFace ID | Recommended For |
|---------|------|----------------|-----------------|
| **Lite** | 300 tasks | `princeton-nlp/SWE-bench_Lite` | Quick iteration, development |
| **Verified** | 500 tasks | `princeton-nlp/SWE-bench_Verified` | Official benchmarking, leaderboard |
| **Full** | ~2294 tasks | `princeton-nlp/SWE-bench` | Comprehensive evaluation |

```bash
# Download specific variant
python scripts/swe-bench/download-dataset.py --dataset verified
python scripts/swe-bench/download-dataset.py --dataset lite --limit 20
```

## Runner Options

```bash
npx tsx scripts/swe-bench/run.ts [options]

Options:
  --dataset PATH      JSONL dataset path          (default: scripts/swe-bench/lite.jsonl)
  --provider NAME     LLM provider                (default: kimi-coding)
  --model NAME        Model override
  --limit N           Max tasks to run             (default: all)
  --offset N          Skip first N tasks           (default: 0)
  --output PATH       Output predictions JSONL     (default: scripts/swe-bench/predictions.jsonl)
  --workdir PATH      Repo clone directory         (default: /tmp/swe-bench)
  --timeout MS        Per-task timeout             (default: 300000 = 5min)
  --instance ID       Run a single instance
  --debug             Enable debug logging
```

### Examples

```bash
# Run 10 tasks with Anthropic Claude
npx tsx scripts/swe-bench/run.ts --limit 10 --provider anthropic

# Run a specific instance
npx tsx scripts/swe-bench/run.ts --instance "django__django-16379"

# Resume from task 50 with longer timeout
npx tsx scripts/swe-bench/run.ts --offset 50 --limit 10 --timeout 600000

# Compare providers (run separately, different output files)
npx tsx scripts/swe-bench/run.ts --provider kimi-coding --output scripts/swe-bench/pred-kimi.jsonl
npx tsx scripts/swe-bench/run.ts --provider anthropic   --output scripts/swe-bench/pred-claude.jsonl
```

## How the Agent Solves Tasks

For each task, the runner:

1. **Clones the repository** to `/tmp/swe-bench/<instance_id>/` and checks out `base_commit`
2. **Creates an Agent** with a focused system prompt and restricted tools (coding only — no web, no cron, no sessions)
3. **Runs the agent** with the issue description as the prompt
4. **Collects `git diff`** as the patch after the agent finishes
5. **Appends** the prediction to `predictions.jsonl` in SWE-bench format

The agent has access to:
- `read`, `write`, `edit` — file operations
- `exec`, `process` — shell commands (for exploring code, running tests)
- `glob` — file search

Tools explicitly denied: `web_fetch`, `web_search`, `cron`, `data`, `sessions_spawn`, `sessions_list`, `memory_search`, `send_file`.

## Output Files

After a run, two files are produced:

### `predictions.jsonl` — SWE-bench format

```json
{"instance_id": "astropy__astropy-12907", "model_patch": "diff --git a/...", "model_name_or_path": "multica-kimi-coding"}
```

This file is the input to the official evaluation harness.

### `predictions.results.jsonl` — detailed run metrics

```json
{
  "instance_id": "astropy__astropy-12907",
  "success": true,
  "patch": "diff --git a/...",
  "error": null,
  "duration_ms": 141892,
  "session_id": "019c60c7-52ac-702a-9b9c-dc53c0daea6b"
}
```

## Analyzing Results

```bash
# Summary report
npx tsx scripts/swe-bench/analyze.ts

# Or specify a results file
npx tsx scripts/swe-bench/analyze.ts scripts/swe-bench/pred-kimi.results.jsonl
```

Output includes:
- Patch rate (how many tasks produced a diff)
- Duration statistics (avg/min/max)
- Error breakdown
- Per-repository stats
- Slowest tasks

### Run-Log Analysis

Each agent session writes a structured `run-log.jsonl` to `~/.super-multica/sessions/<session-id>/`. This captures every LLM call, tool invocation, and timing:

```bash
# Find a session's run log
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | head -5

# Quick stats from a run log
cat ~/.super-multica/sessions/<session-id>/run-log.jsonl | python3 -c "
import json, sys
events = [json.loads(l) for l in sys.stdin if l.strip()]
tools = [e for e in events if e['event'] == 'tool_start']
llm_ms = sum(e.get('duration_ms', 0) for e in events if e['event'] == 'llm_result')
print(f'LLM time: {llm_ms/1000:.1f}s | Tool calls: {len(tools)}')
"
```

## Official Evaluation (Docker)

The runner produces patches, but **only the official SWE-bench harness determines pass/fail** by applying the patch and running the project's test suite.

### Prerequisites

- Docker running (at least 120GB storage, 16GB RAM, 8 CPU cores)
- `pip install swebench`

### Run Evaluation

```bash
# Using the wrapper script
bash scripts/swe-bench/evaluate.sh

# Or directly
python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --predictions_path scripts/swe-bench/predictions.jsonl \
  --max_workers 4 \
  --run_id multica
```

Results are written to `logs/` and `evaluation_results/`.

## Known Limitations and Improvements

### Current Limitations

1. **No Docker isolation for agent execution**: The agent runs on the host, so `pip install` and other commands affect the system Python. SWE-bench standard practice is to run each task in a Docker container.

2. **`SMC_DATA_DIR` timing**: Setting `SMC_DATA_DIR` at runtime doesn't affect `DATA_DIR` (resolved at module import time). Sessions currently write to `~/.super-multica/sessions/`. To isolate, set the env var before the process starts:
   ```bash
   SMC_DATA_DIR=~/.swe-bench-eval npx tsx scripts/swe-bench/run.ts --limit 5
   ```

3. **Sequential execution**: Tasks run one at a time. For large-scale runs, launch multiple processes with `--offset`/`--limit` to parallelize:
   ```bash
   # Run 4 workers in parallel
   npx tsx scripts/swe-bench/run.ts --offset 0   --limit 75 --output pred-0.jsonl &
   npx tsx scripts/swe-bench/run.ts --offset 75  --limit 75 --output pred-1.jsonl &
   npx tsx scripts/swe-bench/run.ts --offset 150 --limit 75 --output pred-2.jsonl &
   npx tsx scripts/swe-bench/run.ts --offset 225 --limit 75 --output pred-3.jsonl &
   wait
   cat pred-*.jsonl > predictions.jsonl
   ```

4. **Repo cloning per instance**: Each instance clones the full repo. For repos with many tasks (e.g., astropy, django), a shared clone with `git worktree` would be faster.

### Potential Improvements

- **Docker-per-task**: Run each agent in a Docker container matching the SWE-bench environment spec (correct Python version, pre-installed dependencies)
- **Shared repo pool**: Clone each unique repo once, use `git worktree` for per-task isolation
- **Cost tracking**: Parse run-log token counts for per-task and aggregate cost estimates
- **Multi-turn retries**: If the agent produces no patch, retry with feedback
- **System prompt tuning**: The current prompt is minimal; more detailed guidance (e.g., "search for related test files to understand expected behavior") could improve solve rate

## Related Benchmarks

| Benchmark | Focus | Notes |
|-----------|-------|-------|
| [SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/) | Bug fixing (Python) | Gold standard, 500 human-verified tasks |
| [SWE-bench Multilingual](https://github.com/SWE-bench/SWE-bench) | Bug fixing (7 languages) | Java, TS, JS, Go, Rust, C, C++ |
| [Terminal-Bench](https://www.swebench.com/) | CLI workflows | Multi-step sandboxed terminal tasks |
| [Aider Polyglot](https://aider.chat/docs/leaderboards/) | Code editing | 225 Exercism exercises, 6 languages |
| [DPAI Arena](https://www.jetbrains.com/) | Full dev workflow | JetBrains: patch, test, review, analysis |
| [HumanEval](https://github.com/openai/human-eval) | Function generation | 164 Python function tasks, largely saturated |

## Initial Results (kimi-coding, 3 tasks)

First run on 3 SWE-bench Lite tasks (all astropy):

| Task | Status | Duration | LLM Time | Tools | Fix |
|------|--------|----------|----------|-------|-----|
| `astropy__astropy-12907` | PATCHED | 141.9s | 125.1s | 30 | `_cstack`: `= 1` → `= right` |
| `astropy__astropy-14182` | PATCHED | 192.0s | 166.9s | 56 | Added `header_rows` param to RST writer |
| `astropy__astropy-14365` | PATCHED | 65.7s | 49.6s | 23 | `re.compile()` + `re.IGNORECASE` |

3/3 tasks produced patches. Formal evaluation pending (requires Docker harness).