Covers the full pipeline: dataset download, agent execution,
result analysis, and official Docker evaluation. Includes
runner options, output format, known limitations, and initial
benchmark results.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>