Finance Agent-Driven E2E Benchmark

This benchmark suite is designed for complex financial analysis scenarios and follows the workflow in docs/e2e-testing-guide.md.

Scope

Domain: equity, macro, rates, credit, cross-asset allocation
Complexity: multi-step planning, data collection, analysis, local artifact generation
Providers: kimi-coding and claude-code
Cases: 10

Case prompts are stored in:

Credentials are configured (pnpm multica credentials init if needed)
Dev auth exists for web_search/data tools (~/.super-multica-dev/auth.json)
Required env:

export SMC_DATA_DIR=~/.super-multica-e2e
export MULTICA_API_URL=https://api-dev.copilothub.ai

scripts/e2e-finance-benchmark/run.sh

The script defaults:

Generated artifact:

manifest.tsv: provider, case id, status, session id, session dir, raw log file

Run only one provider:

PROVIDERS="kimi-coding" scripts/e2e-finance-benchmark/run.sh

Run only specific cases by glob:

CASE_GLOB="case-0[1-3]*.txt" scripts/e2e-finance-benchmark/run.sh

Run with higher parallelism for long-horizon tasks:

MAX_PARALLEL=4 CASE_TIMEOUT_SEC=2700 scripts/e2e-finance-benchmark/run.sh

case-01-top10-financial-reports.txt
- Top-10 US market cap 3-year filing analysis + workbook + 2026 allocation memo
case-02-ai-value-chain-scorecard.txt
- AI value-chain factor model and weighted ranking
case-03-us-bank-stress-test.txt
- US large-bank stress scenarios (mild/severe recession)
case-04-consumer-sector-macro-linkage.txt
- Consumer sector earnings elasticity vs macro variables
case-05-energy-transport-sensitivity.txt
- Energy/transport sensitivity and hedge ideas under oil scenarios
case-06-cross-asset-allocation.txt
- Cross-asset tactical portfolio design with scenario stress tests
case-07-reit-rate-risk.txt
- REIT screening under rate scenarios and debt maturity pressure
case-08-earnings-quality-forensics.txt
- Forensic accounting quality framework and red-flag scoring
case-09-post-earnings-drift-study.txt
- PEAD strategy feasibility study with risk controls
case-10-investment-committee-pack.txt

For each run (session-dir/run-log.jsonl):

For content quality (session.jsonl and output files on Desktop):

Most cases intentionally require web + financial data gathering and local file generation.
Cases are designed to test planning quality, not only final answer quality.
You can analyze sessions after batch runs by opening the session_dir paths in manifest.tsv.