3.6 KiB
3.6 KiB
Finance Agent-Driven E2E Benchmark
This benchmark suite is designed for complex financial analysis scenarios and follows the workflow in docs/e2e-testing-guide.md.
Scope
- Domain: equity, macro, rates, credit, cross-asset allocation
- Complexity: multi-step planning, data collection, analysis, local artifact generation
- Providers:
kimi-codingandclaude-code - Cases: 10
Case prompts are stored in:
scripts/e2e-finance-benchmark/cases/
Prerequisites
- Credentials are configured (
pnpm multica credentials initif needed) - Dev auth exists for
web_search/datatools (~/.super-multica-dev/auth.json) - Required env:
export SMC_DATA_DIR=~/.super-multica-e2e
export MULTICA_API_URL=https://api-dev.copilothub.ai
Run All Cases (Both Providers)
scripts/e2e-finance-benchmark/run.sh
The script defaults:
- Providers:
kimi-coding claude-code - Case glob:
case-*.txt - Max parallel workers:
2 - Per-case timeout:
900s(setCASE_TIMEOUT_SEC=0to disable) - Output directory:
.context/finance-e2e-runs/<timestamp>/
Generated artifact:
manifest.tsv: provider, case id, status, session id, session dir, raw log file
Run a Subset
Run only one provider:
PROVIDERS="kimi-coding" scripts/e2e-finance-benchmark/run.sh
Run only specific cases by glob:
CASE_GLOB="case-0[1-3]*.txt" scripts/e2e-finance-benchmark/run.sh
Run with higher parallelism for long-horizon tasks:
MAX_PARALLEL=4 CASE_TIMEOUT_SEC=2700 scripts/e2e-finance-benchmark/run.sh
Case List
case-01-top10-financial-reports.txt- Top-10 US market cap 3-year filing analysis + workbook + 2026 allocation memo
case-02-ai-value-chain-scorecard.txt- AI value-chain factor model and weighted ranking
case-03-us-bank-stress-test.txt- US large-bank stress scenarios (mild/severe recession)
case-04-consumer-sector-macro-linkage.txt- Consumer sector earnings elasticity vs macro variables
case-05-energy-transport-sensitivity.txt- Energy/transport sensitivity and hedge ideas under oil scenarios
case-06-cross-asset-allocation.txt- Cross-asset tactical portfolio design with scenario stress tests
case-07-reit-rate-risk.txt- REIT screening under rate scenarios and debt maturity pressure
case-08-earnings-quality-forensics.txt- Forensic accounting quality framework and red-flag scoring
case-09-post-earnings-drift-study.txt- PEAD strategy feasibility study with risk controls
case-10-investment-committee-pack.txt
- Q2 2026 investment committee pack + devil's advocate memo
Evaluation Checklist
For each run (session-dir/run-log.jsonl):
- Event completeness
run_startappears beforerun_end
- Tool pairing
- Every
tool_starthas matchingtool_end
- Every
- Error handling
- Check
tool_end.is_error,error_classify,auth_rotate
- Check
- Compaction health
- If compaction occurs:
compaction.tokens_removed > 0
- If compaction occurs:
- Performance
- Inspect
llm_result.duration_msand tool durations for outliers
- Inspect
For content quality (session.jsonl and output files on Desktop):
- Required files are created in target output directory
- Assumptions are explicit and traceable
- Sources are listed (
sources.mdwith links + dates) - Output distinguishes facts vs inferences when requested
- Strategy conclusions include risk and invalidation conditions
Notes
- Most cases intentionally require web + financial data gathering and local file generation.
- Cases are designed to test planning quality, not only final answer quality.
- You can analyze sessions after batch runs by opening the
session_dirpaths inmanifest.tsv.