multica/docs/e2e-finance-benchmark.md
2026-02-16 02:15:55 +08:00

3.6 KiB

Finance Agent-Driven E2E Benchmark

This benchmark suite is designed for complex financial analysis scenarios and follows the workflow in docs/e2e-testing-guide.md.

Scope

  • Domain: equity, macro, rates, credit, cross-asset allocation
  • Complexity: multi-step planning, data collection, analysis, local artifact generation
  • Providers: kimi-coding and claude-code
  • Cases: 10

Case prompts are stored in:

  • scripts/e2e-finance-benchmark/cases/

Prerequisites

  1. Credentials are configured (pnpm multica credentials init if needed)
  2. Dev auth exists for web_search/data tools (~/.super-multica-dev/auth.json)
  3. Required env:
export SMC_DATA_DIR=~/.super-multica-e2e
export MULTICA_API_URL=https://api-dev.copilothub.ai

Run All Cases (Both Providers)

scripts/e2e-finance-benchmark/run.sh

The script defaults:

  • Providers: kimi-coding claude-code
  • Case glob: case-*.txt
  • Max parallel workers: 2
  • Per-case timeout: 900s (set CASE_TIMEOUT_SEC=0 to disable)
  • Output directory: .context/finance-e2e-runs/<timestamp>/

Generated artifact:

  • manifest.tsv: provider, case id, status, session id, session dir, raw log file

Run a Subset

Run only one provider:

PROVIDERS="kimi-coding" scripts/e2e-finance-benchmark/run.sh

Run only specific cases by glob:

CASE_GLOB="case-0[1-3]*.txt" scripts/e2e-finance-benchmark/run.sh

Run with higher parallelism for long-horizon tasks:

MAX_PARALLEL=4 CASE_TIMEOUT_SEC=2700 scripts/e2e-finance-benchmark/run.sh

Case List

  1. case-01-top10-financial-reports.txt
    • Top-10 US market cap 3-year filing analysis + workbook + 2026 allocation memo
  2. case-02-ai-value-chain-scorecard.txt
    • AI value-chain factor model and weighted ranking
  3. case-03-us-bank-stress-test.txt
    • US large-bank stress scenarios (mild/severe recession)
  4. case-04-consumer-sector-macro-linkage.txt
    • Consumer sector earnings elasticity vs macro variables
  5. case-05-energy-transport-sensitivity.txt
    • Energy/transport sensitivity and hedge ideas under oil scenarios
  6. case-06-cross-asset-allocation.txt
    • Cross-asset tactical portfolio design with scenario stress tests
  7. case-07-reit-rate-risk.txt
    • REIT screening under rate scenarios and debt maturity pressure
  8. case-08-earnings-quality-forensics.txt
    • Forensic accounting quality framework and red-flag scoring
  9. case-09-post-earnings-drift-study.txt
    • PEAD strategy feasibility study with risk controls
  10. case-10-investment-committee-pack.txt
  • Q2 2026 investment committee pack + devil's advocate memo

Evaluation Checklist

For each run (session-dir/run-log.jsonl):

  1. Event completeness
    • run_start appears before run_end
  2. Tool pairing
    • Every tool_start has matching tool_end
  3. Error handling
    • Check tool_end.is_error, error_classify, auth_rotate
  4. Compaction health
    • If compaction occurs: compaction.tokens_removed > 0
  5. Performance
    • Inspect llm_result.duration_ms and tool durations for outliers

For content quality (session.jsonl and output files on Desktop):

  1. Required files are created in target output directory
  2. Assumptions are explicit and traceable
  3. Sources are listed (sources.md with links + dates)
  4. Output distinguishes facts vs inferences when requested
  5. Strategy conclusions include risk and invalidation conditions

Notes

  • Most cases intentionally require web + financial data gathering and local file generation.
  • Cases are designed to test planning quality, not only final answer quality.
  • You can analyze sessions after batch runs by opening the session_dir paths in manifest.tsv.