feat(benchmark): add finance e2e case suite

2026-02-16 01:29:24 +08:00 · 2026-02-16 01:29:24 +08:00 · edc55390cf
commit edc55390cf
parent faa54fc671
12 changed files with 326 additions and 0 deletions
--- a/docs/e2e-finance-benchmark.md
+++ b/docs/e2e-finance-benchmark.md
@ -0,0 +1,104 @@
+# Finance Agent-Driven E2E Benchmark
+
+This benchmark suite is designed for complex financial analysis scenarios and follows the workflow in `docs/e2e-testing-guide.md`.
+
+## Scope
+
+- Domain: equity, macro, rates, credit, cross-asset allocation
+- Complexity: multi-step planning, data collection, analysis, local artifact generation
+- Providers: `kimi-coding` and `claude-code`
+- Cases: 10
+
+Case prompts are stored in:
+- `scripts/e2e-finance-benchmark/cases/`
+
+## Prerequisites
+
+1. Credentials are configured (`pnpm multica credentials init` if needed)
+2. Dev auth exists for `web_search`/`data` tools (`~/.super-multica-dev/auth.json`)
+3. Required env:
+
+```bash
+export SMC_DATA_DIR=~/.super-multica-e2e
+export MULTICA_API_URL=https://api-dev.copilothub.ai
+```
+
+## Run All Cases (Both Providers)
+
+```bash
+scripts/e2e-finance-benchmark/run.sh
+```
+
+The script defaults:
+- Providers: `kimi-coding claude-code`
+- Case glob: `case-*.txt`
+- Output directory: `.context/finance-e2e-runs/<timestamp>/`
+
+Generated artifact:
+- `manifest.tsv`: provider, case id, status, session id, session dir, raw log file
+
+## Run a Subset
+
+Run only one provider:
+
+```bash
+PROVIDERS="kimi-coding" scripts/e2e-finance-benchmark/run.sh
+```
+
+Run only specific cases by glob:
+
+```bash
+CASE_GLOB="case-0[1-3]*.txt" scripts/e2e-finance-benchmark/run.sh
+```
+
+## Case List
+
+1. `case-01-top10-financial-reports.txt`
+   - Top-10 US market cap 3-year filing analysis + workbook + 2026 allocation memo
+2. `case-02-ai-value-chain-scorecard.txt`
+   - AI value-chain factor model and weighted ranking
+3. `case-03-us-bank-stress-test.txt`
+   - US large-bank stress scenarios (mild/severe recession)
+4. `case-04-consumer-sector-macro-linkage.txt`
+   - Consumer sector earnings elasticity vs macro variables
+5. `case-05-energy-transport-sensitivity.txt`
+   - Energy/transport sensitivity and hedge ideas under oil scenarios
+6. `case-06-cross-asset-allocation.txt`
+   - Cross-asset tactical portfolio design with scenario stress tests
+7. `case-07-reit-rate-risk.txt`
+   - REIT screening under rate scenarios and debt maturity pressure
+8. `case-08-earnings-quality-forensics.txt`
+   - Forensic accounting quality framework and red-flag scoring
+9. `case-09-post-earnings-drift-study.txt`
+   - PEAD strategy feasibility study with risk controls
+10. `case-10-investment-committee-pack.txt`
+   - Q2 2026 investment committee pack + devil's advocate memo
+
+## Evaluation Checklist
+
+For each run (`session-dir/run-log.jsonl`):
+
+1. Event completeness
+   - `run_start` appears before `run_end`
+2. Tool pairing
+   - Every `tool_start` has matching `tool_end`
+3. Error handling
+   - Check `tool_end.is_error`, `error_classify`, `auth_rotate`
+4. Compaction health
+   - If compaction occurs: `compaction.tokens_removed > 0`
+5. Performance
+   - Inspect `llm_result.duration_ms` and tool durations for outliers
+
+For content quality (`session.jsonl` and output files on Desktop):
+
+1. Required files are created in target output directory
+2. Assumptions are explicit and traceable
+3. Sources are listed (`sources.md` with links + dates)
+4. Output distinguishes facts vs inferences when requested
+5. Strategy conclusions include risk and invalidation conditions
+
+## Notes
+
+- Most cases intentionally require web + financial data gathering and local file generation.
+- Cases are designed to test planning quality, not only final answer quality.
+- You can analyze sessions after batch runs by opening the `session_dir` paths in `manifest.tsv`.