feat(benchmark): add finance e2e case suite

This commit is contained in:
Jiayuan Zhang 2026-02-16 01:29:24 +08:00
parent faa54fc671
commit edc55390cf
12 changed files with 326 additions and 0 deletions

View file

@ -0,0 +1,104 @@
# Finance Agent-Driven E2E Benchmark
This benchmark suite is designed for complex financial analysis scenarios and follows the workflow in `docs/e2e-testing-guide.md`.
## Scope
- Domain: equity, macro, rates, credit, cross-asset allocation
- Complexity: multi-step planning, data collection, analysis, local artifact generation
- Providers: `kimi-coding` and `claude-code`
- Cases: 10
Case prompts are stored in:
- `scripts/e2e-finance-benchmark/cases/`
## Prerequisites
1. Credentials are configured (`pnpm multica credentials init` if needed)
2. Dev auth exists for `web_search`/`data` tools (`~/.super-multica-dev/auth.json`)
3. Required env:
```bash
export SMC_DATA_DIR=~/.super-multica-e2e
export MULTICA_API_URL=https://api-dev.copilothub.ai
```
## Run All Cases (Both Providers)
```bash
scripts/e2e-finance-benchmark/run.sh
```
The script defaults:
- Providers: `kimi-coding claude-code`
- Case glob: `case-*.txt`
- Output directory: `.context/finance-e2e-runs/<timestamp>/`
Generated artifact:
- `manifest.tsv`: provider, case id, status, session id, session dir, raw log file
## Run a Subset
Run only one provider:
```bash
PROVIDERS="kimi-coding" scripts/e2e-finance-benchmark/run.sh
```
Run only specific cases by glob:
```bash
CASE_GLOB="case-0[1-3]*.txt" scripts/e2e-finance-benchmark/run.sh
```
## Case List
1. `case-01-top10-financial-reports.txt`
- Top-10 US market cap 3-year filing analysis + workbook + 2026 allocation memo
2. `case-02-ai-value-chain-scorecard.txt`
- AI value-chain factor model and weighted ranking
3. `case-03-us-bank-stress-test.txt`
- US large-bank stress scenarios (mild/severe recession)
4. `case-04-consumer-sector-macro-linkage.txt`
- Consumer sector earnings elasticity vs macro variables
5. `case-05-energy-transport-sensitivity.txt`
- Energy/transport sensitivity and hedge ideas under oil scenarios
6. `case-06-cross-asset-allocation.txt`
- Cross-asset tactical portfolio design with scenario stress tests
7. `case-07-reit-rate-risk.txt`
- REIT screening under rate scenarios and debt maturity pressure
8. `case-08-earnings-quality-forensics.txt`
- Forensic accounting quality framework and red-flag scoring
9. `case-09-post-earnings-drift-study.txt`
- PEAD strategy feasibility study with risk controls
10. `case-10-investment-committee-pack.txt`
- Q2 2026 investment committee pack + devil's advocate memo
## Evaluation Checklist
For each run (`session-dir/run-log.jsonl`):
1. Event completeness
- `run_start` appears before `run_end`
2. Tool pairing
- Every `tool_start` has matching `tool_end`
3. Error handling
- Check `tool_end.is_error`, `error_classify`, `auth_rotate`
4. Compaction health
- If compaction occurs: `compaction.tokens_removed > 0`
5. Performance
- Inspect `llm_result.duration_ms` and tool durations for outliers
For content quality (`session.jsonl` and output files on Desktop):
1. Required files are created in target output directory
2. Assumptions are explicit and traceable
3. Sources are listed (`sources.md` with links + dates)
4. Output distinguishes facts vs inferences when requested
5. Strategy conclusions include risk and invalidation conditions
## Notes
- Most cases intentionally require web + financial data gathering and local file generation.
- Cases are designed to test planning quality, not only final answer quality.
- You can analyze sessions after batch runs by opening the `session_dir` paths in `manifest.tsv`.