feat(benchmark): add finance e2e case suite
This commit is contained in:
parent
faa54fc671
commit
edc55390cf
12 changed files with 326 additions and 0 deletions
104
docs/e2e-finance-benchmark.md
Normal file
104
docs/e2e-finance-benchmark.md
Normal file
|
|
@ -0,0 +1,104 @@
|
|||
# Finance Agent-Driven E2E Benchmark
|
||||
|
||||
This benchmark suite is designed for complex financial analysis scenarios and follows the workflow in `docs/e2e-testing-guide.md`.
|
||||
|
||||
## Scope
|
||||
|
||||
- Domain: equity, macro, rates, credit, cross-asset allocation
|
||||
- Complexity: multi-step planning, data collection, analysis, local artifact generation
|
||||
- Providers: `kimi-coding` and `claude-code`
|
||||
- Cases: 10
|
||||
|
||||
Case prompts are stored in:
|
||||
- `scripts/e2e-finance-benchmark/cases/`
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. Credentials are configured (`pnpm multica credentials init` if needed)
|
||||
2. Dev auth exists for `web_search`/`data` tools (`~/.super-multica-dev/auth.json`)
|
||||
3. Required env:
|
||||
|
||||
```bash
|
||||
export SMC_DATA_DIR=~/.super-multica-e2e
|
||||
export MULTICA_API_URL=https://api-dev.copilothub.ai
|
||||
```
|
||||
|
||||
## Run All Cases (Both Providers)
|
||||
|
||||
```bash
|
||||
scripts/e2e-finance-benchmark/run.sh
|
||||
```
|
||||
|
||||
The script defaults:
|
||||
- Providers: `kimi-coding claude-code`
|
||||
- Case glob: `case-*.txt`
|
||||
- Output directory: `.context/finance-e2e-runs/<timestamp>/`
|
||||
|
||||
Generated artifact:
|
||||
- `manifest.tsv`: provider, case id, status, session id, session dir, raw log file
|
||||
|
||||
## Run a Subset
|
||||
|
||||
Run only one provider:
|
||||
|
||||
```bash
|
||||
PROVIDERS="kimi-coding" scripts/e2e-finance-benchmark/run.sh
|
||||
```
|
||||
|
||||
Run only specific cases by glob:
|
||||
|
||||
```bash
|
||||
CASE_GLOB="case-0[1-3]*.txt" scripts/e2e-finance-benchmark/run.sh
|
||||
```
|
||||
|
||||
## Case List
|
||||
|
||||
1. `case-01-top10-financial-reports.txt`
|
||||
- Top-10 US market cap 3-year filing analysis + workbook + 2026 allocation memo
|
||||
2. `case-02-ai-value-chain-scorecard.txt`
|
||||
- AI value-chain factor model and weighted ranking
|
||||
3. `case-03-us-bank-stress-test.txt`
|
||||
- US large-bank stress scenarios (mild/severe recession)
|
||||
4. `case-04-consumer-sector-macro-linkage.txt`
|
||||
- Consumer sector earnings elasticity vs macro variables
|
||||
5. `case-05-energy-transport-sensitivity.txt`
|
||||
- Energy/transport sensitivity and hedge ideas under oil scenarios
|
||||
6. `case-06-cross-asset-allocation.txt`
|
||||
- Cross-asset tactical portfolio design with scenario stress tests
|
||||
7. `case-07-reit-rate-risk.txt`
|
||||
- REIT screening under rate scenarios and debt maturity pressure
|
||||
8. `case-08-earnings-quality-forensics.txt`
|
||||
- Forensic accounting quality framework and red-flag scoring
|
||||
9. `case-09-post-earnings-drift-study.txt`
|
||||
- PEAD strategy feasibility study with risk controls
|
||||
10. `case-10-investment-committee-pack.txt`
|
||||
- Q2 2026 investment committee pack + devil's advocate memo
|
||||
|
||||
## Evaluation Checklist
|
||||
|
||||
For each run (`session-dir/run-log.jsonl`):
|
||||
|
||||
1. Event completeness
|
||||
- `run_start` appears before `run_end`
|
||||
2. Tool pairing
|
||||
- Every `tool_start` has matching `tool_end`
|
||||
3. Error handling
|
||||
- Check `tool_end.is_error`, `error_classify`, `auth_rotate`
|
||||
4. Compaction health
|
||||
- If compaction occurs: `compaction.tokens_removed > 0`
|
||||
5. Performance
|
||||
- Inspect `llm_result.duration_ms` and tool durations for outliers
|
||||
|
||||
For content quality (`session.jsonl` and output files on Desktop):
|
||||
|
||||
1. Required files are created in target output directory
|
||||
2. Assumptions are explicit and traceable
|
||||
3. Sources are listed (`sources.md` with links + dates)
|
||||
4. Output distinguishes facts vs inferences when requested
|
||||
5. Strategy conclusions include risk and invalidation conditions
|
||||
|
||||
## Notes
|
||||
|
||||
- Most cases intentionally require web + financial data gathering and local file generation.
|
||||
- Cases are designed to test planning quality, not only final answer quality.
|
||||
- You can analyze sessions after batch runs by opening the `session_dir` paths in `manifest.tsv`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue