98 lines
3.1 KiB
Markdown
98 lines
3.1 KiB
Markdown
# Skills Agent-Driven E2E Benchmark
|
|
|
|
This benchmark validates the meta skill workflow for capability-gap discovery, ClawHub installation, and security-gated rollout.
|
|
|
|
## Scope
|
|
|
|
- Domain: skill discovery + installation + update
|
|
- Focus: `skills/meta-skill-installer`
|
|
- Providers: default `kimi-coding` (override with `PROVIDERS`)
|
|
- Cases: 5
|
|
|
|
Case prompts are stored in:
|
|
- `scripts/e2e-skills-benchmark/cases/`
|
|
|
|
## Real ClawHub Examples Used
|
|
|
|
The case set references real public pages from ClawHub:
|
|
|
|
- [CalDAV Calendar](https://clawhub.ai/skills/caldav-calendar)
|
|
- [Home Assistant](https://clawhub.ai/skills/homeassistant)
|
|
- [CodexMonitor](https://clawhub.ai/odrobnik/codexmonitor)
|
|
- [Spotify (gap-discovery UX flow)](https://clawhub.ai/search?q=spotify)
|
|
- [Notion (gap-discovery UX flow)](https://clawhub.ai/search?q=notion)
|
|
|
|
## Prerequisites
|
|
|
|
1. Credentials configured (`pnpm multica credentials init` if needed)
|
|
2. Dependencies installed in repo (`pnpm install`)
|
|
3. `clawhub` CLI available, or allow runtime fallback to `npx -y clawhub`
|
|
4. Required env:
|
|
|
|
```bash
|
|
export SMC_DATA_DIR=~/.super-multica-e2e
|
|
export MULTICA_API_URL=https://api-dev.copilothub.ai
|
|
```
|
|
|
|
## Run Benchmark
|
|
|
|
```bash
|
|
scripts/e2e-skills-benchmark/run.sh
|
|
```
|
|
|
|
Defaults:
|
|
|
|
- Providers: `kimi-coding`
|
|
- Case glob: `case-*.txt`
|
|
- Max parallel workers: `1`
|
|
- Per-case timeout: `1200s` (`CASE_TIMEOUT_SEC=0` to disable)
|
|
- Output directory: `.context/skills-e2e-runs/<timestamp>/`
|
|
|
|
Generated artifacts:
|
|
|
|
- `manifest.tsv`: provider/case/status/session/log metadata
|
|
- `analysis.txt`: human-readable pass/fail report
|
|
- `analysis.json`: structured detailed check output
|
|
|
|
## Run Subset
|
|
|
|
Only one case:
|
|
|
|
```bash
|
|
CASE_GLOB="case-01-*.txt" scripts/e2e-skills-benchmark/run.sh
|
|
```
|
|
|
|
Multiple providers:
|
|
|
|
```bash
|
|
PROVIDERS="kimi-coding claude-code" scripts/e2e-skills-benchmark/run.sh
|
|
```
|
|
|
|
Faster throughput:
|
|
|
|
```bash
|
|
MAX_PARALLEL=2 CASE_TIMEOUT_SEC=1800 scripts/e2e-skills-benchmark/run.sh
|
|
```
|
|
|
|
## Analyzer Checks
|
|
|
|
For each run:
|
|
|
|
1. `run_start` and `run_end` both present
|
|
2. `run_end.error` is empty/null
|
|
3. `tool_start` and `tool_end` are paired
|
|
4. no `tool_end.is_error=true`
|
|
5. at least one `exec` tool call exists
|
|
6. case-specific command evidence in `tool_start.args`:
|
|
- `clawhub search`
|
|
- `clawhub install`
|
|
- `review-skill-security.mjs`
|
|
- for case 03 also `clawhub update`
|
|
- for case 04, prompt is a natural user request only; agent must self-discover capability gap, propose ClawHub + security review + install confirmation, and must not run workaround commands (`osascript`, `ha.sh`, `spogo`, `spotify_player`) before user confirmation
|
|
- for case 05, prompt is a natural Notion request; agent must discover missing capability, search skill candidates, trigger `install_guard` (blocked until confirmation), and ask for explicit install consent plus token/auth prerequisites
|
|
|
|
## Notes
|
|
|
|
- These are agent-driven tests; prompt intent plus run-log evidence are both evaluated.
|
|
- `SMC_DATA_DIR=~/.super-multica-e2e` avoids polluting normal user skill/session data.
|
|
- If a case fails, open `manifest.tsv` and inspect the matching `session_dir/run-log.jsonl`.
|