3.1 KiB
3.1 KiB
Skills Agent-Driven E2E Benchmark
This benchmark validates the meta skill workflow for capability-gap discovery, ClawHub installation, and security-gated rollout.
Scope
- Domain: skill discovery + installation + update
- Focus:
skills/meta-skill-installer - Providers: default
kimi-coding(override withPROVIDERS) - Cases: 5
Case prompts are stored in:
scripts/e2e-skills-benchmark/cases/
Real ClawHub Examples Used
The case set references real public pages from ClawHub:
- CalDAV Calendar
- Home Assistant
- CodexMonitor
- Spotify (gap-discovery UX flow)
- Notion (gap-discovery UX flow)
Prerequisites
- Credentials configured (
pnpm multica credentials initif needed) - Dependencies installed in repo (
pnpm install) clawhubCLI available, or allow runtime fallback tonpx -y clawhub- Required env:
export SMC_DATA_DIR=~/.super-multica-e2e
export MULTICA_API_URL=https://api-dev.copilothub.ai
Run Benchmark
scripts/e2e-skills-benchmark/run.sh
Defaults:
- Providers:
kimi-coding - Case glob:
case-*.txt - Max parallel workers:
1 - Per-case timeout:
1200s(CASE_TIMEOUT_SEC=0to disable) - Output directory:
.context/skills-e2e-runs/<timestamp>/
Generated artifacts:
manifest.tsv: provider/case/status/session/log metadataanalysis.txt: human-readable pass/fail reportanalysis.json: structured detailed check output
Run Subset
Only one case:
CASE_GLOB="case-01-*.txt" scripts/e2e-skills-benchmark/run.sh
Multiple providers:
PROVIDERS="kimi-coding claude-code" scripts/e2e-skills-benchmark/run.sh
Faster throughput:
MAX_PARALLEL=2 CASE_TIMEOUT_SEC=1800 scripts/e2e-skills-benchmark/run.sh
Analyzer Checks
For each run:
run_startandrun_endboth presentrun_end.erroris empty/nulltool_startandtool_endare paired- no
tool_end.is_error=true - at least one
exectool call exists - case-specific command evidence in
tool_start.args:clawhub searchclawhub installreview-skill-security.mjs- for case 03 also
clawhub update - for case 04, prompt is a natural user request only; agent must self-discover capability gap, propose ClawHub + security review + install confirmation, and must not run workaround commands (
osascript,ha.sh,spogo,spotify_player) before user confirmation - for case 05, prompt is a natural Notion request; agent must discover missing capability, search skill candidates, trigger
install_guard(blocked until confirmation), and ask for explicit install consent plus token/auth prerequisites
Notes
- These are agent-driven tests; prompt intent plus run-log evidence are both evaluated.
SMC_DATA_DIR=~/.super-multica-e2eavoids polluting normal user skill/session data.- If a case fails, open
manifest.tsvand inspect the matchingsession_dir/run-log.jsonl.