marketing-shibata50/multica

Jiayuan Zhang 8a2b3e10f3 test(e2e): add natural Notion gap-discovery benchmark case

2026-02-17 02:37:29 +08:00

3.1 KiB

Raw Blame History

Skills Agent-Driven E2E Benchmark

This benchmark validates the meta skill workflow for capability-gap discovery, ClawHub installation, and security-gated rollout.

Scope

Domain: skill discovery + installation + update
Focus: skills/meta-skill-installer
Providers: default kimi-coding (override with PROVIDERS)
Cases: 5

Case prompts are stored in:

scripts/e2e-skills-benchmark/cases/

Real ClawHub Examples Used

The case set references real public pages from ClawHub:

Prerequisites

Credentials configured (pnpm multica credentials init if needed)
Dependencies installed in repo (pnpm install)
clawhub CLI available, or allow runtime fallback to npx -y clawhub
Required env:

export SMC_DATA_DIR=~/.super-multica-e2e
export MULTICA_API_URL=https://api-dev.copilothub.ai

Run Benchmark

scripts/e2e-skills-benchmark/run.sh

Defaults:

Providers: kimi-coding
Case glob: case-*.txt
Max parallel workers: 1
Per-case timeout: 1200s (CASE_TIMEOUT_SEC=0 to disable)
Output directory: .context/skills-e2e-runs/<timestamp>/

Generated artifacts:

manifest.tsv: provider/case/status/session/log metadata
analysis.txt: human-readable pass/fail report
analysis.json: structured detailed check output

Run Subset

Only one case:

CASE_GLOB="case-01-*.txt" scripts/e2e-skills-benchmark/run.sh

Multiple providers:

PROVIDERS="kimi-coding claude-code" scripts/e2e-skills-benchmark/run.sh

Faster throughput:

MAX_PARALLEL=2 CASE_TIMEOUT_SEC=1800 scripts/e2e-skills-benchmark/run.sh

Analyzer Checks

For each run:

run_start and run_end both present
run_end.error is empty/null
tool_start and tool_end are paired
no tool_end.is_error=true
at least one exec tool call exists
case-specific command evidence in tool_start.args:
- clawhub search
- clawhub install
- review-skill-security.mjs
- for case 03 also clawhub update
- for case 04, prompt is a natural user request only; agent must self-discover capability gap, propose ClawHub + security review + install confirmation, and must not run workaround commands (osascript, ha.sh, spogo, spotify_player) before user confirmation
- for case 05, prompt is a natural Notion request; agent must discover missing capability, search skill candidates, trigger install_guard (blocked until confirmation), and ask for explicit install consent plus token/auth prerequisites

Notes

These are agent-driven tests; prompt intent plus run-log evidence are both evaluated.
SMC_DATA_DIR=~/.super-multica-e2e avoids polluting normal user skill/session data.
If a case fails, open manifest.tsv and inspect the matching session_dir/run-log.jsonl.