E2E tests now use ~/.super-multica-e2e to avoid polluting dev (~/.super-multica-dev) or production (~/.super-multica) session data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11 KiB
Agent-Driven E2E Testing Guide
This guide teaches Coding Agents (Claude Code, etc.) how to perform automated end-to-end testing of Super Multica features. Unlike traditional test frameworks, the Coding Agent itself is the test runner and oracle — it executes the agent, reads structured logs, and intelligently analyzes the results.
Overview
The testing flow:
- Coding Agent runs
pnpm multica run --run-log "test prompt" - The agent engine executes the prompt with full structured logging
- Coding Agent reads the
run-log.jsonlandsession.jsonlfiles - Coding Agent analyzes events, tool calls, and behavior for correctness
This approach is superior to static assertions because:
- The AI can understand intent — did the agent do what the prompt asked?
- It can reason about intermediate process — were the right tools called in the right order?
- It can detect subtle issues — token counts that don't make sense, unnecessary retries, missing events
Prerequisites
- Credentials configured: Run
pnpm multica credentials initor ensure~/.super-multica/credentials.json5has valid provider credentials - Available providers: Check with
pnpm multica profile listor inspect credentials file - Default provider:
kimi-coding(Kimi Code, free tier available). Can override with--provider MULTICA_API_URL: Required forweb_searchanddatatools. Set tohttps://api-dev.copilothub.aifor dev environment. Without this, web search and financial data tools will fail withMULTICA_API_URL is requiredSMC_DATA_DIR: Set to~/.super-multica-e2eto isolate E2E test sessions from dev (~/.super-multica-dev) and production (~/.super-multica) data. Without this, test sessions pollute the production sessions directory
Running a Test
Environment variables
All E2E test commands should include these env vars:
# SMC_DATA_DIR — isolates test sessions from dev/production
# MULTICA_API_URL — enables web_search and data tools
export SMC_DATA_DIR=~/.super-multica-e2e
export MULTICA_API_URL=https://api-dev.copilothub.ai
Basic command
# For prompts that only need exec/read/write tools:
SMC_DATA_DIR=~/.super-multica-e2e pnpm multica run --run-log "your test prompt here"
# For prompts that need web_search or data tools:
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log "your test prompt here"
With provider override
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --provider claude-code "your test prompt"
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --provider kimi-coding "your test prompt"
Resume a session (multi-turn testing)
# First turn
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log "Create a file called test.txt with content 'hello'"
# Note the session ID from stderr output: [session: 019c584a-...]
# Second turn (same session)
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --session 019c584a-... "Read the file test.txt and tell me its content"
Cleanup
# Remove all E2E test sessions
rm -rf ~/.super-multica-e2e
Output
The CLI prints metadata to stderr:
[session: 019c584a-7753-762d-9fb9-9eb0a8187df5]
[session-dir: /Users/you/.super-multica/sessions/019c584a-7753-762d-9fb9-9eb0a8187df5]
Agent text output goes to stdout.
Reading Results
After a run, two files contain the data needed for analysis:
run-log.jsonl
Location: {session-dir}/run-log.jsonl
Each line is a JSON object with structured event data. Read this file to understand what happened during execution.
{"ts":1739000001,"event":"run_start","prompt":"What is 2+2?","provider":"kimi-coding","model":"kimi-k2-thinking","messages":0}
{"ts":1739000002,"event":"llm_call","provider":"kimi-coding","model":"kimi-k2-thinking","messages":2}
{"ts":1739000005,"event":"llm_result","duration_ms":3000}
{"ts":1739000005,"event":"run_end","duration_ms":4000,"error":null,"text":"4"}
session.jsonl
Location: {session-dir}/session.jsonl
Contains the full conversation transcript (user messages, assistant replies, tool calls and results). Read this for message content analysis.
Run-Log Event Reference
Source of truth:
packages/core/src/agent/run-log.ts(JSDoc at top of file)
Lifecycle Events
| Event | Fields | Description |
|---|---|---|
run_start |
prompt, internal, provider, model, messages | Agent run begins |
run_end |
duration_ms, error, text, aborted? | Agent run completes |
LLM Interaction
| Event | Fields | Description |
|---|---|---|
llm_call |
provider, model, profile, messages | LLM API request sent |
llm_result |
duration_ms | LLM API response received |
Tool Execution
| Event | Fields | Description |
|---|---|---|
tool_start |
tool, args | Tool execution begins |
tool_end |
tool, duration_ms, is_error | Tool execution completes |
Context Management
| Event | Fields | Description |
|---|---|---|
preflight_compact_start |
utilization, trigger, messages, est_tokens | Preflight compaction triggered |
preflight_compact_end |
messages_before, messages_after, pruned | Preflight compaction done |
tool_result_pruning |
soft_trimmed, hard_cleared, chars_saved, phase, tokens_before?, tokens_after? | Tool result pruning (Phase 1) |
compaction |
removed, kept, tokens_removed, tokens_kept, reason, pruning_stats? | Summary compaction (Phase 2) |
compaction_detail |
pre_pruning_tokens, post_compaction_tokens, messages_removed, reason, pruning_applied | Detailed compaction breakdown |
Error Recovery
| Event | Fields | Description |
|---|---|---|
context_overflow |
attempt, messages_before | Context window overflow detected |
context_overflow_compacted |
messages_after, tokens_removed | Recovered via compaction |
context_overflow_forced |
messages_before, messages_after | Recovered via forced drop |
error_classify |
error, reason, rotatable | Error classified for rotation |
auth_rotate |
from, to, reason | Auth profile rotated |
Feature Test Playbooks
1. Basic Prompt Completion
Goal: Verify the agent can complete a simple prompt end-to-end.
pnpm multica run --run-log "What is the capital of France? Reply in one word."
What to check in run-log:
run_startevent exists with correct providerllm_call→llm_resultpair exists (at least one)run_endevent haserror: nullrun_end.duration_msis reasonable (< 30s for simple prompt)
What to check in output:
- Text contains "Paris"
2. Tool Usage
Goal: Verify tools are called correctly when the prompt requires them.
pnpm multica run --run-log --cwd /tmp "List the files in the current directory"
What to check in run-log:
tool_startevent withtool: "exec"or similar filesystem tool- Matching
tool_endwithis_error: false - Tool called before final
run_end
What to check in output:
- Output contains actual file names from /tmp
3. Context Compaction
Goal: Verify compaction works correctly on long sessions.
# Build up a long session to trigger compaction
pnpm multica run --run-log "Write a detailed 2000-word essay about climate change"
# Note session ID, then continue:
pnpm multica run --run-log --session {id} "Now write another 2000-word essay about renewable energy"
pnpm multica run --run-log --session {id} "Summarize both essays in 3 bullet points"
What to check in run-log:
preflight_compact_startappears when utilization exceeds trigger ratiotool_result_pruningshowssoft_trimmed > 0orhard_cleared > 0if tool results were prunedcompactionevent hastokens_removed > 0(not near-zero like the bug we fixed)compaction_detailshowspre_pruning_tokens>post_compaction_tokens
4. Multi-Provider Comparison
Goal: Verify the same prompt works across different providers.
pnpm multica run --run-log --provider kimi-coding "Explain recursion in 2 sentences"
pnpm multica run --run-log --provider claude-code "Explain recursion in 2 sentences"
What to check:
- Both runs complete without errors
- Both
run_endevents haveerror: null - Compare
llm_result.duration_msacross providers - Both outputs are meaningful explanations of recursion
5. Error Handling & Auth Rotation
Goal: Verify error recovery when credentials are invalid.
pnpm multica run --run-log --provider anthropic --api-key "sk-invalid-key" "Hello"
What to check in run-log:
error_classifyevent withreason: "auth"auth_rotateevent if multiple profiles are configuredrun_endwith appropriate error message if no valid profiles exist
Analysis Patterns
When analyzing run-logs, look for these patterns:
Healthy Run
run_start → llm_call → llm_result → run_end (error: null)
Run with Tool Usage
run_start → llm_call → llm_result → tool_start → tool_end → llm_call → llm_result → run_end
Run with Compaction
run_start → preflight_compact_start → tool_result_pruning → preflight_compact_end → llm_call → ...
Red Flags
run_endwithout precedingrun_start(log corruption)tool_startwithout matchingtool_end(tool hang/crash)compactionwithtokens_removednear zero (compaction ineffective)- Multiple
error_classifyevents (repeated failures) context_overflow_forced(emergency fallback — should be rare)
Creating a New Test Playbook
When a new feature is implemented, create a test playbook following this template:
### N. Feature Name
**Goal**: One sentence describing what to verify.
**Command**:
\`\`\`bash
pnpm multica run --run-log [options] "prompt that exercises the feature"
\`\`\`
**What to check in run-log**:
- List specific events and field values to verify
- Include both positive checks (event exists) and negative checks (no errors)
**What to check in output**:
- What the text output should contain or look like
**What to check in session.jsonl** (if applicable):
- Specific message patterns to verify
Tips for Coding Agents
- Always use
--run-log— without it, there's no structured data to analyze - Use
--cwdto control the working directory for file-related tests - Read run-log line by line — each line is independent JSON, parse individually
- Check event ordering — events are chronologically ordered by
ts - Token counts are estimates — don't expect exact values, check for reasonable ranges
- Clean up test sessions — after testing, remove session dirs from
~/.super-multica/sessions/to avoid clutter - Use
--providerto test specific providers — defaults to whatever is configured in credentials - For multi-turn tests, always capture and reuse the session ID from the first run