web_search and data tools authenticate via auth.json (sid + deviceId). When SMC_DATA_DIR is set (e.g. for E2E tests), the auth file may not exist in the custom dir. Now getLocalAuth() falls back to ~/.super-multica-dev/auth.json, which is created by pnpm dev:local Desktop login and valid for the dev backend (api-dev.copilothub.ai). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
295 lines
11 KiB
Markdown
295 lines
11 KiB
Markdown
# Agent-Driven E2E Testing Guide
|
|
|
|
This guide teaches Coding Agents (Claude Code, etc.) how to perform automated end-to-end testing of Super Multica features. Unlike traditional test frameworks, **the Coding Agent itself is the test runner and oracle** — it executes the agent, reads structured logs, and intelligently analyzes the results.
|
|
|
|
## Overview
|
|
|
|
The testing flow:
|
|
|
|
1. Coding Agent runs `pnpm multica run --run-log "test prompt"`
|
|
2. The agent engine executes the prompt with full structured logging
|
|
3. Coding Agent reads the `run-log.jsonl` and `session.jsonl` files
|
|
4. Coding Agent analyzes events, tool calls, and behavior for correctness
|
|
|
|
This approach is superior to static assertions because:
|
|
- The AI can understand **intent** — did the agent do what the prompt asked?
|
|
- It can reason about **intermediate process** — were the right tools called in the right order?
|
|
- It can detect **subtle issues** — token counts that don't make sense, unnecessary retries, missing events
|
|
|
|
## Prerequisites
|
|
|
|
1. **Credentials configured**: Run `pnpm multica credentials init` or ensure `~/.super-multica/credentials.json5` has valid provider credentials
|
|
2. **Available providers**: Check with `pnpm multica profile list` or inspect credentials file
|
|
3. **Default provider**: `kimi-coding` (Kimi Code, free tier available). Can override with `--provider`
|
|
4. **`MULTICA_API_URL`**: Required for `web_search` and `data` tools. Set to `https://api-dev.copilothub.ai` for dev environment. Without this, web search and financial data tools will fail with `MULTICA_API_URL is required`
|
|
5. **`SMC_DATA_DIR`**: Set to `~/.super-multica-e2e` to isolate E2E test sessions from dev (`~/.super-multica-dev`) and production (`~/.super-multica`) data. Without this, test sessions pollute the production sessions directory
|
|
6. **Dev auth for `web_search`/`data` tools**: These tools authenticate via `auth.json` (session ID + device ID). The auth store automatically falls back to `~/.super-multica-dev/auth.json` when the E2E data dir has no auth. If `~/.super-multica-dev/auth.json` doesn't exist, run `pnpm dev:local` first and log in through the Desktop app to create it
|
|
|
|
## Running a Test
|
|
|
|
### Environment variables
|
|
|
|
All E2E test commands should include these env vars:
|
|
|
|
```bash
|
|
# SMC_DATA_DIR — isolates test sessions from dev/production
|
|
# MULTICA_API_URL — enables web_search and data tools
|
|
export SMC_DATA_DIR=~/.super-multica-e2e
|
|
export MULTICA_API_URL=https://api-dev.copilothub.ai
|
|
```
|
|
|
|
### Basic command
|
|
|
|
```bash
|
|
# For prompts that only need exec/read/write tools:
|
|
SMC_DATA_DIR=~/.super-multica-e2e pnpm multica run --run-log "your test prompt here"
|
|
|
|
# For prompts that need web_search or data tools:
|
|
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log "your test prompt here"
|
|
```
|
|
|
|
### With provider override
|
|
|
|
```bash
|
|
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --provider claude-code "your test prompt"
|
|
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --provider kimi-coding "your test prompt"
|
|
```
|
|
|
|
### Resume a session (multi-turn testing)
|
|
|
|
```bash
|
|
# First turn
|
|
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log "Create a file called test.txt with content 'hello'"
|
|
# Note the session ID from stderr output: [session: 019c584a-...]
|
|
|
|
# Second turn (same session)
|
|
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --session 019c584a-... "Read the file test.txt and tell me its content"
|
|
```
|
|
|
|
### Cleanup
|
|
|
|
```bash
|
|
# Remove all E2E test sessions
|
|
rm -rf ~/.super-multica-e2e
|
|
```
|
|
|
|
### Output
|
|
|
|
The CLI prints metadata to stderr:
|
|
```
|
|
[session: 019c584a-7753-762d-9fb9-9eb0a8187df5]
|
|
[session-dir: /Users/you/.super-multica/sessions/019c584a-7753-762d-9fb9-9eb0a8187df5]
|
|
```
|
|
|
|
Agent text output goes to stdout.
|
|
|
|
## Reading Results
|
|
|
|
After a run, two files contain the data needed for analysis:
|
|
|
|
### run-log.jsonl
|
|
|
|
Location: `{session-dir}/run-log.jsonl`
|
|
|
|
Each line is a JSON object with structured event data. Read this file to understand **what happened during execution**.
|
|
|
|
```jsonl
|
|
{"ts":1739000001,"event":"run_start","prompt":"What is 2+2?","provider":"kimi-coding","model":"kimi-k2-thinking","messages":0}
|
|
{"ts":1739000002,"event":"llm_call","provider":"kimi-coding","model":"kimi-k2-thinking","messages":2}
|
|
{"ts":1739000005,"event":"llm_result","duration_ms":3000}
|
|
{"ts":1739000005,"event":"run_end","duration_ms":4000,"error":null,"text":"4"}
|
|
```
|
|
|
|
### session.jsonl
|
|
|
|
Location: `{session-dir}/session.jsonl`
|
|
|
|
Contains the full conversation transcript (user messages, assistant replies, tool calls and results). Read this for **message content analysis**.
|
|
|
|
## Run-Log Event Reference
|
|
|
|
> Source of truth: `packages/core/src/agent/run-log.ts` (JSDoc at top of file)
|
|
|
|
### Lifecycle Events
|
|
|
|
| Event | Fields | Description |
|
|
|-------|--------|-------------|
|
|
| `run_start` | prompt, internal, provider, model, messages | Agent run begins |
|
|
| `run_end` | duration_ms, error, text, aborted? | Agent run completes |
|
|
|
|
### LLM Interaction
|
|
|
|
| Event | Fields | Description |
|
|
|-------|--------|-------------|
|
|
| `llm_call` | provider, model, profile, messages | LLM API request sent |
|
|
| `llm_result` | duration_ms | LLM API response received |
|
|
|
|
### Tool Execution
|
|
|
|
| Event | Fields | Description |
|
|
|-------|--------|-------------|
|
|
| `tool_start` | tool, args | Tool execution begins |
|
|
| `tool_end` | tool, duration_ms, is_error | Tool execution completes |
|
|
|
|
### Context Management
|
|
|
|
| Event | Fields | Description |
|
|
|-------|--------|-------------|
|
|
| `preflight_compact_start` | utilization, trigger, messages, est_tokens | Preflight compaction triggered |
|
|
| `preflight_compact_end` | messages_before, messages_after, pruned | Preflight compaction done |
|
|
| `tool_result_pruning` | soft_trimmed, hard_cleared, chars_saved, phase, tokens_before?, tokens_after? | Tool result pruning (Phase 1) |
|
|
| `compaction` | removed, kept, tokens_removed, tokens_kept, reason, pruning_stats? | Summary compaction (Phase 2) |
|
|
| `compaction_detail` | pre_pruning_tokens, post_compaction_tokens, messages_removed, reason, pruning_applied | Detailed compaction breakdown |
|
|
|
|
### Error Recovery
|
|
|
|
| Event | Fields | Description |
|
|
|-------|--------|-------------|
|
|
| `context_overflow` | attempt, messages_before | Context window overflow detected |
|
|
| `context_overflow_compacted` | messages_after, tokens_removed | Recovered via compaction |
|
|
| `context_overflow_forced` | messages_before, messages_after | Recovered via forced drop |
|
|
| `error_classify` | error, reason, rotatable | Error classified for rotation |
|
|
| `auth_rotate` | from, to, reason | Auth profile rotated |
|
|
|
|
## Feature Test Playbooks
|
|
|
|
### 1. Basic Prompt Completion
|
|
|
|
**Goal**: Verify the agent can complete a simple prompt end-to-end.
|
|
|
|
```bash
|
|
pnpm multica run --run-log "What is the capital of France? Reply in one word."
|
|
```
|
|
|
|
**What to check in run-log**:
|
|
- `run_start` event exists with correct provider
|
|
- `llm_call` → `llm_result` pair exists (at least one)
|
|
- `run_end` event has `error: null`
|
|
- `run_end.duration_ms` is reasonable (< 30s for simple prompt)
|
|
|
|
**What to check in output**:
|
|
- Text contains "Paris"
|
|
|
|
### 2. Tool Usage
|
|
|
|
**Goal**: Verify tools are called correctly when the prompt requires them.
|
|
|
|
```bash
|
|
pnpm multica run --run-log --cwd /tmp "List the files in the current directory"
|
|
```
|
|
|
|
**What to check in run-log**:
|
|
- `tool_start` event with `tool: "exec"` or similar filesystem tool
|
|
- Matching `tool_end` with `is_error: false`
|
|
- Tool called before final `run_end`
|
|
|
|
**What to check in output**:
|
|
- Output contains actual file names from /tmp
|
|
|
|
### 3. Context Compaction
|
|
|
|
**Goal**: Verify compaction works correctly on long sessions.
|
|
|
|
```bash
|
|
# Build up a long session to trigger compaction
|
|
pnpm multica run --run-log "Write a detailed 2000-word essay about climate change"
|
|
# Note session ID, then continue:
|
|
pnpm multica run --run-log --session {id} "Now write another 2000-word essay about renewable energy"
|
|
pnpm multica run --run-log --session {id} "Summarize both essays in 3 bullet points"
|
|
```
|
|
|
|
**What to check in run-log**:
|
|
- `preflight_compact_start` appears when utilization exceeds trigger ratio
|
|
- `tool_result_pruning` shows `soft_trimmed > 0` or `hard_cleared > 0` if tool results were pruned
|
|
- `compaction` event has `tokens_removed > 0` (not near-zero like the bug we fixed)
|
|
- `compaction_detail` shows `pre_pruning_tokens` > `post_compaction_tokens`
|
|
|
|
### 4. Multi-Provider Comparison
|
|
|
|
**Goal**: Verify the same prompt works across different providers.
|
|
|
|
```bash
|
|
pnpm multica run --run-log --provider kimi-coding "Explain recursion in 2 sentences"
|
|
pnpm multica run --run-log --provider claude-code "Explain recursion in 2 sentences"
|
|
```
|
|
|
|
**What to check**:
|
|
- Both runs complete without errors
|
|
- Both `run_end` events have `error: null`
|
|
- Compare `llm_result.duration_ms` across providers
|
|
- Both outputs are meaningful explanations of recursion
|
|
|
|
### 5. Error Handling & Auth Rotation
|
|
|
|
**Goal**: Verify error recovery when credentials are invalid.
|
|
|
|
```bash
|
|
pnpm multica run --run-log --provider anthropic --api-key "sk-invalid-key" "Hello"
|
|
```
|
|
|
|
**What to check in run-log**:
|
|
- `error_classify` event with `reason: "auth"`
|
|
- `auth_rotate` event if multiple profiles are configured
|
|
- `run_end` with appropriate error message if no valid profiles exist
|
|
|
|
## Analysis Patterns
|
|
|
|
When analyzing run-logs, look for these patterns:
|
|
|
|
### Healthy Run
|
|
```
|
|
run_start → llm_call → llm_result → run_end (error: null)
|
|
```
|
|
|
|
### Run with Tool Usage
|
|
```
|
|
run_start → llm_call → llm_result → tool_start → tool_end → llm_call → llm_result → run_end
|
|
```
|
|
|
|
### Run with Compaction
|
|
```
|
|
run_start → preflight_compact_start → tool_result_pruning → preflight_compact_end → llm_call → ...
|
|
```
|
|
|
|
### Red Flags
|
|
- `run_end` without preceding `run_start` (log corruption)
|
|
- `tool_start` without matching `tool_end` (tool hang/crash)
|
|
- `compaction` with `tokens_removed` near zero (compaction ineffective)
|
|
- Multiple `error_classify` events (repeated failures)
|
|
- `context_overflow_forced` (emergency fallback — should be rare)
|
|
|
|
## Creating a New Test Playbook
|
|
|
|
When a new feature is implemented, create a test playbook following this template:
|
|
|
|
```markdown
|
|
### N. Feature Name
|
|
|
|
**Goal**: One sentence describing what to verify.
|
|
|
|
**Command**:
|
|
\`\`\`bash
|
|
pnpm multica run --run-log [options] "prompt that exercises the feature"
|
|
\`\`\`
|
|
|
|
**What to check in run-log**:
|
|
- List specific events and field values to verify
|
|
- Include both positive checks (event exists) and negative checks (no errors)
|
|
|
|
**What to check in output**:
|
|
- What the text output should contain or look like
|
|
|
|
**What to check in session.jsonl** (if applicable):
|
|
- Specific message patterns to verify
|
|
```
|
|
|
|
## Tips for Coding Agents
|
|
|
|
1. **Always use `--run-log`** — without it, there's no structured data to analyze
|
|
2. **Use `--cwd`** to control the working directory for file-related tests
|
|
3. **Read run-log line by line** — each line is independent JSON, parse individually
|
|
4. **Check event ordering** — events are chronologically ordered by `ts`
|
|
5. **Token counts are estimates** — don't expect exact values, check for reasonable ranges
|
|
6. **Clean up test sessions** — after testing, remove session dirs from `~/.super-multica/sessions/` to avoid clutter
|
|
7. **Use `--provider`** to test specific providers — defaults to whatever is configured in credentials
|
|
8. **For multi-turn tests**, always capture and reuse the session ID from the first run
|