feat: pivot to AI-native task management platform (#232)
Replace the agent framework codebase with a new monorepo structure for an AI-native Linear-like product where agents are first-class citizens. New architecture: - server/ — Go backend (Chi + gorilla/websocket + sqlc) - API server with REST routes for issues, agents, inbox, workspaces - WebSocket hub for real-time updates - Local daemon entry point for agent runtime connection - PostgreSQL migration with 13 tables (issue, agent, inbox, etc.) - WebSocket protocol types for server<->daemon communication - apps/web/ — Next.js 16 frontend - Dashboard layout with sidebar navigation - Route skeleton: inbox, issues, agents, board, settings - packages/ui/ — Preserved shadcn/ui design system (26+ components) - packages/types/ — Full API contract types (Issue, Agent, Workspace, Inbox, Events) - packages/sdk/ — REST ApiClient + WebSocket WSClient - packages/store/ — Zustand stores (issue, agent, inbox, auth) - packages/hooks/ — React hooks (useIssues, useAgents, useInbox, useRealtime) - packages/utils/ — Shared utilities Removed: apps/cli, apps/desktop, apps/mobile, apps/gateway, packages/core, skills/, and all agent-framework code. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
3f589d8326
commit
d4f5c5b16f
677 changed files with 2779 additions and 122531 deletions
|
|
@ -1,29 +0,0 @@
|
|||
# Documentation Index (Priority-Based)
|
||||
|
||||
This repo keeps documentation intentionally small to reduce stale AI context.
|
||||
Only workflow/testing/process documentation should be maintained.
|
||||
Project-intro and architecture explanation docs are intentionally omitted.
|
||||
|
||||
## P0 (Keep Fresh)
|
||||
|
||||
1. `README.md`
|
||||
2. `CLAUDE.md`
|
||||
3. `docs/development.md`
|
||||
4. `docs/cli.md`
|
||||
5. `docs/credentials.md`
|
||||
|
||||
## P1 (Operational)
|
||||
|
||||
1. `docs/skills-and-tools.md`
|
||||
2. `docs/package-management.md`
|
||||
3. `docs/e2e-testing-guide.md`
|
||||
|
||||
## P2 (Benchmarks / Specialized)
|
||||
|
||||
1. `docs/e2e-finance-benchmark.md`
|
||||
2. `docs/web-tools-policy-optimization.md`
|
||||
|
||||
## Regeneration Rule
|
||||
|
||||
When code behavior changes, update only impacted P0/P1 docs first.
|
||||
If unsure, prefer deleting stale sections over keeping speculative content.
|
||||
129
docs/cli.md
129
docs/cli.md
|
|
@ -1,129 +0,0 @@
|
|||
# CLI Guide (`multica`)
|
||||
|
||||
## Entry
|
||||
|
||||
```bash
|
||||
pnpm multica
|
||||
```
|
||||
|
||||
Equivalent command names:
|
||||
|
||||
- `multica`
|
||||
- `mu`
|
||||
|
||||
## Core Commands
|
||||
|
||||
```bash
|
||||
multica # interactive chat (default)
|
||||
multica run "<prompt>" # one-shot run
|
||||
multica chat # explicit interactive mode
|
||||
multica session <command> # session management
|
||||
multica profile <command> # profile management
|
||||
multica skills <command> # skill management
|
||||
multica tools <command> # tool policy inspection
|
||||
multica credentials <command> # credentials management
|
||||
multica cron <command> # scheduled tasks
|
||||
multica dev [service] # start dev services
|
||||
multica help
|
||||
```
|
||||
|
||||
## Run Mode
|
||||
|
||||
```bash
|
||||
multica run [options] <prompt>
|
||||
echo "prompt" | multica run
|
||||
```
|
||||
|
||||
Common options:
|
||||
|
||||
- `--profile <id>`
|
||||
- `--provider <name>`
|
||||
- `--model <name>`
|
||||
- `--session <id>`
|
||||
- `--cwd <dir>`
|
||||
- `--run-log`
|
||||
- `--tools-allow a,b,c`
|
||||
- `--tools-deny a,b,c`
|
||||
- `--context-window <tokens>`
|
||||
|
||||
## Chat Mode
|
||||
|
||||
```bash
|
||||
multica chat [options]
|
||||
multica [options]
|
||||
```
|
||||
|
||||
In-chat commands:
|
||||
|
||||
- `/help`
|
||||
- `/exit`
|
||||
- `/clear`
|
||||
- `/session`
|
||||
- `/new`
|
||||
- `/multiline`
|
||||
- `/provider`
|
||||
- `/model`
|
||||
|
||||
## Sessions
|
||||
|
||||
```bash
|
||||
multica session list
|
||||
multica session show <id>
|
||||
multica session delete <id>
|
||||
```
|
||||
|
||||
Session data root:
|
||||
|
||||
- `~/.super-multica/sessions/`
|
||||
- or `SMC_DATA_DIR/sessions/`
|
||||
|
||||
## Profiles
|
||||
|
||||
```bash
|
||||
multica profile list
|
||||
multica profile new <id>
|
||||
multica profile setup <id>
|
||||
multica profile show <id>
|
||||
multica profile edit <id>
|
||||
multica profile delete <id>
|
||||
```
|
||||
|
||||
## Skills
|
||||
|
||||
```bash
|
||||
multica skills list
|
||||
multica skills status [id]
|
||||
multica skills install <id>
|
||||
multica skills add <owner/repo[/skill]>
|
||||
multica skills remove <name>
|
||||
```
|
||||
|
||||
## Tools
|
||||
|
||||
```bash
|
||||
multica tools list
|
||||
multica tools list --allow group:fs,web_fetch
|
||||
multica tools list --deny exec
|
||||
multica tools groups
|
||||
```
|
||||
|
||||
## Credentials
|
||||
|
||||
```bash
|
||||
multica credentials init
|
||||
multica credentials show
|
||||
multica credentials edit
|
||||
```
|
||||
|
||||
## Cron
|
||||
|
||||
```bash
|
||||
multica cron status
|
||||
multica cron list
|
||||
multica cron add -n "name" --every "30m" --message "..."
|
||||
multica cron run <id>
|
||||
multica cron enable <id>
|
||||
multica cron disable <id>
|
||||
multica cron remove <id>
|
||||
multica cron logs <id>
|
||||
```
|
||||
|
|
@ -1,76 +0,0 @@
|
|||
# Credentials Guide
|
||||
|
||||
## Initialize
|
||||
|
||||
```bash
|
||||
pnpm multica credentials init
|
||||
```
|
||||
|
||||
This creates:
|
||||
|
||||
- `~/.super-multica/credentials.json5`
|
||||
|
||||
## Path Resolution
|
||||
|
||||
Credential file lookup order:
|
||||
|
||||
1. `SMC_CREDENTIALS_PATH` (explicit override)
|
||||
2. `SMC_DATA_DIR/credentials.json5` (or default data dir)
|
||||
3. `~/.super-multica/credentials.json5` fallback
|
||||
|
||||
## Minimal Template
|
||||
|
||||
```json5
|
||||
{
|
||||
version: 1,
|
||||
llm: {
|
||||
provider: "kimi-coding",
|
||||
providers: {
|
||||
"kimi-coding": {
|
||||
apiKey: "your-key",
|
||||
},
|
||||
},
|
||||
},
|
||||
tools: {
|
||||
// tool-specific keys
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
## Multi-Key Rotation (Per Provider)
|
||||
|
||||
You can define multiple keys under one provider namespace:
|
||||
|
||||
```json5
|
||||
{
|
||||
llm: {
|
||||
providers: {
|
||||
"anthropic": { apiKey: "primary" },
|
||||
"anthropic:backup": { apiKey: "backup" },
|
||||
},
|
||||
order: {
|
||||
anthropic: ["anthropic", "anthropic:backup"],
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
## OAuth Providers
|
||||
|
||||
- `claude-code`: run `claude login`
|
||||
- `openai-codex`: run `codex login`
|
||||
|
||||
API-key providers are configured directly in `credentials.json5`.
|
||||
|
||||
## Tool Credentials
|
||||
|
||||
Tool credentials are read from:
|
||||
|
||||
- `credentials.json5` under `tools`
|
||||
- skill-level `.env` files under skill directories
|
||||
|
||||
## Security
|
||||
|
||||
- Keep credentials file mode private (`600` on Unix-like systems).
|
||||
- Do not commit keys into the repository.
|
||||
- Prefer isolated data dirs (`SMC_DATA_DIR`) for test/dev environments.
|
||||
|
|
@ -1,134 +0,0 @@
|
|||
# Development Guide
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Node.js 20+
|
||||
- pnpm 10+
|
||||
- macOS/Linux/Windows
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
pnpm install
|
||||
```
|
||||
|
||||
`.npmrc` must keep:
|
||||
|
||||
```ini
|
||||
shamefully-hoist=true
|
||||
```
|
||||
|
||||
## Main Dev Entry Points
|
||||
|
||||
```bash
|
||||
# Recommended local desktop workflow
|
||||
pnpm dev
|
||||
|
||||
# Service-specific
|
||||
pnpm dev:desktop
|
||||
pnpm dev:gateway
|
||||
pnpm dev:web
|
||||
|
||||
# Full local stack with isolated dev data
|
||||
pnpm dev:local
|
||||
pnpm dev:local:archive
|
||||
```
|
||||
|
||||
## What Each Command Does
|
||||
|
||||
- `pnpm dev`: builds shared packages, then runs `types + utils + core + desktop` watch flow.
|
||||
- `pnpm dev:desktop`: Electron desktop only.
|
||||
- `pnpm dev:gateway`: NestJS WebSocket gateway (`PORT`, default `3000`).
|
||||
- `pnpm dev:web`: Next.js web app (`3000` by script).
|
||||
- `pnpm dev:local`: gateway + web + desktop with dev-safe env defaults.
|
||||
- `pnpm dev:local:archive`: archive dev data and start fresh.
|
||||
|
||||
## Important Environment Variables
|
||||
|
||||
- `SMC_DATA_DIR`: override runtime data root (default `~/.super-multica`)
|
||||
- `GATEWAY_URL`: gateway endpoint for desktop/CLI hub connection
|
||||
- `MULTICA_API_URL`: required by web/data tools
|
||||
- `PORT`: gateway/server port
|
||||
- `MULTICA_WORKSPACE_DIR`: override workspace root
|
||||
- `MULTICA_RUN_LOG=1`: enable structured run-log output
|
||||
|
||||
## Agent / Conversation Semantics
|
||||
|
||||
- `agentId`: logical owner identity (capabilities/profile scope).
|
||||
- `conversationId`: isolated runtime thread under an agent.
|
||||
- `sessionId`: internal runner/storage identifier for a conversation. External protocols use `conversationId`.
|
||||
|
||||
Protocol rules:
|
||||
|
||||
- Hub RPC is conversation-first: `createConversation/listConversations/deleteConversation`.
|
||||
- All message, stream, and verify payloads use `conversationId` (no `sessionId` alias fields).
|
||||
- New integrations should always pass `conversationId` explicitly.
|
||||
|
||||
Telegram behavior:
|
||||
|
||||
- One Telegram DM binds to one active `conversationId`.
|
||||
- `/new` creates and switches to a new conversation.
|
||||
- `/session <id>` switches the active conversation.
|
||||
- `/sessions` lists available conversations.
|
||||
|
||||
Channel route behavior:
|
||||
|
||||
- Runtime route key is `channelId:accountId:externalConversationId`.
|
||||
- Each route key is bound to one Hub `conversationId`.
|
||||
- Incoming/outgoing channel traffic is isolated per bound conversation (no global first-agent fallback).
|
||||
|
||||
## Local Full-Stack Notes (`pnpm dev:local`)
|
||||
|
||||
`pnpm dev:local` is the recommended way to run the full local stack for integration work.
|
||||
|
||||
Setup:
|
||||
|
||||
1. `cp .env.example .env`
|
||||
2. Set `TELEGRAM_BOT_TOKEN` in root `.env`
|
||||
3. Run `pnpm dev:local`
|
||||
|
||||
Services started by the script:
|
||||
|
||||
| Service | Address | Notes |
|
||||
|---------|---------|-------|
|
||||
| Gateway | `http://localhost:4000` | Telegram long-polling mode (`PORT=4000`) |
|
||||
| Web | `http://localhost:3000` | OAuth login / frontend |
|
||||
| Desktop | — | Uses `GATEWAY_URL=http://localhost:4000` and local web URL |
|
||||
|
||||
Data/workspace isolation used by the script:
|
||||
|
||||
- `SMC_DATA_DIR=~/.super-multica-dev`
|
||||
- `MULTICA_WORKSPACE_DIR=~/Documents/Multica-dev`
|
||||
|
||||
Why this matters:
|
||||
|
||||
- avoids polluting production data under `~/.super-multica`
|
||||
- provides a stable local target for auth/session debugging
|
||||
|
||||
Common follow-up:
|
||||
|
||||
```bash
|
||||
pnpm dev:local:archive
|
||||
```
|
||||
|
||||
This archives prior dev data before starting fresh local runs.
|
||||
|
||||
## Build / Quality
|
||||
|
||||
```bash
|
||||
pnpm build
|
||||
pnpm typecheck
|
||||
pnpm test
|
||||
pnpm test:coverage
|
||||
```
|
||||
|
||||
## Useful Reset Commands
|
||||
|
||||
```bash
|
||||
# Reset default + dev data dirs used by desktop scripts
|
||||
pnpm dev:desktop:reset
|
||||
|
||||
# Reset and relaunch desktop onboarding flow
|
||||
pnpm dev:desktop:fresh
|
||||
pnpm dev:desktop:onboarding
|
||||
```
|
||||
|
|
@ -1,112 +0,0 @@
|
|||
# Finance Agent-Driven E2E Benchmark
|
||||
|
||||
This benchmark suite is designed for complex financial analysis scenarios and follows the workflow in `docs/e2e-testing-guide.md`.
|
||||
|
||||
## Scope
|
||||
|
||||
- Domain: equity, macro, rates, credit, cross-asset allocation
|
||||
- Complexity: multi-step planning, data collection, analysis, local artifact generation
|
||||
- Providers: `kimi-coding` and `claude-code`
|
||||
- Cases: 10
|
||||
|
||||
Case prompts are stored in:
|
||||
- `scripts/e2e-finance-benchmark/cases/`
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. Credentials are configured (`pnpm multica credentials init` if needed)
|
||||
2. Dev auth exists for `web_search`/`data` tools (`~/.super-multica-dev/auth.json`)
|
||||
3. Required env:
|
||||
|
||||
```bash
|
||||
export SMC_DATA_DIR=~/.super-multica-e2e
|
||||
export MULTICA_API_URL=https://api-dev.copilothub.ai
|
||||
```
|
||||
|
||||
## Run All Cases (Both Providers)
|
||||
|
||||
```bash
|
||||
scripts/e2e-finance-benchmark/run.sh
|
||||
```
|
||||
|
||||
The script defaults:
|
||||
- Providers: `kimi-coding claude-code`
|
||||
- Case glob: `case-*.txt`
|
||||
- Max parallel workers: `2`
|
||||
- Per-case timeout: `900s` (set `CASE_TIMEOUT_SEC=0` to disable)
|
||||
- Output directory: `.context/finance-e2e-runs/<timestamp>/`
|
||||
|
||||
Generated artifact:
|
||||
- `manifest.tsv`: provider, case id, status, session id, session dir, raw log file
|
||||
|
||||
## Run a Subset
|
||||
|
||||
Run only one provider:
|
||||
|
||||
```bash
|
||||
PROVIDERS="kimi-coding" scripts/e2e-finance-benchmark/run.sh
|
||||
```
|
||||
|
||||
Run only specific cases by glob:
|
||||
|
||||
```bash
|
||||
CASE_GLOB="case-0[1-3]*.txt" scripts/e2e-finance-benchmark/run.sh
|
||||
```
|
||||
|
||||
Run with higher parallelism for long-horizon tasks:
|
||||
|
||||
```bash
|
||||
MAX_PARALLEL=4 CASE_TIMEOUT_SEC=2700 scripts/e2e-finance-benchmark/run.sh
|
||||
```
|
||||
|
||||
## Case List
|
||||
|
||||
1. `case-01-top10-financial-reports.txt`
|
||||
- Top-10 US market cap 3-year filing analysis + workbook + 2026 allocation memo
|
||||
2. `case-02-ai-value-chain-scorecard.txt`
|
||||
- AI value-chain factor model and weighted ranking
|
||||
3. `case-03-us-bank-stress-test.txt`
|
||||
- US large-bank stress scenarios (mild/severe recession)
|
||||
4. `case-04-consumer-sector-macro-linkage.txt`
|
||||
- Consumer sector earnings elasticity vs macro variables
|
||||
5. `case-05-energy-transport-sensitivity.txt`
|
||||
- Energy/transport sensitivity and hedge ideas under oil scenarios
|
||||
6. `case-06-cross-asset-allocation.txt`
|
||||
- Cross-asset tactical portfolio design with scenario stress tests
|
||||
7. `case-07-reit-rate-risk.txt`
|
||||
- REIT screening under rate scenarios and debt maturity pressure
|
||||
8. `case-08-earnings-quality-forensics.txt`
|
||||
- Forensic accounting quality framework and red-flag scoring
|
||||
9. `case-09-post-earnings-drift-study.txt`
|
||||
- PEAD strategy feasibility study with risk controls
|
||||
10. `case-10-investment-committee-pack.txt`
|
||||
- Q2 2026 investment committee pack + devil's advocate memo
|
||||
|
||||
## Evaluation Checklist
|
||||
|
||||
For each run (`session-dir/run-log.jsonl`):
|
||||
|
||||
1. Event completeness
|
||||
- `run_start` appears before `run_end`
|
||||
2. Tool pairing
|
||||
- Every `tool_start` has matching `tool_end`
|
||||
3. Error handling
|
||||
- Check `tool_end.is_error`, `error_classify`, `auth_rotate`
|
||||
4. Compaction health
|
||||
- If compaction occurs: `compaction.tokens_removed > 0`
|
||||
5. Performance
|
||||
- Inspect `llm_result.duration_ms` and tool durations for outliers
|
||||
|
||||
For content quality (`session.jsonl` and output files on Desktop):
|
||||
|
||||
1. Required files are created in target output directory
|
||||
2. Assumptions are explicit and traceable
|
||||
3. Sources are listed (`sources.md` with links + dates)
|
||||
4. Output distinguishes facts vs inferences when requested
|
||||
5. Strategy conclusions include risk and invalidation conditions
|
||||
|
||||
## Notes
|
||||
|
||||
- Most cases intentionally require web + financial data gathering and local file generation.
|
||||
- Cases are designed to test planning quality, not only final answer quality.
|
||||
- You can analyze sessions after batch runs by opening the `session_dir` paths in `manifest.tsv`.
|
||||
|
|
@ -1,98 +0,0 @@
|
|||
# Skills Agent-Driven E2E Benchmark
|
||||
|
||||
This benchmark validates the meta skill workflow for capability-gap discovery, ClawHub installation, and security-gated rollout.
|
||||
|
||||
## Scope
|
||||
|
||||
- Domain: skill discovery + installation + update
|
||||
- Focus: `skills/meta-skill-installer`
|
||||
- Providers: default `kimi-coding` (override with `PROVIDERS`)
|
||||
- Cases: 5
|
||||
|
||||
Case prompts are stored in:
|
||||
- `scripts/e2e-skills-benchmark/cases/`
|
||||
|
||||
## Real ClawHub Examples Used
|
||||
|
||||
The case set references real public pages from ClawHub:
|
||||
|
||||
- [CalDAV Calendar](https://clawhub.ai/skills/caldav-calendar)
|
||||
- [Home Assistant](https://clawhub.ai/skills/homeassistant)
|
||||
- [CodexMonitor](https://clawhub.ai/odrobnik/codexmonitor)
|
||||
- [Spotify (gap-discovery UX flow)](https://clawhub.ai/search?q=spotify)
|
||||
- [Notion (gap-discovery UX flow)](https://clawhub.ai/search?q=notion)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. Credentials configured (`pnpm multica credentials init` if needed)
|
||||
2. Dependencies installed in repo (`pnpm install`)
|
||||
3. `clawhub` CLI available, or allow runtime fallback to `npx -y clawhub`
|
||||
4. Required env:
|
||||
|
||||
```bash
|
||||
export SMC_DATA_DIR=~/.super-multica-e2e
|
||||
export MULTICA_API_URL=https://api-dev.copilothub.ai
|
||||
```
|
||||
|
||||
## Run Benchmark
|
||||
|
||||
```bash
|
||||
scripts/e2e-skills-benchmark/run.sh
|
||||
```
|
||||
|
||||
Defaults:
|
||||
|
||||
- Providers: `kimi-coding`
|
||||
- Case glob: `case-*.txt`
|
||||
- Max parallel workers: `1`
|
||||
- Per-case timeout: `1200s` (`CASE_TIMEOUT_SEC=0` to disable)
|
||||
- Output directory: `.context/skills-e2e-runs/<timestamp>/`
|
||||
|
||||
Generated artifacts:
|
||||
|
||||
- `manifest.tsv`: provider/case/status/session/log metadata
|
||||
- `analysis.txt`: human-readable pass/fail report
|
||||
- `analysis.json`: structured detailed check output
|
||||
|
||||
## Run Subset
|
||||
|
||||
Only one case:
|
||||
|
||||
```bash
|
||||
CASE_GLOB="case-01-*.txt" scripts/e2e-skills-benchmark/run.sh
|
||||
```
|
||||
|
||||
Multiple providers:
|
||||
|
||||
```bash
|
||||
PROVIDERS="kimi-coding claude-code" scripts/e2e-skills-benchmark/run.sh
|
||||
```
|
||||
|
||||
Faster throughput:
|
||||
|
||||
```bash
|
||||
MAX_PARALLEL=2 CASE_TIMEOUT_SEC=1800 scripts/e2e-skills-benchmark/run.sh
|
||||
```
|
||||
|
||||
## Analyzer Checks
|
||||
|
||||
For each run:
|
||||
|
||||
1. `run_start` and `run_end` both present
|
||||
2. `run_end.error` is empty/null
|
||||
3. `tool_start` and `tool_end` are paired
|
||||
4. no `tool_end.is_error=true`
|
||||
5. at least one `exec` tool call exists
|
||||
6. case-specific command evidence in `tool_start.args`:
|
||||
- `clawhub search`
|
||||
- `clawhub install`
|
||||
- `review-skill-security.mjs`
|
||||
- for case 03 also `clawhub update`
|
||||
- for case 04, prompt is a natural user request only; agent must self-discover capability gap, propose ClawHub + security review + install confirmation, and must not run workaround commands (`osascript`, `ha.sh`, `spogo`, `spotify_player`) before user confirmation
|
||||
- for case 05, prompt is a natural Notion request; agent must discover missing capability, search skill candidates, trigger `install_guard` (blocked until confirmation), and ask for explicit install consent plus token/auth prerequisites
|
||||
|
||||
## Notes
|
||||
|
||||
- These are agent-driven tests; prompt intent plus run-log evidence are both evaluated.
|
||||
- `SMC_DATA_DIR=~/.super-multica-e2e` avoids polluting normal user skill/session data.
|
||||
- If a case fails, open `manifest.tsv` and inspect the matching `session_dir/run-log.jsonl`.
|
||||
|
|
@ -1,295 +0,0 @@
|
|||
# Agent-Driven E2E Testing Guide
|
||||
|
||||
This guide teaches Coding Agents (Claude Code, etc.) how to perform automated end-to-end testing of Super Multica features. Unlike traditional test frameworks, **the Coding Agent itself is the test runner and oracle** — it executes the agent, reads structured logs, and intelligently analyzes the results.
|
||||
|
||||
## Overview
|
||||
|
||||
The testing flow:
|
||||
|
||||
1. Coding Agent runs `pnpm multica run --run-log "test prompt"`
|
||||
2. The agent engine executes the prompt with full structured logging
|
||||
3. Coding Agent reads the `run-log.jsonl` and `session.jsonl` files
|
||||
4. Coding Agent analyzes events, tool calls, and behavior for correctness
|
||||
|
||||
This approach is superior to static assertions because:
|
||||
- The AI can understand **intent** — did the agent do what the prompt asked?
|
||||
- It can reason about **intermediate process** — were the right tools called in the right order?
|
||||
- It can detect **subtle issues** — token counts that don't make sense, unnecessary retries, missing events
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **Credentials configured**: Run `pnpm multica credentials init` or ensure `~/.super-multica/credentials.json5` has valid provider credentials
|
||||
2. **Available providers**: Check with `pnpm multica profile list` or inspect credentials file
|
||||
3. **Default provider**: `kimi-coding` (Kimi Code, free tier available). Can override with `--provider`
|
||||
4. **`MULTICA_API_URL`**: Required for `web_search` and `data` tools. Set to `https://api-dev.copilothub.ai` for dev environment. Without this, web search and financial data tools will fail with `MULTICA_API_URL is required`
|
||||
5. **`SMC_DATA_DIR`**: Set to `~/.super-multica-e2e` to isolate E2E test sessions from dev (`~/.super-multica-dev`) and production (`~/.super-multica`) data. Without this, test sessions pollute the production sessions directory
|
||||
6. **Dev auth for `web_search`/`data` tools**: These tools authenticate via `auth.json` (session ID + device ID). The auth store automatically falls back to `~/.super-multica-dev/auth.json` when the E2E data dir has no auth. If `~/.super-multica-dev/auth.json` doesn't exist, run `pnpm dev:local` first and log in through the Desktop app to create it
|
||||
|
||||
## Running a Test
|
||||
|
||||
### Environment variables
|
||||
|
||||
All E2E test commands should include these env vars:
|
||||
|
||||
```bash
|
||||
# SMC_DATA_DIR — isolates test sessions from dev/production
|
||||
# MULTICA_API_URL — enables web_search and data tools
|
||||
export SMC_DATA_DIR=~/.super-multica-e2e
|
||||
export MULTICA_API_URL=https://api-dev.copilothub.ai
|
||||
```
|
||||
|
||||
### Basic command
|
||||
|
||||
```bash
|
||||
# For prompts that only need exec/read/write tools:
|
||||
SMC_DATA_DIR=~/.super-multica-e2e pnpm multica run --run-log "your test prompt here"
|
||||
|
||||
# For prompts that need web_search or data tools:
|
||||
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log "your test prompt here"
|
||||
```
|
||||
|
||||
### With provider override
|
||||
|
||||
```bash
|
||||
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --provider claude-code "your test prompt"
|
||||
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --provider kimi-coding "your test prompt"
|
||||
```
|
||||
|
||||
### Resume a session (multi-turn testing)
|
||||
|
||||
```bash
|
||||
# First turn
|
||||
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log "Create a file called test.txt with content 'hello'"
|
||||
# Note the session ID from stderr output: [session: 019c584a-...]
|
||||
|
||||
# Second turn (same session)
|
||||
SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --session 019c584a-... "Read the file test.txt and tell me its content"
|
||||
```
|
||||
|
||||
### Cleanup
|
||||
|
||||
```bash
|
||||
# Remove all E2E test sessions
|
||||
rm -rf ~/.super-multica-e2e
|
||||
```
|
||||
|
||||
### Output
|
||||
|
||||
The CLI prints metadata to stderr:
|
||||
```
|
||||
[session: 019c584a-7753-762d-9fb9-9eb0a8187df5]
|
||||
[session-dir: /Users/you/.super-multica/sessions/019c584a-7753-762d-9fb9-9eb0a8187df5]
|
||||
```
|
||||
|
||||
Agent text output goes to stdout.
|
||||
|
||||
## Reading Results
|
||||
|
||||
After a run, two files contain the data needed for analysis:
|
||||
|
||||
### run-log.jsonl
|
||||
|
||||
Location: `{session-dir}/run-log.jsonl`
|
||||
|
||||
Each line is a JSON object with structured event data. Read this file to understand **what happened during execution**.
|
||||
|
||||
```jsonl
|
||||
{"ts":1739000001,"event":"run_start","prompt":"What is 2+2?","provider":"kimi-coding","model":"kimi-k2-thinking","messages":0}
|
||||
{"ts":1739000002,"event":"llm_call","provider":"kimi-coding","model":"kimi-k2-thinking","messages":2}
|
||||
{"ts":1739000005,"event":"llm_result","duration_ms":3000}
|
||||
{"ts":1739000005,"event":"run_end","duration_ms":4000,"error":null,"text":"4"}
|
||||
```
|
||||
|
||||
### session.jsonl
|
||||
|
||||
Location: `{session-dir}/session.jsonl`
|
||||
|
||||
Contains the full conversation transcript (user messages, assistant replies, tool calls and results). Read this for **message content analysis**.
|
||||
|
||||
## Run-Log Event Reference
|
||||
|
||||
> Source of truth: `packages/core/src/agent/run-log.ts` (JSDoc at top of file)
|
||||
|
||||
### Lifecycle Events
|
||||
|
||||
| Event | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `run_start` | prompt, internal, provider, model, messages | Agent run begins |
|
||||
| `run_end` | duration_ms, error, text, aborted? | Agent run completes |
|
||||
|
||||
### LLM Interaction
|
||||
|
||||
| Event | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `llm_call` | provider, model, profile, messages | LLM API request sent |
|
||||
| `llm_result` | duration_ms | LLM API response received |
|
||||
|
||||
### Tool Execution
|
||||
|
||||
| Event | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tool_start` | tool, args | Tool execution begins |
|
||||
| `tool_end` | tool, duration_ms, is_error | Tool execution completes |
|
||||
|
||||
### Context Management
|
||||
|
||||
| Event | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `preflight_compact_start` | utilization, trigger, messages, est_tokens | Preflight compaction triggered |
|
||||
| `preflight_compact_end` | messages_before, messages_after, pruned | Preflight compaction done |
|
||||
| `tool_result_pruning` | soft_trimmed, hard_cleared, chars_saved, phase, tokens_before?, tokens_after? | Tool result pruning (Phase 1) |
|
||||
| `compaction` | removed, kept, tokens_removed, tokens_kept, reason, pruning_stats? | Summary compaction (Phase 2) |
|
||||
| `compaction_detail` | pre_pruning_tokens, post_compaction_tokens, messages_removed, reason, pruning_applied | Detailed compaction breakdown |
|
||||
|
||||
### Error Recovery
|
||||
|
||||
| Event | Fields | Description |
|
||||
|-------|--------|-------------|
|
||||
| `context_overflow` | attempt, messages_before | Context window overflow detected |
|
||||
| `context_overflow_compacted` | messages_after, tokens_removed | Recovered via compaction |
|
||||
| `context_overflow_forced` | messages_before, messages_after | Recovered via forced drop |
|
||||
| `error_classify` | error, reason, rotatable | Error classified for rotation |
|
||||
| `auth_rotate` | from, to, reason | Auth profile rotated |
|
||||
|
||||
## Feature Test Playbooks
|
||||
|
||||
### 1. Basic Prompt Completion
|
||||
|
||||
**Goal**: Verify the agent can complete a simple prompt end-to-end.
|
||||
|
||||
```bash
|
||||
pnpm multica run --run-log "What is the capital of France? Reply in one word."
|
||||
```
|
||||
|
||||
**What to check in run-log**:
|
||||
- `run_start` event exists with correct provider
|
||||
- `llm_call` → `llm_result` pair exists (at least one)
|
||||
- `run_end` event has `error: null`
|
||||
- `run_end.duration_ms` is reasonable (< 30s for simple prompt)
|
||||
|
||||
**What to check in output**:
|
||||
- Text contains "Paris"
|
||||
|
||||
### 2. Tool Usage
|
||||
|
||||
**Goal**: Verify tools are called correctly when the prompt requires them.
|
||||
|
||||
```bash
|
||||
pnpm multica run --run-log --cwd /tmp "List the files in the current directory"
|
||||
```
|
||||
|
||||
**What to check in run-log**:
|
||||
- `tool_start` event with `tool: "exec"` or similar filesystem tool
|
||||
- Matching `tool_end` with `is_error: false`
|
||||
- Tool called before final `run_end`
|
||||
|
||||
**What to check in output**:
|
||||
- Output contains actual file names from /tmp
|
||||
|
||||
### 3. Context Compaction
|
||||
|
||||
**Goal**: Verify compaction works correctly on long sessions.
|
||||
|
||||
```bash
|
||||
# Build up a long session to trigger compaction
|
||||
pnpm multica run --run-log "Write a detailed 2000-word essay about climate change"
|
||||
# Note session ID, then continue:
|
||||
pnpm multica run --run-log --session {id} "Now write another 2000-word essay about renewable energy"
|
||||
pnpm multica run --run-log --session {id} "Summarize both essays in 3 bullet points"
|
||||
```
|
||||
|
||||
**What to check in run-log**:
|
||||
- `preflight_compact_start` appears when utilization exceeds trigger ratio
|
||||
- `tool_result_pruning` shows `soft_trimmed > 0` or `hard_cleared > 0` if tool results were pruned
|
||||
- `compaction` event has `tokens_removed > 0` (not near-zero like the bug we fixed)
|
||||
- `compaction_detail` shows `pre_pruning_tokens` > `post_compaction_tokens`
|
||||
|
||||
### 4. Multi-Provider Comparison
|
||||
|
||||
**Goal**: Verify the same prompt works across different providers.
|
||||
|
||||
```bash
|
||||
pnpm multica run --run-log --provider kimi-coding "Explain recursion in 2 sentences"
|
||||
pnpm multica run --run-log --provider claude-code "Explain recursion in 2 sentences"
|
||||
```
|
||||
|
||||
**What to check**:
|
||||
- Both runs complete without errors
|
||||
- Both `run_end` events have `error: null`
|
||||
- Compare `llm_result.duration_ms` across providers
|
||||
- Both outputs are meaningful explanations of recursion
|
||||
|
||||
### 5. Error Handling & Auth Rotation
|
||||
|
||||
**Goal**: Verify error recovery when credentials are invalid.
|
||||
|
||||
```bash
|
||||
pnpm multica run --run-log --provider anthropic --api-key "sk-invalid-key" "Hello"
|
||||
```
|
||||
|
||||
**What to check in run-log**:
|
||||
- `error_classify` event with `reason: "auth"`
|
||||
- `auth_rotate` event if multiple profiles are configured
|
||||
- `run_end` with appropriate error message if no valid profiles exist
|
||||
|
||||
## Analysis Patterns
|
||||
|
||||
When analyzing run-logs, look for these patterns:
|
||||
|
||||
### Healthy Run
|
||||
```
|
||||
run_start → llm_call → llm_result → run_end (error: null)
|
||||
```
|
||||
|
||||
### Run with Tool Usage
|
||||
```
|
||||
run_start → llm_call → llm_result → tool_start → tool_end → llm_call → llm_result → run_end
|
||||
```
|
||||
|
||||
### Run with Compaction
|
||||
```
|
||||
run_start → preflight_compact_start → tool_result_pruning → preflight_compact_end → llm_call → ...
|
||||
```
|
||||
|
||||
### Red Flags
|
||||
- `run_end` without preceding `run_start` (log corruption)
|
||||
- `tool_start` without matching `tool_end` (tool hang/crash)
|
||||
- `compaction` with `tokens_removed` near zero (compaction ineffective)
|
||||
- Multiple `error_classify` events (repeated failures)
|
||||
- `context_overflow_forced` (emergency fallback — should be rare)
|
||||
|
||||
## Creating a New Test Playbook
|
||||
|
||||
When a new feature is implemented, create a test playbook following this template:
|
||||
|
||||
```markdown
|
||||
### N. Feature Name
|
||||
|
||||
**Goal**: One sentence describing what to verify.
|
||||
|
||||
**Command**:
|
||||
\`\`\`bash
|
||||
pnpm multica run --run-log [options] "prompt that exercises the feature"
|
||||
\`\`\`
|
||||
|
||||
**What to check in run-log**:
|
||||
- List specific events and field values to verify
|
||||
- Include both positive checks (event exists) and negative checks (no errors)
|
||||
|
||||
**What to check in output**:
|
||||
- What the text output should contain or look like
|
||||
|
||||
**What to check in session.jsonl** (if applicable):
|
||||
- Specific message patterns to verify
|
||||
```
|
||||
|
||||
## Tips for Coding Agents
|
||||
|
||||
1. **Always use `--run-log`** — without it, there's no structured data to analyze
|
||||
2. **Use `--cwd`** to control the working directory for file-related tests
|
||||
3. **Read run-log line by line** — each line is independent JSON, parse individually
|
||||
4. **Check event ordering** — events are chronologically ordered by `ts`
|
||||
5. **Token counts are estimates** — don't expect exact values, check for reasonable ranges
|
||||
6. **Clean up test sessions** — after testing, remove session dirs from `~/.super-multica/sessions/` to avoid clutter
|
||||
7. **Use `--provider`** to test specific providers — defaults to whatever is configured in credentials
|
||||
8. **For multi-turn tests**, always capture and reuse the session ID from the first run
|
||||
|
|
@ -1,48 +0,0 @@
|
|||
# Package Management
|
||||
|
||||
## Workspace
|
||||
|
||||
- Package manager: `pnpm` (workspace mode)
|
||||
- Build orchestrator: `turbo`
|
||||
|
||||
## Required `.npmrc`
|
||||
|
||||
Keep this in repo root:
|
||||
|
||||
```ini
|
||||
shamefully-hoist=true
|
||||
```
|
||||
|
||||
This is required for Electron packaging compatibility in this monorepo.
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
pnpm install
|
||||
```
|
||||
|
||||
## Clean Reinstall (When Needed)
|
||||
|
||||
Use this when lockfile/hoist state is corrupted or after major package-manager config changes:
|
||||
|
||||
```bash
|
||||
rm -rf node_modules apps/*/node_modules packages/*/node_modules
|
||||
rm -f pnpm-lock.yaml
|
||||
pnpm install
|
||||
```
|
||||
|
||||
## Build / Check
|
||||
|
||||
```bash
|
||||
pnpm build
|
||||
pnpm typecheck
|
||||
pnpm test
|
||||
```
|
||||
|
||||
## Targeted Commands
|
||||
|
||||
```bash
|
||||
pnpm --filter @multica/desktop build
|
||||
pnpm --filter @multica/core build
|
||||
pnpm --filter @multica/web dev
|
||||
```
|
||||
|
|
@ -1,85 +0,0 @@
|
|||
# Skills and Tools
|
||||
|
||||
## Skills Loading Model
|
||||
|
||||
Skills are loaded from two sources with precedence:
|
||||
|
||||
1. Managed skills: `~/.super-multica/skills/`
|
||||
2. Profile skills: `~/.super-multica/agent-profiles/<profile-id>/skills/`
|
||||
|
||||
Profile skills override managed skills when IDs conflict.
|
||||
|
||||
## Skill File Contract
|
||||
|
||||
A valid skill directory must include:
|
||||
|
||||
- `SKILL.md`
|
||||
|
||||
Optional runtime files:
|
||||
|
||||
- `.env`
|
||||
- helper scripts/assets
|
||||
|
||||
## Current Repo Note
|
||||
|
||||
This repository intentionally keeps docs and bundled skill metadata minimal.
|
||||
If a directory under `skills/` does not contain `SKILL.md`, it will not be loaded as a skill.
|
||||
|
||||
## Skills CLI
|
||||
|
||||
```bash
|
||||
multica skills list
|
||||
multica skills status [id]
|
||||
multica skills install <id>
|
||||
multica skills add <owner/repo[/skill]>
|
||||
multica skills remove <name>
|
||||
```
|
||||
|
||||
## Tool System
|
||||
|
||||
`@multica/core` composes:
|
||||
|
||||
- base coding tools (`read/write/edit/...`)
|
||||
- extended tools (`exec`, `process`, `glob`, `web_fetch`, `web_search`, `data`, `cron`, `delegate`)
|
||||
- conditional tools (`send_file`)
|
||||
|
||||
Tool errors are wrapped into structured tool results instead of crashing runs.
|
||||
|
||||
## Tool Groups
|
||||
|
||||
Supported group aliases:
|
||||
|
||||
- `group:fs` -> `read, write, edit, glob`
|
||||
- `group:runtime` -> `exec, process`
|
||||
- `group:web` -> `web_search, web_fetch`
|
||||
- `group:subagent` -> `delegate`
|
||||
- `group:cron` -> `cron`
|
||||
- `group:data` -> `data`
|
||||
- `group:core` -> core local/web/data set
|
||||
|
||||
## Tool Policy Example
|
||||
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
allow: ["group:fs", "web_search", "web_fetch"],
|
||||
deny: ["exec"],
|
||||
byProvider: {
|
||||
"openai": {
|
||||
deny: ["data"],
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
`deny` always has priority over `allow`.
|
||||
|
||||
## Inspect Effective Tools
|
||||
|
||||
```bash
|
||||
multica tools list
|
||||
multica tools list --allow group:fs,web_fetch
|
||||
multica tools list --deny exec
|
||||
multica tools groups
|
||||
```
|
||||
|
|
@ -1,63 +0,0 @@
|
|||
# Web Tools Policy Optimization Roadmap
|
||||
|
||||
Related Linear issue: [MUL-267](https://linear.app/indexlabs/issue/MUL-267/refactor-web-evidence-guard-to-hybrid-policy-and-configurable-rule)
|
||||
|
||||
## Context
|
||||
|
||||
The current web evidence guard solved the immediate quality issue:
|
||||
- It enforces `web_search` -> `web_fetch` evidence coverage in runtime.
|
||||
- It blocks snippet-only finalization in key web-dependent cases.
|
||||
|
||||
However, semantic intent detection currently relies on hard-coded regex cue groups in `packages/core/src/agent/web-tools-policy.ts`. This is deterministic but not ideal for long-term maintainability and multilingual robustness.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Current limitations:
|
||||
- Semantic classification logic is tightly coupled with runtime enforcement code.
|
||||
- Pattern lists are code-level constants, making iteration high-friction.
|
||||
- Coverage expansion risks overfitting and regression without a stronger benchmark loop.
|
||||
|
||||
## Target Architecture
|
||||
|
||||
Use a hybrid policy model:
|
||||
1. Deterministic guardrail layer (must keep)
|
||||
- Tool-trace based invariants (e.g. search/fetch sequencing, minimum successful fetch count).
|
||||
|
||||
2. Semantic decision layer (new)
|
||||
- Lightweight model/classifier returns decision + confidence + reason codes.
|
||||
|
||||
3. Rulepack fallback layer (refactor existing patterns)
|
||||
- Externalized locale-aware cue packs for conservative fallback only.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
Phase 1: Decouple configuration
|
||||
- Move regex cue groups out of `web-tools-policy.ts` into a policy registry.
|
||||
- Keep behavior equivalent.
|
||||
|
||||
Phase 2: Add semantic classifier path
|
||||
- Add an optional semantic decision step with confidence threshold.
|
||||
- Preserve deterministic tool-trace constraints as final authority.
|
||||
|
||||
Phase 3: Observability and tuning
|
||||
- Emit run-log fields for policy decision source:
|
||||
- `tool-trace`
|
||||
- `semantic`
|
||||
- `fallback-pattern`
|
||||
- Add benchmark slices focused on false-positive/false-negative policy triggers.
|
||||
|
||||
Phase 4: Reduce hard-coded fallback
|
||||
- Keep only minimal safety patterns in code.
|
||||
- Shift language/phrase evolution to versioned config updates.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- No large hard-coded regex arrays in runtime policy file.
|
||||
- Semantic decision path is independently testable and feature-flagged.
|
||||
- Baseline behavior remains backward-compatible for existing guard cases.
|
||||
- Benchmark report shows equal or lower policy misfire rate.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Replacing deterministic tool-trace enforcement with pure model decisions.
|
||||
- Expanding scope to unrelated tool policy domains in the same iteration.
|
||||
Loading…
Add table
Add a link
Reference in a new issue