feat: pivot to AI-native task management platform (#232)

Replace the agent framework codebase with a new monorepo structure for an AI-native Linear-like product where agents are first-class citizens. New architecture: - server/ — Go backend (Chi + gorilla/websocket + sqlc) - API server with REST routes for issues, agents, inbox, workspaces - WebSocket hub for real-time updates - Local daemon entry point for agent runtime connection - PostgreSQL migration with 13 tables (issue, agent, inbox, etc.) - WebSocket protocol types for server<->daemon communication - apps/web/ — Next.js 16 frontend - Dashboard layout with sidebar navigation - Route skeleton: inbox, issues, agents, board, settings - packages/ui/ — Preserved shadcn/ui design system (26+ components) - packages/types/ — Full API contract types (Issue, Agent, Workspace, Inbox, Events) - packages/sdk/ — REST ApiClient + WebSocket WSClient - packages/store/ — Zustand stores (issue, agent, inbox, auth) - packages/hooks/ — React hooks (useIssues, useAgents, useInbox, useRealtime) - packages/utils/ — Shared utilities Removed: apps/cli, apps/desktop, apps/mobile, apps/gateway, packages/core, skills/, and all agent-framework code. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 17:55:49 +08:00 · 2026-03-20 17:55:49 +08:00 · d4f5c5b16f
commit d4f5c5b16f
parent 3f589d8326
677 changed files with 2779 additions and 122531 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -1,29 +0,0 @@
-# Documentation Index (Priority-Based)
-
-This repo keeps documentation intentionally small to reduce stale AI context.
-Only workflow/testing/process documentation should be maintained.
-Project-intro and architecture explanation docs are intentionally omitted.
-
-## P0 (Keep Fresh)
-
-1. `README.md`
-2. `CLAUDE.md`
-3. `docs/development.md`
-4. `docs/cli.md`
-5. `docs/credentials.md`
-
-## P1 (Operational)
-
-1. `docs/skills-and-tools.md`
-2. `docs/package-management.md`
-3. `docs/e2e-testing-guide.md`
-
-## P2 (Benchmarks / Specialized)
-
-1. `docs/e2e-finance-benchmark.md`
-2. `docs/web-tools-policy-optimization.md`
-
-## Regeneration Rule
-
-When code behavior changes, update only impacted P0/P1 docs first.
-If unsure, prefer deleting stale sections over keeping speculative content.
--- a/docs/cli.md
+++ b/docs/cli.md
@ -1,129 +0,0 @@
-# CLI Guide (`multica`)
-
-## Entry
-
-```bash
-pnpm multica
-```
-
-Equivalent command names:
-
- `multica`
- `mu`
-
-## Core Commands
-
-```bash
-multica                       # interactive chat (default)
-multica run "<prompt>"        # one-shot run
-multica chat                  # explicit interactive mode
-multica session <command>     # session management
-multica profile <command>     # profile management
-multica skills <command>      # skill management
-multica tools <command>       # tool policy inspection
-multica credentials <command> # credentials management
-multica cron <command>        # scheduled tasks
-multica dev [service]         # start dev services
-multica help
-```
-
-## Run Mode
-
-```bash
-multica run [options] <prompt>
-echo "prompt" | multica run
-```
-
-Common options:
-
- `--profile <id>`
- `--provider <name>`
- `--model <name>`
- `--session <id>`
- `--cwd <dir>`
- `--run-log`
- `--tools-allow a,b,c`
- `--tools-deny a,b,c`
- `--context-window <tokens>`
-
-## Chat Mode
-
-```bash
-multica chat [options]
-multica [options]
-```
-
-In-chat commands:
-
- `/help`
- `/exit`
- `/clear`
- `/session`
- `/new`
- `/multiline`
- `/provider`
- `/model`
-
-## Sessions
-
-```bash
-multica session list
-multica session show <id>
-multica session delete <id>
-```
-
-Session data root:
-
- `~/.super-multica/sessions/`
- or `SMC_DATA_DIR/sessions/`
-
-## Profiles
-
-```bash
-multica profile list
-multica profile new <id>
-multica profile setup <id>
-multica profile show <id>
-multica profile edit <id>
-multica profile delete <id>
-```
-
-## Skills
-
-```bash
-multica skills list
-multica skills status [id]
-multica skills install <id>
-multica skills add <owner/repo[/skill]>
-multica skills remove <name>
-```
-
-## Tools
-
-```bash
-multica tools list
-multica tools list --allow group:fs,web_fetch
-multica tools list --deny exec
-multica tools groups
-```
-
-## Credentials
-
-```bash
-multica credentials init
-multica credentials show
-multica credentials edit
-```
-
-## Cron
-
-```bash
-multica cron status
-multica cron list
-multica cron add -n "name" --every "30m" --message "..."
-multica cron run <id>
-multica cron enable <id>
-multica cron disable <id>
-multica cron remove <id>
-multica cron logs <id>
-```
--- a/docs/credentials.md
+++ b/docs/credentials.md
@ -1,76 +0,0 @@
-# Credentials Guide
-
-## Initialize
-
-```bash
-pnpm multica credentials init
-```
-
-This creates:
-
- `~/.super-multica/credentials.json5`
-
-## Path Resolution
-
-Credential file lookup order:
-
-1. `SMC_CREDENTIALS_PATH` (explicit override)
-2. `SMC_DATA_DIR/credentials.json5` (or default data dir)
-3. `~/.super-multica/credentials.json5` fallback
-
-## Minimal Template
-
-```json5
-{
-  version: 1,
-  llm: {
-    provider: "kimi-coding",
-    providers: {
-      "kimi-coding": {
-        apiKey: "your-key",
-      },
-    },
-  },
-  tools: {
-    // tool-specific keys
-  },
-}
-```
-
-## Multi-Key Rotation (Per Provider)
-
-You can define multiple keys under one provider namespace:
-
-```json5
-{
-  llm: {
-    providers: {
-      "anthropic": { apiKey: "primary" },
-      "anthropic:backup": { apiKey: "backup" },
-    },
-    order: {
-      anthropic: ["anthropic", "anthropic:backup"],
-    },
-  },
-}
-```
-
-## OAuth Providers
-
- `claude-code`: run `claude login`
- `openai-codex`: run `codex login`
-
-API-key providers are configured directly in `credentials.json5`.
-
-## Tool Credentials
-
-Tool credentials are read from:
-
- `credentials.json5` under `tools`
- skill-level `.env` files under skill directories
-
-## Security
-
- Keep credentials file mode private (`600` on Unix-like systems).
- Do not commit keys into the repository.
- Prefer isolated data dirs (`SMC_DATA_DIR`) for test/dev environments.
--- a/docs/development.md
+++ b/docs/development.md
@ -1,134 +0,0 @@
-# Development Guide
-
-## Prerequisites
-
- Node.js 20+
- pnpm 10+
- macOS/Linux/Windows
-
-## Install
-
-```bash
-pnpm install
-```
-
-`.npmrc` must keep:
-
-```ini
-shamefully-hoist=true
-```
-
-## Main Dev Entry Points
-
-```bash
-# Recommended local desktop workflow
-pnpm dev
-
-# Service-specific
-pnpm dev:desktop
-pnpm dev:gateway
-pnpm dev:web
-
-# Full local stack with isolated dev data
-pnpm dev:local
-pnpm dev:local:archive
-```
-
-## What Each Command Does
-
- `pnpm dev`: builds shared packages, then runs `types + utils + core + desktop` watch flow.
- `pnpm dev:desktop`: Electron desktop only.
- `pnpm dev:gateway`: NestJS WebSocket gateway (`PORT`, default `3000`).
- `pnpm dev:web`: Next.js web app (`3000` by script).
- `pnpm dev:local`: gateway + web + desktop with dev-safe env defaults.
- `pnpm dev:local:archive`: archive dev data and start fresh.
-
-## Important Environment Variables
-
- `SMC_DATA_DIR`: override runtime data root (default `~/.super-multica`)
- `GATEWAY_URL`: gateway endpoint for desktop/CLI hub connection
- `MULTICA_API_URL`: required by web/data tools
- `PORT`: gateway/server port
- `MULTICA_WORKSPACE_DIR`: override workspace root
- `MULTICA_RUN_LOG=1`: enable structured run-log output
-
-## Agent / Conversation Semantics
-
- `agentId`: logical owner identity (capabilities/profile scope).
- `conversationId`: isolated runtime thread under an agent.
- `sessionId`: internal runner/storage identifier for a conversation. External protocols use `conversationId`.
-
-Protocol rules:
-
- Hub RPC is conversation-first: `createConversation/listConversations/deleteConversation`.
- All message, stream, and verify payloads use `conversationId` (no `sessionId` alias fields).
- New integrations should always pass `conversationId` explicitly.
-
-Telegram behavior:
-
- One Telegram DM binds to one active `conversationId`.
- `/new` creates and switches to a new conversation.
- `/session <id>` switches the active conversation.
- `/sessions` lists available conversations.
-
-Channel route behavior:
-
- Runtime route key is `channelId:accountId:externalConversationId`.
- Each route key is bound to one Hub `conversationId`.
- Incoming/outgoing channel traffic is isolated per bound conversation (no global first-agent fallback).
-
-## Local Full-Stack Notes (`pnpm dev:local`)
-
-`pnpm dev:local` is the recommended way to run the full local stack for integration work.
-
-Setup:
-
-1. `cp .env.example .env`
-2. Set `TELEGRAM_BOT_TOKEN` in root `.env`
-3. Run `pnpm dev:local`
-
-Services started by the script:
-
-| Service | Address | Notes |
-|---------|---------|-------|
-| Gateway | `http://localhost:4000` | Telegram long-polling mode (`PORT=4000`) |
-| Web | `http://localhost:3000` | OAuth login / frontend |
-| Desktop | — | Uses `GATEWAY_URL=http://localhost:4000` and local web URL |
-
-Data/workspace isolation used by the script:
-
- `SMC_DATA_DIR=~/.super-multica-dev`
- `MULTICA_WORKSPACE_DIR=~/Documents/Multica-dev`
-
-Why this matters:
-
- avoids polluting production data under `~/.super-multica`
- provides a stable local target for auth/session debugging
-
-Common follow-up:
-
-```bash
-pnpm dev:local:archive
-```
-
-This archives prior dev data before starting fresh local runs.
-
-## Build / Quality
-
-```bash
-pnpm build
-pnpm typecheck
-pnpm test
-pnpm test:coverage
-```
-
-## Useful Reset Commands
-
-```bash
-# Reset default + dev data dirs used by desktop scripts
-pnpm dev:desktop:reset
-
-# Reset and relaunch desktop onboarding flow
-pnpm dev:desktop:fresh
-pnpm dev:desktop:onboarding
-```
--- a/docs/e2e-finance-benchmark.md
+++ b/docs/e2e-finance-benchmark.md
@ -1,112 +0,0 @@
-# Finance Agent-Driven E2E Benchmark
-
-This benchmark suite is designed for complex financial analysis scenarios and follows the workflow in `docs/e2e-testing-guide.md`.
-
-## Scope
-
- Domain: equity, macro, rates, credit, cross-asset allocation
- Complexity: multi-step planning, data collection, analysis, local artifact generation
- Providers: `kimi-coding` and `claude-code`
- Cases: 10
-
-Case prompts are stored in:
- `scripts/e2e-finance-benchmark/cases/`
-
-## Prerequisites
-
-1. Credentials are configured (`pnpm multica credentials init` if needed)
-2. Dev auth exists for `web_search`/`data` tools (`~/.super-multica-dev/auth.json`)
-3. Required env:
-
-```bash
-export SMC_DATA_DIR=~/.super-multica-e2e
-export MULTICA_API_URL=https://api-dev.copilothub.ai
-```
-
-## Run All Cases (Both Providers)
-
-```bash
-scripts/e2e-finance-benchmark/run.sh
-```
-
-The script defaults:
- Providers: `kimi-coding claude-code`
- Case glob: `case-*.txt`
- Max parallel workers: `2`
- Per-case timeout: `900s` (set `CASE_TIMEOUT_SEC=0` to disable)
- Output directory: `.context/finance-e2e-runs/<timestamp>/`
-
-Generated artifact:
- `manifest.tsv`: provider, case id, status, session id, session dir, raw log file
-
-## Run a Subset
-
-Run only one provider:
-
-```bash
-PROVIDERS="kimi-coding" scripts/e2e-finance-benchmark/run.sh
-```
-
-Run only specific cases by glob:
-
-```bash
-CASE_GLOB="case-0[1-3]*.txt" scripts/e2e-finance-benchmark/run.sh
-```
-
-Run with higher parallelism for long-horizon tasks:
-
-```bash
-MAX_PARALLEL=4 CASE_TIMEOUT_SEC=2700 scripts/e2e-finance-benchmark/run.sh
-```
-
-## Case List
-
-1. `case-01-top10-financial-reports.txt`
-   - Top-10 US market cap 3-year filing analysis + workbook + 2026 allocation memo
-2. `case-02-ai-value-chain-scorecard.txt`
-   - AI value-chain factor model and weighted ranking
-3. `case-03-us-bank-stress-test.txt`
-   - US large-bank stress scenarios (mild/severe recession)
-4. `case-04-consumer-sector-macro-linkage.txt`
-   - Consumer sector earnings elasticity vs macro variables
-5. `case-05-energy-transport-sensitivity.txt`
-   - Energy/transport sensitivity and hedge ideas under oil scenarios
-6. `case-06-cross-asset-allocation.txt`
-   - Cross-asset tactical portfolio design with scenario stress tests
-7. `case-07-reit-rate-risk.txt`
-   - REIT screening under rate scenarios and debt maturity pressure
-8. `case-08-earnings-quality-forensics.txt`
-   - Forensic accounting quality framework and red-flag scoring
-9. `case-09-post-earnings-drift-study.txt`
-   - PEAD strategy feasibility study with risk controls
-10. `case-10-investment-committee-pack.txt`
-   - Q2 2026 investment committee pack + devil's advocate memo
-
-## Evaluation Checklist
-
-For each run (`session-dir/run-log.jsonl`):
-
-1. Event completeness
-   - `run_start` appears before `run_end`
-2. Tool pairing
-   - Every `tool_start` has matching `tool_end`
-3. Error handling
-   - Check `tool_end.is_error`, `error_classify`, `auth_rotate`
-4. Compaction health
-   - If compaction occurs: `compaction.tokens_removed > 0`
-5. Performance
-   - Inspect `llm_result.duration_ms` and tool durations for outliers
-
-For content quality (`session.jsonl` and output files on Desktop):
-
-1. Required files are created in target output directory
-2. Assumptions are explicit and traceable
-3. Sources are listed (`sources.md` with links + dates)
-4. Output distinguishes facts vs inferences when requested
-5. Strategy conclusions include risk and invalidation conditions
-
-## Notes
-
- Most cases intentionally require web + financial data gathering and local file generation.
- Cases are designed to test planning quality, not only final answer quality.
- You can analyze sessions after batch runs by opening the `session_dir` paths in `manifest.tsv`.
--- a/docs/e2e-skills-benchmark.md
+++ b/docs/e2e-skills-benchmark.md
@ -1,98 +0,0 @@
-# Skills Agent-Driven E2E Benchmark
-
-This benchmark validates the meta skill workflow for capability-gap discovery, ClawHub installation, and security-gated rollout.
-
-## Scope
-
- Domain: skill discovery + installation + update
- Focus: `skills/meta-skill-installer`
- Providers: default `kimi-coding` (override with `PROVIDERS`)
- Cases: 5
-
-Case prompts are stored in:
- `scripts/e2e-skills-benchmark/cases/`
-
-## Real ClawHub Examples Used
-
-The case set references real public pages from ClawHub:
-
- [CalDAV Calendar](https://clawhub.ai/skills/caldav-calendar)
- [Home Assistant](https://clawhub.ai/skills/homeassistant)
- [CodexMonitor](https://clawhub.ai/odrobnik/codexmonitor)
- [Spotify (gap-discovery UX flow)](https://clawhub.ai/search?q=spotify)
- [Notion (gap-discovery UX flow)](https://clawhub.ai/search?q=notion)
-
-## Prerequisites
-
-1. Credentials configured (`pnpm multica credentials init` if needed)
-2. Dependencies installed in repo (`pnpm install`)
-3. `clawhub` CLI available, or allow runtime fallback to `npx -y clawhub`
-4. Required env:
-
-```bash
-export SMC_DATA_DIR=~/.super-multica-e2e
-export MULTICA_API_URL=https://api-dev.copilothub.ai
-```
-
-## Run Benchmark
-
-```bash
-scripts/e2e-skills-benchmark/run.sh
-```
-
-Defaults:
-
- Providers: `kimi-coding`
- Case glob: `case-*.txt`
- Max parallel workers: `1`
- Per-case timeout: `1200s` (`CASE_TIMEOUT_SEC=0` to disable)
- Output directory: `.context/skills-e2e-runs/<timestamp>/`
-
-Generated artifacts:
-
- `manifest.tsv`: provider/case/status/session/log metadata
- `analysis.txt`: human-readable pass/fail report
- `analysis.json`: structured detailed check output
-
-## Run Subset
-
-Only one case:
-
-```bash
-CASE_GLOB="case-01-*.txt" scripts/e2e-skills-benchmark/run.sh
-```
-
-Multiple providers:
-
-```bash
-PROVIDERS="kimi-coding claude-code" scripts/e2e-skills-benchmark/run.sh
-```
-
-Faster throughput:
-
-```bash
-MAX_PARALLEL=2 CASE_TIMEOUT_SEC=1800 scripts/e2e-skills-benchmark/run.sh
-```
-
-## Analyzer Checks
-
-For each run:
-
-1. `run_start` and `run_end` both present
-2. `run_end.error` is empty/null
-3. `tool_start` and `tool_end` are paired
-4. no `tool_end.is_error=true`
-5. at least one `exec` tool call exists
-6. case-specific command evidence in `tool_start.args`:
-   - `clawhub search`
-   - `clawhub install`
-   - `review-skill-security.mjs`
-   - for case 03 also `clawhub update`
-   - for case 04, prompt is a natural user request only; agent must self-discover capability gap, propose ClawHub + security review + install confirmation, and must not run workaround commands (`osascript`, `ha.sh`, `spogo`, `spotify_player`) before user confirmation
-   - for case 05, prompt is a natural Notion request; agent must discover missing capability, search skill candidates, trigger `install_guard` (blocked until confirmation), and ask for explicit install consent plus token/auth prerequisites
-
-## Notes
-
- These are agent-driven tests; prompt intent plus run-log evidence are both evaluated.
- `SMC_DATA_DIR=~/.super-multica-e2e` avoids polluting normal user skill/session data.
- If a case fails, open `manifest.tsv` and inspect the matching `session_dir/run-log.jsonl`.
--- a/docs/e2e-testing-guide.md
+++ b/docs/e2e-testing-guide.md
@ -1,295 +0,0 @@
-# Agent-Driven E2E Testing Guide
-
-This guide teaches Coding Agents (Claude Code, etc.) how to perform automated end-to-end testing of Super Multica features. Unlike traditional test frameworks, **the Coding Agent itself is the test runner and oracle** — it executes the agent, reads structured logs, and intelligently analyzes the results.
-
-## Overview
-
-The testing flow:
-
-1. Coding Agent runs `pnpm multica run --run-log "test prompt"`
-2. The agent engine executes the prompt with full structured logging
-3. Coding Agent reads the `run-log.jsonl` and `session.jsonl` files
-4. Coding Agent analyzes events, tool calls, and behavior for correctness
-
-This approach is superior to static assertions because:
- The AI can understand **intent** — did the agent do what the prompt asked?
- It can reason about **intermediate process** — were the right tools called in the right order?
- It can detect **subtle issues** — token counts that don't make sense, unnecessary retries, missing events
-
-## Prerequisites
-
-1. **Credentials configured**: Run `pnpm multica credentials init` or ensure `~/.super-multica/credentials.json5` has valid provider credentials
-2. **Available providers**: Check with `pnpm multica profile list` or inspect credentials file
-3. **Default provider**: `kimi-coding` (Kimi Code, free tier available). Can override with `--provider`
-4. **`MULTICA_API_URL`**: Required for `web_search` and `data` tools. Set to `https://api-dev.copilothub.ai` for dev environment. Without this, web search and financial data tools will fail with `MULTICA_API_URL is required`
-5. **`SMC_DATA_DIR`**: Set to `~/.super-multica-e2e` to isolate E2E test sessions from dev (`~/.super-multica-dev`) and production (`~/.super-multica`) data. Without this, test sessions pollute the production sessions directory
-6. **Dev auth for `web_search`/`data` tools**: These tools authenticate via `auth.json` (session ID + device ID). The auth store automatically falls back to `~/.super-multica-dev/auth.json` when the E2E data dir has no auth. If `~/.super-multica-dev/auth.json` doesn't exist, run `pnpm dev:local` first and log in through the Desktop app to create it
-
-## Running a Test
-
-### Environment variables
-
-All E2E test commands should include these env vars:
-
-```bash
-# SMC_DATA_DIR — isolates test sessions from dev/production
-# MULTICA_API_URL — enables web_search and data tools
-export SMC_DATA_DIR=~/.super-multica-e2e
-export MULTICA_API_URL=https://api-dev.copilothub.ai
-```
-
-### Basic command
-
-```bash
-# For prompts that only need exec/read/write tools:
-SMC_DATA_DIR=~/.super-multica-e2e pnpm multica run --run-log "your test prompt here"
-
-# For prompts that need web_search or data tools:
-SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log "your test prompt here"
-```
-
-### With provider override
-
-```bash
-SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --provider claude-code "your test prompt"
-SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --provider kimi-coding "your test prompt"
-```
-
-### Resume a session (multi-turn testing)
-
-```bash
-# First turn
-SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log "Create a file called test.txt with content 'hello'"
-# Note the session ID from stderr output: [session: 019c584a-...]
-
-# Second turn (same session)
-SMC_DATA_DIR=~/.super-multica-e2e MULTICA_API_URL=https://api-dev.copilothub.ai pnpm multica run --run-log --session 019c584a-... "Read the file test.txt and tell me its content"
-```
-
-### Cleanup
-
-```bash
-# Remove all E2E test sessions
-rm -rf ~/.super-multica-e2e
-```
-
-### Output
-
-The CLI prints metadata to stderr:
-```
-[session: 019c584a-7753-762d-9fb9-9eb0a8187df5]
-[session-dir: /Users/you/.super-multica/sessions/019c584a-7753-762d-9fb9-9eb0a8187df5]
-```
-
-Agent text output goes to stdout.
-
-## Reading Results
-
-After a run, two files contain the data needed for analysis:
-
-### run-log.jsonl
-
-Location: `{session-dir}/run-log.jsonl`
-
-Each line is a JSON object with structured event data. Read this file to understand **what happened during execution**.
-
-```jsonl
-{"ts":1739000001,"event":"run_start","prompt":"What is 2+2?","provider":"kimi-coding","model":"kimi-k2-thinking","messages":0}
-{"ts":1739000002,"event":"llm_call","provider":"kimi-coding","model":"kimi-k2-thinking","messages":2}
-{"ts":1739000005,"event":"llm_result","duration_ms":3000}
-{"ts":1739000005,"event":"run_end","duration_ms":4000,"error":null,"text":"4"}
-```
-
-### session.jsonl
-
-Location: `{session-dir}/session.jsonl`
-
-Contains the full conversation transcript (user messages, assistant replies, tool calls and results). Read this for **message content analysis**.
-
-## Run-Log Event Reference
-
-> Source of truth: `packages/core/src/agent/run-log.ts` (JSDoc at top of file)
-
-### Lifecycle Events
-
-| Event | Fields | Description |
-|-------|--------|-------------|
-| `run_start` | prompt, internal, provider, model, messages | Agent run begins |
-| `run_end` | duration_ms, error, text, aborted? | Agent run completes |
-
-### LLM Interaction
-
-| Event | Fields | Description |
-|-------|--------|-------------|
-| `llm_call` | provider, model, profile, messages | LLM API request sent |
-| `llm_result` | duration_ms | LLM API response received |
-
-### Tool Execution
-
-| Event | Fields | Description |
-|-------|--------|-------------|
-| `tool_start` | tool, args | Tool execution begins |
-| `tool_end` | tool, duration_ms, is_error | Tool execution completes |
-
-### Context Management
-
-| Event | Fields | Description |
-|-------|--------|-------------|
-| `preflight_compact_start` | utilization, trigger, messages, est_tokens | Preflight compaction triggered |
-| `preflight_compact_end` | messages_before, messages_after, pruned | Preflight compaction done |
-| `tool_result_pruning` | soft_trimmed, hard_cleared, chars_saved, phase, tokens_before?, tokens_after? | Tool result pruning (Phase 1) |
-| `compaction` | removed, kept, tokens_removed, tokens_kept, reason, pruning_stats? | Summary compaction (Phase 2) |
-| `compaction_detail` | pre_pruning_tokens, post_compaction_tokens, messages_removed, reason, pruning_applied | Detailed compaction breakdown |
-
-### Error Recovery
-
-| Event | Fields | Description |
-|-------|--------|-------------|
-| `context_overflow` | attempt, messages_before | Context window overflow detected |
-| `context_overflow_compacted` | messages_after, tokens_removed | Recovered via compaction |
-| `context_overflow_forced` | messages_before, messages_after | Recovered via forced drop |
-| `error_classify` | error, reason, rotatable | Error classified for rotation |
-| `auth_rotate` | from, to, reason | Auth profile rotated |
-
-## Feature Test Playbooks
-
-### 1. Basic Prompt Completion
-
-**Goal**: Verify the agent can complete a simple prompt end-to-end.
-
-```bash
-pnpm multica run --run-log "What is the capital of France? Reply in one word."
-```
-
-**What to check in run-log**:
- `run_start` event exists with correct provider
- `llm_call` → `llm_result` pair exists (at least one)
- `run_end` event has `error: null`
- `run_end.duration_ms` is reasonable (< 30s for simple prompt)
-
-**What to check in output**:
- Text contains "Paris"
-
-### 2. Tool Usage
-
-**Goal**: Verify tools are called correctly when the prompt requires them.
-
-```bash
-pnpm multica run --run-log --cwd /tmp "List the files in the current directory"
-```
-
-**What to check in run-log**:
- `tool_start` event with `tool: "exec"` or similar filesystem tool
- Matching `tool_end` with `is_error: false`
- Tool called before final `run_end`
-
-**What to check in output**:
- Output contains actual file names from /tmp
-
-### 3. Context Compaction
-
-**Goal**: Verify compaction works correctly on long sessions.
-
-```bash
-# Build up a long session to trigger compaction
-pnpm multica run --run-log "Write a detailed 2000-word essay about climate change"
-# Note session ID, then continue:
-pnpm multica run --run-log --session {id} "Now write another 2000-word essay about renewable energy"
-pnpm multica run --run-log --session {id} "Summarize both essays in 3 bullet points"
-```
-
-**What to check in run-log**:
- `preflight_compact_start` appears when utilization exceeds trigger ratio
- `tool_result_pruning` shows `soft_trimmed > 0` or `hard_cleared > 0` if tool results were pruned
- `compaction` event has `tokens_removed > 0` (not near-zero like the bug we fixed)
- `compaction_detail` shows `pre_pruning_tokens` > `post_compaction_tokens`
-
-### 4. Multi-Provider Comparison
-
-**Goal**: Verify the same prompt works across different providers.
-
-```bash
-pnpm multica run --run-log --provider kimi-coding "Explain recursion in 2 sentences"
-pnpm multica run --run-log --provider claude-code "Explain recursion in 2 sentences"
-```
-
-**What to check**:
- Both runs complete without errors
- Both `run_end` events have `error: null`
- Compare `llm_result.duration_ms` across providers
- Both outputs are meaningful explanations of recursion
-
-### 5. Error Handling & Auth Rotation
-
-**Goal**: Verify error recovery when credentials are invalid.
-
-```bash
-pnpm multica run --run-log --provider anthropic --api-key "sk-invalid-key" "Hello"
-```
-
-**What to check in run-log**:
- `error_classify` event with `reason: "auth"`
- `auth_rotate` event if multiple profiles are configured
- `run_end` with appropriate error message if no valid profiles exist
-
-## Analysis Patterns
-
-When analyzing run-logs, look for these patterns:
-
-### Healthy Run
-```
-run_start → llm_call → llm_result → run_end (error: null)
-```
-
-### Run with Tool Usage
-```
-run_start → llm_call → llm_result → tool_start → tool_end → llm_call → llm_result → run_end
-```
-
-### Run with Compaction
-```
-run_start → preflight_compact_start → tool_result_pruning → preflight_compact_end → llm_call → ...
-```
-
-### Red Flags
- `run_end` without preceding `run_start` (log corruption)
- `tool_start` without matching `tool_end` (tool hang/crash)
- `compaction` with `tokens_removed` near zero (compaction ineffective)
- Multiple `error_classify` events (repeated failures)
- `context_overflow_forced` (emergency fallback — should be rare)
-
-## Creating a New Test Playbook
-
-When a new feature is implemented, create a test playbook following this template:
-
-```markdown
-### N. Feature Name
-
-**Goal**: One sentence describing what to verify.
-
-**Command**:
-\`\`\`bash
-pnpm multica run --run-log [options] "prompt that exercises the feature"
-\`\`\`
-
-**What to check in run-log**:
- List specific events and field values to verify
- Include both positive checks (event exists) and negative checks (no errors)
-
-**What to check in output**:
- What the text output should contain or look like
-
-**What to check in session.jsonl** (if applicable):
- Specific message patterns to verify
-```
-
-## Tips for Coding Agents
-
-1. **Always use `--run-log`** — without it, there's no structured data to analyze
-2. **Use `--cwd`** to control the working directory for file-related tests
-3. **Read run-log line by line** — each line is independent JSON, parse individually
-4. **Check event ordering** — events are chronologically ordered by `ts`
-5. **Token counts are estimates** — don't expect exact values, check for reasonable ranges
-6. **Clean up test sessions** — after testing, remove session dirs from `~/.super-multica/sessions/` to avoid clutter
-7. **Use `--provider`** to test specific providers — defaults to whatever is configured in credentials
-8. **For multi-turn tests**, always capture and reuse the session ID from the first run
--- a/docs/package-management.md
+++ b/docs/package-management.md
@ -1,48 +0,0 @@
-# Package Management
-
-## Workspace
-
- Package manager: `pnpm` (workspace mode)
- Build orchestrator: `turbo`
-
-## Required `.npmrc`
-
-Keep this in repo root:
-
-```ini
-shamefully-hoist=true
-```
-
-This is required for Electron packaging compatibility in this monorepo.
-
-## Install
-
-```bash
-pnpm install
-```
-
-## Clean Reinstall (When Needed)
-
-Use this when lockfile/hoist state is corrupted or after major package-manager config changes:
-
-```bash
-rm -rf node_modules apps/*/node_modules packages/*/node_modules
-rm -f pnpm-lock.yaml
-pnpm install
-```
-
-## Build / Check
-
-```bash
-pnpm build
-pnpm typecheck
-pnpm test
-```
-
-## Targeted Commands
-
-```bash
-pnpm --filter @multica/desktop build
-pnpm --filter @multica/core build
-pnpm --filter @multica/web dev
-```
--- a/docs/skills-and-tools.md
+++ b/docs/skills-and-tools.md
@ -1,85 +0,0 @@
-# Skills and Tools
-
-## Skills Loading Model
-
-Skills are loaded from two sources with precedence:
-
-1. Managed skills: `~/.super-multica/skills/`
-2. Profile skills: `~/.super-multica/agent-profiles/<profile-id>/skills/`
-
-Profile skills override managed skills when IDs conflict.
-
-## Skill File Contract
-
-A valid skill directory must include:
-
- `SKILL.md`
-
-Optional runtime files:
-
- `.env`
- helper scripts/assets
-
-## Current Repo Note
-
-This repository intentionally keeps docs and bundled skill metadata minimal.
-If a directory under `skills/` does not contain `SKILL.md`, it will not be loaded as a skill.
-
-## Skills CLI
-
-```bash
-multica skills list
-multica skills status [id]
-multica skills install <id>
-multica skills add <owner/repo[/skill]>
-multica skills remove <name>
-```
-
-## Tool System
-
-`@multica/core` composes:
-
- base coding tools (`read/write/edit/...`)
- extended tools (`exec`, `process`, `glob`, `web_fetch`, `web_search`, `data`, `cron`, `delegate`)
- conditional tools (`send_file`)
-
-Tool errors are wrapped into structured tool results instead of crashing runs.
-
-## Tool Groups
-
-Supported group aliases:
-
- `group:fs` -> `read, write, edit, glob`
- `group:runtime` -> `exec, process`
- `group:web` -> `web_search, web_fetch`
- `group:subagent` -> `delegate`
- `group:cron` -> `cron`
- `group:data` -> `data`
- `group:core` -> core local/web/data set
-
-## Tool Policy Example
-
-```json5
-{
-  tools: {
-    allow: ["group:fs", "web_search", "web_fetch"],
-    deny: ["exec"],
-    byProvider: {
-      "openai": {
-        deny: ["data"],
-      },
-    },
-  },
-}
-```
-
-`deny` always has priority over `allow`.
-
-## Inspect Effective Tools
-
-```bash
-multica tools list
-multica tools list --allow group:fs,web_fetch
-multica tools list --deny exec
-multica tools groups
-```
--- a/docs/web-tools-policy-optimization.md
+++ b/docs/web-tools-policy-optimization.md
@ -1,63 +0,0 @@
-# Web Tools Policy Optimization Roadmap
-
-Related Linear issue: [MUL-267](https://linear.app/indexlabs/issue/MUL-267/refactor-web-evidence-guard-to-hybrid-policy-and-configurable-rule)
-
-## Context
-
-The current web evidence guard solved the immediate quality issue:
- It enforces `web_search` -> `web_fetch` evidence coverage in runtime.
- It blocks snippet-only finalization in key web-dependent cases.
-
-However, semantic intent detection currently relies on hard-coded regex cue groups in `packages/core/src/agent/web-tools-policy.ts`. This is deterministic but not ideal for long-term maintainability and multilingual robustness.
-
-## Problem Statement
-
-Current limitations:
- Semantic classification logic is tightly coupled with runtime enforcement code.
- Pattern lists are code-level constants, making iteration high-friction.
- Coverage expansion risks overfitting and regression without a stronger benchmark loop.
-
-## Target Architecture
-
-Use a hybrid policy model:
-1. Deterministic guardrail layer (must keep)
- Tool-trace based invariants (e.g. search/fetch sequencing, minimum successful fetch count).
-
-2. Semantic decision layer (new)
- Lightweight model/classifier returns decision + confidence + reason codes.
-
-3. Rulepack fallback layer (refactor existing patterns)
- Externalized locale-aware cue packs for conservative fallback only.
-
-## Migration Plan
-
-Phase 1: Decouple configuration
- Move regex cue groups out of `web-tools-policy.ts` into a policy registry.
- Keep behavior equivalent.
-
-Phase 2: Add semantic classifier path
- Add an optional semantic decision step with confidence threshold.
- Preserve deterministic tool-trace constraints as final authority.
-
-Phase 3: Observability and tuning
- Emit run-log fields for policy decision source:
-  - `tool-trace`
-  - `semantic`
-  - `fallback-pattern`
- Add benchmark slices focused on false-positive/false-negative policy triggers.
-
-Phase 4: Reduce hard-coded fallback
- Keep only minimal safety patterns in code.
- Shift language/phrase evolution to versioned config updates.
-
-## Acceptance Criteria
-
- No large hard-coded regex arrays in runtime policy file.
- Semantic decision path is independently testable and feature-flagged.
- Baseline behavior remains backward-compatible for existing guard cases.
- Benchmark report shows equal or lower policy misfire rate.
-
-## Non-goals
-
- Replacing deterministic tool-trace enforcement with pure model decisions.
- Expanding scope to unrelated tool policy domains in the same iteration.