From 49623b47796f95c6cf7b738c4c474ea64eff2fe5 Mon Sep 17 00:00:00 2001
From: Naiyuan Qing <145280634+NevilleQingNY@users.noreply.github.com>
Date: Mon, 9 Feb 2026 11:03:49 +0800
Subject: [PATCH] docs(channels): add system overview and update media handling
 docs

- Create docs/channels/README.md: plugin architecture, adapters, lastRoute
  pattern, message flow, configuration, and new plugin guide
- Update media-handling.md: local whisper priority in tables, rewrite
  fallback section, remove completed items from future work
- Add @see doc references in types.ts, telegram.ts, manager.ts,
  transcribe.ts, describe-image.ts, describe-video.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 docs/channels/README.md         | 175 ++++++++++++++++++++++++++++++++
 docs/channels/media-handling.md | 161 +++++++++++++++++++++++++++++
 src/channels/types.ts           |   2 +
 3 files changed, 338 insertions(+)
 create mode 100644 docs/channels/README.md
 create mode 100644 docs/channels/media-handling.md

diff --git a/docs/channels/README.md b/docs/channels/README.md
new file mode 100644
index 00000000..426ebdb3
--- /dev/null
+++ b/docs/channels/README.md
@@ -0,0 +1,175 @@
+# Channel System
+
+The Channel system connects external messaging platforms (Telegram, Discord, etc.) to the Hub's agent. Each platform is a **plugin** that translates platform-specific APIs into a unified interface.
+
+> For media handling details (audio transcription, image/video description), see [media-handling.md](./media-handling.md).
+> For message flow across all three I/O paths (Desktop / Web / Channel), see [message-paths.md](../message-paths.md).
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  credentials.json5                                          │
+│  { channels: { telegram: { default: { botToken } } } }     │
+└──────────────────────┬──────────────────────────────────────┘
+                       │ loadChannelsConfig()
+                       ▼
+┌─────────────────────────────────────────────────────────────┐
+│  Channel Manager (manager.ts)                               │
+│                                                             │
+│  startAll() → iterate plugins → startAccount() per account  │
+│  subscribeToAgent() → listen for AI replies                 │
+│                                                             │
+│  Incoming: routeIncoming() → routeMedia() → agent.write()  │
+│  Outgoing: lastRoute → aggregator → plugin.outbound.*()    │
+└──────────┬──────────────────────────────────────────────────┘
+           │
+           ▼
+┌─────────────────────────────────────────────────────────────┐
+│  Plugin Registry (registry.ts)                              │
+│  registerChannel(plugin) / listChannels() / getChannel(id)  │
+└──────────┬──────────────────────────────────────────────────┘
+           │
+           ▼
+┌─────────────────────────────────────────────────────────────┐
+│  Channel Plugins (e.g. telegram.ts)                         │
+│                                                             │
+│  config    — resolve account credentials                    │
+│  gateway   — receive messages (polling / webhook)           │
+│  outbound  — send replies back to platform                  │
+│  downloadMedia() — download media files to local disk       │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Plugin Interface
+
+Each channel plugin implements `ChannelPlugin` (defined in `types.ts`):
+
+```typescript
+interface ChannelPlugin {
+  readonly id: string;                          // "telegram", "discord", etc.
+  readonly meta: { name: string; description: string };
+  readonly chunkerConfig?: BlockChunkerConfig;  // override text chunking per platform
+  readonly config: ChannelConfigAdapter;        // credential resolution
+  readonly gateway: ChannelGatewayAdapter;      // receive messages
+  readonly outbound: ChannelOutboundAdapter;    // send replies
+  downloadMedia?(fileId: string, accountId: string): Promise<string>;  // optional
+}
+```
+
+### Three Adapters
+
+| Adapter | Role | Key Methods |
+|---------|------|-------------|
+| **config** | Resolve credentials from `credentials.json5` | `listAccountIds()`, `resolveAccount()`, `isConfigured()` |
+| **gateway** | Receive inbound messages from the platform | `start(accountId, config, onMessage, signal)` |
+| **outbound** | Send replies back to the platform | `sendText()`, `replyText()`, `sendTyping?()` |
+
+### downloadMedia (optional)
+
+Platforms that support media (voice, image, video, document) implement `downloadMedia()` to download files to `~/.super-multica/cache/media/` with UUID filenames. The Manager calls this before processing media.
+
+## Message Flow
+
+### Inbound (Platform → Agent)
+
+```
+User sends message in Telegram
+  → grammy long-polling → onMessage callback
+    → ChannelManager.routeIncoming()
+      1. Update lastRoute (reply target)
+      2. Start typing indicator
+      3. If media: routeMedia() → download → transcribe/describe → text
+      4. agent.write(text)
+```
+
+All media is converted to text before the agent sees it. See [media-handling.md](./media-handling.md) for details.
+
+### Outbound (Agent → Platform)
+
+```
+Agent produces reply
+  → agent.subscribe() in ChannelManager
+    → Check: if (!lastRoute) return   // not from a channel, skip
+    → message_start → create MessageAggregator
+    → message_update → feed text to aggregator
+    → message_end → aggregator flushes final block
+      → Aggregator emits BlockReply chunks
+        → Block 0: plugin.outbound.replyText()   // Telegram reply format
+        → Block N: plugin.outbound.sendText()     // follow-up messages
+```
+
+The **MessageAggregator** buffers streaming LLM output and splits it into blocks at natural text boundaries (paragraphs, code blocks). This is necessary because messaging platforms cannot consume raw streaming deltas.
+
+## lastRoute Pattern
+
+The `lastRoute` tracks which channel last sent a message:
+
+- **Channel message arrives** → `lastRoute` is set to that plugin + conversation
+- **Desktop/Web message arrives** → `clearLastRoute()` is called
+- **Agent replies** → if `lastRoute` is set, reply goes to that channel; otherwise skipped
+
+This ensures replies go back to the originating channel. Desktop and Web always receive agent events independently via their own mechanisms (IPC / Gateway).
+
+## Configuration
+
+Channel credentials are stored in `~/.super-multica/credentials.json5` under the `channels` key:
+
+```json5
+{
+  channels: {
+    telegram: {
+      default: {
+        botToken: "123456:ABC-DEF..."
+      }
+    },
+    // discord: { default: { botToken: "..." } },
+  }
+}
+```
+
+Each channel ID maps to accounts (keyed by account ID, typically `"default"`). The config adapter for each plugin knows how to extract and validate its credentials.
+
+## Adding a New Plugin
+
+1. Create `src/channels/plugins/<name>.ts` implementing `ChannelPlugin`
+2. Register it in `src/channels/index.ts`:
+   ```typescript
+   import { <name>Channel } from "./plugins/<name>.js";
+   registerChannel(<name>Channel);
+   ```
+3. Add the config shape to the `channels` section of `credentials.json5`
+
+### Implementation Checklist
+
+- [ ] `config` adapter: parse credentials from `credentials.json5`
+- [ ] `gateway` adapter: connect to platform, normalize messages to `ChannelMessage`
+- [ ] `outbound` adapter: `sendText`, `replyText`, optional `sendTyping`
+- [ ] `downloadMedia` (if platform supports media): download to `MEDIA_CACHE_DIR`
+- [ ] Group filtering: only respond to messages directed at the bot
+- [ ] Graceful shutdown: respect the `AbortSignal` passed to `gateway.start()`
+
+## File Map
+
+| File | Role |
+|------|------|
+| `src/channels/types.ts` | All type definitions (`ChannelPlugin`, `ChannelMessage`, `DeliveryContext`, etc.) |
+| `src/channels/manager.ts` | `ChannelManager` — bridges plugins to the Hub's agent |
+| `src/channels/registry.ts` | Plugin registry (`registerChannel`, `listChannels`, `getChannel`) |
+| `src/channels/config.ts` | Load channel config from `credentials.json5` |
+| `src/channels/index.ts` | Bootstrap: register built-in plugins, re-export public API |
+| `src/channels/plugins/telegram.ts` | Telegram plugin (grammy, long polling) |
+| `src/channels/plugins/telegram-format.ts` | Markdown → Telegram HTML converter |
+| `src/media/transcribe.ts` | Audio transcription (local whisper → OpenAI API) |
+| `src/media/describe-image.ts` | Image description (OpenAI Vision API) |
+| `src/media/describe-video.ts` | Video description (ffmpeg frame + Vision API) |
+| `src/shared/paths.ts` | `MEDIA_CACHE_DIR` path constant |
+| `src/hub/message-aggregator.ts` | Streaming text → block chunking for channel delivery |
+
+## Current Plugins
+
+| Plugin | Platform | Transport | Library |
+|--------|----------|-----------|---------|
+| `telegram` | Telegram | Long polling | grammy |
+
+Planned: Discord, Feishu, LINE, etc.
diff --git a/docs/channels/media-handling.md b/docs/channels/media-handling.md
new file mode 100644
index 00000000..bfed1ff7
--- /dev/null
+++ b/docs/channels/media-handling.md
@@ -0,0 +1,161 @@
+# Channel Media Handling
+
+How multimedia messages (voice, image, video, document) from messaging platforms are processed before reaching the Agent.
+
+## Core Principle
+
+All media is converted to text before the Agent sees it. The Agent only ever receives plain text via `agent.write()`.
+
+```
+Platform message (voice/image/video/doc)
+  → Plugin: detect type + download file
+  → Manager: convert to text (API transcription / vision description)
+  → Agent receives text via agent.write()
+```
+
+## Reference Architecture (OpenClaw)
+
+OpenClaw supports 6 platforms (Telegram, Discord, LINE, Signal, iMessage, Slack). All share the same media processing pipeline.
+
+### Per-Platform Layer (different for each platform)
+
+Each platform detects media type using its own API:
+
+| Platform | Detection Method |
+|----------|-----------------|
+| Telegram | `msg.voice`, `msg.audio`, `msg.photo`, `msg.video`, `msg.document` |
+| Discord | `attachment.content_type` MIME prefix (`audio/`, `image/`, `video/`) |
+| LINE | `message.type` field (`"audio"`, `"image"`, `"video"`, `"file"`) |
+| Signal | `attachment.contentType` MIME prefix |
+| iMessage | `attachment.mime_type` MIME prefix |
+| Slack | Any file attachment (MIME-based detection happens later) |
+
+Each platform downloads the file using its own API, saves to local disk, and tags it:
+- `<media:audio>` for voice/audio
+- `<media:image>` for images
+- `<media:video>` for video
+- `<media:document>` for files
+
+### Shared Layer (`applyMediaUnderstanding()`)
+
+One function handles all conversions, called automatically before the Agent sees the message:
+
+1. Reads local file path + MIME type
+2. Selects conversion method based on type:
+   - **audio** → transcription (whisper local / OpenAI API / Groq / Deepgram / Google)
+   - **image** → vision model description (Gemini / OpenAI / Anthropic)
+   - **video** → vision model description
+3. Replaces placeholder with formatted text:
+   - Audio: `[Audio]\nTranscript:\n<transcribed text>`
+   - Image: `[Image]\nDescription:\n<description text>`
+4. If conversion fails (no provider configured), the raw placeholder stays in the message
+
+### Transcription Provider Priority
+
+Auto-detection order:
+1. sherpa-onnx-offline (local)
+2. whisper-cli / whisper.cpp (local)
+3. whisper Python CLI (local)
+4. gemini CLI (local)
+5. API providers: OpenAI → Groq → Deepgram → Google
+
+### Skill Integration
+
+Whisper skills declare requirements in `SKILL.md` metadata:
+```yaml
+requires:
+  bins: ["whisper"]  # must exist in PATH
+```
+
+If the binary is missing, the skill is filtered out — the Agent never sees it. If present, the Agent can use it for transcription.
+
+---
+
+## Our Implementation
+
+All media is converted to text in the Manager layer (`routeMedia()`) before reaching the Agent, matching OpenClaw's `applyMediaUnderstanding()` pattern.
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────────┐
+│  Platform Plugin (e.g. telegram.ts)                  │
+│                                                      │
+│  bot.on("message:voice") → detect type               │
+│  bot.api.getFile() → download to local disk           │
+│  Emit ChannelMessage with media attachment            │
+└──────────────────┬──────────────────────────────────┘
+                   │
+                   ▼
+┌─────────────────────────────────────────────────────┐
+│  Channel Manager (manager.ts → routeMedia())         │
+│                                                      │
+│  Download file via plugin.downloadMedia()            │
+│  audio → transcribeAudio() → text                    │
+│  image → describeImage() → text                      │
+│  video → describeVideo() (ffmpeg frame + vision) → text │
+│  document → file path info                           │
+│  All results → agent.write(text)                     │
+└──────────────────┬──────────────────────────────────┘
+                   │
+                   ▼
+┌─────────────────────────────────────────────────────┐
+│  Agent receives plain text only                      │
+│  e.g. "[Voice Message]\nTranscript: ..."             │
+│  e.g. "[Image]\nDescription: ..."                    │
+│  e.g. "[Video]\nDescription: ..."                    │
+└─────────────────────────────────────────────────────┘
+```
+
+### Media Processing Modules
+
+| Type | Module | Method | API |
+|------|--------|--------|-----|
+| audio | `src/media/transcribe.ts` | `transcribeAudio()` | Local whisper/whisper-cli → OpenAI Whisper API (`whisper-1`) |
+| image | `src/media/describe-image.ts` | `describeImage()` | OpenAI Vision API (`gpt-4o-mini`) |
+| video | `src/media/describe-video.ts` | `describeVideo()` | ffmpeg frame extraction + Vision API |
+| document | (inline in manager) | — | File path info only |
+
+### Agent Output Format
+
+| Type | Success | No API Key |
+|------|---------|------------|
+| audio | `[Voice Message]\nTranscript: <text>` | `[audio message received]\nFile: <path>` |
+| image | `[Image]\nDescription: <text>` | `[image message received]\nFile: <path>` |
+| video | `[Video]\nDescription: <text>` | `[video message received]\nFile: <path>` |
+| document | `[document message received]\nFile: <path>` | same |
+
+### Audio Transcription Priority
+
+`transcribeAudio()` tries providers in order, matching OpenClaw's local-first approach:
+
+1. **Local whisper/whisper-cli** — Free, no latency, works offline. Detected via `which` and cached.
+2. **OpenAI Whisper API** (`whisper-1`) — Requires API key in `credentials.json5`.
+3. **null** — No provider available. Placeholder stays in message, agent naturally responds (e.g. suggests installing whisper).
+
+### Whisper Skill (Agent Fallback)
+
+The `skills/whisper/SKILL.md` skill is a secondary safety net. If transcription returned null (no local binary, no API key), the agent receives a placeholder with the file path. If whisper is installed, the skill tells the agent how to transcribe it via the exec tool.
+
+### File Map
+
+| File | Role |
+|------|------|
+| `src/channels/types.ts` | `ChannelMediaAttachment`, `ChannelMessage.media`, `ChannelPlugin.downloadMedia` |
+| `src/channels/plugins/telegram.ts` | Detect voice/audio/photo/video/document + download via Grammy API |
+| `src/channels/manager.ts` | `routeMedia()` — download, convert, `agent.write(text)` |
+| `src/media/transcribe.ts` | Audio → text (local whisper → OpenAI Whisper API) |
+| `src/media/describe-image.ts` | Image → text via OpenAI Vision API (gpt-4o-mini) |
+| `src/media/describe-video.ts` | Video → extract frame (ffmpeg) → text via Vision API |
+| `src/shared/paths.ts` | `MEDIA_CACHE_DIR` (`~/.super-multica/cache/media/`) |
+| `skills/whisper/SKILL.md` | Local whisper CLI fallback skill |
+
+### Future Work
+
+| Task | Scope |
+|------|-------|
+| Groq / Deepgram fallback for audio | `src/media/transcribe.ts` |
+| Multi-provider vision support (Gemini, Anthropic) | `src/media/describe-image.ts` |
+| Document text extraction (PDF, DOCX) | `src/media/` |
+| Media cache cleanup (delete old files) | `src/shared/` |
+| Outbound media (send images/audio back to channels) | `types.ts`, plugins |
diff --git a/src/channels/types.ts b/src/channels/types.ts
index 0486f393..a1aa58e9 100644
--- a/src/channels/types.ts
+++ b/src/channels/types.ts
@@ -3,6 +3,8 @@
  *
  * Each messaging platform (Telegram, Discord, Feishu, etc.) implements the
  * ChannelPlugin interface with three adapters: config, gateway, outbound.
+ *
+ * @see docs/channels/README.md — Channel system overview and plugin guide
  */
 
 import type { BlockChunkerConfig } from "../hub/block-chunker.js";