From 49623b47796f95c6cf7b738c4c474ea64eff2fe5 Mon Sep 17 00:00:00 2001 From: Naiyuan Qing <145280634+NevilleQingNY@users.noreply.github.com> Date: Mon, 9 Feb 2026 11:03:49 +0800 Subject: [PATCH] docs(channels): add system overview and update media handling docs - Create docs/channels/README.md: plugin architecture, adapters, lastRoute pattern, message flow, configuration, and new plugin guide - Update media-handling.md: local whisper priority in tables, rewrite fallback section, remove completed items from future work - Add @see doc references in types.ts, telegram.ts, manager.ts, transcribe.ts, describe-image.ts, describe-video.ts Co-Authored-By: Claude Opus 4.6 --- docs/channels/README.md | 175 ++++++++++++++++++++++++++++++++ docs/channels/media-handling.md | 161 +++++++++++++++++++++++++++++ src/channels/types.ts | 2 + 3 files changed, 338 insertions(+) create mode 100644 docs/channels/README.md create mode 100644 docs/channels/media-handling.md diff --git a/docs/channels/README.md b/docs/channels/README.md new file mode 100644 index 00000000..426ebdb3 --- /dev/null +++ b/docs/channels/README.md @@ -0,0 +1,175 @@ +# Channel System + +The Channel system connects external messaging platforms (Telegram, Discord, etc.) to the Hub's agent. Each platform is a **plugin** that translates platform-specific APIs into a unified interface. + +> For media handling details (audio transcription, image/video description), see [media-handling.md](./media-handling.md). +> For message flow across all three I/O paths (Desktop / Web / Channel), see [message-paths.md](../message-paths.md). + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ credentials.json5 │ +│ { channels: { telegram: { default: { botToken } } } } │ +└──────────────────────┬──────────────────────────────────────┘ + │ loadChannelsConfig() + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Channel Manager (manager.ts) │ +│ │ +│ startAll() → iterate plugins → startAccount() per account │ +│ subscribeToAgent() → listen for AI replies │ +│ │ +│ Incoming: routeIncoming() → routeMedia() → agent.write() │ +│ Outgoing: lastRoute → aggregator → plugin.outbound.*() │ +└──────────┬──────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Plugin Registry (registry.ts) │ +│ registerChannel(plugin) / listChannels() / getChannel(id) │ +└──────────┬──────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Channel Plugins (e.g. telegram.ts) │ +│ │ +│ config — resolve account credentials │ +│ gateway — receive messages (polling / webhook) │ +│ outbound — send replies back to platform │ +│ downloadMedia() — download media files to local disk │ +└─────────────────────────────────────────────────────────────┘ +``` + +## Plugin Interface + +Each channel plugin implements `ChannelPlugin` (defined in `types.ts`): + +```typescript +interface ChannelPlugin { + readonly id: string; // "telegram", "discord", etc. + readonly meta: { name: string; description: string }; + readonly chunkerConfig?: BlockChunkerConfig; // override text chunking per platform + readonly config: ChannelConfigAdapter; // credential resolution + readonly gateway: ChannelGatewayAdapter; // receive messages + readonly outbound: ChannelOutboundAdapter; // send replies + downloadMedia?(fileId: string, accountId: string): Promise; // optional +} +``` + +### Three Adapters + +| Adapter | Role | Key Methods | +|---------|------|-------------| +| **config** | Resolve credentials from `credentials.json5` | `listAccountIds()`, `resolveAccount()`, `isConfigured()` | +| **gateway** | Receive inbound messages from the platform | `start(accountId, config, onMessage, signal)` | +| **outbound** | Send replies back to the platform | `sendText()`, `replyText()`, `sendTyping?()` | + +### downloadMedia (optional) + +Platforms that support media (voice, image, video, document) implement `downloadMedia()` to download files to `~/.super-multica/cache/media/` with UUID filenames. The Manager calls this before processing media. + +## Message Flow + +### Inbound (Platform → Agent) + +``` +User sends message in Telegram + → grammy long-polling → onMessage callback + → ChannelManager.routeIncoming() + 1. Update lastRoute (reply target) + 2. Start typing indicator + 3. If media: routeMedia() → download → transcribe/describe → text + 4. agent.write(text) +``` + +All media is converted to text before the agent sees it. See [media-handling.md](./media-handling.md) for details. + +### Outbound (Agent → Platform) + +``` +Agent produces reply + → agent.subscribe() in ChannelManager + → Check: if (!lastRoute) return // not from a channel, skip + → message_start → create MessageAggregator + → message_update → feed text to aggregator + → message_end → aggregator flushes final block + → Aggregator emits BlockReply chunks + → Block 0: plugin.outbound.replyText() // Telegram reply format + → Block N: plugin.outbound.sendText() // follow-up messages +``` + +The **MessageAggregator** buffers streaming LLM output and splits it into blocks at natural text boundaries (paragraphs, code blocks). This is necessary because messaging platforms cannot consume raw streaming deltas. + +## lastRoute Pattern + +The `lastRoute` tracks which channel last sent a message: + +- **Channel message arrives** → `lastRoute` is set to that plugin + conversation +- **Desktop/Web message arrives** → `clearLastRoute()` is called +- **Agent replies** → if `lastRoute` is set, reply goes to that channel; otherwise skipped + +This ensures replies go back to the originating channel. Desktop and Web always receive agent events independently via their own mechanisms (IPC / Gateway). + +## Configuration + +Channel credentials are stored in `~/.super-multica/credentials.json5` under the `channels` key: + +```json5 +{ + channels: { + telegram: { + default: { + botToken: "123456:ABC-DEF..." + } + }, + // discord: { default: { botToken: "..." } }, + } +} +``` + +Each channel ID maps to accounts (keyed by account ID, typically `"default"`). The config adapter for each plugin knows how to extract and validate its credentials. + +## Adding a New Plugin + +1. Create `src/channels/plugins/.ts` implementing `ChannelPlugin` +2. Register it in `src/channels/index.ts`: + ```typescript + import { Channel } from "./plugins/.js"; + registerChannel(Channel); + ``` +3. Add the config shape to the `channels` section of `credentials.json5` + +### Implementation Checklist + +- [ ] `config` adapter: parse credentials from `credentials.json5` +- [ ] `gateway` adapter: connect to platform, normalize messages to `ChannelMessage` +- [ ] `outbound` adapter: `sendText`, `replyText`, optional `sendTyping` +- [ ] `downloadMedia` (if platform supports media): download to `MEDIA_CACHE_DIR` +- [ ] Group filtering: only respond to messages directed at the bot +- [ ] Graceful shutdown: respect the `AbortSignal` passed to `gateway.start()` + +## File Map + +| File | Role | +|------|------| +| `src/channels/types.ts` | All type definitions (`ChannelPlugin`, `ChannelMessage`, `DeliveryContext`, etc.) | +| `src/channels/manager.ts` | `ChannelManager` — bridges plugins to the Hub's agent | +| `src/channels/registry.ts` | Plugin registry (`registerChannel`, `listChannels`, `getChannel`) | +| `src/channels/config.ts` | Load channel config from `credentials.json5` | +| `src/channels/index.ts` | Bootstrap: register built-in plugins, re-export public API | +| `src/channels/plugins/telegram.ts` | Telegram plugin (grammy, long polling) | +| `src/channels/plugins/telegram-format.ts` | Markdown → Telegram HTML converter | +| `src/media/transcribe.ts` | Audio transcription (local whisper → OpenAI API) | +| `src/media/describe-image.ts` | Image description (OpenAI Vision API) | +| `src/media/describe-video.ts` | Video description (ffmpeg frame + Vision API) | +| `src/shared/paths.ts` | `MEDIA_CACHE_DIR` path constant | +| `src/hub/message-aggregator.ts` | Streaming text → block chunking for channel delivery | + +## Current Plugins + +| Plugin | Platform | Transport | Library | +|--------|----------|-----------|---------| +| `telegram` | Telegram | Long polling | grammy | + +Planned: Discord, Feishu, LINE, etc. diff --git a/docs/channels/media-handling.md b/docs/channels/media-handling.md new file mode 100644 index 00000000..bfed1ff7 --- /dev/null +++ b/docs/channels/media-handling.md @@ -0,0 +1,161 @@ +# Channel Media Handling + +How multimedia messages (voice, image, video, document) from messaging platforms are processed before reaching the Agent. + +## Core Principle + +All media is converted to text before the Agent sees it. The Agent only ever receives plain text via `agent.write()`. + +``` +Platform message (voice/image/video/doc) + → Plugin: detect type + download file + → Manager: convert to text (API transcription / vision description) + → Agent receives text via agent.write() +``` + +## Reference Architecture (OpenClaw) + +OpenClaw supports 6 platforms (Telegram, Discord, LINE, Signal, iMessage, Slack). All share the same media processing pipeline. + +### Per-Platform Layer (different for each platform) + +Each platform detects media type using its own API: + +| Platform | Detection Method | +|----------|-----------------| +| Telegram | `msg.voice`, `msg.audio`, `msg.photo`, `msg.video`, `msg.document` | +| Discord | `attachment.content_type` MIME prefix (`audio/`, `image/`, `video/`) | +| LINE | `message.type` field (`"audio"`, `"image"`, `"video"`, `"file"`) | +| Signal | `attachment.contentType` MIME prefix | +| iMessage | `attachment.mime_type` MIME prefix | +| Slack | Any file attachment (MIME-based detection happens later) | + +Each platform downloads the file using its own API, saves to local disk, and tags it: +- `` for voice/audio +- `` for images +- `` for video +- `` for files + +### Shared Layer (`applyMediaUnderstanding()`) + +One function handles all conversions, called automatically before the Agent sees the message: + +1. Reads local file path + MIME type +2. Selects conversion method based on type: + - **audio** → transcription (whisper local / OpenAI API / Groq / Deepgram / Google) + - **image** → vision model description (Gemini / OpenAI / Anthropic) + - **video** → vision model description +3. Replaces placeholder with formatted text: + - Audio: `[Audio]\nTranscript:\n` + - Image: `[Image]\nDescription:\n` +4. If conversion fails (no provider configured), the raw placeholder stays in the message + +### Transcription Provider Priority + +Auto-detection order: +1. sherpa-onnx-offline (local) +2. whisper-cli / whisper.cpp (local) +3. whisper Python CLI (local) +4. gemini CLI (local) +5. API providers: OpenAI → Groq → Deepgram → Google + +### Skill Integration + +Whisper skills declare requirements in `SKILL.md` metadata: +```yaml +requires: + bins: ["whisper"] # must exist in PATH +``` + +If the binary is missing, the skill is filtered out — the Agent never sees it. If present, the Agent can use it for transcription. + +--- + +## Our Implementation + +All media is converted to text in the Manager layer (`routeMedia()`) before reaching the Agent, matching OpenClaw's `applyMediaUnderstanding()` pattern. + +### Architecture + +``` +┌─────────────────────────────────────────────────────┐ +│ Platform Plugin (e.g. telegram.ts) │ +│ │ +│ bot.on("message:voice") → detect type │ +│ bot.api.getFile() → download to local disk │ +│ Emit ChannelMessage with media attachment │ +└──────────────────┬──────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────┐ +│ Channel Manager (manager.ts → routeMedia()) │ +│ │ +│ Download file via plugin.downloadMedia() │ +│ audio → transcribeAudio() → text │ +│ image → describeImage() → text │ +│ video → describeVideo() (ffmpeg frame + vision) → text │ +│ document → file path info │ +│ All results → agent.write(text) │ +└──────────────────┬──────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────┐ +│ Agent receives plain text only │ +│ e.g. "[Voice Message]\nTranscript: ..." │ +│ e.g. "[Image]\nDescription: ..." │ +│ e.g. "[Video]\nDescription: ..." │ +└─────────────────────────────────────────────────────┘ +``` + +### Media Processing Modules + +| Type | Module | Method | API | +|------|--------|--------|-----| +| audio | `src/media/transcribe.ts` | `transcribeAudio()` | Local whisper/whisper-cli → OpenAI Whisper API (`whisper-1`) | +| image | `src/media/describe-image.ts` | `describeImage()` | OpenAI Vision API (`gpt-4o-mini`) | +| video | `src/media/describe-video.ts` | `describeVideo()` | ffmpeg frame extraction + Vision API | +| document | (inline in manager) | — | File path info only | + +### Agent Output Format + +| Type | Success | No API Key | +|------|---------|------------| +| audio | `[Voice Message]\nTranscript: ` | `[audio message received]\nFile: ` | +| image | `[Image]\nDescription: ` | `[image message received]\nFile: ` | +| video | `[Video]\nDescription: ` | `[video message received]\nFile: ` | +| document | `[document message received]\nFile: ` | same | + +### Audio Transcription Priority + +`transcribeAudio()` tries providers in order, matching OpenClaw's local-first approach: + +1. **Local whisper/whisper-cli** — Free, no latency, works offline. Detected via `which` and cached. +2. **OpenAI Whisper API** (`whisper-1`) — Requires API key in `credentials.json5`. +3. **null** — No provider available. Placeholder stays in message, agent naturally responds (e.g. suggests installing whisper). + +### Whisper Skill (Agent Fallback) + +The `skills/whisper/SKILL.md` skill is a secondary safety net. If transcription returned null (no local binary, no API key), the agent receives a placeholder with the file path. If whisper is installed, the skill tells the agent how to transcribe it via the exec tool. + +### File Map + +| File | Role | +|------|------| +| `src/channels/types.ts` | `ChannelMediaAttachment`, `ChannelMessage.media`, `ChannelPlugin.downloadMedia` | +| `src/channels/plugins/telegram.ts` | Detect voice/audio/photo/video/document + download via Grammy API | +| `src/channels/manager.ts` | `routeMedia()` — download, convert, `agent.write(text)` | +| `src/media/transcribe.ts` | Audio → text (local whisper → OpenAI Whisper API) | +| `src/media/describe-image.ts` | Image → text via OpenAI Vision API (gpt-4o-mini) | +| `src/media/describe-video.ts` | Video → extract frame (ffmpeg) → text via Vision API | +| `src/shared/paths.ts` | `MEDIA_CACHE_DIR` (`~/.super-multica/cache/media/`) | +| `skills/whisper/SKILL.md` | Local whisper CLI fallback skill | + +### Future Work + +| Task | Scope | +|------|-------| +| Groq / Deepgram fallback for audio | `src/media/transcribe.ts` | +| Multi-provider vision support (Gemini, Anthropic) | `src/media/describe-image.ts` | +| Document text extraction (PDF, DOCX) | `src/media/` | +| Media cache cleanup (delete old files) | `src/shared/` | +| Outbound media (send images/audio back to channels) | `types.ts`, plugins | diff --git a/src/channels/types.ts b/src/channels/types.ts index 0486f393..a1aa58e9 100644 --- a/src/channels/types.ts +++ b/src/channels/types.ts @@ -3,6 +3,8 @@ * * Each messaging platform (Telegram, Discord, Feishu, etc.) implements the * ChannelPlugin interface with three adapters: config, gateway, outbound. + * + * @see docs/channels/README.md — Channel system overview and plugin guide */ import type { BlockChunkerConfig } from "../hub/block-chunker.js";