docs(channels): add system overview and update media handling docs
- Create docs/channels/README.md: plugin architecture, adapters, lastRoute pattern, message flow, configuration, and new plugin guide - Update media-handling.md: local whisper priority in tables, rewrite fallback section, remove completed items from future work - Add @see doc references in types.ts, telegram.ts, manager.ts, transcribe.ts, describe-image.ts, describe-video.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
30e9041084
commit
49623b4779
3 changed files with 338 additions and 0 deletions
175
docs/channels/README.md
Normal file
175
docs/channels/README.md
Normal file
|
|
@ -0,0 +1,175 @@
|
|||
# Channel System
|
||||
|
||||
The Channel system connects external messaging platforms (Telegram, Discord, etc.) to the Hub's agent. Each platform is a **plugin** that translates platform-specific APIs into a unified interface.
|
||||
|
||||
> For media handling details (audio transcription, image/video description), see [media-handling.md](./media-handling.md).
|
||||
> For message flow across all three I/O paths (Desktop / Web / Channel), see [message-paths.md](../message-paths.md).
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ credentials.json5 │
|
||||
│ { channels: { telegram: { default: { botToken } } } } │
|
||||
└──────────────────────┬──────────────────────────────────────┘
|
||||
│ loadChannelsConfig()
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Channel Manager (manager.ts) │
|
||||
│ │
|
||||
│ startAll() → iterate plugins → startAccount() per account │
|
||||
│ subscribeToAgent() → listen for AI replies │
|
||||
│ │
|
||||
│ Incoming: routeIncoming() → routeMedia() → agent.write() │
|
||||
│ Outgoing: lastRoute → aggregator → plugin.outbound.*() │
|
||||
└──────────┬──────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Plugin Registry (registry.ts) │
|
||||
│ registerChannel(plugin) / listChannels() / getChannel(id) │
|
||||
└──────────┬──────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Channel Plugins (e.g. telegram.ts) │
|
||||
│ │
|
||||
│ config — resolve account credentials │
|
||||
│ gateway — receive messages (polling / webhook) │
|
||||
│ outbound — send replies back to platform │
|
||||
│ downloadMedia() — download media files to local disk │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Plugin Interface
|
||||
|
||||
Each channel plugin implements `ChannelPlugin` (defined in `types.ts`):
|
||||
|
||||
```typescript
|
||||
interface ChannelPlugin {
|
||||
readonly id: string; // "telegram", "discord", etc.
|
||||
readonly meta: { name: string; description: string };
|
||||
readonly chunkerConfig?: BlockChunkerConfig; // override text chunking per platform
|
||||
readonly config: ChannelConfigAdapter; // credential resolution
|
||||
readonly gateway: ChannelGatewayAdapter; // receive messages
|
||||
readonly outbound: ChannelOutboundAdapter; // send replies
|
||||
downloadMedia?(fileId: string, accountId: string): Promise<string>; // optional
|
||||
}
|
||||
```
|
||||
|
||||
### Three Adapters
|
||||
|
||||
| Adapter | Role | Key Methods |
|
||||
|---------|------|-------------|
|
||||
| **config** | Resolve credentials from `credentials.json5` | `listAccountIds()`, `resolveAccount()`, `isConfigured()` |
|
||||
| **gateway** | Receive inbound messages from the platform | `start(accountId, config, onMessage, signal)` |
|
||||
| **outbound** | Send replies back to the platform | `sendText()`, `replyText()`, `sendTyping?()` |
|
||||
|
||||
### downloadMedia (optional)
|
||||
|
||||
Platforms that support media (voice, image, video, document) implement `downloadMedia()` to download files to `~/.super-multica/cache/media/` with UUID filenames. The Manager calls this before processing media.
|
||||
|
||||
## Message Flow
|
||||
|
||||
### Inbound (Platform → Agent)
|
||||
|
||||
```
|
||||
User sends message in Telegram
|
||||
→ grammy long-polling → onMessage callback
|
||||
→ ChannelManager.routeIncoming()
|
||||
1. Update lastRoute (reply target)
|
||||
2. Start typing indicator
|
||||
3. If media: routeMedia() → download → transcribe/describe → text
|
||||
4. agent.write(text)
|
||||
```
|
||||
|
||||
All media is converted to text before the agent sees it. See [media-handling.md](./media-handling.md) for details.
|
||||
|
||||
### Outbound (Agent → Platform)
|
||||
|
||||
```
|
||||
Agent produces reply
|
||||
→ agent.subscribe() in ChannelManager
|
||||
→ Check: if (!lastRoute) return // not from a channel, skip
|
||||
→ message_start → create MessageAggregator
|
||||
→ message_update → feed text to aggregator
|
||||
→ message_end → aggregator flushes final block
|
||||
→ Aggregator emits BlockReply chunks
|
||||
→ Block 0: plugin.outbound.replyText() // Telegram reply format
|
||||
→ Block N: plugin.outbound.sendText() // follow-up messages
|
||||
```
|
||||
|
||||
The **MessageAggregator** buffers streaming LLM output and splits it into blocks at natural text boundaries (paragraphs, code blocks). This is necessary because messaging platforms cannot consume raw streaming deltas.
|
||||
|
||||
## lastRoute Pattern
|
||||
|
||||
The `lastRoute` tracks which channel last sent a message:
|
||||
|
||||
- **Channel message arrives** → `lastRoute` is set to that plugin + conversation
|
||||
- **Desktop/Web message arrives** → `clearLastRoute()` is called
|
||||
- **Agent replies** → if `lastRoute` is set, reply goes to that channel; otherwise skipped
|
||||
|
||||
This ensures replies go back to the originating channel. Desktop and Web always receive agent events independently via their own mechanisms (IPC / Gateway).
|
||||
|
||||
## Configuration
|
||||
|
||||
Channel credentials are stored in `~/.super-multica/credentials.json5` under the `channels` key:
|
||||
|
||||
```json5
|
||||
{
|
||||
channels: {
|
||||
telegram: {
|
||||
default: {
|
||||
botToken: "123456:ABC-DEF..."
|
||||
}
|
||||
},
|
||||
// discord: { default: { botToken: "..." } },
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Each channel ID maps to accounts (keyed by account ID, typically `"default"`). The config adapter for each plugin knows how to extract and validate its credentials.
|
||||
|
||||
## Adding a New Plugin
|
||||
|
||||
1. Create `src/channels/plugins/<name>.ts` implementing `ChannelPlugin`
|
||||
2. Register it in `src/channels/index.ts`:
|
||||
```typescript
|
||||
import { <name>Channel } from "./plugins/<name>.js";
|
||||
registerChannel(<name>Channel);
|
||||
```
|
||||
3. Add the config shape to the `channels` section of `credentials.json5`
|
||||
|
||||
### Implementation Checklist
|
||||
|
||||
- [ ] `config` adapter: parse credentials from `credentials.json5`
|
||||
- [ ] `gateway` adapter: connect to platform, normalize messages to `ChannelMessage`
|
||||
- [ ] `outbound` adapter: `sendText`, `replyText`, optional `sendTyping`
|
||||
- [ ] `downloadMedia` (if platform supports media): download to `MEDIA_CACHE_DIR`
|
||||
- [ ] Group filtering: only respond to messages directed at the bot
|
||||
- [ ] Graceful shutdown: respect the `AbortSignal` passed to `gateway.start()`
|
||||
|
||||
## File Map
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `src/channels/types.ts` | All type definitions (`ChannelPlugin`, `ChannelMessage`, `DeliveryContext`, etc.) |
|
||||
| `src/channels/manager.ts` | `ChannelManager` — bridges plugins to the Hub's agent |
|
||||
| `src/channels/registry.ts` | Plugin registry (`registerChannel`, `listChannels`, `getChannel`) |
|
||||
| `src/channels/config.ts` | Load channel config from `credentials.json5` |
|
||||
| `src/channels/index.ts` | Bootstrap: register built-in plugins, re-export public API |
|
||||
| `src/channels/plugins/telegram.ts` | Telegram plugin (grammy, long polling) |
|
||||
| `src/channels/plugins/telegram-format.ts` | Markdown → Telegram HTML converter |
|
||||
| `src/media/transcribe.ts` | Audio transcription (local whisper → OpenAI API) |
|
||||
| `src/media/describe-image.ts` | Image description (OpenAI Vision API) |
|
||||
| `src/media/describe-video.ts` | Video description (ffmpeg frame + Vision API) |
|
||||
| `src/shared/paths.ts` | `MEDIA_CACHE_DIR` path constant |
|
||||
| `src/hub/message-aggregator.ts` | Streaming text → block chunking for channel delivery |
|
||||
|
||||
## Current Plugins
|
||||
|
||||
| Plugin | Platform | Transport | Library |
|
||||
|--------|----------|-----------|---------|
|
||||
| `telegram` | Telegram | Long polling | grammy |
|
||||
|
||||
Planned: Discord, Feishu, LINE, etc.
|
||||
161
docs/channels/media-handling.md
Normal file
161
docs/channels/media-handling.md
Normal file
|
|
@ -0,0 +1,161 @@
|
|||
# Channel Media Handling
|
||||
|
||||
How multimedia messages (voice, image, video, document) from messaging platforms are processed before reaching the Agent.
|
||||
|
||||
## Core Principle
|
||||
|
||||
All media is converted to text before the Agent sees it. The Agent only ever receives plain text via `agent.write()`.
|
||||
|
||||
```
|
||||
Platform message (voice/image/video/doc)
|
||||
→ Plugin: detect type + download file
|
||||
→ Manager: convert to text (API transcription / vision description)
|
||||
→ Agent receives text via agent.write()
|
||||
```
|
||||
|
||||
## Reference Architecture (OpenClaw)
|
||||
|
||||
OpenClaw supports 6 platforms (Telegram, Discord, LINE, Signal, iMessage, Slack). All share the same media processing pipeline.
|
||||
|
||||
### Per-Platform Layer (different for each platform)
|
||||
|
||||
Each platform detects media type using its own API:
|
||||
|
||||
| Platform | Detection Method |
|
||||
|----------|-----------------|
|
||||
| Telegram | `msg.voice`, `msg.audio`, `msg.photo`, `msg.video`, `msg.document` |
|
||||
| Discord | `attachment.content_type` MIME prefix (`audio/`, `image/`, `video/`) |
|
||||
| LINE | `message.type` field (`"audio"`, `"image"`, `"video"`, `"file"`) |
|
||||
| Signal | `attachment.contentType` MIME prefix |
|
||||
| iMessage | `attachment.mime_type` MIME prefix |
|
||||
| Slack | Any file attachment (MIME-based detection happens later) |
|
||||
|
||||
Each platform downloads the file using its own API, saves to local disk, and tags it:
|
||||
- `<media:audio>` for voice/audio
|
||||
- `<media:image>` for images
|
||||
- `<media:video>` for video
|
||||
- `<media:document>` for files
|
||||
|
||||
### Shared Layer (`applyMediaUnderstanding()`)
|
||||
|
||||
One function handles all conversions, called automatically before the Agent sees the message:
|
||||
|
||||
1. Reads local file path + MIME type
|
||||
2. Selects conversion method based on type:
|
||||
- **audio** → transcription (whisper local / OpenAI API / Groq / Deepgram / Google)
|
||||
- **image** → vision model description (Gemini / OpenAI / Anthropic)
|
||||
- **video** → vision model description
|
||||
3. Replaces placeholder with formatted text:
|
||||
- Audio: `[Audio]\nTranscript:\n<transcribed text>`
|
||||
- Image: `[Image]\nDescription:\n<description text>`
|
||||
4. If conversion fails (no provider configured), the raw placeholder stays in the message
|
||||
|
||||
### Transcription Provider Priority
|
||||
|
||||
Auto-detection order:
|
||||
1. sherpa-onnx-offline (local)
|
||||
2. whisper-cli / whisper.cpp (local)
|
||||
3. whisper Python CLI (local)
|
||||
4. gemini CLI (local)
|
||||
5. API providers: OpenAI → Groq → Deepgram → Google
|
||||
|
||||
### Skill Integration
|
||||
|
||||
Whisper skills declare requirements in `SKILL.md` metadata:
|
||||
```yaml
|
||||
requires:
|
||||
bins: ["whisper"] # must exist in PATH
|
||||
```
|
||||
|
||||
If the binary is missing, the skill is filtered out — the Agent never sees it. If present, the Agent can use it for transcription.
|
||||
|
||||
---
|
||||
|
||||
## Our Implementation
|
||||
|
||||
All media is converted to text in the Manager layer (`routeMedia()`) before reaching the Agent, matching OpenClaw's `applyMediaUnderstanding()` pattern.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Platform Plugin (e.g. telegram.ts) │
|
||||
│ │
|
||||
│ bot.on("message:voice") → detect type │
|
||||
│ bot.api.getFile() → download to local disk │
|
||||
│ Emit ChannelMessage with media attachment │
|
||||
└──────────────────┬──────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Channel Manager (manager.ts → routeMedia()) │
|
||||
│ │
|
||||
│ Download file via plugin.downloadMedia() │
|
||||
│ audio → transcribeAudio() → text │
|
||||
│ image → describeImage() → text │
|
||||
│ video → describeVideo() (ffmpeg frame + vision) → text │
|
||||
│ document → file path info │
|
||||
│ All results → agent.write(text) │
|
||||
└──────────────────┬──────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Agent receives plain text only │
|
||||
│ e.g. "[Voice Message]\nTranscript: ..." │
|
||||
│ e.g. "[Image]\nDescription: ..." │
|
||||
│ e.g. "[Video]\nDescription: ..." │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Media Processing Modules
|
||||
|
||||
| Type | Module | Method | API |
|
||||
|------|--------|--------|-----|
|
||||
| audio | `src/media/transcribe.ts` | `transcribeAudio()` | Local whisper/whisper-cli → OpenAI Whisper API (`whisper-1`) |
|
||||
| image | `src/media/describe-image.ts` | `describeImage()` | OpenAI Vision API (`gpt-4o-mini`) |
|
||||
| video | `src/media/describe-video.ts` | `describeVideo()` | ffmpeg frame extraction + Vision API |
|
||||
| document | (inline in manager) | — | File path info only |
|
||||
|
||||
### Agent Output Format
|
||||
|
||||
| Type | Success | No API Key |
|
||||
|------|---------|------------|
|
||||
| audio | `[Voice Message]\nTranscript: <text>` | `[audio message received]\nFile: <path>` |
|
||||
| image | `[Image]\nDescription: <text>` | `[image message received]\nFile: <path>` |
|
||||
| video | `[Video]\nDescription: <text>` | `[video message received]\nFile: <path>` |
|
||||
| document | `[document message received]\nFile: <path>` | same |
|
||||
|
||||
### Audio Transcription Priority
|
||||
|
||||
`transcribeAudio()` tries providers in order, matching OpenClaw's local-first approach:
|
||||
|
||||
1. **Local whisper/whisper-cli** — Free, no latency, works offline. Detected via `which` and cached.
|
||||
2. **OpenAI Whisper API** (`whisper-1`) — Requires API key in `credentials.json5`.
|
||||
3. **null** — No provider available. Placeholder stays in message, agent naturally responds (e.g. suggests installing whisper).
|
||||
|
||||
### Whisper Skill (Agent Fallback)
|
||||
|
||||
The `skills/whisper/SKILL.md` skill is a secondary safety net. If transcription returned null (no local binary, no API key), the agent receives a placeholder with the file path. If whisper is installed, the skill tells the agent how to transcribe it via the exec tool.
|
||||
|
||||
### File Map
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `src/channels/types.ts` | `ChannelMediaAttachment`, `ChannelMessage.media`, `ChannelPlugin.downloadMedia` |
|
||||
| `src/channels/plugins/telegram.ts` | Detect voice/audio/photo/video/document + download via Grammy API |
|
||||
| `src/channels/manager.ts` | `routeMedia()` — download, convert, `agent.write(text)` |
|
||||
| `src/media/transcribe.ts` | Audio → text (local whisper → OpenAI Whisper API) |
|
||||
| `src/media/describe-image.ts` | Image → text via OpenAI Vision API (gpt-4o-mini) |
|
||||
| `src/media/describe-video.ts` | Video → extract frame (ffmpeg) → text via Vision API |
|
||||
| `src/shared/paths.ts` | `MEDIA_CACHE_DIR` (`~/.super-multica/cache/media/`) |
|
||||
| `skills/whisper/SKILL.md` | Local whisper CLI fallback skill |
|
||||
|
||||
### Future Work
|
||||
|
||||
| Task | Scope |
|
||||
|------|-------|
|
||||
| Groq / Deepgram fallback for audio | `src/media/transcribe.ts` |
|
||||
| Multi-provider vision support (Gemini, Anthropic) | `src/media/describe-image.ts` |
|
||||
| Document text extraction (PDF, DOCX) | `src/media/` |
|
||||
| Media cache cleanup (delete old files) | `src/shared/` |
|
||||
| Outbound media (send images/audio back to channels) | `types.ts`, plugins |
|
||||
|
|
@ -3,6 +3,8 @@
|
|||
*
|
||||
* Each messaging platform (Telegram, Discord, Feishu, etc.) implements the
|
||||
* ChannelPlugin interface with three adapters: config, gateway, outbound.
|
||||
*
|
||||
* @see docs/channels/README.md — Channel system overview and plugin guide
|
||||
*/
|
||||
|
||||
import type { BlockChunkerConfig } from "../hub/block-chunker.js";
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue