- Create docs/channels/README.md: plugin architecture, adapters, lastRoute pattern, message flow, configuration, and new plugin guide - Update media-handling.md: local whisper priority in tables, rewrite fallback section, remove completed items from future work - Add @see doc references in types.ts, telegram.ts, manager.ts, transcribe.ts, describe-image.ts, describe-video.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7.6 KiB
Channel Media Handling
How multimedia messages (voice, image, video, document) from messaging platforms are processed before reaching the Agent.
Core Principle
All media is converted to text before the Agent sees it. The Agent only ever receives plain text via agent.write().
Platform message (voice/image/video/doc)
→ Plugin: detect type + download file
→ Manager: convert to text (API transcription / vision description)
→ Agent receives text via agent.write()
Reference Architecture (OpenClaw)
OpenClaw supports 6 platforms (Telegram, Discord, LINE, Signal, iMessage, Slack). All share the same media processing pipeline.
Per-Platform Layer (different for each platform)
Each platform detects media type using its own API:
| Platform | Detection Method |
|---|---|
| Telegram | msg.voice, msg.audio, msg.photo, msg.video, msg.document |
| Discord | attachment.content_type MIME prefix (audio/, image/, video/) |
| LINE | message.type field ("audio", "image", "video", "file") |
| Signal | attachment.contentType MIME prefix |
| iMessage | attachment.mime_type MIME prefix |
| Slack | Any file attachment (MIME-based detection happens later) |
Each platform downloads the file using its own API, saves to local disk, and tags it:
<media:audio>for voice/audio<media:image>for images<media:video>for video<media:document>for files
Shared Layer (applyMediaUnderstanding())
One function handles all conversions, called automatically before the Agent sees the message:
- Reads local file path + MIME type
- Selects conversion method based on type:
- audio → transcription (whisper local / OpenAI API / Groq / Deepgram / Google)
- image → vision model description (Gemini / OpenAI / Anthropic)
- video → vision model description
- Replaces placeholder with formatted text:
- Audio:
[Audio]\nTranscript:\n<transcribed text> - Image:
[Image]\nDescription:\n<description text>
- Audio:
- If conversion fails (no provider configured), the raw placeholder stays in the message
Transcription Provider Priority
Auto-detection order:
- sherpa-onnx-offline (local)
- whisper-cli / whisper.cpp (local)
- whisper Python CLI (local)
- gemini CLI (local)
- API providers: OpenAI → Groq → Deepgram → Google
Skill Integration
Whisper skills declare requirements in SKILL.md metadata:
requires:
bins: ["whisper"] # must exist in PATH
If the binary is missing, the skill is filtered out — the Agent never sees it. If present, the Agent can use it for transcription.
Our Implementation
All media is converted to text in the Manager layer (routeMedia()) before reaching the Agent, matching OpenClaw's applyMediaUnderstanding() pattern.
Architecture
┌─────────────────────────────────────────────────────┐
│ Platform Plugin (e.g. telegram.ts) │
│ │
│ bot.on("message:voice") → detect type │
│ bot.api.getFile() → download to local disk │
│ Emit ChannelMessage with media attachment │
└──────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Channel Manager (manager.ts → routeMedia()) │
│ │
│ Download file via plugin.downloadMedia() │
│ audio → transcribeAudio() → text │
│ image → describeImage() → text │
│ video → describeVideo() (ffmpeg frame + vision) → text │
│ document → file path info │
│ All results → agent.write(text) │
└──────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Agent receives plain text only │
│ e.g. "[Voice Message]\nTranscript: ..." │
│ e.g. "[Image]\nDescription: ..." │
│ e.g. "[Video]\nDescription: ..." │
└─────────────────────────────────────────────────────┘
Media Processing Modules
| Type | Module | Method | API |
|---|---|---|---|
| audio | src/media/transcribe.ts |
transcribeAudio() |
Local whisper/whisper-cli → OpenAI Whisper API (whisper-1) |
| image | src/media/describe-image.ts |
describeImage() |
OpenAI Vision API (gpt-4o-mini) |
| video | src/media/describe-video.ts |
describeVideo() |
ffmpeg frame extraction + Vision API |
| document | (inline in manager) | — | File path info only |
Agent Output Format
| Type | Success | No API Key |
|---|---|---|
| audio | [Voice Message]\nTranscript: <text> |
[audio message received]\nFile: <path> |
| image | [Image]\nDescription: <text> |
[image message received]\nFile: <path> |
| video | [Video]\nDescription: <text> |
[video message received]\nFile: <path> |
| document | [document message received]\nFile: <path> |
same |
Audio Transcription Priority
transcribeAudio() tries providers in order, matching OpenClaw's local-first approach:
- Local whisper/whisper-cli — Free, no latency, works offline. Detected via
whichand cached. - OpenAI Whisper API (
whisper-1) — Requires API key incredentials.json5. - null — No provider available. Placeholder stays in message, agent naturally responds (e.g. suggests installing whisper).
Whisper Skill (Agent Fallback)
The skills/whisper/SKILL.md skill is a secondary safety net. If transcription returned null (no local binary, no API key), the agent receives a placeholder with the file path. If whisper is installed, the skill tells the agent how to transcribe it via the exec tool.
File Map
| File | Role |
|---|---|
src/channels/types.ts |
ChannelMediaAttachment, ChannelMessage.media, ChannelPlugin.downloadMedia |
src/channels/plugins/telegram.ts |
Detect voice/audio/photo/video/document + download via Grammy API |
src/channels/manager.ts |
routeMedia() — download, convert, agent.write(text) |
src/media/transcribe.ts |
Audio → text (local whisper → OpenAI Whisper API) |
src/media/describe-image.ts |
Image → text via OpenAI Vision API (gpt-4o-mini) |
src/media/describe-video.ts |
Video → extract frame (ffmpeg) → text via Vision API |
src/shared/paths.ts |
MEDIA_CACHE_DIR (~/.super-multica/cache/media/) |
skills/whisper/SKILL.md |
Local whisper CLI fallback skill |
Future Work
| Task | Scope |
|---|---|
| Groq / Deepgram fallback for audio | src/media/transcribe.ts |
| Multi-provider vision support (Gemini, Anthropic) | src/media/describe-image.ts |
| Document text extraction (PDF, DOCX) | src/media/ |
| Media cache cleanup (delete old files) | src/shared/ |
| Outbound media (send images/audio back to channels) | types.ts, plugins |