Media Understanding (Inbound) — 2026-01-17

Mayros reply pipeline चलने से पहले inbound media को summarize कर सकता है (image/audio/video)। यह auto‑detect करता है जब local tools या provider keys उपलब्ध हों, और disabled या customized किया जा सकता है। यदि understanding off है, तो models अभी भी सामान्य रूप से original files/URLs प्राप्त करते हैं।

लक्ष्य

वैकल्पिक: तेज़ routing + बेहतर command parsing के लिए inbound media को short text में pre‑digest करें।
Model को original media delivery preserve करें (हमेशा)।
Provider APIs और CLI fallbacks का समर्थन करें।
Ordered fallback (error/size/timeout) के साथ कई models की अनुमति दें।

High‑level व्यवहार

Inbound attachments collect करें (MediaPaths, MediaUrls, MediaTypes)।
प्रत्येक enabled capability (image/audio/video) के लिए, policy के अनुसार attachments select करें (डिफ़ॉल्ट: first)।
पहली eligible model entry चुनें (size + capability + auth)।
यदि model fail होता है या media बहुत बड़ा है, तो अगली entry पर fall back करें।
Success पर:
- Body [Image], [Audio] या [Video] block बन जाती है।
- Audio {{Transcript}} set करता है; command parsing caption text का उपयोग करता है जब present हो, अन्यथा transcript।
- Captions को block के अंदर User text: के रूप में preserved किया जाता है।

यदि understanding fail होता है या disabled है, तो reply flow original body + attachments के साथ continue होता है।

Config अवलोकन

tools.media shared models और per‑capability overrides का समर्थन करता है:

tools.media.models: shared model list (capabilities का उपयोग करके gate करें)।
tools.media.image / tools.media.audio / tools.media.video:
- defaults (prompt, maxChars, maxBytes, timeoutSeconds, language)
- provider overrides (baseUrl, headers, providerOptions)
- tools.media.audio.providerOptions.deepgram के माध्यम से Deepgram audio विकल्प
- वैकल्पिक per‑capability models list (shared models से पहले preferred)
- attachments policy (mode, maxAttachments, prefer)
- scope (channel/chatType/session key द्वारा वैकल्पिक gating)
tools.media.concurrency: max concurrent capability runs (डिफ़ॉल्ट 2)।

json5
{
  tools: {
    media: {
      models: [
        /* shared list */
      ],
      image: {
        /* वैकल्पिक overrides */
      },
      audio: {
        /* वैकल्पिक overrides */
      },
      video: {
        /* वैकल्पिक overrides */
      },
    },
  },
}

Model entries

प्रत्येक models[] entry provider या CLI हो सकती है:

json5
{
  type: "provider", // यदि omit किया गया तो default
  provider: "openai",
  model: "gpt-5.2",
  prompt: "Describe the image in <= 500 chars.",
  maxChars: 500,
  maxBytes: 10485760,
  timeoutSeconds: 60,
  capabilities: ["image"], // वैकल्पिक, multi‑modal entries के लिए उपयोग किया जाता है
  profile: "vision-profile",
  preferredProfile: "vision-fallback",
}

json5
{
  type: "cli",
  command: "gemini",
  args: [
    "-m",
    "gemini-3-flash",
    "--allowed-tools",
    "read_file",
    "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
  ],
  maxChars: 500,
  maxBytes: 52428800,
  timeoutSeconds: 120,
  capabilities: ["video", "image"],
}

CLI templates ये भी उपयोग कर सकते हैं:

{{MediaDir}} (media file वाली directory)
{{OutputDir}} (इस run के लिए बनाई गई scratch dir)
{{OutputBase}} (scratch file base path, बिना extension)

Defaults और limits

अनुशंसित defaults:

maxChars: image/video के लिए 500 (short, command‑friendly)
maxChars: audio के लिए unset (full transcript जब तक limit set न हो)
maxBytes:
- image: 10MB
- audio: 20MB
- video: 50MB

नियम:

यदि media maxBytes से अधिक है, तो उस model को skip किया जाता है और अगला model try किया जाता है।
यदि model maxChars से अधिक return करता है, तो output trimmed है।
prompt simple "Describe the ." और maxChars guidance (केवल image/video) पर default करता है।
यदि <capability>.enabled: true लेकिन कोई models configured नहीं हैं, तो Mayros active reply model try करता है जब इसका provider capability support करता हो।

Auto-detect media understanding (डिफ़ॉल्ट)

यदि tools.media.<capability>.enabled false पर सेट नहीं है और आपने models configure नहीं किए हैं, तो Mayros इस क्रम में auto-detect करता है और पहले working विकल्प पर रुकता है:

Local CLIs (केवल audio; यदि installed हों)
- sherpa-onnx-offline (SHERPA_ONNX_MODEL_DIR की आवश्यकता encoder/decoder/joiner/tokens के साथ)
- whisper-cli (whisper-cpp; WHISPER_CPP_MODEL या bundled tiny model का उपयोग करता है)
- whisper (Python CLI; models स्वचालित रूप से download करता है)
Gemini CLI (gemini) read_many_files का उपयोग करके
Provider keys
- Audio: OpenAI → Groq → Deepgram → Google
- Image: OpenAI → Anthropic → Google → MiniMax
- Video: Google

Auto-detection को disable करने के लिए, सेट करें:

json5
{
  tools: {
    media: {
      audio: {
        enabled: false,
      },
    },
  },
}

नोट: Binary detection macOS/Linux/Windows पर best-effort है; सुनिश्चित करें कि CLI PATH पर है (हम ~ expand करते हैं), या explicit CLI model को full command path के साथ सेट करें।

Capabilities (वैकल्पिक)

यदि आप capabilities set करते हैं, तो entry केवल उन media types के लिए चलती है। Shared lists के लिए, Mayros defaults infer कर सकता है:

openai, anthropic, minimax: image
google (Gemini API): image + audio + video
groq: audio
deepgram: audio

CLI entries के लिए, capabilities explicitly set करें ताकि surprising matches से बचा जा सके। यदि आप capabilities omit करते हैं, तो entry उस list के लिए eligible है जिसमें यह दिखाई देती है।

Provider support matrix (Mayros integrations)

Capability	Provider integration	नोट्स
Image	OpenAI / Anthropic / Google / अन्य `pi-ai` के माध्यम से	Registry में कोई भी image-capable model काम करता है।
Audio	OpenAI, Groq, Deepgram, Google	Provider transcription (Whisper/Deepgram/Gemini)।
Video	Google (Gemini API)	Provider video understanding।

अनुशंसित providers

Image

यदि यह images को support करता है तो अपने active model को prefer करें।
अच्छे defaults: openai/gpt-5.2, anthropic/claude-opus-4-6, google/gemini-3-pro-preview।

Audio

openai/gpt-4o-mini-transcribe, groq/whisper-large-v3-turbo या deepgram/nova-3।
CLI fallback: whisper-cli (whisper-cpp) या whisper।
Deepgram setup: Deepgram (audio transcription)।

Video

google/gemini-3-flash-preview (fast), google/gemini-3-pro-preview (richer)।
CLI fallback: gemini CLI (video/audio पर read_file को support करता है)।

Attachment policy

Per‑capability attachments controls करता है कि कौन से attachments processed हैं:

mode: first (डिफ़ॉल्ट) या all
maxAttachments: processed संख्या cap करें (डिफ़ॉल्ट 1)
prefer: first, last, path, url

जब mode: "all", तो outputs को [Image 1/2], [Audio 2/2], आदि label किया जाता है।

Config उदाहरण

1) Shared models list + overrides

json5
{
  tools: {
    media: {
      models: [
        { provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
        {
          provider: "google",
          model: "gemini-3-flash-preview",
          capabilities: ["image", "audio", "video"],
        },
        {
          type: "cli",
          command: "gemini",
          args: [
            "-m",
            "gemini-3-flash",
            "--allowed-tools",
            "read_file",
            "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
          ],
          capabilities: ["image", "video"],
        },
      ],
      audio: {
        attachments: { mode: "all", maxAttachments: 2 },
      },
      video: {
        maxChars: 500,
      },
    },
  },
}

2) केवल Audio + Video (image off)

json5
{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"],
          },
        ],
      },
      video: {
        enabled: true,
        maxChars: 500,
        models: [
          { provider: "google", model: "gemini-3-flash-preview" },
          {
            type: "cli",
            command: "gemini",
            args: [
              "-m",
              "gemini-3-flash",
              "--allowed-tools",
              "read_file",
              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
            ],
          },
        ],
      },
    },
  },
}

3) वैकल्पिक image understanding

json5
{
  tools: {
    media: {
      image: {
        enabled: true,
        maxBytes: 10485760,
        maxChars: 500,
        models: [
          { provider: "openai", model: "gpt-5.2" },
          { provider: "anthropic", model: "claude-opus-4-6" },
          {
            type: "cli",
            command: "gemini",
            args: [
              "-m",
              "gemini-3-flash",
              "--allowed-tools",
              "read_file",
              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
            ],
          },
        ],
      },
    },
  },
}

4) Multi‑modal single entry (explicit capabilities)

json5
{
  tools: {
    media: {
      image: {
        models: [
          {
            provider: "google",
            model: "gemini-3-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
      audio: {
        models: [
          {
            provider: "google",
            model: "gemini-3-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
      video: {
        models: [
          {
            provider: "google",
            model: "gemini-3-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
    },
  },
}

Status output

जब media understanding चलता है, /status एक short summary line शामिल करता है:

📎 Media: image ok (openai/gpt-5.2) · audio skipped (maxBytes)

यह per‑capability outcomes और जब applicable हो तो चुना गया provider/model दिखाता है।

नोट्स

Understanding best‑effort है। Errors replies को block नहीं करते।
Attachments अभी भी models को pass किए जाते हैं तब भी जब understanding disabled हो।
Understanding कहां चलता है उसे सीमित करने के लिए scope का उपयोग करें (जैसे केवल DMs)।