ai/studio

Quantum Minds Media Operators

Introduction

Media operators in Quantum Minds enable you to work with non-text content types including images, audio, and speech. These operators allow your minds to process and generate multimedia content, expanding the capabilities beyond text-only applications to create rich, multimodal experiences.

Available Media Operators

Operator Description Common Use Cases
ImageGenerationUrl Creates images from text prompts with URL output Visual content creation, marketing materials, illustrations
ImageGeneration64 Creates images with base64-encoded output Embedded visuals, application integration, offline use
SpeechToText Converts audio to text transcriptions Transcription, meeting notes, audio analysis
TextToSpeech Converts text to spoken audio Audio content creation, accessibility, voice interfaces

ImageGenerationUrl

The ImageGenerationUrl operator creates images based on text descriptions and returns accessible URLs for the generated images.

Inputs

Parameter Type Required Description
model string Yes Image generation model to use (e.g., "dall-e-3")
prompt string Yes Description of the image to generate
quality string Yes Image quality level (e.g., "standard", "hd")
size string Yes Image dimensions (e.g., "1024x1024", "512x512")
trigger string No Optional control signal

Outputs

Parameter Type Description
type string Output format (markdown)
content string Markdown containing the image URL and generation details

Example Usage

Model: "dall-e-3"
Prompt: "A futuristic smart city with sustainable architecture, flying vehicles, green spaces, and solar panels, in a photorealistic style"
Quality: "hd"
Size: "1024x1024"

Output: Markdown with an embedded image URL showing the described futuristic city

Best Practices

ImageGeneration64

The ImageGeneration64 operator creates images based on text descriptions and returns them in base64-encoded format for direct embedding.

Inputs

Similar to ImageGenerationUrl:

Parameter Type Required Description
model string Yes Image generation model to use
prompt string Yes Description of the image to generate
quality string Yes Image quality level
size string Yes Image dimensions
trigger string No Optional control signal

Outputs

Parameter Type Description
type string Output format (markdown)
content string Markdown with embedded base64-encoded image

Supported Models

Both image generation operators support:

Model Provider Strengths
dall-e-2 OpenAI Faster generation, stylized results
dall-e-3 OpenAI Higher quality, better prompt adherence

Choosing Between Image Operators

Consideration ImageGenerationUrl ImageGeneration64
Persistence Images hosted externally Image data contained in output
Loading speed May be faster for large images May be slower for large images
Integration Requires URL access Works offline or in closed systems
Storage Minimal output size Larger output size
Sharing Easier to share URLs Requires full data transfer

SpeechToText

The SpeechToText operator converts audio content to text transcriptions.

Inputs

Parameter Type Required Description
file string Yes Audio file to transcribe
model string No Transcription model to use (default based on file type)
trigger string No Optional control signal

Outputs

Parameter Type Description
type string Output format (markdown)
content markdown Transcribed text

Supported Models

Model Provider Best For
whisper-large Fireworks General transcription, multiple languages
whisper-large-v3 Groq Enhanced accuracy, speaker diarization

Example Usage

File: [Meeting recording audio file]
Model: "whisper-large-v3"

Output: Complete transcription of the meeting with speaker identification

Best Practices

TextToSpeech

The TextToSpeech operator converts text to spoken audio content.

Inputs

Parameter Type Required Description
text string Yes Text to convert to speech
language string No Language code (e.g., "en-US", "fr-FR")
trigger string No Optional control signal

Outputs

Parameter Type Description
type string Output format (markdown)
content markdown Markdown with embedded audio player

Example Usage

Text: "Welcome to our quarterly financial review. In this presentation, we'll cover the key performance indicators, revenue highlights, and our outlook for the coming quarter."
Language: "en-US"

Output: Audio file containing the spoken version of the provided text

Best Practices

Combining Media Operators

Media operators can be combined to create powerful multimedia experiences:

Text-to-Speech-to-Text Verification

{
  "operator": "TextToSpeech",
  "input": {
    "text": "Important announcement regarding the system upgrade scheduled for next weekend."
  }
}

↓

{
  "operator": "SpeechToText",
  "input": {
    "file": "$TextToSpeech_001.output.content"
  }
}

↓

{
  "operator": "TableToTextSummary",
  "input": {
    "prompt": "Compare the original text with the transcription and identify any discrepancies",
    "dataframe": { "original": "Important announcement...", "transcribed": "$SpeechToText_001.output.content" }
  }
}

Image Generation with Audio Description

{
  "operator": "ImageGenerationUrl",
  "input": {
    "model": "dall-e-3",
    "prompt": "A visualization of global supply chain networks with highlighted routes and distribution centers",
    "quality": "standard",
    "size": "1024x1024"
  }
}

↓

{
  "operator": "TextToSpeech",
  "input": {
    "text": "This visualization shows our global supply chain network. The red lines represent primary shipping routes, while the blue dots indicate major distribution centers. Note the concentration of activity in Southeast Asia and North America."
  }
}

↓

{
  "operator": "CardGenerator",
  "input": {
    "prompt": "Create an interactive card with the image and audio narration"
  }
}

Integration with Other Operators

Multi-Modal Content Creation

Media operators work seamlessly with other operator categories:

  1. Data Visualization with Narration:
    SQLExecution → TableToGraph → TextToSpeech

  2. Automated Report Generation:
    PandasAi → TextToSpeech → CardGenerator

  3. Image Generation from Data:
    TableToTextSummary → ImageGenerationUrl

  4. Audio Transcription Analysis:
    SpeechToText → TextToNoSQL → MongoExecution

Using with MultiModal Operators

Media operators complement the MultiModal operators:

Advanced Use Cases

Accessibility Enhancement

Create accessible versions of content:

{
  "operator": "RAGSummarize",
  "input": {
    "prompt": "Summarize the key points from our annual report",
    "collection": "company_reports"
  }
}

↓

{
  "operator": "TextToSpeech",
  "input": {
    "text": "$RAGSummarize_001.output.content"
  }
}

Content Localization

Translate and voice content for multiple languages:

{
  "operator": "OpenSearch",
  "input": {
    "prompt": "Translate the following product description to Spanish, French, and German: [product description]"
  }
}

↓

{
  "operator": "TextToSpeech",
  "input": {
    "text": "$OpenSearch_001.output.content",
    "language": "es-ES, fr-FR, de-DE"
  }
}

Interactive Tutorials

Create rich learning experiences:

{
  "operator": "ImageGenerationUrl",
  "input": {
    "prompt": "Step-by-step illustration of how to configure the system settings",
    "model": "dall-e-3",
    "quality": "hd",
    "size": "1024x1024"
  }
}

↓

{
  "operator": "TextToSpeech",
  "input": {
    "text": "In this tutorial, we'll walk through the system configuration process..."
  }
}

↓

{
  "operator": "CardGenerator",
  "input": {
    "prompt": "Create an interactive tutorial card with both visual and audio guidance"
  }
}

Best Practices for Media Operators

Performance Considerations

Content Quality

Technical Limitations

Ethical Considerations

Next Steps

Explore how Media Operators can be combined with Excel Operators to create rich, data-driven presentations and reports.


Overview | Operator Categories | LLM Operators | Excel Operators