ai/studio

Quantum Minds Media Operators

Introduction

Media operators in Quantum Minds enable you to work with non-text content types including images, audio, and speech. These operators allow your minds to process and generate multimedia content, expanding the capabilities beyond text-only applications to create rich, multimodal experiences.

Available Media Operators

Operator	Description	Common Use Cases
ImageGenerationUrl	Creates images from text prompts with URL output	Visual content creation, marketing materials, illustrations
ImageGeneration64	Creates images with base64-encoded output	Embedded visuals, application integration, offline use
SpeechToText	Converts audio to text transcriptions	Transcription, meeting notes, audio analysis
TextToSpeech	Converts text to spoken audio	Audio content creation, accessibility, voice interfaces

ImageGenerationUrl

The ImageGenerationUrl operator creates images based on text descriptions and returns accessible URLs for the generated images.

Inputs

Parameter	Type	Required	Description
model	string	Yes	Image generation model to use (e.g., "dall-e-3")
prompt	string	Yes	Description of the image to generate
quality	string	Yes	Image quality level (e.g., "standard", "hd")
size	string	Yes	Image dimensions (e.g., "1024x1024", "512x512")
trigger	string	No	Optional control signal

Outputs

Parameter	Type	Description
type	string	Output format (markdown)
content	string	Markdown containing the image URL and generation details

Example Usage

Model: "dall-e-3"
Prompt: "A futuristic smart city with sustainable architecture, flying vehicles, green spaces, and solar panels, in a photorealistic style"
Quality: "hd"
Size: "1024x1024"

Output: Markdown with an embedded image URL showing the described futuristic city

Best Practices

Provide detailed, specific descriptions
Mention style, perspective, lighting, and mood
Include important details about what should and shouldn't be included
Be aware of content policies that may restrict certain types of images
Use quality and size parameters appropriate for your use case

ImageGeneration64

The ImageGeneration64 operator creates images based on text descriptions and returns them in base64-encoded format for direct embedding.

Inputs

Similar to ImageGenerationUrl:

Parameter	Type	Required	Description
model	string	Yes	Image generation model to use
prompt	string	Yes	Description of the image to generate
quality	string	Yes	Image quality level
size	string	Yes	Image dimensions
trigger	string	No	Optional control signal

Outputs

Parameter	Type	Description
type	string	Output format (markdown)
content	string	Markdown with embedded base64-encoded image

Supported Models

Both image generation operators support:

Model	Provider	Strengths
dall-e-2	OpenAI	Faster generation, stylized results
dall-e-3	OpenAI	Higher quality, better prompt adherence

Choosing Between Image Operators

Consideration	ImageGenerationUrl	ImageGeneration64
Persistence	Images hosted externally	Image data contained in output
Loading speed	May be faster for large images	May be slower for large images
Integration	Requires URL access	Works offline or in closed systems
Storage	Minimal output size	Larger output size
Sharing	Easier to share URLs	Requires full data transfer

SpeechToText

The SpeechToText operator converts audio content to text transcriptions.

Inputs

Parameter	Type	Required	Description
file	string	Yes	Audio file to transcribe
model	string	No	Transcription model to use (default based on file type)
trigger	string	No	Optional control signal

Outputs

Parameter	Type	Description
type	string	Output format (markdown)
content	markdown	Transcribed text

Supported Models

Model	Provider	Best For
whisper-large	Fireworks	General transcription, multiple languages
whisper-large-v3	Groq	Enhanced accuracy, speaker diarization

Example Usage

File: [Meeting recording audio file]
Model: "whisper-large-v3"

Output: Complete transcription of the meeting with speaker identification

Best Practices

Use high-quality audio when possible
Consider pre-processing noisy audio
Choose appropriate models for your language needs
Be aware of length limitations for audio files
Review and correct transcriptions for critical content

TextToSpeech

The TextToSpeech operator converts text to spoken audio content.

Inputs

Parameter	Type	Required	Description
text	string	Yes	Text to convert to speech
language	string	No	Language code (e.g., "en-US", "fr-FR")
trigger	string	No	Optional control signal

Outputs

Parameter	Type	Description
type	string	Output format (markdown)
content	markdown	Markdown with embedded audio player

Example Usage

Text: "Welcome to our quarterly financial review. In this presentation, we'll cover the key performance indicators, revenue highlights, and our outlook for the coming quarter."
Language: "en-US"

Output: Audio file containing the spoken version of the provided text

Best Practices

Break long text into natural paragraphs
Use punctuation to influence pacing and intonation
Consider including pronunciation guides for uncommon terms
Test with smaller segments before processing large content
Be mindful of total character count for efficiency

Combining Media Operators

Media operators can be combined to create powerful multimedia experiences:

Text-to-Speech-to-Text Verification

{
  "operator": "TextToSpeech",
  "input": {
    "text": "Important announcement regarding the system upgrade scheduled for next weekend."
  }
}

↓

{
  "operator": "SpeechToText",
  "input": {
    "file": "$TextToSpeech_001.output.content"
  }
}

↓

{
  "operator": "TableToTextSummary",
  "input": {
    "prompt": "Compare the original text with the transcription and identify any discrepancies",
    "dataframe": { "original": "Important announcement...", "transcribed": "$SpeechToText_001.output.content" }
  }
}

Image Generation with Audio Description

{
  "operator": "ImageGenerationUrl",
  "input": {
    "model": "dall-e-3",
    "prompt": "A visualization of global supply chain networks with highlighted routes and distribution centers",
    "quality": "standard",
    "size": "1024x1024"
  }
}

↓

{
  "operator": "TextToSpeech",
  "input": {
    "text": "This visualization shows our global supply chain network. The red lines represent primary shipping routes, while the blue dots indicate major distribution centers. Note the concentration of activity in Southeast Asia and North America."
  }
}

↓

{
  "operator": "CardGenerator",
  "input": {
    "prompt": "Create an interactive card with the image and audio narration"
  }
}

Integration with Other Operators

Multi-Modal Content Creation

Media operators work seamlessly with other operator categories:

Data Visualization with Narration:
SQLExecution → TableToGraph → TextToSpeech
Automated Report Generation:
PandasAi → TextToSpeech → CardGenerator
Image Generation from Data:
TableToTextSummary → ImageGenerationUrl
Audio Transcription Analysis:
SpeechToText → TextToNoSQL → MongoExecution

Using with MultiModal Operators

Media operators complement the MultiModal operators:

Use SpeechToText to prepare audio for GeminiMultiModal analysis
Use ImageGenerationUrl to create visuals based on ClaudeMultiModal insights
Process TextToSpeech output with GeminiMultiModal for secondary analysis

Advanced Use Cases

Accessibility Enhancement

Create accessible versions of content:

{
  "operator": "RAGSummarize",
  "input": {
    "prompt": "Summarize the key points from our annual report",
    "collection": "company_reports"
  }
}

↓

{
  "operator": "TextToSpeech",
  "input": {
    "text": "$RAGSummarize_001.output.content"
  }
}

Content Localization

Translate and voice content for multiple languages:

{
  "operator": "OpenSearch",
  "input": {
    "prompt": "Translate the following product description to Spanish, French, and German: [product description]"
  }
}

↓

{
  "operator": "TextToSpeech",
  "input": {
    "text": "$OpenSearch_001.output.content",
    "language": "es-ES, fr-FR, de-DE"
  }
}

Interactive Tutorials

Create rich learning experiences:

{
  "operator": "ImageGenerationUrl",
  "input": {
    "prompt": "Step-by-step illustration of how to configure the system settings",
    "model": "dall-e-3",
    "quality": "hd",
    "size": "1024x1024"
  }
}

↓

{
  "operator": "TextToSpeech",
  "input": {
    "text": "In this tutorial, we'll walk through the system configuration process..."
  }
}

↓

{
  "operator": "CardGenerator",
  "input": {
    "prompt": "Create an interactive tutorial card with both visual and audio guidance"
  }
}

Best Practices for Media Operators

Performance Considerations

Be aware of processing times for media generation
Consider asynchronous processing for long audio files
Optimize image sizes for your specific use case
Cache frequently used media when possible

Content Quality

Provide detailed prompts for image generation
Use clear, well-paced text for speech synthesis
Ensure audio files have good quality for transcription
Review generated media for accuracy and appropriateness

Technical Limitations

Be aware of file size limits for audio processing
Consider format compatibility across systems
Understand resolution constraints for image generation
Plan for potential failures in media processing

Ethical Considerations

Ensure generated content adheres to appropriate guidelines
Consider bias in image generation and speech recognition
Be transparent about AI-generated media when appropriate
Respect copyright and intellectual property in prompts

Next Steps

Explore how Media Operators can be combined with Excel Operators to create rich, data-driven presentations and reports.

Overview | Operator Categories | LLM Operators | Excel Operators