google/gemini-omni-flash/reference-to-video

Generates video with audio from combined multimodal references. Accepts text, images, audio, and video together as input to guide subject, motion, style, and sound in the output.

Learn more about Gemini Omni Flash

Inference

Commercial use

Partner

Schema

LLMs

Playground API Examples

Input

Prompt*

Type # to reference inputs.

Image URLs*

Hint: Drag and drop files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL.

1 image added

Aspect Ratio

Duration

Result

Idle

What would you like to do next?

Download

{
  "video": {
    "url": "https://v3b.fal.media/files/b/0aa06557/WF8fgoBMMq5QwnHbWXkFO_da643ad0160e4e8488b864723150f274.mp4",
    "content_type": "video/mp4",
    "file_name": "da643ad0160e4e8488b864723150f274.mp4",
    "file_size": 1946074
  }
}

Billing is based on total token consumption. Input tokens (text/audio/video) cost $1.875 per 1 million tokens. Output tokens cost $21.875 per 1 million tokens. For 720p video this costs approximately $0.13 per second of video.

google/gemini-omni-flash/reference-to-video

Input

Result

What would you like to do next?

Logs