Run the latest models all in one Sandbox 🏖️

Sora 2 Text to Video

fal-ai/sora-2/text-to-video/pro
Text-to-video endpoint for Sora 2 Pro, OpenAI's state-of-the-art video model capable of creating richly detailed, dynamic clips with audio from natural language or images.
Inference
Commercial use
Partner

Input

Result

Idle

What would you like to do next?

If an OpenAI API key is provided, it will be directly charged by OpenAI. Otherwise, you will be charged fal credits. The pricing is $0.30/s for 720p and $0.50/s for 1080p.

Logs

Sora 2 Pro | [text-to-video]

OpenAI's Sora 2 Pro generates up to 25-second videos with synchronized audio at $0.50 per second for 1080p output. With unprecedented length and audio integration, Sora 2 excels where most competing models cap at 10 seconds without sound. Built for filmmakers, content creators, and developers who need production-ready clips with natural dialogue and environmental audio.

Use Cases: Cinematic Scene Generation | Marketing Video Production | AI-Assisted Filmmaking


Performance

At $0.50/second for 1080p (or $0.30/second for 720p), Sora 2 Pro is a premium text-to-video solution, trading cost efficiency for industry-leading output length and native audio synthesis.

MetricResultContext
Maximum DurationUp to 25 seconds2.5x longer than most competing models (10s standard)
Resolution Options720p, 1080pTwo quality tiers with tiered pricing
Cost per Second$0.30 (720p), $0.50 (1080p)Premium pricing for audio-enabled, extended-duration output
Aspect Ratios9:16, 16:9Vertical and horizontal formats for social and cinematic use
Audio SynthesisNative audio generationSynchronized dialogue, ambient sound, and environmental audio
Related EndpointsSora 2 Video to Video, Sora 2 Text to VideoPro vs Standard tiers and remix capabilities

Audio-First Video Generation

Sora 2 Pro breaks from traditional silent video generation by synthesizing audio alongside visual content. Dialogue lip-syncing, environmental sounds, and ambient audio emerge from the same text prompt that describes the scene.

What this means for you:

  • Synchronized dialogue generation: Characters speak naturally with accurate lip-sync to match emotional tone and scene context, no separate audio track required

  • Environmental audio integration: Ambient sounds (wind, traffic, footsteps) generate contextually based on visual elements described in your prompt

  • Extended temporal coherence: 25-second maximum duration maintains visual and audio consistency across longer narrative arcs than standard 4-10 second models

  • Flexible duration control: Generate 4, 8, or 12-second clips for rapid iteration, or push to 25 seconds for complete scene development


Technical Specifications

SpecDetails
ArchitectureSora 2 Pro
Input FormatsText prompts (natural language descriptions)
Output FormatsMP4 video with audio, optional thumbnail and spritesheet
Resolution Options720p ($0.30/s), 1080p ($0.50/s)
Duration Range4, 8, 12 seconds (standard), up to 25 seconds (extended)
Aspect Ratios9:16 (vertical), 16:9 (horizontal)
LicenseCommercial use via OpenAI API or fal credits

API Documentation | Quickstart Guide | Enterprise Pricing


How It Stacks Up

Sora 2 Text to Video (Standard) – Sora 2 Pro trades cost efficiency for extended duration and audio synthesis at premium pricing. Standard Sora 2 remains ideal for rapid prototyping and shorter clips where audio isn't required.

Hunyuan Video V1.5 Text to Video – Sora 2 Pro prioritizes audio integration and extended temporal coherence (up to 25s) for narrative-driven content. Hunyuan Video V1.5 emphasizes cost-effective generation for standard-length clips without audio requirements.

LongCat Video Text to Video – Sora 2 Pro delivers native audio synthesis and dialogue lip-syncing for production-ready scenes. LongCat Video focuses on visual-only generation with competitive pricing for silent video workflows.