Cosmos 3 SuperOmnimodal Generation Across Image, Video, and Action
NVIDIA's omnimodal world model, ranked #1 among open-weights models in both Text-to-Image and Image-to-Video on the Artificial Analysis leaderboards.
What Makes NVIDIA Cosmos 3 Super Different

Top Open-Weights Quality, Text to Image
Cosmos3-Super-Text2Image takes the #1 open-weights spot in Text-to-Image on the Artificial Analysis leaderboard. It runs through an agentic prompt-upsampling harness, turning a simple prompt into the structured detail the generator needs for sharp, faithful results.
Top Open-Weights Quality, Image to Video
Cosmos3-Super-Image2Video takes the #1 open-weights spot in Image-to-Video (No Audio) on the Artificial Analysis leaderboard. Hand it a single frame and it expands the scene into coherent motion grounded in real-world physics.

One Model Across Image, Video, Audio, and Action
Cosmos 3 is a collection of omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. One model spans the modalities, with flexible input-output configurations for whatever you feed it.
Text-to-image and image-to-video in one API
Both Cosmos 3 Super endpoints run on fal's serverless API with pay-per-use pricing and no GPUs to manage.
See what Cosmos 3 Super can create
A mix of text-to-image stills and image-to-video clips, each generated on fal.

"A weathered fishing boat moored at a misty harbor at dawn, coiled ropes and chipped paint, soft directional light across the hull"
"The moored boat rocks gently as ripples spread across the water and mist drifts past the bow, gulls crossing the frame"

"Exploded technical view of a mechanical wristwatch, gears, springs, and screws suspended mid-air, clean studio lighting"
"Starting from this exact exploded vertical view of a luxury mechanical wristwatch, the suspended parts begin to assemble themselves in mid-air. Brass gears spin into alignment, tiny screws spiral downward, springs coil and tighten, ruby bearings glint as they lock into place, and polished steel plates glide together with precise magnetic motion. The dial, hands, crown, case, and crystal slowly descend and snap into perfect position, creating a smooth cinematic “watch forming together” sequence. Hyper-detailed macro product shot, clean white studio background, soft diffused lighting, realistic metal reflections, subtle motion blur on rotating gears, elegant slow-motion engineering aesthetic, premium horology, photorealistic 3D render, crisp focus, high-end luxury watch commercial style."
A few lines of code.
Image and video output.
fal.ai handles the infrastructure: fast inference, auto-scaling, and a developer-friendly API. No GPUs to manage.
- Serverless: scales to zero, scales to millions
- Text-to-image and image-to-video from one SDK
- Python and JavaScript SDKs, plus REST API
import fal_client
result = fal_client.run(
"nvidia/cosmos-3-super/text-to-image",
arguments={
"prompt": "A weathered fishing boat moored at a misty harbor at dawn, soft directional light",
}
)
# result["images"][0]["url"] → your generated imageCommon questions about Cosmos 3 Super
What is Cosmos 3 Super?
Cosmos 3 is a collection of omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. Cosmos 3 Super is NVIDIA's 64B model in the family, and its Text-to-Image and Image-to-Video fine-tuned variants rank #1 among open-weights models on the Artificial Analysis leaderboards.
What can I do with Cosmos 3 Super on fal?
Two endpoints are available on fal: text-to-image, which generates images from a prompt, and image-to-video, which animates a reference image into video with coherent, physically grounded motion. Both run through fal's serverless API with no GPUs to manage.
How does Cosmos 3 Super rank on the Artificial Analysis leaderboards?
Cosmos 3 Super's fine-tuned variants rank #1 among open-weights models on both Artificial Analysis leaderboards: #1 in Text-to-Image and #1 in Image-to-Video (No Audio).
How does the Cosmos 3 architecture work?
Cosmos 3 uses a single Mixture-of-Transformers that pairs an autoregressive reasoner with a diffusion generator. The reasoner interprets and plans, while the generator produces pixels. This unified design lets one model span language, image, video, audio, and action for Physical AI.
What are the Cosmos 3 variants?
The Cosmos 3 family comes in two base sizes: Nano (16B, an 8B reasoner tower plus an 8B generator tower) and Super (64B, a 32B reasoner tower plus a 32B generator tower). The Super model also has Text-to-Image and Image-to-Video fine-tuned variants, which are the versions ranked on the Artificial Analysis Arena leaderboards.
How much does Cosmos 3 Super cost on fal?
Pricing is pay-per-use with no minimums or subscriptions. Text-to-image costs $0.04 per generated image, with optional prompt expansion adding $0.02 per request. Image-to-video is billed at $0.05 per second of generated video, rounded up.
Why does Cosmos 3 use structured JSON prompts?
Cosmos 3 generators take structured JSON prompts rather than plain text, which is what reproduces the leaderboard-topping results. Prompt upsampling converts a simple prompt into that structured form. It can be handled by an external agentic harness or by the model's own reasoner branch, so Cosmos 3 can also run self-contained.
How do I get started with the API?
Install the fal SDK (Python or JavaScript), grab an API key from your dashboard, and make your first request in a few lines of code. The API is serverless, so there are no GPUs to manage and no infrastructure to set up. Check the API documentation for all available parameters.
Can I use Cosmos 3 Super for commercial projects?
Yes. Content generated through the fal API can be used in commercial projects. Check fal's terms of service for full details on usage rights and licensing.

