fal-ai/omnigen-v1

OmniGen is a unified image generation model that can generate a wide range of images from multi-modal prompts. It can be used for various tasks such as Image Editing, Personalized Image Generation, Virtual Try-On, Multi Person Generation and more!

Inference

Commercial use

Schema

LLMs

Playground API Examples

Input

Prompt*

Input Image Urls

Hint: Drag and drop files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL.

Additional Settings

Customize your input with more control.

Result

Idle

What would you like to do next?

Download

{
  "seed": 5536220984035758000,
  "images": [
    {
      "url": "https://fal.media/files/tiger/McOSLy9PtgUjSEC_xyM0c.jpeg",
      "width": 1024,
      "height": 1024,
      "content_type": "image/jpeg"
    }
  ],
  "prompt": "Neon words \"Omni Gen\" are flashing in the prosperous future city, the sense of science and technology, quality details, hyper realistic, high definition, 8K, photo, best quality, high quality.",
  "timings": {
    "inference": 24.059714447706938
  },
  "has_nsfw_concepts": [
    false
  ]
}

Your request will cost $0.1 per processed megapixel.

Logs

OmniGen v1 | [text/image-to-image]

OmniGen is a unified image generation model that handles multi-modal prompts across image editing, personalized generation, virtual try-on, and multi-person composition at $0.10 per processed megapixel. Trading specialized single-task architectures for unified multi-modal flexibility, it consolidates workflows that typically require switching between separate models. This matters for production teams managing complex image generation pipelines where reference images, editing constraints, and generation need to work together seamlessly.

Use Cases: Image Editing | Personalized Image Generation | Virtual Try-On

Performance

OmniGen's unified architecture processes up to 4 megapixels with multi-image input support at $0.10 per processed megapixel.

Metric	Result	Context
Resolution Support	Up to 4 megapixels (2048×2048)	Multiple preset sizes including square_hd
Multi-Image Input	Up to multiple reference images	Via `<\|image_1\|>` syntax in prompts
Inference Steps	1-50 steps (default 50)	Configurable quality/speed tradeoff
Cost per Megapixel	$0.10	10 generations per $1.00
Batch Generation	1-4 images per request	Parallel output with shared inference cost

OmniGen consolidates image editing, personalization, and generation into a single unified model that accepts text prompts with embedded image references. Where traditional workflows require separate models for inpainting, style transfer, and generation, OmniGen processes multi-modal inputs through a shared architecture with dual guidance controls.

What this means for you:

Reference Image Integration: Embed multiple images directly in text prompts using `<|image_1|>` syntax, no separate preprocessing pipeline required for personalization or style transfer workflows
Dual Guidance Control: Independent text guidance (0-20 CFG) and image guidance (0-20) scales let you balance prompt adherence against reference image influence for precise creative control
Task-Agnostic Pipeline: Handle virtual try-on, multi-person generation, and targeted editing through the same endpoint—consolidate API calls and simplify production infrastructure
Flexible Output Configuration: Generate 1-4 images per request with deterministic seeding, configurable inference steps (1-50), and JPEG/PNG output format selection

Technical Specifications

Spec	Details
Architecture	OmniGen
Input Formats	Text prompts, multiple image URLs via `input_image_urls` parameter
Output Formats	JPEG, PNG (configurable via `output_format`)
Resolution Options	square_hd, square, portrait_4_3, portrait_16_9, landscape_4_3, landscape_16_9 (up to 4MP)
License	Commercial use enabled

API Documentation | Quickstart Guide | Enterprise Pricing

How It Stacks Up

AuraFlow Text to Image ($0.055 per image) – OmniGen prioritizes multi-modal prompt integration and task flexibility at $0.10 per processed megapixel versus AuraFlow's single-task text-to-image focus. AuraFlow delivers faster inference for straightforward text-to-image generation where reference image integration isn't required, trading multi-modal capabilities for 2x cost efficiency on simple prompts.