ideogram/v4/trainer

Train custom LoRAs for personalization, styles or other use cases on top of Ideogram V4.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.00675 * steps. With 1000 steps, your request will cost $6.75.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

Ideogram V4 Trainer User Manual

This guide explains how to use the Ideogram V4 trainer to fine-tune a LoRA adapter for text-to-image generation. The result is a small LoRA weights file you can apply at inference time to teach Ideogram V4 a new subject, character, object, or visual style.

Overview

The Ideogram V4 trainer fine-tunes a LoRA (Low-Rank Adaptation) adapter on top of the Ideogram V4 image model. Instead of retraining the whole model, it learns a compact set of additional weights that nudge the base model toward the look of your dataset.

Key features:

  • Train on your own images and captions from a single zip archive
  • Automatic resolution selection that picks the largest crop your images support, or choose from a set of presets / a custom size
  • Style or subject training driven entirely by your captions and trigger phrases
  • Two output formats so the weights drop straight into the fal Ideogram V4 endpoint or into ComfyUI

Input Parameters Reference

Dataset
`images_data_url` (required)

Type: `string`

URL to a zip archive containing your training images and (optionally) caption files.

Supported image formats: `.png`, `.jpg`, `.jpeg`, `.webp`. Files in any other format (including `.heic`, `.heif`, and `.avif`) are not processed and are silently skipped, so convert them to one of the supported formats before zipping.

Captions: Add a `.txt` file with the same base name as each image. The caption file holds the text that describes that image.

sunset.png
sunset.txt    # Contains: "a watercolor painting of a sunset over the ocean"

If an image has no caption file, the trainer falls back to `default_caption` (see below). If neither is present, training stops with an error telling you which image is missing a caption.

Notes:

  • You need at least one image. For most use cases, aim for the dataset sizes in the Tips section below.
  • Files ending in `_mask` (for example `sunset_mask.png`) are ignored, so you can leave masks in the archive without affecting training.
  • Folder structure inside the zip does not matter, but every file's base name must be unique across the archive. During extraction, nested folders are flattened by filename, so two files that share a name (for example `cat/01.png` and `dog/01.png`) collide and one is lost. Give each image (and its caption) a unique name.
Captions
`default_caption`

Type: `string` Default: none

A fallback caption used for any image that does not have its own `.txt` file. This is convenient when every image in your dataset shares the same description (for example, a single style or a single subject).

If you provide caption files for some images and a `default_caption` as well, each image uses its own caption when present and falls back to `default_caption` only when its caption file is missing.

Training Parameters
`steps`

Type: `integer` (100 - 40,000) Default: `1000`

Total number of training steps. More steps means the adapter is exposed to your dataset more times, learning it more strongly, up to the point of overfitting.

StepsUse Case
500-1500Styles, or small/simple datasets
1500-3000A specific subject, character, or object
3000+Larger or more varied datasets (watch for overfitting)
`learning_rate`

Type: `float` (1e-6 to 1e-2) Default: `0.0001`

How large each training update is. The default works well for most cases. Raise it cautiously for faster learning at the risk of instability; lower it for gentler, slower learning.

`resolution`

Type: `string` Default: `"auto"`

The pixel dimensions images are trained at. Images are center-cropped to this size without upscaling, so every image in your dataset must be at least as large as the selected resolution.

You can pass:

  • `auto` (the default) to let the trainer pick the largest size your dataset supports (see "What Happens to Your Data")
  • One of the named presets below
  • A custom `WIDTHxHEIGHT` string (for example `1280x768`), where both numbers must be divisible by 16
PresetDimensions (W×H)Shape
`square`1024×1024Square
`landscape`1536×1024Landscape
`portrait`1024×1536Portrait
`widescreen`1920×1088Wide
`ultrawide`2048×768Very wide
`phone_wallpaper`1024×1792Tall
`social_banner`1584×400Banner

Limits for custom and auto resolutions: width up to 2048, height up to 1792, and total area up to about 2,088,960 pixels. A custom size beyond these limits is rejected. Every named preset is within these limits by construction, so any preset is always accepted.

Guidance: Pick the resolution that matches what you intend to generate later. If your dataset images are smaller than every preset, use `auto` so the trainer can size the crop to your images instead of rejecting them.

Output
`output_lora_format`

Type: `string` (`"fal"` or `"comfy"`) Default: `"fal"`

Naming scheme for the keys inside the produced weights file.

ValueUse Case
`fal`Use the adapter with fal's Ideogram V4 inference endpoint
`comfy`Use the adapter in ComfyUI's Ideogram V4 workflow

The two files contain the same trained weights; only the internal key names differ. Choose the one that matches where you will load the adapter.

How the Training Works

Pipeline Overview
  1. Preprocessing
    1. Your archive is downloaded and extracted.
    2. Images are discovered and validated; captions are matched to images.
    3. The training resolution is resolved (either from your choice or inferred automatically).
    4. Each image is cropped and prepared along with its caption.
  2. Training
    1. The adapter is trained for the number of `steps` you requested.
  3. Output
    1. The trained LoRA weights file is produced in your chosen format.
    2. A configuration file recording your settings is produced alongside it.
What Happens to Your Data

Archive extraction: The zip is unpacked, including any nested folders. Mac-specific junk entries (`__MACOSX`, files beginning with `._`) are ignored. Only `.png`, `.jpg`, `.jpeg`, and `.webp` images are used; files in other formats are skipped.

File matching: Each image is paired with a caption file that has the exact same base name (`name.png``name.txt`). Images whose name ends in `_mask` are skipped entirely. If a caption file is missing, the image uses `default_caption`; if there is no `default_caption` either, training stops and tells you which caption is missing.

Orientation: Images are auto-rotated to their correct upright orientation based on their embedded orientation metadata before anything else happens.

Image fitting: Each image is scaled to cover the target resolution while keeping its aspect ratio, then center-cropped to the exact width and height. Images are never upscaled, so any image smaller than the chosen resolution is rejected with a message telling you to pick a smaller resolution or upload larger images. Very large images (beyond the platform's maximum pixel limit) are also rejected.

Auto resolution: When `resolution` is `auto`, the trainer inspects every image, finds the smallest width and smallest height across the dataset, and rounds each down to a multiple of 16. That becomes the training size, so the largest possible crop is used without upscaling any image. Note that `auto` does not shrink the result to fit the limits: if your smallest images are large enough that the inferred size exceeds the caps (width 2048, height 1792, or area ~2,088,960 px) — common for high-resolution datasets — training fails with an error asking you to choose a preset or a smaller custom resolution. Pick an explicit `resolution` in that case.

Captions: Caption text is read as-is from your `.txt` files (or taken from `default_caption`). Captions are not rewritten, summarized, or auto-generated. Whatever you write is exactly what the adapter learns to associate with each image.

LoRA Training

A LoRA adapter is a small set of extra weights layered onto the base Ideogram V4 model. Training only updates these extra weights, leaving the base model untouched. This keeps the output file small and makes it easy to switch the learned look on or off at inference time. The base model itself is never modified or redistributed.

Tips for Getting Good Results

Dataset Quality

Your dataset is the single biggest factor in the result. A quality checklist:

  • Consistency: For a subject or character, show the same subject across varied poses, angles, lighting, and backgrounds. For a style, show varied content all rendered in that same style.
  • Resolution: Provide images at least as large as your target resolution. If many images are small, use `resolution: auto` so they are not rejected.
  • Clean images: Avoid watermarks, heavy compression artifacts, borders, or collages. The adapter will faithfully reproduce whatever junk is in the data.
  • Variety where it matters: Don't use near-duplicate images. Variety in the things you want to stay flexible (pose, background) helps the model generalize.

Suggested starting sizes:

GoalImage Count
A consistent style10-30 images
A specific subject / character / object10-30 images
A complex or highly varied concept30+ images
Caption Best Practices
  • Describe what is actually in each image in plain language.
  • Be consistent in how you phrase recurring elements across captions.
  • Caption the things you want the model to be able to vary at generation time (clothing, background, pose). Things you describe become "controllable"; things you leave out get baked into the concept.

Good caption:

a photo of sks dog sitting on a wooden porch, soft afternoon light

Weak caption:

dog
Trigger Phrases

A trigger phrase is a distinctive word or short phrase you include in every caption so you can summon the learned concept at generation time.

  • Subject / object / character training: Use a rare, unusual trigger token (for example `sks dog`, `tok_woman`, `mybrand logo`) in every caption. At inference, include that same trigger phrase in your prompt to invoke the subject.
  • Style training: Either use a consistent trailing phrase (for example `in the style of mystyle`) on every caption, or rely on `default_caption` to apply one shared description to the whole set. At inference, add that phrase to steer the style.
Captions and Inference Must Match

The adapter learns the relationship between your caption wording and your images. At generation time, prompt in the same style and format you used in training, including your trigger phrase. If you trained with short tag-like captions, short prompts will work best; if you trained with full descriptive sentences, prompt with full sentences.

json
{
  "images_data_url": "https://your-host/dataset.zip",
  "steps": 1000,
  "resolution": "auto",
  "learning_rate": 0.0001,
  "default_caption": "a photo of sks subject",
  "output_lora_format": "fal"
}

Run this first, then generate a few images with the resulting LoRA and adjust `steps` (and optionally `learning_rate`) based on what you see.

Diagnosing Issues

Judge results by applying the trained LoRA at inference time and looking at what it generates. The job returns the LoRA weights and a config file only — no intermediate metrics or previews — so evaluate the finished adapter directly.

Signs of overfitting (trained too hard):

  • Generated images look like near-copies of your training images regardless of the prompt.
  • Backgrounds, poses, or colors from the dataset show up even when your prompt asks for something else.
  • The model ignores parts of new prompts.
  • Fix: lower `steps`, add more variety to the dataset, or apply the LoRA at a lower scale at inference time.

Signs of underfitting (trained too little):

  • The learned subject or style barely shows up, even with the trigger phrase.
  • Output looks like the plain base model.
  • Fix: increase `steps`, make sure your trigger phrase is in every caption, or add more representative images.

Training failures:

  • "Missing caption": an image has no matching `.txt` and no `default_caption` was set. Add captions or set `default_caption`.
  • Resolution too large for an image: an image is smaller than the chosen resolution. Switch to `resolution: auto` or pick a smaller size.
  • Resolution rejected / too large: a custom `WIDTHxHEIGHT` — or the size `auto` inferred from large images — exceeds the supported limits, or a custom size is not divisible by 16. Pick a preset, or a smaller custom size that is a multiple of 16.
  • No images found: the archive contained no usable images — often because they are in an unsupported format (e.g. HEIC). Convert images to `.png`/`.jpg`/`.jpeg`/`.webp`, re-create the zip, and confirm it opens on your computer.
Common Pitfalls
  • Mismatched prompt format at inference. Train and generate with the same caption style and trigger phrase.
  • Images smaller than the resolution. Use `auto` when in doubt.
  • Inconsistent or missing trigger phrase. It must appear in every caption to be reliable.
  • Dirty data. Watermarks, borders, and duplicates get learned faithfully.
  • Too many steps on a tiny dataset. This is the fastest route to overfitting.
  • Wrong output format. Use `fal` for the fal endpoint and `comfy` for ComfyUI.