fal-ai/ltx23-trainer-v2/i2v

Fine-tune LTX 2.3 to animate a starting image — supply a still plus a prompt at inference and the model generates a video that begins from that frame.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0024 * steps. With 1000 steps, your request will cost $2.40.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Image-to-Video (`/i2v`)

Overview

The `/i2v` endpoint trains a LoRA adapter for the LTX 2.3 video model that learns to animate a starting image. The first frame is held as a condition during training, so at inference you supply a still image plus a prompt and the model generates a video that begins from that image.

Key features:

  • Learns to bring a subject or style to life starting from a first-frame image.
  • Adjustable first-frame conditioning: at the default `0.5`, the LoRA is versatile and works for both image-to-video and plain text-to-video inference.
  • Optional joint audio training (auto-detected — audio is used only if every clip has a track).
  • Trains on videos or images, with an optional trigger phrase.
  • Validation previews (each with a first-frame image) generated during training.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) containing your clips and optional captions:

  • Videos: `.mp4`, `.mov`, `.avi`, `.mkv`
  • Images: `.png`, `.jpg`, `.jpeg`
  • Captions: a `.txt` file with the same base name as each media file (e.g. `clip01.mp4` + `clip01.txt`). Optional but recommended.

The archive must contain only videos OR only images. Aim for at least 10 files. Files in subfolders are fine — clips with the same name in different subfolders are kept distinct automatically. The first frame of each clip is what the model learns to condition on, so make sure clips open on a representative, clean frame.

Minimum clip length: with `auto_scale_input` off (the default), each video must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). Shorter clips are silently skipped, and if every clip is too short the request fails (422, "All training videos are too short to be trainable"). Turn on `auto_scale_input` to resample shorter clips to the target frame count instead. This applies to video datasets only; images are single-frame and always usable.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of training clips (and optional captions).

`trigger_phrase`

Type: `string` Default: `""`

A phrase prepended to every caption during training and used at inference to activate the learned concept.

Training Parameters
`rank`

Type: `integer` (one of `8`, `16`, `32`, `64`, `128`) Default: `32`

LoRA capacity. Higher values capture more detail but use more memory and overfit more easily.

ValueUse Case
8–16Small datasets, subtle changes
32Balanced default
64–128Larger datasets or complex subjects
`number_of_steps`

Type: `integer` Default: `2000` (range `100``20000`)

Number of optimization steps.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size. Start with the default; lower it if results are unstable.

`first_frame_conditioning_p`

Type: `number` Default: `0.5` (range `0.0``1.0`)

Probability that the first frame is held as a condition during training.

ValueBehavior
0.0Never conditions on the first frame (behaves like text-to-video)
0.5Versatile — the resulting LoRA works for both image-to-video and text-to-video inference
1.0Always conditions on the first frame (most specialized to image-to-video)

Note: for single-image datasets the first frame is the entire sample, so first-frame conditioning is skipped automatically and training behaves like text-to-video on stills.

Video Configuration
`number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values snap down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Target frames per second. LTX 2.3's native rate is 24.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution16:91:19:16
low512×288512×512288×512
medium768×448768×768448×768
high960×544960×960544×960
`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate. No effect on images.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Split long videos into scenes before training.

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration above which a video is eligible for scene splitting.

Audio Configuration
`with_audio`

Type: `boolean` or null Default: `null` (auto-detect)

Controls joint audio-video training, all-or-nothing across the dataset. With `null` (default, auto-detect), audio is enabled only when every clip has an audio track — if even one clip is silent, the whole run trains video-only. Set `true` to force audio training (the request fails with a 422 if any clip lacks an audio track), or `false` to ignore audio even when present.

`audio_normalize`

Type: `boolean` Default: `true`

Peak-normalize audio for consistent loudness.

`audio_preserve_pitch`

Type: `boolean` Default: `true`

Preserve pitch when fitting audio to video duration.

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `image_url` (`string`, required) — the first-frame image to animate.
`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

Spatio-Temporal Guidance scale for previews; `1.0` recommended, `0.0` disables.

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

  • `lora_file` — the trained LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `video` — a combined validation reel (when validation samples were provided).
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, media matched to captions, clips resized/cropped to the resolution bucket, optionally scene-split, and audio prepared.
  2. Training — the LoRA trains for `number_of_steps`, conditioning on the first frame with probability `first_frame_conditioning_p`, running validation previews at intervals.
  3. Output — the LoRA, config, and validation reel are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
  • File matching: each media file is paired with the `.txt` of the same base name.
  • Video fitting: clips are resized to fill the target resolution (keeping aspect ratio), then center-cropped to the exact bucket size. With `auto_scale_input`, clips are resampled to the target frame rate and frame count.
  • First frame: during training the first frame of each clip is sometimes held as a clean condition (per `first_frame_conditioning_p`) and the model learns to generate the rest.
  • Scene splitting: long clips are cut into shots when enabled.
  • Captions: the trigger phrase (if set) is prepended.
  • Audio: when enabled, each clip's soundtrack is prepared and learned jointly.
LoRA Training

A LoRA is a compact set of adapter weights trained on top of the frozen base model. You load it alongside the base LTX 2.3 model at inference. Here it specializes the model at animating a supplied first frame.

Audio Support

If clips have sound and audio is enabled, the LoRA can produce a matching soundtrack. Audio is all-or-nothing across the dataset: if any clip is silent, the whole run trains video-only.

Tips for Getting Good Results

Dataset Quality
  • At least 10 clips; more variety helps.
  • Clips should open on a clean, representative first frame — that frame drives generation.
  • Show the subject from multiple angles and contexts for a robust concept.
Caption Best Practices
  • Describe the content of each clip plainly and specifically.
  • Subject/object: combine your trigger phrase with a description, e.g. `tronlog0 a glowing logo rotating slowly`.
  • Style: describe the content and let the look be learned.

Good caption: `a golden retriever runs across a grassy field` Weak caption: `dog`

Trigger Phrases
  • Use a rare, distinctive trigger token for a specific subject and include it in every caption.
  • Skip the trigger phrase for an always-on style.
Scene Splitting and Captions

With scene splitting on, one long video produces several clips sharing one caption — which may not describe each split accurately. Pre-split and disable scene splitting for precise captions.

Inference Format Matching

Prompt the LoRA the way you captioned it, including the trigger phrase. Supply a first-frame image consistent with your training footage. At `first_frame_conditioning_p = 0.5` the LoRA also works without an image (text-to-video).

json
{
  "training_data_url": "https://example.com/my_dataset.zip",
  "trigger_phrase": "tronlog0",
  "rank": 32,
  "number_of_steps": 2000,
  "learning_rate": 0.0002,
  "first_frame_conditioning_p": 0.5,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    { "prompt": "tronlog0 rotating slowly", "image_url": "https://example.com/start.png" }
  ]
}
Diagnosing Issues
  • Overfitting (previews copy training clips or ignore the prompt): fewer steps, lower `rank`, more varied data.
  • Underfitting (concept barely shows): more steps, higher `rank`, better captions.
  • Animation ignores the supplied image: raise `first_frame_conditioning_p` toward `1.0`.
  • Training fails: verify the archive is all-video or all-image, and that captions match base names.
Validation Prompt Tips
  • Pair each prompt with a representative first-frame image.
  • Use prompts in the same style as your captions, including the trigger phrase.
Common Pitfalls
  • Mixing images and videos in one archive.
  • Clips that open on blurry or unrepresentative first frames.
  • Forgetting the trigger phrase or a first-frame image at inference.