fal-ai/ltx23-trainer-v2/ic-lora/v2v

Train an IC-LoRA that learns a video-to-video transformation from paired before/after clips, conditioned at inference on a reference (control) video.
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

The cost of training depends on the number of steps. The formula is: 0.0059 * steps. With 1000 steps, your request will cost $5.90.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

LTX 2.3 Trainer — Video-to-Video IC-LoRA (`/ic-lora/v2v`)

Overview

The `/ic-lora/v2v` endpoint trains an IC-LoRA (In-Context LoRA) for the LTX 2.3 video model. An IC-LoRA is a small adapter that does not generate from text alone — instead it conditions on a reference (control) video supplied at inference and learns to produce a corresponding target video. In other words, it learns a transformation: "given this kind of input clip, produce that kind of output clip." You teach it that mapping by providing pairs of clips — a "before" reference and an "after" target — and the LoRA learns to go from one to the other.

This is the right endpoint when you want a control-driven video transformation, for example:

  • Pose / depth / sketch / edge control to full-resolution video.
  • Restyling or colorization that follows the motion of a control clip.
  • A recurring, repeatable edit applied to arbitrary footage.
  • Any reference-to-video mapping you can demonstrate with paired examples.

Key features:

  • Trains an IC-LoRA from paired reference→target video clips.
  • Optional reference downscaling so the LoRA can be driven by a coarse / low-resolution control proxy (e.g. a small pose or depth map) yet output full resolution.
  • Optional reference temporal downsampling so the LoRA can be driven by a low-FPS reference.
  • Video-only; no audio is trained.

Dataset Format

Provide a single `.zip` archive (linked via `training_data_url`) of paired clips:

  • `<name>_start.<ext>` — the reference / control video (the "input" to transform).
  • `<name>_end.<ext>` — the target video (the desired "output").
  • `<name>.txt` — caption for the pair (optional only if a `trigger_phrase` is set; otherwise required).

Each `_start` must have a matching `_end` with the same base `<name>`. Video formats: `.mp4`, `.mov`, `.avi`, `.mkv`. The `_start` and `_end` clips of a pair must have matching frame counts. File names must be unique across the archive. Aim for at least 10 pairs.

Minimum clip length: with `auto_scale_input` off (the default), both clips in each pair (`_start` and `_end`) must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A pair with either side too short is skipped, and if no pair qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.

Example layout:

sample01_start.mp4   sample01_end.mp4   sample01.txt
sample02_start.mp4   sample02_end.mp4   sample02.txt

Within each pair, the `_start` and `_end` must have the same frame count. If any reference/target pair differs in length, the entire request is rejected (HTTP 422); the mismatched pair is not silently skipped. Pre-trim your clips so each pair matches.

Input Parameters Reference

Dataset
`training_data_url` (required)

Type: `string`

URL to the `.zip` archive of `_start`/`_end` pairs (and optional captions).

`trigger_phrase`

Type: `string` Default: `""`

Phrase prepended to captions during training; include it at inference to activate the transformation.

Training Parameters
`rank`

Type: `integer` (`8`, `16`, `32`, `64`, `128`) Default: `32`

IC-LoRA capacity. Higher values capture more detail at the cost of memory and overfitting risk.

`number_of_steps`

Type: `integer` Default: `3000` (range `100``20000`)

Number of optimization steps. Video-to-video transformations typically benefit from a somewhat higher step count than plain text-to-video, hence the higher default.

`learning_rate`

Type: `number` Default: `0.0002`

Optimization step size.

`first_frame_conditioning_p`

Type: `number` Default: `0.1` (range `0.0``1.0`)

Probability of conditioning on the first frame during training. Lower values work better for video-to-video transformation (the default is intentionally low).

`reference_downscale_factor`

Type: `integer` Default: `1` (range `1``8`)

Spatially downscale the reference (control) video by this factor before it is encoded, so the LoRA learns to drive a full-resolution output from a coarse / low-resolution reference (e.g. a small pose, depth, or sketch proxy). `1` means no downscaling.

ValueUse Case
1Reference and target at the same resolution (default)
2–8Reference is a smaller / coarser control proxy than the desired output

Note: both width and height must be divisible by the factor, and width ÷ factor and height ÷ factor must each be divisible by 32 (checked against both the training and validation resolutions); an incompatible value fails the request with a 422.

`reference_temporal_scale_factor`

Type: `integer` Default: `1` (range `1``8`)

Temporally downsample the reference video (lower FPS) by this factor before encoding, so the LoRA can be driven by a low-FPS reference. `1` means no change.

`(number_of_frames − 1)` must be divisible by the factor, and after subsampling `(frames − 1)` must remain a multiple of 8 — checked against both `number_of_frames` and `validation_number_of_frames`. An incompatible factor/frame-count combination fails the request with a 422. (Example: with the default 89 frames, a factor of 2 is invalid because `(89 − 1) ÷ 2 = 44`, which is not a multiple of 8.)

Video Configuration
`number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.

`frame_rate`

Type: `integer` Default: `24` (range `8``60`)

Target frames per second. LTX 2.3's native rate is 24.

`resolution`

Type: `string` (`low`, `medium`, `high`) Default: `medium`

Resolution16:91:19:16
low512×288512×512288×512
medium768×448768×768448×768
high960×544960×960544×960
`aspect_ratio`

Type: `string` (`16:9`, `1:1`, `9:16`) Default: `1:1`

`auto_scale_input`

Type: `boolean` Default: `false`

Fit videos to the target frame count and frame rate.

`split_input_into_scenes`

Type: `boolean` Default: `true`

Split long clips into scenes (kept in sync across the reference/target pair).

`split_input_duration_threshold`

Type: `number` Default: `30.0` (range `1.0``60.0`)

Duration above which a clip is eligible for scene splitting.

Validation
`validation`

Type: `array` Default: `[]` (max 2 entries)

Validation samples, each an object with:

  • `prompt` (`string`) — the text prompt.
  • `reference_video_url` (`string`, required) — the reference / control video to transform.
`validation_negative_prompt`

Type: `string` Default: a built-in quality negative prompt.

`validation_number_of_frames`

Type: `integer` Default: `89` (range `9``121`)

`validation_frame_rate`

Type: `integer` Default: `24` (range `8``60`)

`validation_resolution`

Type: `string` Default: `high`

`validation_aspect_ratio`

Type: `string` Default: `1:1`

`stg_scale`

Type: `number` Default: `1.0` (range `0.0``3.0`)

`debug_dataset`

Type: `boolean` Default: `false`

Return an archive of the preprocessed data for inspection.

Outputs

  • `lora_file` — the trained IC-LoRA weights (`.safetensors`).
  • `config_file` — JSON describing the trigger phrase and training type.
  • `video` — combined validation reel (when validation samples were provided).
  • `debug_dataset` — preprocessed-data archive, only when `debug_dataset` is enabled.

Billing

A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.

How the Training Works

Pipeline Overview
  1. Preprocessing — the archive is extracted, `_start`/`_end` pairs and captions are matched, each clip is resized/cropped to the resolution bucket (optionally scene-split in sync), and the reference is optionally downscaled / temporally downsampled.
  2. Training — the IC-LoRA trains for `number_of_steps`, conditioned on the reference clip, learning to produce the target. Validation previews run at intervals.
  3. Output — the IC-LoRA, config, and validation reel are uploaded.
What Happens to Your Data
  • Archive extraction: the `.zip` is unpacked; macOS `__MACOSX` metadata folders are ignored.
  • Pair matching: each `<name>_start` is paired with its `<name>_end` and the optional `<name>.txt` caption. Targets missing a reference (or vice versa) are reported as errors.
  • Video fitting: both clips in a pair are resized to fill the resolution bucket and center-cropped; with `auto_scale_input` they are resampled to the target frame rate/count. Reference and target stay aligned.
  • Reference scaling: when `reference_downscale_factor` or `reference_temporal_scale_factor` is above 1, the reference is physically downscaled / temporally downsampled before encoding, so the LoRA learns to drive full-resolution output from a coarse reference.
  • Scene splitting: when on, pairs are split using synchronized boundaries so the reference and target never desync.
  • Captions: the trigger phrase (if set) is prepended.
What an IC-LoRA Is

An IC-LoRA is a LoRA trained to perform an in-context transformation: instead of generating from text alone, it conditions on a reference video supplied at inference and produces a transformed result. The trained file specializes the base model at your reference→target mapping. At inference you supply a reference video (matching how you scaled it during training) plus a prompt, and the LoRA produces the corresponding target.

Tips for Getting Good Results

Dataset Quality
  • Use at least 10 well-aligned reference→target pairs; more variety yields a more general transformation.
  • The reference and target of each pair must describe the same scene/motion differing only by the transformation you want learned.
  • Keep frame counts matched within a pair.
Caption Best Practices
  • Describe the target content plainly, optionally with a trigger phrase.
  • Keep captions consistent in style across pairs so the LoRA associates the transformation, not the wording.

Good caption: `tron1ze a neon-outlined city street at night` Weak caption: `street`

Trigger Phrases
  • A distinctive trigger phrase helps cleanly invoke the transformation at inference; include it in every caption and at inference.
Reference Scaling
  • If you want to drive generation from a small / coarse control map (pose, depth, edges), set `reference_downscale_factor` above 1 and supply matching coarse references at inference.
  • Use the same scaling at inference as you did at training; the validation localizer mirrors the training-time scaling for you.
Scene Splitting and Captions

Scene splitting keeps reference/target in sync, but each split inherits the pair's single caption. For precise captions, pre-split your pairs and disable scene splitting.

Inference Format Matching

At inference, supply a reference video, use the same trigger phrase, and apply the same reference scaling you trained with.

json
{
  "training_data_url": "https://example.com/v2v_pairs.zip",
  "trigger_phrase": "tron1ze",
  "rank": 32,
  "number_of_steps": 3000,
  "learning_rate": 0.0002,
  "first_frame_conditioning_p": 0.1,
  "reference_downscale_factor": 1,
  "reference_temporal_scale_factor": 1,
  "number_of_frames": 89,
  "frame_rate": 24,
  "resolution": "medium",
  "aspect_ratio": "1:1",
  "validation": [
    { "prompt": "tron1ze a busy city street", "reference_video_url": "https://example.com/ref.mp4" }
  ]
}
Diagnosing Issues
  • Transformation too weak: more steps, higher `rank`, or more consistent pairs.
  • Output ignores the reference structure: ensure your pairs are well aligned; consider lowering `first_frame_conditioning_p` (already low by default) and check reference scaling matches training.
  • Overfitting (previews copy training targets): fewer steps, lower `rank`, more varied pairs.
  • Dataset errors: every `_start` needs a matching `_end`; frame counts within a pair must match; names must be unique.
Validation Prompt Tips
  • Use a reference video that the LoRA has not seen, so previews reveal generalization.
  • Match the caption style and trigger phrase you trained with.
Common Pitfalls
  • Mismatched or missing `_start`/`_end` pairs.
  • Reference and target that differ by more than the intended transformation.
  • Using different reference scaling at inference than at training.