fal-ai/ltx23-trainer-v2/ic-lora/v2v
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0059 * steps. With 1000 steps, your request will cost $5.90.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Video-to-Video IC-LoRA (`/ic-lora/v2v`)
Overview
The `/ic-lora/v2v` endpoint trains an IC-LoRA (In-Context LoRA) for the LTX 2.3 video model. An IC-LoRA is a small adapter that does not generate from text alone — instead it conditions on a reference (control) video supplied at inference and learns to produce a corresponding target video. In other words, it learns a transformation: "given this kind of input clip, produce that kind of output clip." You teach it that mapping by providing pairs of clips — a "before" reference and an "after" target — and the LoRA learns to go from one to the other.
This is the right endpoint when you want a control-driven video transformation, for example:
- Pose / depth / sketch / edge control to full-resolution video.
- Restyling or colorization that follows the motion of a control clip.
- A recurring, repeatable edit applied to arbitrary footage.
- Any reference-to-video mapping you can demonstrate with paired examples.
Key features:
- Trains an IC-LoRA from paired reference→target video clips.
- Optional reference downscaling so the LoRA can be driven by a coarse / low-resolution control proxy (e.g. a small pose or depth map) yet output full resolution.
- Optional reference temporal downsampling so the LoRA can be driven by a low-FPS reference.
- Video-only; no audio is trained.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) of paired clips:
`<name>_start.<ext>`— the reference / control video (the "input" to transform).`<name>_end.<ext>`— the target video (the desired "output").`<name>.txt`— caption for the pair (optional only if a`trigger_phrase`is set; otherwise required).
Each `_start` must have a matching `_end` with the same base `<name>`. Video formats: `.mp4`, `.mov`, `.avi`, `.mkv`. The `_start` and `_end` clips of a pair must have matching frame counts. File names must be unique across the archive. Aim for at least 10 pairs.
Minimum clip length: with `auto_scale_input` off (the default), both clips in each pair (`_start` and `_end`) must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A pair with either side too short is skipped, and if no pair qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.
Example layout:
sample01_start.mp4 sample01_end.mp4 sample01.txt sample02_start.mp4 sample02_end.mp4 sample02.txt
Within each pair, the `_start` and `_end` must have the same frame count. If any reference/target pair differs in length, the entire request is rejected (HTTP 422); the mismatched pair is not silently skipped. Pre-trim your clips so each pair matches.
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of `_start`/`_end` pairs (and optional captions).
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference to activate the transformation.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
IC-LoRA capacity. Higher values capture more detail at the cost of memory and overfitting risk.
`number_of_steps`
Type: `integer`
Default: `3000` (range `100`–`20000`)
Number of optimization steps. Video-to-video transformations typically benefit from a somewhat higher step count than plain text-to-video, hence the higher default.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
`first_frame_conditioning_p`
Type: `number`
Default: `0.1` (range `0.0`–`1.0`)
Probability of conditioning on the first frame during training. Lower values work better for video-to-video transformation (the default is intentionally low).
`reference_downscale_factor`
Type: `integer`
Default: `1` (range `1`–`8`)
Spatially downscale the reference (control) video by this factor before it is encoded, so the LoRA learns to drive a full-resolution output from a coarse / low-resolution reference (e.g. a small pose, depth, or sketch proxy). `1` means no downscaling.
| Value | Use Case |
|---|---|
| 1 | Reference and target at the same resolution (default) |
| 2–8 | Reference is a smaller / coarser control proxy than the desired output |
Note: both width and height must be divisible by the factor, and width ÷ factor and height ÷ factor must each be divisible by 32 (checked against both the training and validation resolutions); an incompatible value fails the request with a 422.
`reference_temporal_scale_factor`
Type: `integer`
Default: `1` (range `1`–`8`)
Temporally downsample the reference video (lower FPS) by this factor before encoding, so the LoRA can be driven by a low-FPS reference. `1` means no change.
`(number_of_frames − 1)` must be divisible by the factor, and after subsampling `(frames − 1)` must remain a multiple of 8 — checked against both `number_of_frames` and `validation_number_of_frames`. An incompatible factor/frame-count combination fails the request with a 422. (Example: with the default 89 frames, a factor of 2 is invalid because `(89 − 1) ÷ 2 = 44`, which is not a multiple of 8.)
Video Configuration
`number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.
`frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Target frames per second. LTX 2.3's native rate is 24.
`resolution`
Type: `string` (`low`, `medium`, `high`)
Default: `medium`
| Resolution | 16:9 | 1:1 | 9:16 |
|---|---|---|---|
| low | 512×288 | 512×512 | 288×512 |
| medium | 768×448 | 768×768 | 448×768 |
| high | 960×544 | 960×960 | 544×960 |
`aspect_ratio`
Type: `string` (`16:9`, `1:1`, `9:16`)
Default: `1:1`
`auto_scale_input`
Type: `boolean`
Default: `false`
Fit videos to the target frame count and frame rate.
`split_input_into_scenes`
Type: `boolean`
Default: `true`
Split long clips into scenes (kept in sync across the reference/target pair).
`split_input_duration_threshold`
Type: `number`
Default: `30.0` (range `1.0`–`60.0`)
Duration above which a clip is eligible for scene splitting.
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`reference_video_url`(`string`, required) — the reference / control video to transform.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
`validation_resolution`
Type: `string`
Default: `high`
`validation_aspect_ratio`
Type: `string`
Default: `1:1`
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Outputs
`lora_file`— the trained IC-LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`video`— combined validation reel (when validation samples were provided).`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted,
`_start`/`_end`pairs and captions are matched, each clip is resized/cropped to the resolution bucket (optionally scene-split in sync), and the reference is optionally downscaled / temporally downsampled. - Training — the IC-LoRA trains for
`number_of_steps`, conditioned on the reference clip, learning to produce the target. Validation previews run at intervals. - Output — the IC-LoRA, config, and validation reel are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS`__MACOSX`metadata folders are ignored. - Pair matching: each
`<name>_start`is paired with its`<name>_end`and the optional`<name>.txt`caption. Targets missing a reference (or vice versa) are reported as errors. - Video fitting: both clips in a pair are resized to fill the resolution bucket and center-cropped; with
`auto_scale_input`they are resampled to the target frame rate/count. Reference and target stay aligned. - Reference scaling: when
`reference_downscale_factor`or`reference_temporal_scale_factor`is above 1, the reference is physically downscaled / temporally downsampled before encoding, so the LoRA learns to drive full-resolution output from a coarse reference. - Scene splitting: when on, pairs are split using synchronized boundaries so the reference and target never desync.
- Captions: the trigger phrase (if set) is prepended.
What an IC-LoRA Is
An IC-LoRA is a LoRA trained to perform an in-context transformation: instead of generating from text alone, it conditions on a reference video supplied at inference and produces a transformed result. The trained file specializes the base model at your reference→target mapping. At inference you supply a reference video (matching how you scaled it during training) plus a prompt, and the LoRA produces the corresponding target.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 well-aligned reference→target pairs; more variety yields a more general transformation.
- The reference and target of each pair must describe the same scene/motion differing only by the transformation you want learned.
- Keep frame counts matched within a pair.
Caption Best Practices
- Describe the target content plainly, optionally with a trigger phrase.
- Keep captions consistent in style across pairs so the LoRA associates the transformation, not the wording.
Good caption: `tron1ze a neon-outlined city street at night`
Weak caption: `street`
Trigger Phrases
- A distinctive trigger phrase helps cleanly invoke the transformation at inference; include it in every caption and at inference.
Reference Scaling
- If you want to drive generation from a small / coarse control map (pose, depth, edges), set
`reference_downscale_factor`above 1 and supply matching coarse references at inference. - Use the same scaling at inference as you did at training; the validation localizer mirrors the training-time scaling for you.
Scene Splitting and Captions
Scene splitting keeps reference/target in sync, but each split inherits the pair's single caption. For precise captions, pre-split your pairs and disable scene splitting.
Inference Format Matching
At inference, supply a reference video, use the same trigger phrase, and apply the same reference scaling you trained with.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/v2v_pairs.zip", "trigger_phrase": "tron1ze", "rank": 32, "number_of_steps": 3000, "learning_rate": 0.0002, "first_frame_conditioning_p": 0.1, "reference_downscale_factor": 1, "reference_temporal_scale_factor": 1, "number_of_frames": 89, "frame_rate": 24, "resolution": "medium", "aspect_ratio": "1:1", "validation": [ { "prompt": "tron1ze a busy city street", "reference_video_url": "https://example.com/ref.mp4" } ] }
Diagnosing Issues
- Transformation too weak: more steps, higher
`rank`, or more consistent pairs. - Output ignores the reference structure: ensure your pairs are well aligned; consider lowering
`first_frame_conditioning_p`(already low by default) and check reference scaling matches training. - Overfitting (previews copy training targets): fewer steps, lower
`rank`, more varied pairs. - Dataset errors: every
`_start`needs a matching`_end`; frame counts within a pair must match; names must be unique.
Validation Prompt Tips
- Use a reference video that the LoRA has not seen, so previews reveal generalization.
- Match the caption style and trigger phrase you trained with.
Common Pitfalls
- Mismatched or missing
`_start`/`_end`pairs. - Reference and target that differ by more than the intended transformation.
- Using different reference scaling at inference than at training.