fal-ai/ltx23-trainer-v2/ic-lora/av2av
Input
Hint: Upload a prepared .zip archive or provide a URL. See the field description for the required file layout.
Customize your input with more control.
The cost of training depends on the number of steps. The formula is: 0.0068 * steps. With 1000 steps, your request will cost $6.80.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
LTX 2.3 Trainer — Audio+Video Reference IC-LoRA (`/ic-lora/av2av`)
Overview
The `/ic-lora/av2av` endpoint trains an IC-LoRA (In-Context LoRA) for the LTX 2.3 model that performs a combined audio+video → audio+video transformation. An IC-LoRA is a small adapter that does not generate from text alone — instead it conditions on a reference clip supplied at inference (both its video and its audio track) and learns to produce a corresponding target clip with both generated video and generated audio. You teach it that joint mapping by providing pairs of clips — a reference and a target, each carrying audio — and the LoRA learns to go from one to the other across both modalities at once.
Use this endpoint when you want a control-driven transformation across both modalities together, for example:
- Restyling a video and its sound at the same time.
- Applying a recurring audiovisual edit that touches both picture and audio.
- Any reference-to-(audio+video) mapping you can demonstrate with paired examples.
Key features:
- Trains an IC-LoRA from paired reference→target clips, each with audio.
- Both video and audio are generated, each conditioned on its own reference.
- Optional reference video downscaling / temporal downsampling so the LoRA can be driven by a coarse / low-FPS video control proxy.
- Validation previews transform a supplied reference clip (video + audio) into a combined video preview that carries the generated audio.
Dataset Format
Provide a single `.zip` archive (linked via `training_data_url`) of paired clips, each with an audio track:
`<name>_start.<ext>`— the reference clip (video + audio).`<name>_end.<ext>`— the target clip (video + audio).`<name>.txt`— optional caption.
Video formats: `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`. Both the reference and the target of every pair must contain an audio track — silent clips are rejected with a clear error. Each `_start` must have a matching `_end`. File names must be unique across the archive. Aim for at least 10 pairs.
Minimum clip length: with `auto_scale_input` off (the default), both clips in each pair (`_start` and `_end`) must already have at least `number_of_frames` frames (default 89, ≈ 3.7 s at 24 fps). A pair with either side too short is skipped, and if no pair qualifies the request fails (422). Turn on `auto_scale_input` to resample shorter clips instead.
Example layout:
sample01_start.mp4 sample01_end.mp4 sample01.txt sample02_start.mp4 sample02_end.mp4 sample02.txt
Input Parameters Reference
Dataset
`training_data_url` (required)
Type: `string`
URL to the `.zip` archive of `_start`/`_end` clip pairs (with audio).
`trigger_phrase`
Type: `string`
Default: `""`
Phrase prepended to captions during training; include it at inference.
Training Parameters
`rank`
Type: `integer` (`8`, `16`, `32`, `64`, `128`)
Default: `32`
IC-LoRA capacity.
`number_of_steps`
Type: `integer`
Default: `2000` (range `100`–`20000`)
Number of optimization steps.
`learning_rate`
Type: `number`
Default: `0.0002`
Optimization step size.
`reference_downscale_factor`
Type: `integer`
Default: `1` (range `1`–`8`)
Spatially downscale the reference video by this factor before encoding (coarse / low-resolution control proxy → full-resolution output). `1` means no downscaling. Both width and height must be divisible by the factor, and width ÷ factor and height ÷ factor must each be divisible by 32 (checked against both the training and validation resolutions); an incompatible value fails the request with a 422. (Applies to the video reference only; audio has no spatial knob.)
`reference_temporal_scale_factor`
Type: `integer`
Default: `1` (range `1`–`8`)
Temporally downsample the reference video (lower FPS) by this factor before encoding. `1` means no change. `(number_of_frames − 1)` must be divisible by the factor, and after subsampling `(frames − 1)` must remain a multiple of 8 — checked against both `number_of_frames` and `validation_number_of_frames`. An incompatible factor/frame-count combination fails the request with a 422. (Example: with the default 89 frames, a factor of 2 is invalid because `(89 − 1) ÷ 2 = 44`, which is not a multiple of 8.)
Video Configuration
`number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
Frames per training clip. Must satisfy `frames % 8 == 1`; other values are snapped down to the nearest valid count.
`frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
Target frames per second.
`resolution`
Type: `string` (`low`, `medium`, `high`)
Default: `medium`
| Resolution | 16:9 | 1:1 | 9:16 |
|---|---|---|---|
| low | 512×288 | 512×512 | 288×512 |
| medium | 768×448 | 768×768 | 448×768 |
| high | 960×544 | 960×960 | 544×960 |
`aspect_ratio`
Type: `string` (`16:9`, `1:1`, `9:16`)
Default: `1:1`
`auto_scale_input`
Type: `boolean`
Default: `false`
Fit videos to the target frame count and frame rate.
`split_input_into_scenes`
Type: `boolean`
Default: `false`
Off for this mode: scene splitting would desync the reference/target pair. Provide pre-split clips instead.
`split_input_duration_threshold`
Type: `number`
Default: `30.0` (range `1.0`–`60.0`)
Duration threshold for scene splitting (only relevant if splitting were enabled).
Audio Configuration
`audio_normalize`
Type: `boolean`
Default: `true`
Peak-normalize audio for consistent loudness across the dataset.
`audio_preserve_pitch`
Type: `boolean`
Default: `true`
Preserve pitch when fitting audio to video duration.
Validation
`validation`
Type: `array`
Default: `[]` (max 2 entries)
Validation samples, each an object with:
`prompt`(`string`) — the text prompt.`reference_video_url`(`string`, required) — the reference video (its audio track is also used as the audio reference) that conditions the generated audio+video.`reference_audio_url`(`string`, optional) — a separate reference audio; if omitted, the audio is taken from the reference video's own track — in which case that reference video must contain an audio track, or the validation sample is rejected with a 422.
`validation_negative_prompt`
Type: `string`
Default: a built-in quality negative prompt.
`validation_number_of_frames`
Type: `integer`
Default: `89` (range `9`–`121`)
`validation_frame_rate`
Type: `integer`
Default: `24` (range `8`–`60`)
`validation_resolution`
Type: `string`
Default: `high`
`validation_aspect_ratio`
Type: `string`
Default: `1:1`
`stg_scale`
Type: `number`
Default: `1.0` (range `0.0`–`3.0`)
`debug_dataset`
Type: `boolean`
Default: `false`
Return an archive of the preprocessed data for inspection.
Outputs
`lora_file`— the trained IC-LoRA weights (`.safetensors`).`config_file`— JSON describing the trigger phrase and training type.`video`— combined validation reel; the preview video carries the generated audio track.`debug_dataset`— preprocessed-data archive, only when`debug_dataset`is enabled.
Billing
A successful run is billed `max(100, number_of_steps)` billable units. Requests that fail before training completes (input-validation errors / HTTP 422, or dataset-download failures) are billed 0 units.
How the Training Works
Pipeline Overview
- Preprocessing — the archive is extracted,
`_start`/`_end`pairs and captions matched, both clips verified to have audio, clips fit to the resolution bucket, and the video reference optionally downscaled / temporally downsampled. The target's audio is extracted from its own track; the reference's audio is extracted from the reference clip. - Training — the IC-LoRA trains for
`number_of_steps`. Both modalities are generated: the video conditioned on the reference video, the audio conditioned on the reference audio. Validation previews run at intervals. - Output — the IC-LoRA, config, and validation reel (with generated audio) are uploaded.
What Happens to Your Data
- Archive extraction: the
`.zip`is unpacked; macOS metadata and hidden files are ignored. - Pair matching: each
`<name>_start`is paired with its`<name>_end`and the optional`<name>.txt`. Targets missing a reference cause a clear error. - Audio requirement: both clips of every pair must carry audio; silent pairs are rejected.
- Video fitting: both clips are resized to fill the resolution bucket and center-cropped, staying aligned.
- Reference scaling: when set above 1, the reference video is downscaled / temporally downsampled before encoding so the LoRA learns to drive full-resolution output from a coarse reference.
- Captions: the trigger phrase (if set) is prepended.
What an IC-LoRA Is
An IC-LoRA performs an in-context transformation: it conditions on a reference clip supplied at inference and produces a transformed result — here, both video and audio together. The trained file specializes the base model at your reference→target audiovisual mapping. At inference you supply a reference video (and optionally a separate reference audio) plus a prompt, and the LoRA produces the corresponding target clip with both generated video and audio.
Tips for Getting Good Results
Dataset Quality
- Use at least 10 well-aligned reference→target pairs, each with clean, in-sync audio.
- Reference and target should describe the same scene/sound differing only by the transformation you want learned.
Caption Best Practices
- Describe the target's content and sound plainly, optionally with a trigger phrase.
- Keep captions consistent in style across pairs.
Good caption: `av_st9le a neon-lit street with a synthwave soundtrack`
Weak caption: `street`
Trigger Phrases
- A distinctive trigger phrase helps invoke the transformation cleanly; include it in every caption and at inference.
Reference Scaling
- To drive generation from a coarse video control map, set
`reference_downscale_factor`above 1 and supply matching coarse references at inference; the validation localizer mirrors training-time scaling for you.
Inference Format Matching
At inference, supply a reference video (and optionally a separate reference audio), use the same trigger phrase, and apply the same reference scaling you trained with.
Recommended Starting Configuration
json{ "training_data_url": "https://example.com/av2av_pairs.zip", "trigger_phrase": "av_st9le", "rank": 32, "number_of_steps": 2000, "learning_rate": 0.0002, "reference_downscale_factor": 1, "reference_temporal_scale_factor": 1, "number_of_frames": 89, "frame_rate": 24, "resolution": "medium", "aspect_ratio": "1:1", "validation": [ { "prompt": "av_st9le a neon-lit street", "reference_video_url": "https://example.com/ref.mp4" } ] }
Diagnosing Issues
- Transformation too weak: more steps, higher
`rank`, or more consistent pairs. - Audio or video ignores the reference: ensure pairs are well aligned and differ only by the intended transformation; check reference scaling matches at inference.
- Dataset rejected for silence: both clips of every pair must contain audio.
- Overfitting: fewer steps, lower
`rank`, more pairs.
Validation Prompt Tips
- Use a fresh reference clip to gauge generalization.
- Match the caption style and trigger phrase you trained with.
Common Pitfalls
- Silent reference or target clips (rejected).
- Missing or mismatched
`_start`/`_end`pairs. - Using different reference scaling at inference than at training.