Seedance 2.0 by ByteDance is now live on fal! 🚀

fal-ai/qwen-image-2512-trainer-v2

Fast LoRA trainer for Qwen-Image-2512
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

Your request will cost $0.00095 per step (minimum of 500 steps is charged). For $0.95 you can fine-tune a LoRA for 1000 steps.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

Qwen-Image-2512 Trainer V2

Fine-tunes the Qwen-Image-2512 model using LoRA to teach it new subjects, objects, or styles from your images and captions.

Input Parameters

`image_data_url` (required)

URL to a zip archive containing image + caption pairs. Each image (e.g., `photo.jpg`) should have a matching caption file (e.g., `photo.txt`).
Supported formats: `.jpg`, `.jpeg`, `.png`, `.webp`, `.heic`, `.heif`, `.avif`, `.bmp`, `.psd`, `.tiff`

`default_caption` (default = `null`)

Fallback caption for images without a `.txt` file. If not set and a caption is missing, training fails.

`steps` (default = `2000`)

The ideal number of training steps depends on the difficulty of your task.
Simple tasks (e.g., learning a basic object or loose style) often work well with 1000–2000 steps.
Complex tasks (e.g., accurate character likeness, detailed faces, or precise artistic styles) usually require more steps — commonly 2000–4000+.
Tip: Start with the default (2000) and experiment. Train a few versions with different step counts, then compare the results. More steps can improve learning but may cause overfitting if your dataset is small or not diverse enough.

`learning_rate` (default = `5e-4`)

Use `1e-4` for slower/conservative learning, `1e-3` for faster/aggressive learning.


How the Training Works

Image-caption pairing: Each image (e.g., `photo.jpg`) is paired with its caption file (`photo.txt` or `photo.TXT`).

Aspect ratio bucketing: Images are assigned to the nearest bucket matching their aspect ratio, preserving natural proportions.

Bucket (H×W)AROrientation
1344×5763:7Portrait
1280×7209:16Portrait
1248×8322:3Portrait
1152×8643:4Portrait
1152×8967:9Portrait
1024×10241:1Square
896×11529:7Landscape
864×11524:3Landscape
832×12483:2Landscape
720×128016:9Landscape
576×13447:3Landscape

Caption dropout: 5% of the time, captions are dropped during training to help the model generalize.


Tips for Good Results

Dataset
  • Size: 5-10 images minimum, 15-30 optimal for subjects, 20-50 for styles.
  • Quality: High resolution images (1024px+) with good lighting and no watermarks.
  • Variety: Different poses, angles, lighting, and backgrounds for better results.
Captions

Write clear, descriptive captions in a consistent style:

  • Good: `a golden retriever sitting on a red couch, soft afternoon lighting`
  • Bad: `dog`

Tip: Use an image caption model like Moondream-3

Trigger phrases: For subject training, include a unique trigger word or phrase in every caption (e.g., `a photo of ohwx person`). Use the same trigger when generating new images later.

Troubleshooting
ProblemSignsSolution
OverfittingOutputs match training images exactly, poor generalizationReduce `steps`, add more varied images
UnderfittingOutputs barely resemble your training dataIncrease `steps`, improve captions or dataset quality
Caption errorsTraining failsAdd missing `.txt` files or set `default_caption`