Run the latest models all in one Sandbox 🏖️

Qwen Image 2512 Trainer V2 Training

fal-ai/qwen-image-2512-trainer-v2
Fast LoRA trainer for Qwen-Image-2512
Training
Commercial use

Input

Additional Settings

Customize your input with more control.

Your request will cost $0.00095 per step (minimum of 500 steps is charged). For $0.95 you can fine-tune a LoRA for 1000 steps.

Training history

Note: these are the most recent training requests. For the full history, check the requests tab.

Qwen-Image-2512 Trainer V2

Fine-tunes the Qwen-Image-2512 model using LoRA to teach it new subjects, objects, or styles from your images and captions.

Input Parameters

`image_data_url` (required)

URL to a zip archive containing image + caption pairs. Each image (e.g., `photo.jpg`) should have a matching caption file (e.g., `photo.txt`).

Supported formats: `.jpg`, `.jpeg`, `.png`, `.webp`, `.heic`, `.heif`, `.avif`, `.bmp`, `.psd`, `.tiff`

`default_caption`

Default: `null`

Fallback caption for images without a `.txt` file. If not set and a caption is missing, training fails.

`steps`

Default: `2000`

Dataset SizeRecommended Steps
5-10 images500-1000
10-30 images1000-2000
30-100 images2000-4000
100+ images4000+
`learning_rate`

Default: `5e-4`

Use `1e-4` for slower/conservative learning, `1e-3` for faster/aggressive learning.

How the Training Works

Image-caption pairing: Each image (e.g., `photo.jpg`) is paired with its caption file (`photo.txt` or `photo.TXT`).

Aspect ratio bucketing: Images are assigned to the nearest bucket matching their aspect ratio, preserving natural proportions.

Bucket (H×W)AROrientation
1344×5763:7Portrait
1280×7209:16Portrait
1248×8322:3Portrait
1152×8643:4Portrait
1152×8967:9Portrait
1024×10241:1Square
896×11529:7Landscape
864×11524:3Landscape
832×12483:2Landscape
720×128016:9Landscape
576×13447:3Landscape

Caption dropout: 5% of the time, captions are dropped during training to help the model generalize.

Tips for Good Results

Dataset
  • Size: 5-10 images minimum, 15-30 optimal for subjects, 20-50 for styles
  • Quality: High resolution (1024px+), good lighting, no watermarks
  • Variety: Different poses, angles, lighting, backgrounds
Captions

Write specific, descriptive captions with consistent style:

  • Good: `a golden retriever sitting on a red couch, soft afternoon lighting`
  • Bad: `dog`

Tip: Try using image caption models like Moondream-3

Trigger phrases: For subject training, use a unique trigger in all captions (e.g., `a photo of ohwx person standing in a park`). Include the same trigger at inference time.

Troubleshooting
ProblemSignsSolution
OverfittingOutputs match training images exactly, poor generalizationReduce `steps`, add more images
UnderfittingOutputs don't resemble training dataIncrease `steps`, improve captions
Caption errorsTraining failsAdd `.txt` files or set `default_caption`