Qwen Image 2512 Trainer V2 | Training

Qwen-Image-2512 Trainer V2

Fine-tunes the Qwen-Image-2512 model using LoRA to teach it new subjects, objects, or styles from your images and captions.

Input Parameters

`image_data_url` (required)

URL to a zip archive containing image + caption pairs. Each image (e.g., `photo.jpg`) should have a matching caption file (e.g., `photo.txt`).

Supported formats: `.jpg`, `.jpeg`, `.png`, `.webp`, `.heic`, `.heif`, `.avif`, `.bmp`, `.psd`, `.tiff`

`default_caption`

Default: `null`

Fallback caption for images without a `.txt` file. If not set and a caption is missing, training fails.

`steps`

Default: `2000`

Dataset Size	Recommended Steps
5-10 images	500-1000
10-30 images	1000-2000
30-100 images	2000-4000
100+ images	4000+

`learning_rate`

Default: `5e-4`

Use `1e-4` for slower/conservative learning, `1e-3` for faster/aggressive learning.

How the Training Works

Image-caption pairing: Each image (e.g., `photo.jpg`) is paired with its caption file (`photo.txt` or `photo.TXT`).

Aspect ratio bucketing: Images are assigned to the nearest bucket matching their aspect ratio, preserving natural proportions.

Bucket (H×W)	AR	Orientation
1344×576	3:7	Portrait
1280×720	9:16	Portrait
1248×832	2:3	Portrait
1152×864	3:4	Portrait
1152×896	7:9	Portrait
1024×1024	1:1	Square
896×1152	9:7	Landscape
864×1152	4:3	Landscape
832×1248	3:2	Landscape
720×1280	16:9	Landscape
576×1344	7:3	Landscape

Caption dropout: 5% of the time, captions are dropped during training to help the model generalize.

Tips for Good Results

Dataset

Size: 5-10 images minimum, 15-30 optimal for subjects, 20-50 for styles
Quality: High resolution (1024px+), good lighting, no watermarks
Variety: Different poses, angles, lighting, backgrounds

Captions

Write specific, descriptive captions with consistent style:

Good: `a golden retriever sitting on a red couch, soft afternoon lighting`
Bad: `dog`

Tip: Try using image caption models like Moondream-3

Trigger phrases: For subject training, use a unique trigger in all captions (e.g., `a photo of ohwx person standing in a park`). Include the same trigger at inference time.

Troubleshooting

Problem	Signs	Solution
Overfitting	Outputs match training images exactly, poor generalization	Reduce `steps`, add more images
Underfitting	Outputs don't resemble training data	Increase `steps`, improve captions
Caption errors	Training fails	Add `.txt` files or set `default_caption`

fal-ai/qwen-image-2512-trainer-v2

Input

Training history

Nothing here yet...