fal-ai/qwen-image-2512-trainer-v2
Input
Hint: Drag and drop files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL.
Customize your input with more control.
Your request will cost $0.00095 per step (minimum of 500 steps is charged). For $0.95 you can fine-tune a LoRA for 1000 steps.
Training history
Nothing here yet...
Fine-tune your training parameters and start right now.
Qwen-Image-2512 Trainer V2
Fine-tunes the Qwen-Image-2512 model using LoRA to teach it new subjects, objects, or styles from your images and captions.
Input Parameters
`image_data_url` (required)
URL to a zip archive containing image + caption pairs. Each image (e.g., `photo.jpg`) should have a matching caption file (e.g., `photo.txt`).
Supported formats: `.jpg`, `.jpeg`, `.png`, `.webp`, `.heic`, `.heif`, `.avif`, `.bmp`, `.psd`, `.tiff`
`default_caption` (default = `null`)
Fallback caption for images without a `.txt` file. If not set and a caption is missing, training fails.
`steps` (default = `2000`)
The ideal number of training steps depends on the difficulty of your task.
Simple tasks (e.g., learning a basic object or loose style) often work well with 1000–2000 steps.
Complex tasks (e.g., accurate character likeness, detailed faces, or precise artistic styles) usually require more steps — commonly 2000–4000+.
Tip: Start with the default (2000) and experiment. Train a few versions with different step counts, then compare the results. More steps can improve learning but may cause overfitting if your dataset is small or not diverse enough.
`learning_rate` (default = `5e-4`)
Use `1e-4` for slower/conservative learning, `1e-3` for faster/aggressive learning.
How the Training Works
Image-caption pairing: Each image (e.g., `photo.jpg`) is paired with its caption file (`photo.txt` or `photo.TXT`).
Aspect ratio bucketing: Images are assigned to the nearest bucket matching their aspect ratio, preserving natural proportions.
| Bucket (H×W) | AR | Orientation |
|---|---|---|
| 1344×576 | 3:7 | Portrait |
| 1280×720 | 9:16 | Portrait |
| 1248×832 | 2:3 | Portrait |
| 1152×864 | 3:4 | Portrait |
| 1152×896 | 7:9 | Portrait |
| 1024×1024 | 1:1 | Square |
| 896×1152 | 9:7 | Landscape |
| 864×1152 | 4:3 | Landscape |
| 832×1248 | 3:2 | Landscape |
| 720×1280 | 16:9 | Landscape |
| 576×1344 | 7:3 | Landscape |
Caption dropout: 5% of the time, captions are dropped during training to help the model generalize.
Tips for Good Results
Dataset
- Size: 5-10 images minimum, 15-30 optimal for subjects, 20-50 for styles.
- Quality: High resolution images (1024px+) with good lighting and no watermarks.
- Variety: Different poses, angles, lighting, and backgrounds for better results.
Captions
Write clear, descriptive captions in a consistent style:
- Good:
`a golden retriever sitting on a red couch, soft afternoon lighting` - Bad:
`dog`
Tip: Use an image caption model like Moondream-3
Trigger phrases: For subject training, include a unique trigger word or phrase in every caption (e.g., `a photo of ohwx person`). Use the same trigger when generating new images later.
Troubleshooting
| Problem | Signs | Solution |
|---|---|---|
| Overfitting | Outputs match training images exactly, poor generalization | Reduce `steps`, add more varied images |
| Underfitting | Outputs barely resemble your training data | Increase `steps`, improve captions or dataset quality |
| Caption errors | Training fails | Add missing `.txt` files or set `default_caption` |