Wan 2.6 Prompt Guide: Mastering Three Generation Modes

Explore all models

Wan 2.6 delivers video generation across three modes: text-to-video, image-to-video, and reference-to-video. Each requires specific prompt structures and techniques for optimal output quality.

last updated
12/17/2025
edited by
Brad Rose
read time
5 minutes
Wan 2.6 Prompt Guide: Mastering Three Generation Modes

Prompt Engineering Across All Modes

Wan 2.6 introduces three distinct video generation approaches on fal, each with specific prompting requirements. Text-to-video generates from descriptions alone, image-to-video animates static images, and reference-to-video maintains subject consistency across new contexts.

The difference between adequate and exceptional results comes down to prompt structure. Generic descriptions produce generic videos. Precise prompts with timing markers, camera movements, and style descriptors generate cinematic output. This guide covers the technical requirements and practical techniques for each mode.

Text-to-Video Mode

Text-to-video accepts prompts up to 800 characters and supports resolutions of 720p or 1080p at durations of 5, 10, or 15 seconds.

Prompt Structure

Effective prompts contain two components:

  1. Global style description (lighting, quality, aesthetic)
  2. Individual shot descriptions with timing brackets

Example:

A cinematic journey through ancient ruins at sunset. Photoreal, 4K, film grain.

Shot 1 [0-3s] Wide establishing shot of stone pillars with sunlight streaming through.
Shot 2 [3-7s] Camera tracks forward through an archway revealing a hidden chamber.
Shot 3 [7-10s] Close-up of ancient inscriptions as dust particles float in light beams.

Multi-Shot Formatting

The multi_shots parameter (enabled by default) allows segmented narratives within a single generation. Structure each shot with:

  • Timing indicators: [0-3s], [3-7s], etc.
  • Camera action: push, pull, pan, orbit, track
  • Scene elements: subject position, lighting changes, environmental details

Maintain continuity between shots by referencing consistent elements (characters, locations, objects). Abrupt changes between unrelated shots produce disjointed results.

Resolution and Aspect Ratio

Wan 2.6 text-to-video supports five aspect ratios: 16:9, 9:16, 1:1, 4:3, and 3:4. Tailor prompts to the selected ratio:

  • 16:9 (landscape): Wide establishing shots, horizontal camera movement
  • 9:16 (portrait): Vertical composition, tighter framing on subjects
  • 1:1 (square): Centered subjects, balanced composition

falMODEL APIs

The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models

falSERVERLESS

Scale custom models and apps to thousands of GPUs instantly

falCOMPUTE

A fully controlled GPU cloud for enterprise AI training + research

Image-to-Video Mode

Image-to-video animates a provided still image based on motion descriptions. The image_url parameter accepts publicly accessible URLs or base64 data URIs. Images must meet these specifications:

ParameterRequirement
File sizeMax 100MB
Dimensions360px - 2000px (width/height)
Resolution480p, 720p, or 1080p
Duration5, 10, or 15 seconds

Motion Description

Prompts should describe how the camera moves and how elements within the frame animate:

Continue from first frame. Gentle camera push toward the mountain peak as clouds drift overhead. Light changes from morning to golden hour. Cinematic and serene movement.

Avoid describing what's already in the image. Focus on temporal changes: camera motion, lighting shifts, environmental animation (water, clouds, foliage).

Multi-Shot Image Animation

Multi-shot formatting is disabled by default for image-to-video. Enable it by setting multi_shots: true:

Shot 1 [0-5s] Continue from first frame. Slow zoom out revealing more of the landscape.
Shot 2 [5-10s] Camera pans right to follow the river flowing through the valley.

Image Selection

Images with these characteristics animate more effectively:

  • High resolution (minimum 1080p)
  • Clear depth of field
  • Atmospheric elements (clouds, water, smoke)
  • Uncluttered composition
  • Well-defined subjects

Busy compositions with multiple competing elements produce inconsistent motion.

Reference-to-Video Mode

Reference-to-video maintains subject consistency by extracting subjects from provided reference videos and placing them in new contexts. This mode accepts 1-3 reference videos via the video_urls parameter and supports only 5 or 10-second durations (15 seconds is unavailable).[^3]

Reference Syntax

Tag reference videos in prompts using @Video1, @Video2, and @Video3:

@Video1 walks through a futuristic cityscape as holographic displays activate around them. Cinematic lighting, shallow depth of field.

The model extracts the primary subject from each reference video and composites it into the generated scene.

Multi-Reference Interactions

When using multiple references, specify spatial relationships and interactions:

Dance battle between @Video1 and @Video2 in an ancient colosseum. @Video3 watches from a throne. Dynamic camera movement, dramatic lighting.

Without explicit positioning, the model places subjects based on prompt context, which may not match your intent.

Reference Video Selection

Optimal reference videos share these characteristics:

  • Clear, well-lit subjects
  • Subject shown from multiple angles (if available)
  • Duration under 10 seconds
  • Subject is the dominant element in frame
  • Minimal background clutter

The model performs subject extraction, so videos with complex backgrounds or multiple subjects may produce inconsistent results.

Advanced Techniques

Negative Prompts

The negative_prompt parameter (max 500 characters) specifies what to avoid:

negative_prompt: "low quality, blurry, distorted faces, unnatural movement, text, watermarks, shaky camera"

Prioritize the most problematic artifacts. Character limits force concise descriptions.

Motion Intensity Control

Adjust motion intensity through descriptive language:

  • Minimal: "subtle camera drift," "gentle movement"
  • Moderate: "smooth camera track," "flowing motion"
  • Dramatic: "dynamic sweeping motion," "rapid camera movement"

Audio Integration

The audio_url parameter accepts publicly accessible audio files (WAV or MP3, 3-30 seconds, max 15MB). Audio handling follows these rules:

  • Audio longer than video duration is truncated
  • Audio shorter than video duration leaves remaining video silent
  • For dialogue, include speaker cues in prompts

Prompt Expansion

The enable_prompt_expansion parameter (enabled by default) uses an LLM to enhance prompts. For best results:

  • Specify style references: cinematic, documentary, animation style
  • Include visual descriptors: lighting type, color palette, mood
  • Use technical terminology when precision matters

Main prompts are limited to 800 characters. Prompt expansion adds detail without increasing your character count.

Common Issues and Solutions

IssueSolution
Incoherent narrativeBreak complex scenes into specific shots with timing
Inconsistent subjectsAdd detailed character descriptions across all shots
Unnatural motionSpecify camera movements explicitly (push, pan, orbit)
Low visual qualityAdd quality descriptors: "photorealistic, 4K, high detail"
Ignored prompt elementsPlace critical details early in shot descriptions

Implementation Priorities

  1. Start with text-to-video: Test prompt structures and multi-shot formatting without image dependencies

  2. Experiment with timing: Adjust shot durations to find natural pacing for your content type

  3. Build a prompt library: Document successful prompts and patterns for reuse

  4. Test reference extraction: Evaluate which types of reference videos maintain consistency

  5. Iterate systematically: Change one variable at a time to understand its impact

Wan 2.6 on fal provides three complementary approaches to video generation. Text-to-video offers maximum creative freedom, image-to-video animates existing assets, and reference-to-video maintains character consistency. Each mode requires specific prompt engineering, but the underlying principles remain consistent: be specific, structure your prompts, and iterate based on results.

Recently Added

about the author
Brad Rose
A content producer with creative focus, Brad covers and crafts stories spanning all of generative media.

Related articles