Seedance 1.5 Prompt Guide: Mastering ByteDance's Audio-Video Generation Model

Why Audio-Video Synchronization Changes Prompt Strategy

Most text-to-video models treat sound as an afterthought, bolting audio onto finished visuals through post-processing. ByteDance's Seedance 1.5 Pro takes a fundamentally different approach: it generates audio and video simultaneously using a dual-branch diffusion transformer architecture that renders both modalities in the same latent space.¹ When a character speaks, lip movements synchronize with the dialogue at millisecond precision. When glass shatters on screen, the audio spike occurs at that exact frame.

This architectural distinction matters for prompt engineering. Traditional video prompts focus exclusively on visual elements because sound arrives separately. Seedance 1.5 prompts must account for both visual and auditory dimensions simultaneously, describing what viewers see and what they hear in a unified instruction set. Models achieving tight temporal alignment require conditioning mechanisms that account for both modalities during the generation process, not as sequential steps.²

The Four-Layer Prompt Structure

Effective Seedance 1.5 prompts follow a structured format that guides the model through multiple layers of detail. Think of your prompt as a director's shot list combined with a sound designer's cue sheet.

Layer 1: Primary Action or Subject. Start with the core visual element, specifying what happens or who appears in the frame. Precision about the subject and their primary action establishes the foundation.

Layer 2: Dialogue or Key Sound Event. When your scene includes speech or a critical sound moment, include it in quotes. This signals to Seedance 1.5 that it should prioritize audio generation matching the visual.

Layer 3: Environmental Audio Cues. Describe ambient sounds and secondary audio elements that create atmosphere. Use comma-separated phrases to list multiple sound sources.

Layer 4: Visual Style and Mood. Conclude with descriptive terms establishing overall aesthetic and emotional tone.

Combined example: "Defense attorney declaring 'Ladies and gentlemen, reasonable doubt isn't just a phrase, it's the foundation of justice itself', footsteps on marble, jury shifting, courtroom drama, closing argument power."

fal^{MODEL APIs}

The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models

Build

fal^SERVERLESS

Scale custom models and apps to thousands of GPUs instantly

Deploy

fal^COMPUTE

A fully controlled GPU cloud for enterprise AI training + research

Train

API Parameters and Pricing

Understanding the parameter schema helps you optimize for cost, speed, or quality depending on project phase.

Parameter	Options	Default	Notes
`prompt`	string (required)	-	Scene, action, dialogue, camera, and sound
`aspect_ratio`	`21:9`, `16:9`, `4:3`, `1:1`, `3:4`, `9:16`	`16:9`	Affects composition and platform fit
`resolution`	`480p`, `720p`	`720p`	480p for iteration, 720p for output
`duration`	`4` through `12` seconds	`5`	Integer values only
`generate_audio`	boolean	`true`	Set `false` for silent video
`camera_fixed`	boolean	`false`	`true` locks camera position
`seed`	integer	random	Use `-1` for random, specific value for reproducibility

Pricing: Each 720p 5-second video with audio costs approximately $0.26. Token-based pricing runs $2.40 per million video tokens with audio, $1.20 without. Token calculation: (height x width x FPS x duration) / 1024.

Generation time: A 5-second clip typically completes in 30-45 seconds depending on resolution and server load.

The response returns a video object containing the URL and the seed value used for generation:

{
  "video": { "url": "https://..." },
  "seed": 42
}

Capture the returned seed when you achieve good results. Submitting the same seed with a modified prompt produces variations that preserve core characteristics of the original generation.

Resolution and Duration Trade-offs

Use Case	Resolution	Duration	Cost Estimate
Prompt testing	480p	4-5 seconds	~$0.08
Social content	720p	5-6 seconds	~$0.26-$0.31
Final production	720p	8-12 seconds	~$0.42-$0.62

Start with 480p and short durations during prompt development. The lower cost per generation makes rapid iteration practical. Switch to 720p for final output once you have refined your prompt structure.

Advanced Prompting Techniques

Sound-First Prompting

For scenes where audio drives the narrative, structure your prompt around the soundscape:

"Thunder cracking overhead, rain pelting windows, wind howling through trees, abandoned mansion interior, Gothic atmosphere, lightning flashes illuminating dusty furniture"

This approach signals that audio carries primary storytelling weight, with visuals supporting the sonic environment.

Action Sequencing

When conveying a progression of events within your duration limit, use temporal markers:

"Chef's hands chopping vegetables rapidly, knife striking cutting board rhythmically, sizzling as ingredients hit hot pan, steam rising, professional kitchen energy, culinary precision"

Comma-separated actions create a natural sequence the model interprets temporally.

Contrast and Juxtaposition

Build contrasts directly into your prompt for visual interest:

"Quiet library, sudden book slam echoing, startled students looking up, whispered apologies, tension breaking into nervous laughter, academic setting, comedic timing"

This technique works particularly well for scenes with emotional shifts or dramatic reveals.

Common Failure Modes

Understanding how prompts fail helps you debug generation issues efficiently.

Vague descriptions produce generic output. "A person walking in a city" lacks the specificity needed for coherent audio-video synthesis. Compare with: "Business professional striding confidently down rain-slicked sidewalk, heels clicking rhythmically, traffic sounds in background, urban morning commute, determined energy."

Conflicting instructions cause incoherent results. The model cannot reconcile "peaceful meditation garden with loud rock concert and quiet library atmosphere." When prompts contain contradictory audio or visual elements, expect unpredictable output rather than an error message. The generation completes but produces visual or audio artifacts.

Ignoring audio wastes the model's primary capability. "Dancer performing on stage" misses the opportunity for synchronized sound. "Ballet dancer executing pirouettes, pointe shoes tapping wooden stage, orchestral music swelling, audience gasps at difficult leap" gives the dual-branch architecture material to work with.

The safety checker rejects certain content. The enable_safety_checker parameter defaults to true. Prompts triggering content filters return an error rather than a video. If you receive unexpected failures, check whether your prompt contains elements the safety system might flag.

Parameter Combinations for Specific Use Cases

Cinematic establishing shot: aspect_ratio: "21:9", resolution: "720p", duration: "10", camera_fixed: false Prompt: "Sunrise over mountain valley, birds beginning morning chorus, mist rising from lake, gentle breeze rustling leaves, epic landscape, golden hour cinematography"

Product demonstration: aspect_ratio: "4:3", resolution: "720p", duration: "6", camera_fixed: true Prompt: "Artisan coffee being poured into ceramic mug, liquid streaming steadily, cup settling on wooden table, steam rising visibly, cafe ambiance, craftsmanship focus"

Social media hook: aspect_ratio: "9:16", resolution: "720p", duration: "5", camera_fixed: false Prompt: "Surprised face reacting to off-screen reveal, sudden gasp, dramatic music sting, quick zoom in, viral moment energy, engaging expression"

Audio Generation Controls

Set generate_audio: false when you plan to add custom soundtracks in post-production. This reduces cost to $1.20 per million tokens and may slightly decrease generation time.

When audio generation is enabled, include specific audio cues in prompts for key moments where sound drives narrative. The model generates dialogue with lip-sync, environmental foley, and ambient sound mixed at 48 kHz AAC. Output format is MP4 with H.264 video encoding.

Iteration Workflow

The seed parameter enables systematic prompt refinement. When a generation achieves 80% of your vision:

Note the seed value from the response
Modify specific prompt elements while keeping the seed constant
Compare outputs to isolate which prompt changes produce which effects

This approach transforms prompt engineering from random experimentation into controlled iteration. Without seed control, you cannot distinguish whether output differences stem from prompt changes or random variation.

For production workflows requiring multiple generations, use the queue API with webhooks rather than blocking calls. Submit requests asynchronously, track progress through status endpoints, and retrieve results when ready.