Try New Grok Imagine here!

Seedance 1.5 Prompt Guide: Mastering ByteDance's Audio-Video Generation Model

Explore all models

Seedance 1.5 Pro's dual-branch diffusion transformer generates synchronized audio and video from text prompts. Output quality depends on structured prompting across four layers: subject definition, dialogue or key sound events, environmental audio cues, and visual style.

last updated
1/11/2026
edited by
Zachary Roth
read time
6 minutes
Seedance 1.5 Prompt Guide: Mastering ByteDance's Audio-Video Generation Model

Why Audio-Video Synchronization Changes Prompt Strategy

Most text-to-video models treat sound as an afterthought, bolting audio onto finished visuals through post-processing. ByteDance's Seedance 1.5 Pro takes a fundamentally different approach: it generates audio and video simultaneously using a dual-branch diffusion transformer architecture that renders both modalities in the same latent space.1 When a character speaks, lip movements synchronize with the dialogue at millisecond precision. When glass shatters on screen, the audio spike occurs at that exact frame.

This architectural distinction matters for prompt engineering. Traditional video prompts focus exclusively on visual elements because sound arrives separately. Seedance 1.5 prompts must account for both visual and auditory dimensions simultaneously, describing what viewers see and what they hear in a unified instruction set. Models achieving tight temporal alignment require conditioning mechanisms that account for both modalities during the generation process, not as sequential steps.2

The Four-Layer Prompt Structure

Effective Seedance 1.5 prompts follow a structured format that guides the model through multiple layers of detail. Think of your prompt as a director's shot list combined with a sound designer's cue sheet.

Layer 1: Primary Action or Subject. Start with the core visual element, specifying what happens or who appears in the frame. Precision about the subject and their primary action establishes the foundation.

Layer 2: Dialogue or Key Sound Event. When your scene includes speech or a critical sound moment, include it in quotes. This signals to Seedance 1.5 that it should prioritize audio generation matching the visual.

Layer 3: Environmental Audio Cues. Describe ambient sounds and secondary audio elements that create atmosphere. Use comma-separated phrases to list multiple sound sources.

Layer 4: Visual Style and Mood. Conclude with descriptive terms establishing overall aesthetic and emotional tone.

Combined example: "Defense attorney declaring 'Ladies and gentlemen, reasonable doubt isn't just a phrase, it's the foundation of justice itself', footsteps on marble, jury shifting, courtroom drama, closing argument power."

falMODEL APIs

The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models

falSERVERLESS

Scale custom models and apps to thousands of GPUs instantly

falCOMPUTE

A fully controlled GPU cloud for enterprise AI training + research

API Parameters and Pricing

Understanding the parameter schema helps you optimize for cost, speed, or quality depending on project phase.

ParameterOptionsDefaultNotes
promptstring (required)-Scene, action, dialogue, camera, and sound
aspect_ratio21:9, 16:9, 4:3, 1:1, 3:4, 9:1616:9Affects composition and platform fit
resolution480p, 720p720p480p for iteration, 720p for output
duration4 through 12 seconds5Integer values only
generate_audiobooleantrueSet false for silent video
camera_fixedbooleanfalsetrue locks camera position
seedintegerrandomUse -1 for random, specific value for reproducibility

Pricing: Each 720p 5-second video with audio costs approximately $0.26. Token-based pricing runs $2.40 per million video tokens with audio, $1.20 without. Token calculation: (height x width x FPS x duration) / 1024.

Generation time: A 5-second clip typically completes in 30-45 seconds depending on resolution and server load.

The response returns a video object containing the URL and the seed value used for generation:

{
  "video": { "url": "https://..." },
  "seed": 42
}

Capture the returned seed when you achieve good results. Submitting the same seed with a modified prompt produces variations that preserve core characteristics of the original generation.

Resolution and Duration Trade-offs

Use CaseResolutionDurationCost Estimate
Prompt testing480p4-5 seconds~$0.08
Social content720p5-6 seconds~$0.26-$0.31
Final production720p8-12 seconds~$0.42-$0.62

Start with 480p and short durations during prompt development. The lower cost per generation makes rapid iteration practical. Switch to 720p for final output once you have refined your prompt structure.

Advanced Prompting Techniques

Sound-First Prompting

For scenes where audio drives the narrative, structure your prompt around the soundscape:

"Thunder cracking overhead, rain pelting windows, wind howling through trees, abandoned mansion interior, Gothic atmosphere, lightning flashes illuminating dusty furniture"

This approach signals that audio carries primary storytelling weight, with visuals supporting the sonic environment.

Action Sequencing

When conveying a progression of events within your duration limit, use temporal markers:

"Chef's hands chopping vegetables rapidly, knife striking cutting board rhythmically, sizzling as ingredients hit hot pan, steam rising, professional kitchen energy, culinary precision"

Comma-separated actions create a natural sequence the model interprets temporally.

Contrast and Juxtaposition

Build contrasts directly into your prompt for visual interest:

"Quiet library, sudden book slam echoing, startled students looking up, whispered apologies, tension breaking into nervous laughter, academic setting, comedic timing"

This technique works particularly well for scenes with emotional shifts or dramatic reveals.

Common Failure Modes

Understanding how prompts fail helps you debug generation issues efficiently.

Vague descriptions produce generic output. "A person walking in a city" lacks the specificity needed for coherent audio-video synthesis. Compare with: "Business professional striding confidently down rain-slicked sidewalk, heels clicking rhythmically, traffic sounds in background, urban morning commute, determined energy."

Conflicting instructions cause incoherent results. The model cannot reconcile "peaceful meditation garden with loud rock concert and quiet library atmosphere." When prompts contain contradictory audio or visual elements, expect unpredictable output rather than an error message. The generation completes but produces visual or audio artifacts.

Ignoring audio wastes the model's primary capability. "Dancer performing on stage" misses the opportunity for synchronized sound. "Ballet dancer executing pirouettes, pointe shoes tapping wooden stage, orchestral music swelling, audience gasps at difficult leap" gives the dual-branch architecture material to work with.

The safety checker rejects certain content. The enable_safety_checker parameter defaults to true. Prompts triggering content filters return an error rather than a video. If you receive unexpected failures, check whether your prompt contains elements the safety system might flag.

Parameter Combinations for Specific Use Cases

Cinematic establishing shot: aspect_ratio: "21:9", resolution: "720p", duration: "10", camera_fixed: false Prompt: "Sunrise over mountain valley, birds beginning morning chorus, mist rising from lake, gentle breeze rustling leaves, epic landscape, golden hour cinematography"

Product demonstration: aspect_ratio: "4:3", resolution: "720p", duration: "6", camera_fixed: true Prompt: "Artisan coffee being poured into ceramic mug, liquid streaming steadily, cup settling on wooden table, steam rising visibly, cafe ambiance, craftsmanship focus"

Social media hook: aspect_ratio: "9:16", resolution: "720p", duration: "5", camera_fixed: false Prompt: "Surprised face reacting to off-screen reveal, sudden gasp, dramatic music sting, quick zoom in, viral moment energy, engaging expression"

Audio Generation Controls

Set generate_audio: false when you plan to add custom soundtracks in post-production. This reduces cost to $1.20 per million tokens and may slightly decrease generation time.

When audio generation is enabled, include specific audio cues in prompts for key moments where sound drives narrative. The model generates dialogue with lip-sync, environmental foley, and ambient sound mixed at 48 kHz AAC. Output format is MP4 with H.264 video encoding.

Iteration Workflow

The seed parameter enables systematic prompt refinement. When a generation achieves 80% of your vision:

  • Note the seed value from the response
  • Modify specific prompt elements while keeping the seed constant
  • Compare outputs to isolate which prompt changes produce which effects

This approach transforms prompt engineering from random experimentation into controlled iteration. Without seed control, you cannot distinguish whether output differences stem from prompt changes or random variation.

For production workflows requiring multiple generations, use the queue API with webhooks rather than blocking calls. Submit requests asynchronously, track progress through status endpoints, and retrieve results when ready.

Recently Added

References

  1. ByteDance Seed Team. "Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model." ByteDance, 2024. https://seed.bytedance.com/en/seedance1_5_pro ↩

  2. Liu, Haohe, et al. "SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text." arXiv preprint arXiv:2412.15220 (2024). https://arxiv.org/abs/2412.15220 ↩

about the author
Zachary Roth
A generative media engineer with a focus on growth, Zach has deep expertise in building RAG architecture for complex content systems.

Related articles