Seed Audio 1.0 is ByteDance's audio generation model that produces multi-character dialogue, sound effects, and background music in a single pass. It supports English and Chinese, accepts up to 3 reference voice clips (30s each), generates up to 2 minutes per pass.
Seed Audio 1.0 is ByteDance's flagship audio generation model, built by ByteDance's Seed team. It goes beyond traditional text-to-speech by generating multi-character dialogue, sound effects, and background music in a single pass. It runs in two modes: text-to-speech, with no background sound effects or music, and text-to-audio, where you generate cinematic audio alongside the dialogue.
What makes Seed Audio 1.0 genuinely wild is that it does two hard things at once. It generates net-new, film-quality audio from a text prompt, and it repairs and reshapes audio you already have, filling silent gaps, swapping lines, extending clips, and generating alternate endings. It is a cinematic audio generator and an audio editor in a single model, which is rare.
Seed Audio 1.0 points toward a new creative workflow: audio-first media generation. Instead of treating sound as the final layer added after video, creators can start with a complete audio scene (dialogue, speaker emotion, sound effects, music, and atmosphere), then use that scene as the foundation for video, podcasts, audiobooks, games, or cinematic content. Audio-first pipelines could become a new foundation for generative media workflows.
Featured Audio Snippets
Want to create scenes like these? Keep reading, and by the end of this guide you'll know how to generate every one of them.
Sports Commentary
Preview: "OH, HE SCORES!!! WHAT A GOAL! He beats two men and buries it in the top corner!"
Fantasy Movie Scene
Preview: "We're trapped! Blue fire ahead, crimson fire behind, that's it, we're finished!"
Action Movie Trailer
Preview: "Move now! The tunnel is coming down behind us!"
Movie Trailer Narrator
Preview: "In a world where aliens come down from the skies, everything is about to turn upside down for one family living in South Texas."
Sitcom: Changing the Ending (Before and After)
With Seed Audio you can take a finished clip and change how it ends, then redo it as many times as you want. Here is the same sitcom scene before and after a new comedic ending is generated in, all in the same voices and room tone (covered in detail in section 8 below).
Before:
After:
The Next Paradigm of Audio
This model brings a new paradigm to generative media on two fronts at once. On one side, it generates high-fidelity, cinematic audio from scratch: sound effects, multi-character dialogue, music, and atmosphere in a single pass. On the other, it fixes and reshapes audio you already have, repairing dropped lines, filling silent gaps, and regenerating endings without re-recording a thing. Most tools do one or the other. Seed Audio 1.0 does both, and that combination is what makes it so powerful for audio-first industries.
Most AI dialogue carries tells that give it away as synthetic: robotic inflections, unnatural rhythm, audio that is too clean, or missing natural sounds like breathing. Seed Audio 1.0 layers in background noise that adds a level of sound design and helps the voices feel natural and grounded in a specific scene. When you add descriptors to the prompt (like young boy, breathless shaky voice, panicky, fantasy film style), the model nails it, generating speech at a variable cadence with the subtle vocal sounds that make it feel human.
Run Seed Audio 1.0 on fal
fal is one of the best places to run Seed Audio 1.0 because you can generate film-quality audio and pair it with state-of-the-art video models like Seedance 2.0 and Seedance 2.0 4K. fal gives businesses and developers a single API to generate assets programmatically, faster and at a fraction of the cost of running them elsewhere. Seed Audio 1.0 is purpose-built for audio, and when you stitch its output together with video models, you can produce podcasts, audiobooks, short films, movies, TV shows, documentaries, and more.
The model is available at the endpoint bytedance/seed-audio-1.0. Reference audio inputs by order in your prompt with @Audio1, @Audio2, and @Audio3.
import { fal } from "@fal-ai/client";
const result = await fal.subscribe("bytedance/seed-audio-1.0", {
input: {
prompt:
"Generate a short suspense radio drama in a late-night convenience store.",
},
});
console.log(result.data);Testing the Model and Common Workflows
We tested Seed Audio with a variety of prompts, and the results are excellent. The model responds best to clear audio direction: environment, speaker identity, voice traits, delivery, dialogue, sound design, and constraints. The strongest prompts read less like vague creative briefs and more like mini audio scripts.
This guide covers 10 common workflows:
- Sound-design-heavy workflow (full cinematic audio scenes)
- Text-to-speech scenes using saved reference voices
- Build text-and-audio-to-audio scenes
- T2A / TTS scenes with no reference voices
- Personalized text-to-speech (use your own voice)
- Combining multiple audios with background effects or music
- Audio extending (creates a new clip to be added)
- Audio inpainting (holds the original clip with new addition/deletion)
- Audio stitching (merges two separate clips)
- Audio editing (deleting/changing words)
Seed Audio is especially sensitive to speaker clarity. If you want consistent voices, controlled emotion, and clean multi-character scenes, structure matters.
Key Terms to Know
Before you start, a few terms appear throughout:
- @Audio1, @Audio2, …: references to your saved or uploaded audio clips, numbered in the order you add them. Use them in a prompt to reuse a specific reference voice or clip.
- T2A: text-to-audio. Use it when generating a full audio scene from a text prompt only, with no reference voice.
- TA2A: text-and-audio-to-audio. Use it when generating audio from text plus one or more reference voice clips.
- TTS: text-to-speech. Turns written text into spoken words.
Simple rule: use T2A for text-only prompts, and TA2A when you want characters to match saved voices using @Audio# references.
1. Sound-Design-Heavy Workflow for Seed Audio 1.0
A strong prompt for cinematic audio usually contains six parts:
1. [Genre + environment + mood.] Set the style, location, and emotional tone.
2. [Continuous sound bed.] Name the main sound that stays underneath the whole scene.
3. Speaker Name (voice attributes + emotion + pace + optional reference tag) delivery verb: "Dialogue." Introduce the speaker with clear voice and performance direction.
4. [Concrete sound effect or transition.] Add one specific sound that supports the story moment.
5. Speaker Name (voice attributes...) delivery verb: "Dialogue." Continue the scene with dialogue that escalates the situation.
6. [Silence, closing sound, music cue, or fade.] End with a sound cue that resolves or suspends the moment.
Don't overload the prompt. One strong sound bed plus a few precise cues beats ten random effects.
Example 1: Fantasy movie scene
[Fantasy adventure film style. An ancient underground trial chamber, black stone walls, glowing symbols, blue fire ahead and crimson fire behind. Tense, urgent, magical.]
[A continuous roar of flames fills the chamber throughout, with crackling embers, low stone echoes, and heat hissing against the walls.]
Tobin (young boy, breathless shaky voice, panicky, rapid pace) blurts, backing away: "We're trapped! Blue fire ahead, crimson fire behind — that's it, we're finished!"
Liora (young girl, clear articulate voice, sharp and composed, quick steady pace) answers: "Don't panic. This isn't an attack — it's a logic puzzle. There's a note here… whoever built this wanted us to think."
Kael (young boy, low tight voice, brave but tense) asks: "There are seven vials. Which one gets us through?"
Liora (focused, fast, certain) says: "Some are poison, some are harmless. One takes us forward, one takes us back… Got it. This tiny vial takes you through the blue flames."
Tobin (nervous, swallowing hard) asks: "And the way back?"
Liora (crisp) replies: "The round vial sends you through the crimson fire."
Kael (quiet, decisive) says: "But there's only enough for one."
[A brief silence. Only the flames roar.]
Liora (trembling but firm) says: "Kael, you go. You have to reach the final chamber."
Kael (softly, resolved) says: "Tobin, take Liora back and warn the elders."
Tobin (choked up) whispers: "Be careful."
Generated using Seed Audio 1.0 on fal.
Example 2: Action movie trailer
[Action movie trailer style. A collapsing underground train tunnel, red emergency lights flashing through dust, tense and urgent.]
[A deep rumbling tunnel collapse continues underneath the whole scene, with distant metal groans and loose concrete falling in short bursts.]
Mara (female, early 30s, American accent, sharp commanding voice, controlled but urgent, fast pace) shouts, forcing calm over the chaos: "Move now! The tunnel is coming down behind us!"
[A steel support beam snaps overhead with a violent metallic crack, showering sparks onto the tracks.]
Jonah (male, late 20s, American accent, breathless nervous voice, panicked but trying to keep up, rapid pace) blurts, stumbling over the rubble: "The exit gate is sealed — we're trapped!"
Mara (sharp commanding voice, controlled but urgent) snaps, decisive and fierce: "Then we make our own exit. Get behind me."
[A brief silence as the rumble drops low, leaving only falling dust and one sharp inhale.]
[A sudden explosive blast punches through the sealed gate, followed by a rush of air, crashing metal, and the percussion bed cutting hard to silence.]
Generated using Seed Audio 1.0 on fal.
2. Text-to-Speech Scenes Using Saved Reference Voices
Here's the workflow for creating basic text-to-speech output from a reusable voice library.
Create a Reference Voice Library
Use short, clean, single-speaker clips as reusable voice assets. Each reference clip should be:
- Around 30 seconds
- One speaker only
- One consistent emotion and timbre
- No music
- No second voice
- Minimal background sound
- Clear, steady mic level
- Natural pacing, roughly 70–85 words
Use these clips later in TTS prompts with reference markers such as @Audio1, @Audio2, and @Audio3.
Example 1: Dex, confident broadcaster (use as SPK1)
[Quiet studio room tone, very faint.] Dex (male, 30s–40s, American accent, smooth warm confident broadcaster voice, crisp articulation, relaxed medium pace) says, easy and inviting: "Welcome back to the show — good to have you here. Every week I sit down in this little studio thinking I've heard every story there is, and every week somebody proves me wrong. That's the whole reason we do this. So settle in, grab whatever you're drinking, and let's get into it. No script today, no rush — just a real conversation. Trust me, this one's a good one."
Example 2: Priya, lively, blunt, comedic (use as SPK2)
[Quiet indoor room tone.] Priya (female, mid 20s, American accent, bright lively voice, blunt and quick, fast pace) says, exasperated and funny: "Okay so — first of all? Absolutely not. I have seen a lot of questionable decisions in my time, but this one might take the crown. No, I'm serious, put it down. Whatever you're about to do, the answer is no. Look, I love you, I do, but you have the survival instincts of a houseplant. Let me handle this. Sit. Stay. Watch a professional work, please."
3. Build Text-and-Audio-to-Audio Scenes
Use TA2A when you want a scene to reuse saved reference voices. Here's an example with no extra sound design, using the two saved voices above in dialogue:
Dex (smooth warm confident broadcaster voice, voiced by @Audio1), relaxed and amused, says: "Alright, Priya, I'm going to ask this carefully — what exactly did you do?"
Priya (bright lively blunt comedic voice, voiced by @Audio2), jumping in fast, already defensive, replies: "Okay, first of all, the word 'exactly' feels hostile."
Dex (voiced by @Audio1), chuckling under his breath, asks: "That usually means the story is good."
Priya (voiced by @Audio2), exasperated but funny, says: "It means the story has paperwork, Dex. There's a difference."
4. T2A / TTS Scenes With No Reference Voices
Use T2A when you don't need reference audio. Instead of target speaker tags, fully describe each voice in the prompt.
Example 1: Dialogue
Mira (young girl, clear bright voice, calm and quick-thinking, even pace) says: "Don't worry — I've figured out harder things than this. See, if you just slow down and look at the pieces one at a time, it always makes sense in the end. People panic; that's their mistake. I don't. I think. Give me a minute and a little quiet, and I promise — I'll find the way through. I always do. Now hand me that lamp and stop shaking, would you?"
Example 2: Movie trailer
Create a deep, polished movie trailer narrator voice. In a world where aliens come down from the skies, everything is about to turn upside down for one family living in South Texas.
Example 3: Commercial
Create a British radio commercial with some intro music and then background music as a female voice actress says "Celebrate with our Summer Getaway trip sale, and right now you can save 100 pounds per person, which is 500 pounds off for a family of five..."
5. Personalized Text-to-Speech
You can record your voice on any audio platform and upload it into Seed Audio as a reference. For something simple, use a browser-based recorder to capture a clip, then upload it. Write whatever you want the voice to say and click run. You are only authorized to use voices for which you have direct permission.
Input audio file:
This is the output of a test from running Seed Audio personalized text-to-speech.
Output audio file:
The test was successful: the output matched the sound of the reference voice and produced a highly personalized text-to-speech result on par with the latest state-of-the-art TTS models. The output even cleaned up some of the static and clicking present in the input file.
6. Combining Multiple Audios With Background Effects or Music
Audio blending is about combining multiple sounds, actors, ambience, and music into one cohesive track so they feel like they belong together. Instead of generating a dry voice clip or a single sound effect, the model has to balance every layer at once: dialogue stays clear, background effects create the setting, music supports the emotion, and each sound enters at the right moment. This is especially useful for cinematic scenes, audio dramas, game cutscenes, ads, trailers, podcasts, and any workflow where voice, environment, and score need to feel like one finished production.
Example 1: Lighthouse storm
Interior, the glass lamp room atop a stone lighthouse at the height of a night storm — throughout, heavy rain lashes the windows, wind howls and whistles through the railings, distant thunder rolls, and waves boom against rock far below, while the great rotating lens hums and clicks in a slow, steady rhythm; a low marine foghorn sounds twice in the distance. A tense orchestral score of low strings simmers underneath, swelling at the climax.
A burst of radio static crackles. Eamon (man in his late sixties, deep gravelly voice, slow and weathered, soft Irish lilt, calm and granite-steady) says low and certain, "Easy now. She'll hold. This old lamp's outlasted worse than you." Quick wet footsteps ring on a metal stair. Mira (young woman, early twenties, bright quick voice, breathless and frightened) blurts, voice shaking, "Grandda — there's a second boat! Out past the reef, I saw her lights go under!" A trembling, ragged breath. Bram (man around forty, low hoarse voice, exhausted, soft coastal accent, teeth chattering) rasps, barely above the wind, "Two of us… there were two. My brother's still out there. Please." A beat — only wind and the slow click of the turning lens. Eamon exhales hard, decisive, "Then we don't wait for the coast guard. Mira — swing the beam south, hold it on the reef. Bram, get on that radio and don't stop calling." Mira, fear turning to fierce, "I've got it — turning now!" A heavy brass mechanism groans and grinds as the beam swings. Bram, voice cracking with hope, fast, "Mayday, mayday — this is the lighthouse, we have your position—" The orchestral score surges; thunder cracks overhead. Eamon, quiet and fierce beneath the swell, "Hold the light, girl. Don't you let it go." The foghorn blares long and low as rain and music rise together, then settle.
Example 2: Sports commentary
Inside a huge football stadium, with the deafening roar of tens of thousands of fans throughout the background. The commentator (middle-aged male, British accent, rich and penetrating voice, classic sports commentary, extremely exhilarated) shouts in a rapid, soaring, full-throated tone: "OH, HE SCORES!!! WHAT A GOAL! He beats two men and buries it in the top corner — UNBELIEVABLE! The stadium is on its feet!!!" He draws out the word "GOAL" with a voice slightly hoarse from excitement, and the crowd's cheering erupts at the moment of the goal and continues to the end.
Example 3: Sitcom
Opens with a classic upbeat sitcom intro tune — bright electric-guitar strumming, cheerful bass, crisp drums, playful and bouncy; plays a few seconds then fades out. Inside a pet store, relaxed atmosphere, very faint room tone. Dave (man around 30, American accent, low voice, forced composure, pretend-expert) says smugly and falsely composed: "Oh~ this Golden Retriever is so well taken care of, scientifically raised since a pup. How old's the little guy?" Lily (young woman around 25, American accent, bright lively voice, blunt) sincerely blurts out: "Sir… that's an alpaca." Immediately a classic sitcom "audience roaring" canned laughter bursts out, fading loud to soft. Dave freezes, struggles to save face, voice cracking mid-sentence: "I… I know that, I mean it… it really looks like a Golden Retriever." Audience laughter erupts louder. Lily nods earnestly, twisting the knife: "Yeah, it thinks so too — that's why it's been staring at you this whole time." Ends with audience laughter mixed with scattered applause.
7. Audio Extending
Audio extending lets you take an existing clip and continue it naturally, as if the original recording never stopped. This is useful when you have a short voice memo, narration, character line, or scene but need a longer version without re-recording from scratch. The model preserves the speaker's voice, pacing, and subject matter while generating a natural continuation.
Input reference audio:
@Audio1 Continue this exact speech seamlessly in the same voice and topic, as if it never stopped, for about {add_seconds} more seconds
Replace {add_seconds} with the number of additional seconds you want.
New extended section to append to the original:
You can also guide the extension more directly by adding specific lines:
@Audio1 Continue this exact speech seamlessly in the same voice and topic, as if it never stopped, for about 10 more seconds. The character should add the following lines: "and he does it. he wins the world championships and won over the hearts of all the boys and girls out there watching."
New extended section to append to the original:
This makes audio extending useful for lengthening narration, continuing character dialogue, creating alternate endings, expanding podcast-style commentary, or extending a scene while keeping the same voice and performance style.
8. Audio Inpainting
Audio inpainting repairs or completes a recording by filling in a missing section between two surrounding clips. Instead of regenerating the entire file from scratch, the model uses the audio before and after the gap as context, then creates the missing speech so the final recording flows naturally.
Tip: Use the original transcript/prompt, but append the dialogue of the new section to the end of the prompt. This works because the model often reverts to a TTS + T2A blend, so if you don't tell it what the new words are, it's less successful.
Sitcom, Audio 1 (beginning):
What you hear: Dave proudly shows off his "Golden Retriever," Lily deadpans "Sir… that's an alpaca," the canned laughter hits, and the clip cuts off right there.
Sitcom, Audio 2 (with new inpainted ending):
What changed: the opening is identical, but a new comedic ending is filled in. Lily twists the knife ("you even kind of look like one yourself"), Dave snaps back ("I never want to hear from you again!"), and the audience gasps, all in the same voices and room tone as the original.
This is where inpainting gets powerful. We took one finished clip and gave it a completely different ending, without re-recording the opening or changing the original voices. The model keeps everything from the beginning intact and only generates the new section, so the added lines flow in naturally instead of sounding bolted on.
The creative implication is huge: a single piece of content can become an endless set of variations. Alternate punchlines, different character reactions, multiple endings to A/B test, or a fresh cut tailored to each platform, all built on the same opening. Instead of producing one clip, you can spin up effectively infinite versions of it from the same source audio.
@Audio1 and @Audio2 are the audio before and after a missing section. Reproduce the complete recording with the missing speech filled in naturally.
Opens with a classic upbeat sitcom intro tune — bright electric-guitar strumming, cheerful bass, crisp drums, playful and bouncy; plays a few seconds then fades out. Inside a pet store, relaxed atmosphere, very faint room tone. Dave (man around 30, American accent, low voice, forced composure, pretend-expert) says smugly and falsely composed: "Oh~ this Golden Retriever is so well taken care of, scientifically raised since a pup. How old's the little guy?" Lily (young woman around 25, American accent, bright lively voice, blunt) sincerely blurts out: "Sir… that's an alpaca." Immediately a classic sitcom "audience roaring" canned laughter bursts out, fading loud to soft. Lily (young woman around 25, American accent, bright lively voice, blunt) saying you even kind of look like one yourself. Dave (man around 30, American accent, low voice, forced composure, pretend-expert) says sharply and rudely: "I never want to hear from you again! I can't believe you would even think to say that to me." Immediately a classic sitcom "audience gasps" fading loud to soft. Abruptly the sound cuts off.
This is useful when a recording has a dropped phrase, corrupted section, awkward cut, or missing line. The goal is for the filled-in section to match the same speaker, tone, pacing, room sound, and emotional delivery, so listeners can't tell where the original audio ends and the repaired audio begins.
Single-Audio Inpainting
You can also inpaint within a single clip. In this example, the file has a one-second gap of silence in the middle. With this workflow you can seamlessly recreate the audio without the gap.
Input audio file:
Output audio file:
@Audio1 This recording has a silent pause in the middle where a part is missing. Reproduce the full recording with the missing part filled in naturally, in the same voice, so it flows continuously.
He says "That is why I am not going to leave them. I will be here until my last breath"
9. Audio Stitching
If you have two clips that are split apart, you can merge them into one seamless-sounding track.
Piano, Audio 1 (beginning):
Piano, Audio 2 (ending):
Stitched audio:
generate a 10 second piano medley combining @Audio1 with @Audio2
10. Audio Editing
In this example, we wanted the character to say a different line, so we supplied the input reference file, changed the script, and regenerated.
Input audio file:
What you hear: the repaired clip from the previous step, where the character says "That is why I am not going to leave them. I will be here until my last breath."
Output audio file:
What changed: same voice, tone, and room sound, but the line is rewritten to "My brother we shall fight for the right to be alive every single day until death."
@Audio1 Keep this exact voice, tone, pacing, and room sound, but change the spoken line to: "My brother we shall fight for the right to be alive every single day until death"
Constraints to Know About
When running this model, the maximum text prompt is 2,048 characters. A standard two-minute speech at a natural conversational pace runs between 1,400 and 2,000 characters. The maximum generated audio length is 2 minutes per generation, so the model is great for short clips you stitch together. With fal's API you can generate all your clips at once, so producing 30+ minutes of audio for a podcast or other media is straightforward with minimal waiting. You can supply at most 3 reference clips, each 30 seconds or shorter.
A planned July 2026 update is expected to support generating up to 10 minutes of audio in a single pass, along with length control so you can set the desired duration of your output.
Two-Language Support
Seed Audio 1.0 currently supports English and Chinese audio generation, which makes it a strong option for English and Chinese voice workflows today. When writing your prompt, specify the language of the desired output. Broader multilingual support is planned for the July 2026 update.
Start Building With Seed Audio 1.0
Seed Audio 1.0 is built for audio-first production: multi-character dialogue, sound effects, and music in a single pass, with reference voices, extending, inpainting, stitching, and editing all from one model. Try it on fal and pair your audio with state-of-the-art video models to build podcasts, audiobooks, short films, and more, all through one API.



