Lipsync Unknown
Input
Hint: Drag and drop image files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: *.*
Hint: Drag and drop image files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: *.*
Customize your input with more control.
Result
Waiting for your input...
What would you like to do next?
Your request will cost $0.09 per second for 1080p, $0.16 per second for 1440p or $0.36 per second for 2160p.
Logs
Pipio Lipsync API
This service runs the core model of EditYourself or pipio.ai—a diffusion-based model that generates professional-grade lip-sync for talking head videos from video or image input and their corresponding audio. It also supports seamless addition and removal of scenes, while keeping the speaker’s identity and visual continuity end-to-end.
Addition and removal are facilitated with the `edit_addition*` and `edit_removal*` fields, for details how to use them please refer to their field descriptions.
For more information visit our project page or pipio.ai
Required Fields
`video`
Type: `File` (required)
URL or path to the conditioning input video or image file. If the conditioning is a video we do v2v if its an image we do i2v.
`audio`
Type: `File` (required)
URL or path to the audio file that will be used for lipsyncing. The audio determines the lip movements in the output. Can be either an audio or video file. If its a video file only the audio is used.
Video/Audio Settings
`frame_rate`
Type: `integer`
Default: `-1` (native fps)
Frame rate (fps) of the output video.
| Value | Behavior |
|---|---|
`-1` | Uses the native frame rate of the input video |
`24` | Forces 24 fps output |
`30` | Forces 30 fps output |
`height`
Type: `integer`
Default: `-1` (native height)
Height in pixels of the output video.
| Value | Behavior |
|---|---|
`-1` | Native height |
`-2` | Native height ÷ 2 |
`-3` | Native height ÷ 3 |
`720` | Fixed 720px height |
`width`
Type: `integer`
Default: `-1` (native width)
Width in pixels of the output video. Uses the same special value system as `height`.
`num_frames`
Type: `integer`
Default: `100`
Number of frames to process from the input video. This determines the length of the output video when no edits are present.
Processing Parameters
`vae_chunk_size`
Type: `integer`
Default: `65`
Size of chunks for VAE encode/decode during long video inference. Must follow the formula `8n + 1` (e.g., 17, 25, 33, 41, 49, 57, 65, 73...).
- Lower values: Less memory usage, potentially slower
- Higher values: More memory usage, potentially faster
- Very large values: Disables chunking entirely
`vae_overlap_window_width`
Type: `integer`
Default: `16`
Size of the overlap window between VAE encode/decode chunks. Helps reduce visible seams between chunks.
| Value | Behavior |
|---|---|
`0` | No overlap (may cause visible seams) |
`8-16` | Typical values for smooth transitions |
`32+` | Higher quality but slower |
`frame_block_width`
Type: `integer`
Default: `136`
For long video inference, the transformer processes the video in blocks of this width (in frames). Affects temporal consistency and memory usage.
- Lower values: Less memory, potentially less temporal consistency
- Higher values: Better temporal consistency, more memory
`feed_forward_num_splits`
Type: `integer`
Default: `2`
Number of chunks to split the feed-forward layer into during processing. Higher values reduce memory usage but may increase processing time.
Conditioning Strength
`face_id_cond_strength`
Type: `integer`
Default: `8`
Range: `1-16`
Controls how strongly the model preserves the subject's face identity from the input video.
`appearance_cond_strength`
Type: `integer`
Default: `1`
Range: `1-16`
Controls how closely fully synthetic frames align with the conditioning (original video appearance).
Edit Operations
Edit operations allow you to add or remove content from the video timeline.
`edit_addition_start_frames`
Type: `list[int] | null`
Default: `null`
List of 0-based frame indices where new synthetic content should be inserted. Must have the same length as `edit_addition_durations`.
`edit_addition_durations`
Type: `list[int] | null`
Default: `null`
List of durations (in frames) for each addition edit. Must match the length of `edit_addition_start_frames`.
`edit_removal_ranges`
Type: `list[int] | null`
Default: `null`
List of frame index pairs specifying ranges to remove from the video. Values come in pairs: `[start1, end1, start2, end2, ...]`. Both start and end are inclusive, 0-based indices.
`edit_removal_bridge_durations`
Type: `list[int] | null`
Default: `null`
List of bridge durations (in frames) for each removal range. Determines how the gap is filled:
`0`= Jump cut (no transition)`>0`= Synthetic bridge frames
Advanced Settings
`seed`
Type: `integer`
Default: `42`
Random seed for reproducible video generation. Use the same seed with identical inputs to get consistent results.
`use_custom_prompt`
Type: `boolean`
Default: `false`
When enabled, uses the `custom_prompt` field instead of automatically generating a prompt from the video content.
`custom_prompt`
Type: `string`
Default: `"A high quality video."`
Custom text prompt describing the video. Only used when `use_custom_prompt` is `true`. A good prompt helps the model understand the scene context.
Tips and Best Practices
-
Start with defaults: The default values work well for most use cases. Only adjust if needed.
-
Memory issues? Use lower resolution or reduce
`vae_chunk_size`,`frame_block_width`, or increase`feed_forward_num_splits`. -
Better quality? Increase
`face_id_cond_strength`for identity preservation, use higher resolution. -
Reproducibility: Always set the same
`seed`if you need consistent outputs. -
Edit operations: Ensure
`edit_addition_start_frames`and`edit_addition_durations`have matching lengths, and`edit_removal_ranges`has pairs of values matching`edit_removal_bridge_durations`. -
Resolution: Using native resolution (
`-1`) generally produces the best quality. Downscaling (`-2`,`-3`) can speed up processing. -
Frame alignment: Many internal parameters work in multiples of 8 frames. When possible, use
`num_frames`values like 121, 129, 137, etc. (8n + 1).