Sam 3 Video to Video
Input
Hint: Drag and drop video files from your computer, video from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: mp4, mov, webm, m4v, gif
Customize your input with more control.
Result
Waiting for your input...
What would you like to do next?
Your request will cost $0.005 per 16 frames of video.
Logs
SAM 3 Object Tracking | [video-segmentation]
Meta's Segment Anything Model 3 (SAM 3) delivers unified video segmentation at $0.005 per 16 frames, trading specialized single-task models for multi-prompt flexibility. The architecture handles text, point, box, and mask prompts in a single inference pass, eliminating the workflow friction of switching between detection and tracking models. Built for production teams who need real-time object isolation across video content without maintaining separate segmentation pipelines.
Use Cases: Video editing workflows | Object tracking and removal | Content moderation automation
Performance
SAM 3 Video to Video processes segmentation at $0.005 per 16 frames, roughly 3,200 frames per dollar, making it cost-effective for batch video processing compared to frame-by-frame image segmentation approaches that would require separate API calls per frame.
| Metric | Result | Context |
|---|---|---|
| Prompt Flexibility | Text, point, box, mask | Unified model handles 4 prompt types vs specialized tools per modality |
| Detection Threshold | 0.01 to 1.0 configurable | Default 0.5 for existing objects, 0.7 for new; lower to 0.2 to 0.3 if text prompts fail |
| Cost per Inference | $0.005 per 16 frames | 3,200 frames per $1.00 on fal |
| Output Formats | MP4 video + optional bounding box ZIP | Segmented video with frame-by-frame overlay archives |
| Related Endpoints | SAM 3 Image, SAM 3D Objects, SAM 3D Body | Image segmentation, object reconstruction, and human body estimation variants |
Multi-Prompt Architecture for Production Workflows
SAM 3 consolidates detection, segmentation, and tracking into a single API call, contrasting with traditional pipelines that chain separate models for object detection, mask generation, and temporal tracking. The unified architecture accepts text prompts ("person, cloth"), point coordinates with frame indices, bounding boxes, or initial masks, letting you switch prompt strategies mid-project without model swaps.
What this means for you:
-
Configurable precision control: Adjust detection thresholds per object type. Use 0.5 for high-confidence tracking of known objects, drop to 0.2 to 0.3 when text prompts initially fail to detect targets
-
Multi-object tracking: Track multiple objects simultaneously via comma-separated text prompts, eliminating sequential processing overhead
-
Frame-specific interaction: Apply point or box prompts at specific frame indices for user-guided refinement when automated detection misses edge cases
-
Developer-friendly output: Returns segmented MP4 plus optional per-frame bounding box overlays as ZIP archive for downstream processing pipelines
Technical Specifications
| Spec | Details |
|---|---|
| Architecture | Segment Anything Model 3 |
| Input Formats | MP4, MOV, WebM, M4V, GIF video; optional mask PNG/JPEG |
| Output Formats | MP4 video with segmentation masks; optional ZIP of per-frame bounding box overlays |
| Prompt Types | Text strings, point coordinates (x, y, frame_index), bounding boxes, initial mask URLs |
| License | Commercial use permitted |
API Documentation | Quickstart Guide | Enterprise Pricing
How It Stacks Up
SAM 3 Image to Image ($0.005 per image) – SAM 3 Video to Video extends the same prompt flexibility to temporal data at identical per-unit pricing, adding frame-by-frame tracking that image endpoints can't provide. SAM 3 Image remains ideal for single-frame segmentation tasks where temporal consistency isn't required.
[SAM 3D Objects] (see pricing) – SAM 3 Video to Video handles 2D video segmentation, while SAM 3D Objects reconstructs full 3D geometry from images. Use Video to Video for content editing workflows; use 3D Objects when you need mesh output for game engines or AR applications.
[SAM 3D Body] (see pricing) – SAM 3 Video to Video segments any objects via text/visual prompts, trading specialized human body estimation for general-purpose flexibility. SAM 3D Body delivers SMPL-format body models with 70 keypoints for motion capture pipelines where human-specific accuracy matters.