Sam 3 Video to Video

fal-ai/sam-3/video-rle
SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.
Inference
Commercial use

Input

Additional Settings

Customize your input with more control.

Result

Idle

Waiting for your input...

What would you like to do next?

Your request will cost $0.005 per 16 frames of video.

Logs

SAM 3 Object Tracking | [video-segmentation]

Meta's Segment Anything Model 3 (SAM 3) delivers unified video segmentation at $0.005 per 16 frames, trading specialized single-task models for multi-prompt flexibility. The architecture handles text, point, box, and mask prompts in a single inference pass, eliminating the workflow friction of switching between detection and tracking models. Built for production teams who need real-time object isolation across video content without maintaining separate segmentation pipelines.

Use Cases: Video editing workflows | Object tracking and removal | Content moderation automation


Performance

SAM 3 Video to Video processes segmentation at $0.005 per 16 frames, roughly 3,200 frames per dollar, making it cost-effective for batch video processing compared to frame-by-frame image segmentation approaches that would require separate API calls per frame.

MetricResultContext
Prompt FlexibilityText, point, box, maskUnified model handles 4 prompt types vs specialized tools per modality
Detection Threshold0.01 to 1.0 configurableDefault 0.5 for existing objects, 0.7 for new; lower to 0.2 to 0.3 if text prompts fail
Cost per Inference$0.005 per 16 frames3,200 frames per $1.00 on fal
Output FormatsMP4 video + optional bounding box ZIPSegmented video with frame-by-frame overlay archives
Related EndpointsSAM 3 Image, SAM 3D Objects, SAM 3D BodyImage segmentation, object reconstruction, and human body estimation variants

Multi-Prompt Architecture for Production Workflows

SAM 3 consolidates detection, segmentation, and tracking into a single API call, contrasting with traditional pipelines that chain separate models for object detection, mask generation, and temporal tracking. The unified architecture accepts text prompts ("person, cloth"), point coordinates with frame indices, bounding boxes, or initial masks, letting you switch prompt strategies mid-project without model swaps.

What this means for you:

  • Configurable precision control: Adjust detection thresholds per object type. Use 0.5 for high-confidence tracking of known objects, drop to 0.2 to 0.3 when text prompts initially fail to detect targets

  • Multi-object tracking: Track multiple objects simultaneously via comma-separated text prompts, eliminating sequential processing overhead

  • Frame-specific interaction: Apply point or box prompts at specific frame indices for user-guided refinement when automated detection misses edge cases

  • Developer-friendly output: Returns segmented MP4 plus optional per-frame bounding box overlays as ZIP archive for downstream processing pipelines


Technical Specifications

SpecDetails
ArchitectureSegment Anything Model 3
Input FormatsMP4, MOV, WebM, M4V, GIF video; optional mask PNG/JPEG
Output FormatsMP4 video with segmentation masks; optional ZIP of per-frame bounding box overlays
Prompt TypesText strings, point coordinates (x, y, frame_index), bounding boxes, initial mask URLs
LicenseCommercial use permitted

API Documentation | Quickstart Guide | Enterprise Pricing


How It Stacks Up

SAM 3 Image to Image ($0.005 per image) – SAM 3 Video to Video extends the same prompt flexibility to temporal data at identical per-unit pricing, adding frame-by-frame tracking that image endpoints can't provide. SAM 3 Image remains ideal for single-frame segmentation tasks where temporal consistency isn't required.

[SAM 3D Objects] (see pricing) – SAM 3 Video to Video handles 2D video segmentation, while SAM 3D Objects reconstructs full 3D geometry from images. Use Video to Video for content editing workflows; use 3D Objects when you need mesh output for game engines or AR applications.

[SAM 3D Body] (see pricing) – SAM 3 Video to Video segments any objects via text/visual prompts, trading specialized human body estimation for general-purpose flexibility. SAM 3D Body delivers SMPL-format body models with 70 keypoints for motion capture pipelines where human-specific accuracy matters.