SAM 3: Video Segmentation AI | Object Detection + Tracking

SAM 3 Object Tracking | [video-segmentation]

Meta's Segment Anything Model 3 (SAM 3) delivers unified video segmentation at $0.005 per 16 frames, trading specialized single-task models for multi-prompt flexibility. The architecture handles text, point, box, and mask prompts in a single inference pass, eliminating the workflow friction of switching between detection and tracking models. Built for production teams who need real-time object isolation across video content without maintaining separate segmentation pipelines.

Use Cases: Video editing workflows | Object tracking and removal | Content moderation automation

Performance

SAM 3 Video to Video processes segmentation at $0.005 per 16 frames, roughly 3,200 frames per dollar, making it cost-effective for batch video processing compared to frame-by-frame image segmentation approaches that would require separate API calls per frame.

Metric	Result	Context
Prompt Flexibility	Text, point, box, mask	Unified model handles 4 prompt types vs specialized tools per modality
Detection Threshold	0.01 to 1.0 configurable	Default 0.5 for existing objects, 0.7 for new; lower to 0.2 to 0.3 if text prompts fail
Cost per Inference	$0.005 per 16 frames	3,200 frames per $1.00 on fal
Output Formats	MP4 video + optional bounding box ZIP	Segmented video with frame-by-frame overlay archives
Related Endpoints	SAM 3 Image, SAM 3D Objects, SAM 3D Body	Image segmentation, object reconstruction, and human body estimation variants

Multi-Prompt Architecture for Production Workflows

SAM 3 consolidates detection, segmentation, and tracking into a single API call, contrasting with traditional pipelines that chain separate models for object detection, mask generation, and temporal tracking. The unified architecture accepts text prompts ("person, cloth"), point coordinates with frame indices, bounding boxes, or initial masks, letting you switch prompt strategies mid-project without model swaps.

What this means for you:

Configurable precision control: Adjust detection thresholds per object type. Use 0.5 for high-confidence tracking of known objects, drop to 0.2 to 0.3 when text prompts initially fail to detect targets
Multi-object tracking: Track multiple objects simultaneously via comma-separated text prompts, eliminating sequential processing overhead
Frame-specific interaction: Apply point or box prompts at specific frame indices for user-guided refinement when automated detection misses edge cases
Developer-friendly output: Returns segmented MP4 plus optional per-frame bounding box overlays as ZIP archive for downstream processing pipelines

Technical Specifications

Spec	Details
Architecture	Segment Anything Model 3
Input Formats	MP4, MOV, WebM, M4V, GIF video; optional mask PNG/JPEG
Output Formats	MP4 video with segmentation masks; optional ZIP of per-frame bounding box overlays
Prompt Types	Text strings, point coordinates (x, y, frame_index), bounding boxes, initial mask URLs
License	Commercial use permitted

API Documentation | Quickstart Guide | Enterprise Pricing

How It Stacks Up

SAM 3 Image to Image ($0.005 per image) – SAM 3 Video to Video extends the same prompt flexibility to temporal data at identical per-unit pricing, adding frame-by-frame tracking that image endpoints can't provide. SAM 3 Image remains ideal for single-frame segmentation tasks where temporal consistency isn't required.

[SAM 3D Objects] (see pricing) – SAM 3 Video to Video handles 2D video segmentation, while SAM 3D Objects reconstructs full 3D geometry from images. Use Video to Video for content editing workflows; use 3D Objects when you need mesh output for game engines or AR applications.

[SAM 3D Body] (see pricing) – SAM 3 Video to Video segments any objects via text/visual prompts, trading specialized human body estimation for general-purpose flexibility. SAM 3D Body delivers SMPL-format body models with 70 keypoints for motion capture pipelines where human-specific accuracy matters.

fal-ai/sam-3/video-rle

Input

Result

What would you like to do next?

Logs

SAM 3 Object Tracking | [video-segmentation]

Performance

Multi-Prompt Architecture for Production Workflows

Technical Specifications

How It Stacks Up