SAM 3: Video to Video Segmentation + Object Tracking

Segment Anything Model 3 (SAM 3) | [video-to-video]

Meta's SAM 3 delivers unified video segmentation at $0.005 per 16 frames, trading specialized single-frame accuracy for continuous object tracking across video sequences. This architecture handles text, point, box, and mask prompts simultaneously—eliminating the need to switch between separate image and video segmentation tools for production workflows.

Use Cases: Video Content Moderation | Automated Sports Highlight Generation | Medical Video Analysis

Performance

At $0.31 per 1,000 frames ($0.005 per 16 frames), SAM 3 Video provides cost-effective video segmentation compared to frame-by-frame processing with dedicated image models.

Metric	Result	Context
Detection Threshold	0.1-1.0 (adjustable)	Configurable precision vs. recall tradeoff for domain-specific needs
Prompt Flexibility	4 input types	Text, points, boxes, masks—combinable within single inference
Cost per 16 Frames	$0.005	Approximately 200 frames per $1.00 on fal
Output Formats	MP4 video + optional bbox ZIP	Segmented video with per-frame bounding box overlays available
Related Endpoints	SAM 3 Image, SAM 3 3D Objects, SAM 3 3D Body	Image, object reconstruction, and body estimation variants

SAM 3 Video unifies four prompt types—text descriptions, coordinate points, bounding boxes, and mask overlays—into a single inference call. Standard video segmentation workflows require separate tools for each prompt modality, forcing you to maintain multiple API integrations and reconcile inconsistent output formats.

What this means for you:

Cross-frame object persistence: Track "person, cloth" across camera cuts and occlusions using comma-separated text prompts. The model maintains object identity without manual re-initialization
Adjustable detection sensitivity: The `detection_threshold` parameter (0.1-1.0) lets you tune false positive rates per use case—set to 0.3 for high-recall content moderation or 0.8 for precision medical imaging
Optional mask application: Toggle `apply_mask: false` to receive raw segmentation data without visual overlays, enabling custom post-processing or analytics pipelines
Per-frame bounding box export: Request `boundingbox_frames_zip` output to extract frame-by-frame coordinates for downstream tracking systems or training data generation

Technical Specifications

Spec	Details
Architecture	Segment Anything Model 3
Input Formats	MP4, MOV, WebM, M4V, GIF video files
Output Formats	MP4 video (masked), ZIP archive (per-frame bounding boxes)
Prompt Types	Text strings, point coordinates, box coordinates (x_min, y_min, x_max, y_max), mask overlays
License	Commercial use permitted

API Documentation | Quickstart Guide | Enterprise Pricing

How It Stacks Up

SAM 3 Image ($0.002/image) – SAM 3 Video trades per-frame segmentation precision for temporal object tracking at 2.5x the per-frame cost ($0.005 per 16 frames vs. $0.002 per static image). SAM 3 Image remains ideal for single-frame workflows like product photography masking or medical scan annotation where frame-to-frame consistency isn't required.

SAM 3 3D Objects ($0.015/inference) – SAM 3 Video provides 2D video segmentation and tracking at $0.005 per 16 frames, while SAM 3 3D Objects reconstructs full 3D meshes from images at $0.015 per generation—3x the cost for spatial geometry. Use SAM 3 Video for content analysis pipelines and SAM 3 3D Objects when you need exportable GLB/FBX assets for game engines or AR applications.

SAM 3 3D Body ($0.015/inference) – SAM 3 Video handles general object segmentation in video at $0.005 per 16 frames, while SAM 3 3D Body specializes in human pose estimation and body mesh reconstruction at $0.015 per inference. SAM 3 3D Body delivers rigged skeletal output for animation workflows where SAM 3 Video provides mask-based tracking for surveillance, sports analytics, or automated video editing.

fal-ai/sam-3/video

Input

Result

What would you like to do next?

Logs

Segment Anything Model 3 (SAM 3) | [video-to-video]

Performance

Technical Specifications

How It Stacks Up

fal-ai/sam-3/video

Input

Result

What would you like to do next?

Logs

Segment Anything Model 3 (SAM 3) | [video-to-video]

Performance

Multi-Modal Prompting Without Pipeline Complexity

Technical Specifications

How It Stacks Up