Sam 3 Video to Video

fal-ai/sam-3/video
SAM 3 is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks.
Inference
Commercial use

Input

Additional Settings

Customize your input with more control.

Result

Idle

Waiting for your input...

What would you like to do next?

Your request will cost $0.005 per 16 frames of video input.

Logs

Segment Anything Model 3 (SAM 3) | [video-to-video]

Meta's SAM 3 delivers unified video segmentation at $0.005 per 16 frames, trading specialized single-frame accuracy for continuous object tracking across video sequences. This architecture handles text, point, box, and mask prompts simultaneously—eliminating the need to switch between separate image and video segmentation tools for production workflows.

Use Cases: Video Content Moderation | Automated Sports Highlight Generation | Medical Video Analysis


Performance

At $0.31 per 1,000 frames ($0.005 per 16 frames), SAM 3 Video provides cost-effective video segmentation compared to frame-by-frame processing with dedicated image models.

MetricResultContext
Detection Threshold0.1-1.0 (adjustable)Configurable precision vs. recall tradeoff for domain-specific needs
Prompt Flexibility4 input typesText, points, boxes, masks—combinable within single inference
Cost per 16 Frames$0.005Approximately 200 frames per $1.00 on fal
Output FormatsMP4 video + optional bbox ZIPSegmented video with per-frame bounding box overlays available
Related EndpointsSAM 3 Image, SAM 3 3D Objects, SAM 3 3D BodyImage, object reconstruction, and body estimation variants

Multi-Modal Prompting Without Pipeline Complexity

SAM 3 Video unifies four prompt types—text descriptions, coordinate points, bounding boxes, and mask overlays—into a single inference call. Standard video segmentation workflows require separate tools for each prompt modality, forcing you to maintain multiple API integrations and reconcile inconsistent output formats.

What this means for you:

  • Cross-frame object persistence: Track "person, cloth" across camera cuts and occlusions using comma-separated text prompts. The model maintains object identity without manual re-initialization

  • Adjustable detection sensitivity: The `detection_threshold` parameter (0.1-1.0) lets you tune false positive rates per use case—set to 0.3 for high-recall content moderation or 0.8 for precision medical imaging

  • Optional mask application: Toggle `apply_mask: false` to receive raw segmentation data without visual overlays, enabling custom post-processing or analytics pipelines

  • Per-frame bounding box export: Request `boundingbox_frames_zip` output to extract frame-by-frame coordinates for downstream tracking systems or training data generation


Technical Specifications

SpecDetails
ArchitectureSegment Anything Model 3
Input FormatsMP4, MOV, WebM, M4V, GIF video files
Output FormatsMP4 video (masked), ZIP archive (per-frame bounding boxes)
Prompt TypesText strings, point coordinates, box coordinates (x_min, y_min, x_max, y_max), mask overlays
LicenseCommercial use permitted

API Documentation | Quickstart Guide | Enterprise Pricing


How It Stacks Up

SAM 3 Image ($0.002/image) – SAM 3 Video trades per-frame segmentation precision for temporal object tracking at 2.5x the per-frame cost ($0.005 per 16 frames vs. $0.002 per static image). SAM 3 Image remains ideal for single-frame workflows like product photography masking or medical scan annotation where frame-to-frame consistency isn't required.

SAM 3 3D Objects ($0.015/inference) – SAM 3 Video provides 2D video segmentation and tracking at $0.005 per 16 frames, while SAM 3 3D Objects reconstructs full 3D meshes from images at $0.015 per generation—3x the cost for spatial geometry. Use SAM 3 Video for content analysis pipelines and SAM 3 3D Objects when you need exportable GLB/FBX assets for game engines or AR applications.

SAM 3 3D Body ($0.015/inference) – SAM 3 Video handles general object segmentation in video at $0.005 per 16 frames, while SAM 3 3D Body specializes in human pose estimation and body mesh reconstruction at $0.015 per inference. SAM 3 3D Body delivers rigged skeletal output for animation workflows where SAM 3 Video provides mask-based tracking for surveillance, sports analytics, or automated video editing.