Sam 3 Video to Video
Input
Hint: Drag and drop video files from your computer, video from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: mp4, mov, webm, m4v, gif
Customize your input with more control.
Result
Waiting for your input...
What would you like to do next?
Your request will cost $0.005 per 16 frames of video input.
Logs
Segment Anything Model 3 (SAM 3) | [video-to-video]
Meta's SAM 3 delivers unified video segmentation at $0.005 per 16 frames, trading specialized single-frame accuracy for continuous object tracking across video sequences. This architecture handles text, point, box, and mask prompts simultaneously—eliminating the need to switch between separate image and video segmentation tools for production workflows.
Use Cases: Video Content Moderation | Automated Sports Highlight Generation | Medical Video Analysis
Performance
At $0.31 per 1,000 frames ($0.005 per 16 frames), SAM 3 Video provides cost-effective video segmentation compared to frame-by-frame processing with dedicated image models.
| Metric | Result | Context |
|---|---|---|
| Detection Threshold | 0.1-1.0 (adjustable) | Configurable precision vs. recall tradeoff for domain-specific needs |
| Prompt Flexibility | 4 input types | Text, points, boxes, masks—combinable within single inference |
| Cost per 16 Frames | $0.005 | Approximately 200 frames per $1.00 on fal |
| Output Formats | MP4 video + optional bbox ZIP | Segmented video with per-frame bounding box overlays available |
| Related Endpoints | SAM 3 Image, SAM 3 3D Objects, SAM 3 3D Body | Image, object reconstruction, and body estimation variants |
Multi-Modal Prompting Without Pipeline Complexity
SAM 3 Video unifies four prompt types—text descriptions, coordinate points, bounding boxes, and mask overlays—into a single inference call. Standard video segmentation workflows require separate tools for each prompt modality, forcing you to maintain multiple API integrations and reconcile inconsistent output formats.
What this means for you:
-
Cross-frame object persistence: Track "person, cloth" across camera cuts and occlusions using comma-separated text prompts. The model maintains object identity without manual re-initialization
-
Adjustable detection sensitivity: The
`detection_threshold`parameter (0.1-1.0) lets you tune false positive rates per use case—set to 0.3 for high-recall content moderation or 0.8 for precision medical imaging -
Optional mask application: Toggle
`apply_mask: false`to receive raw segmentation data without visual overlays, enabling custom post-processing or analytics pipelines -
Per-frame bounding box export: Request
`boundingbox_frames_zip`output to extract frame-by-frame coordinates for downstream tracking systems or training data generation
Technical Specifications
| Spec | Details |
|---|---|
| Architecture | Segment Anything Model 3 |
| Input Formats | MP4, MOV, WebM, M4V, GIF video files |
| Output Formats | MP4 video (masked), ZIP archive (per-frame bounding boxes) |
| Prompt Types | Text strings, point coordinates, box coordinates (x_min, y_min, x_max, y_max), mask overlays |
| License | Commercial use permitted |
API Documentation | Quickstart Guide | Enterprise Pricing
How It Stacks Up
SAM 3 Image ($0.002/image) – SAM 3 Video trades per-frame segmentation precision for temporal object tracking at 2.5x the per-frame cost ($0.005 per 16 frames vs. $0.002 per static image). SAM 3 Image remains ideal for single-frame workflows like product photography masking or medical scan annotation where frame-to-frame consistency isn't required.
SAM 3 3D Objects ($0.015/inference) – SAM 3 Video provides 2D video segmentation and tracking at $0.005 per 16 frames, while SAM 3 3D Objects reconstructs full 3D meshes from images at $0.015 per generation—3x the cost for spatial geometry. Use SAM 3 Video for content analysis pipelines and SAM 3 3D Objects when you need exportable GLB/FBX assets for game engines or AR applications.
SAM 3 3D Body ($0.015/inference) – SAM 3 Video handles general object segmentation in video at $0.005 per 16 frames, while SAM 3 3D Body specializes in human pose estimation and body mesh reconstruction at $0.015 per inference. SAM 3 3D Body delivers rigged skeletal output for animation workflows where SAM 3 Video provides mask-based tracking for surveillance, sports analytics, or automated video editing.