SAM 3D transforms single 2D images into detailed 3D models through three specialized tools: Body for humans, Objects for items, and Align for complete scenes. Perfect for AR/VR experiences, game asset creation, e-commerce product visualization, and digital twin generation.
Building 3D Assets from Single Images
Single-image 3D reconstruction traditionally demanded specialized hardware, controlled environments, and multiple camera angles. SAM 3D eliminates these requirements through neural reconstruction techniques that extract depth and geometry from ordinary photographs.
The suite comprises three models: SAM 3D Body reconstructs human forms, SAM 3D Objects generates object meshes, and SAM 3D Align combines both into unified scenes. Processing times range from 5-10 seconds for single subjects at 512px to 30+ seconds for complex multi-object scenes at 1024px+. Output file sizes typically span 2-15MB for GLB meshes and 5-50MB for Gaussian splat files, depending on geometric complexity.
SAM 3D Body: Human Form Reconstruction
SAM 3D Body reconstructs three-dimensional human body geometry and pose from single RGB images1. The model applies parametric body representations combined with learned pose estimation to infer complete body structure even when portions remain occluded.
Core Parameters
| Parameter | Type | Purpose | Default |
|---|---|---|---|
image_url | string | Source image containing person | Required |
mask_url | string | Binary segmentation mask (white=person, black=background) | Optional |
export_meshes | boolean | Generate individual mesh files per detected person | true |
include_3d_keypoints | boolean | Include skeletal keypoint markers in visualization | true |
Prompting Strategies
Image Composition. Clear subject isolation produces superior results. When working with complex backgrounds, generate precise masks using Segment Anything Model 2 before reconstruction. Full-body visibility improves accuracy, though the model infers occluded limbs based on visible anatomical constraints.
Pose Considerations. Natural standing or walking poses yield the most reliable reconstructions. Front-facing or three-quarter views align with training data distributions. Extreme poses (handstands, complex acrobatics) introduce ambiguity that degrades geometric accuracy2.
Resolution Requirements. Minimum 512px on shortest dimension. For facial detail and hand geometry, use 1024px+ resolution. Higher resolution increases processing time proportionally but captures fine features more accurately.
The model returns GLB files containing textured meshes, visualization images, individual PLY mesh files when multiple people appear, and structured metadata including camera intrinsics and skeletal keypoints.
falMODEL APIs
The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models
SAM 3D Objects: Item Reconstruction
SAM 3D Objects transforms objects into three-dimensional representations using Gaussian splatting for photorealistic texture capture3. The model handles geometric complexity while preserving surface appearance across viewing angles.
Essential Parameters
- prompt: Text description enabling automatic segmentation
- mask_urls: Explicit boundaries for precise object isolation (array of URLs)
- point_prompts: Coordinate arrays
[[x, y, label], ...]where label is 1 (foreground) or 0 (background) - box_prompts: Bounding box arrays
[[x1, y1, x2, y2], ...]indicating regions - seed: Reproducibility control (integer)
- pointmap_url: External depth map for improved geometric accuracy
Object Reconstruction Techniques
Segmentation Precision. Automatic segmentation performs well with prominent, clearly defined objects against simple backgrounds. Complex scenes benefit from explicit masks. Specific text prompts ("wooden dining chair with armrests") improve multi-object disambiguation versus generic terms ("chair").
Lighting and Materials. Diffused, even lighting produces accurate geometry by minimizing shadow-induced artifacts. Reflective or transparent surfaces (glass, polished metal) confuse depth estimation and require manual mesh refinement in post-processing.
Multi-Object Workflows. For scenes containing multiple items, use individual masks per object. Generate accurate masks via Segment Anything Model 2 as preprocessing input.
Output includes Gaussian splat files (.ply format), GLB mesh files with embedded textures, and transformation metadata containing scale factors and rotation matrices for each reconstructed object.
SAM 3D Align: Scene Assembly
SAM 3D Align positions human and object reconstructions within coherent spatial relationships. The model computes relative scales and transformations that preserve perspective consistency from the source image.
Alignment Parameters
Minimum requirements: image_url and body_mesh_url. Optional parameters include body_mask_url for refined human positioning, object_mesh_url for scene element inclusion, and focal_length from SAM 3D Body metadata (estimated if omitted).
Scene Composition Practices
Maintain Source Consistency. Use identical source images for body and object reconstruction. Perspective shifts between images cause alignment failures because the model assumes shared camera parameters.
Leverage Metadata. Pass focal length values from body reconstruction to object alignment. This prevents scale mismatches between scene elements. Without correct focal length, relative sizes may appear distorted.
Manage Complexity Progressively. Begin with simple scenes (one person, one object) before attempting multi-object arrangements. Each additional element introduces potential alignment error.
Advanced Techniques
Depth Map Integration
External depth maps improve geometric accuracy for SAM 3D Objects4. Generate depth estimation using Midas Depth Estimation, then pass as pointmap_url parameter. This provides geometric priors that constrain reconstruction, particularly for objects with ambiguous depth cues.
Complex Scene Workflows
Multi-element scenes require modular assembly: generate individual object reconstructions with SAM 3D Objects, create human models using SAM 3D Body, then combine elements through SAM 3D Align. This pipeline offers granular control while ensuring spatial coherence.
Performance Characteristics
| Input Configuration | Processing Time | Output Size (GLB) | Use Case |
|---|---|---|---|
| Single person, 512px | 5-8 seconds | 2-5MB | Mobile AR, avatars |
| Single person, 1024px+ | 8-12 seconds | 5-12MB | Desktop applications, detailed models |
| Single object, 512px | 4-7 seconds | 3-8MB | E-commerce visualization |
| Multi-object scene, 1024px+ | 25-35 seconds | 12-25MB | Game assets, architectural visualization |
Production Deployment Checklist
Error Handling. API returns status codes for common failures. Insufficient image quality produces 400 errors. Segmentation failures (no valid masks) return 422 errors. Implement retry logic with exponential backoff for transient failures.
File Size Management. GLB meshes range 2-25MB depending on complexity. Gaussian splat files (.ply) range 5-50MB. For web delivery, consider mesh decimation post-processing or progressive loading strategies. Cache generated assets with content-addressable storage to avoid redundant processing.
Rate Limits. Standard fal accounts support concurrent requests based on subscription tier. Batch processing workflows should implement request queuing to respect limits. Monitor X-RateLimit-Remaining response headers.
Caching Strategies. Hash input parameters (image URL, prompts, seed) to create cache keys. Store generated assets in CDN with appropriate cache-control headers. Typical cache hit rates of 40-60% reduce processing costs in production environments.
Integration Patterns. GLB format integrates directly with Three.js, Babylon.js, Unity, and Unreal Engine. For web applications, use GLTFLoader in Three.js. For game engines, import GLB files directly into asset pipelines. Gaussian splat files require custom rendering implementations or specialized viewers.
Limitations and Constraints
Occlusion Handling. Heavily occluded subjects produce less accurate geometry. The models infer hidden structure, but reconstruction quality degrades proportionally with occlusion percentage. Critical applications require minimal occlusion or planned post-processing.
Material Constraints. Transparent and highly reflective surfaces confuse depth estimation. Glass, mirrors, and chrome objects require manual mesh cleanup for production use.
Scale Ambiguity. Single-image reconstruction cannot determine absolute scale without reference objects of known dimensions. SAM 3D Align handles relative scaling effectively, but absolute measurements may need manual adjustment.
Format Considerations. GLB format enjoys broad support, but some legacy 3D engines exhibit issues with specific material properties. Validate your complete pipeline before production deployment.
Troubleshooting Common Issues
Incomplete Reconstructions: Provide manual masks generated via SAM 2 to improve segmentation accuracy
Inaccurate Scaling: Include known focal length values from camera metadata when available. Extract from EXIF data if present.
Misaligned Elements: Verify consistent source imagery across reconstructions. Perspective shifts break alignment assumptions.
Missing Details: Use specific text prompts for complex objects ("wooden dining chair with armrests" versus "chair")
Texture Artifacts: Avoid images with strong directional lighting or heavy shadows that become embedded in 3D textures
Processing Timeouts: For high-resolution inputs (2048px+), expect extended processing times. Implement appropriate timeout values (60+ seconds) in production clients.
Implementation
SAM 3D provides accessible 3D reconstruction through Body, Objects, and Align models. Success depends on understanding each model's constraints, providing clear inputs, implementing appropriate error handling, and managing file sizes for your target platform.
For API implementation details, consult the quickstart guide.
Recently Added
References
-
Loper, Matthew, et al. "SMPL: A Skinned Multi-Person Linear Model." ACM Transactions on Graphics (SIGGRAPH), 2015. https://files.is.tue.mpg.de/black/papers/SMPL2015.pdf ↩
-
Kanazawa, Angjoo, et al. "End-to-end Recovery of Human Shape and Pose." Computer Vision and Pattern Analysis (CVPR), 2018. https://arxiv.org/abs/1712.06584 ↩
-
Kerbl, Bernhard, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Transactions on Graphics (SIGGRAPH), 2023. https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ ↩
-
Ranftl, René, et al. "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. https://arxiv.org/abs/1907.01341 ↩



