SAM 3D transforms single 2D images into detailed 3D models of humans, objects, and complete scenes in seconds through three specialized APIs - perfect for AR/VR, gaming, e-commerce product visualization, and immersive storytelling.
Integrating Single-Image 3D Reconstruction
Traditional 3D modeling demands specialized equipment, controlled capture environments, and substantial processing infrastructure. SAM 3D reconstructs detailed 3D assets from single RGB images through three specialized APIs addressing distinct reconstruction challenges: human body geometry, object meshes, and spatial scene alignment.
This guide demonstrates integration patterns for SAM 3D's component architecture. The system applies Gaussian splatting techniques for photorealistic rendering from sparse input data1. These methods enable production-grade 3D asset generation suitable for AR/VR applications, game development, e-commerce visualization, and interactive media.
Component Architecture
| Component | Purpose | Input | Output | Processing Time |
|---|---|---|---|---|
| SAM 3D Body | Human reconstruction | RGB image | GLB mesh + keypoints | 5-10 seconds |
| SAM 3D Objects | Object reconstruction | RGB image + prompt/mask | GLB mesh + Gaussian splat | 4-8 seconds |
| SAM 3D Align | Scene composition | Image + body/object meshes | Unified GLB scene | 3-6 seconds |
falMODEL APIs
The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models
Installation
Install the appropriate client library:
# Python
pip install fal-client
# JavaScript
npm install --save @fal-ai/client
Authentication requires a fal API key stored in environment variables.
SAM 3D Body: Human Reconstruction
SAM 3D Body reconstructs human body geometry from single images using parametric body models combined with learned pose estimation. The system infers complete body structure including occluded regions, generates skeletal keypoint data, and exports camera intrinsics.
Implementation
import fal_client
result = fal_client.subscribe(
"fal-ai/sam-3/3d-body",
arguments={
"image_url": "YOUR_IMAGE_URL",
"export_meshes": True,
"include_3d_keypoints": True
}
)
glb_model_url = result["model_glb"]
keypoints = result["metadata"]["people"][0]["keypoints_3d"]
focal_length = result["metadata"]["people"][0]["focal_length"]
Parameters
- mask_url: Binary segmentation mask (white=person, black=background) for explicit figure selection
- export_meshes: Generate individual mesh files per detected person (default: true)
- include_3d_keypoints: Include skeletal keypoint markers in GLB output (default: true)
Technical Constraints
Accuracy degrades with extreme poses (inverted positions, complex acrobatics) where pose ambiguity increases outside standard viewing angles. Occlusion beyond 40-50% compromises geometric precision. Front-facing or three-quarter views produce optimal results. Multi-person detection operates automatically but benefits from explicit masks in overlapping scenarios.
SAM 3D Objects: Object Reconstruction
SAM 3D Objects employs Gaussian splatting for photorealistic texture capture while maintaining geometric fidelity1. Segmentation operates through text descriptions, coordinate-based point prompts, or bounding box specifications.
Implementation
import { fal } from "@fal-ai/client";
const result = await fal.subscribe("fal-ai/sam-3/3d-objects", {
input: {
image_url: "YOUR_OBJECT_IMAGE_URL",
prompt: "wooden dining chair",
seed: 42,
},
});
const gaussianSplatUrl = result.data.gaussian_splat.url;
const glbModelUrl = result.data.individual_glbs[0].url;
const metadata = result.data.metadata[0];
Segmentation Methods
Text prompts: Describe target object ("red sports car" vs "car" for disambiguation)
Point prompts: Coordinate arrays [[x, y, label], ...] where label is 1 (foreground) or 0 (background)
Box prompts: Bounding box arrays [[x1, y1, x2, y2], ...] indicating object regions
Custom masks: Pre-segmented masks for absolute control in complex scenes
Technical Constraints
Performance degrades with transparent or highly reflective materials (glass, polished metal) where depth estimation becomes ambiguous. Multi-object scenes require explicit segmentation guidance through masks or prompts. Output includes traditional meshes (GLB) and Gaussian splat files (PLY) with transformation metadata.
SAM 3D Align: Scene Composition
SAM 3D Align computes relative scales and transformations between human and object reconstructions, preserving perspective consistency from source imagery.
Implementation
import fal_client
result = fal_client.subscribe(
"fal-ai/sam-3/3d-align",
arguments={
"image_url": "YOUR_SCENE_IMAGE_URL",
"body_mesh_url": body_result["model_glb"],
"object_mesh_url": object_result["individual_glbs"][0]["url"],
"focal_length": body_result["metadata"]["people"][0]["focal_length"]
}
)
scene_glb_url = result["scene_glb"]["url"]
aligned_body_url = result["body_mesh_glb"]["url"]
Requirements
The model requires identical source images for all components to maintain shared camera parameters. Perspective shifts between images cause alignment failures. Optimal performance occurs with 2-3 scene elements; accuracy decreases as element count increases. Passing focal length from body reconstruction prevents scale mismatches.
Complete Pipeline Example
import fal_client
def create_3d_scene(image_url):
"""Generate complete 3D scene from single image."""
try:
# Reconstruct human body
body_result = fal_client.subscribe(
"fal-ai/sam-3/3d-body",
arguments={"image_url": image_url}
)
# Reconstruct objects
object_result = fal_client.subscribe(
"fal-ai/sam-3/3d-objects",
arguments={
"image_url": image_url,
"prompt": "chair"
}
)
# Align into unified scene
scene_result = fal_client.subscribe(
"fal-ai/sam-3/3d-align",
arguments={
"image_url": image_url,
"body_mesh_url": body_result["model_glb"],
"object_mesh_url": object_result["individual_glbs"][0]["url"],
"focal_length": body_result["metadata"]["people"][0]["focal_length"]
}
)
return {
"scene_url": scene_result["scene_glb"]["url"],
"cost": 0.06, # $0.02 * 3 components
"processing_time": "12-24 seconds"
}
except Exception as e:
# 400: Invalid image format/quality
# 422: Segmentation failure (no objects/people detected)
if hasattr(e, 'status_code'):
if e.status_code == 422:
# Retry with explicit mask or different prompt
pass
raise
Response Schema
SAM 3D Body Returns:
{
"model_glb": str, # URL to GLB file
"metadata": {
"people": [{
"keypoints_3d": [[x, y, z], ...], # 3D joint coordinates
"focal_length": float, # Camera focal length
"camera_intrinsics": {"fx": float, "fy": float, "cx": float, "cy": float}
}]
}
}
SAM 3D Objects Returns:
{
"gaussian_splat": {"url": str}, # PLY format
"individual_glbs": [{"url": str}, ...], # One per object
"metadata": [{"scale": float, "rotation": [...], "translation": [...]}]
}
SAM 3D Align Returns:
{
"scene_glb": {"url": str}, # Combined scene
"body_mesh_glb": {"url": str} # Aligned body mesh
}
Implementation Best Practices
Resolution: Minimum 512px on shortest dimension; use 1024px+ for facial/hand detail. Higher resolution increases processing time proportionally.
Lighting: Use diffused, even lighting. Strong directional shadows embed in 3D textures creating viewing artifacts.
Segmentation: Specific prompts ("wooden dining chair with armrests") improve disambiguation. Generate masks via SAM 2 for complex backgrounds.
Caching: Hash input parameters (image URL, prompts, seed) for cache keys. Store generated assets with content-addressable storage. Typical 40-60% cache hit rates reduce costs.
Error Handling: Implement retry with exponential backoff for transient failures. Handle 400 (invalid input) and 422 (segmentation failure) explicitly.
Async Processing: Use on_queue_update callback for progress. Implement 60+ second timeouts for high-resolution inputs.
Output Format Integration
GLB Meshes: Compatible with Three.js (GLTFLoader), Babylon.js, Unity, and Unreal Engine. Import directly into asset pipelines. File sizes range 2-15MB depending on complexity.
// Three.js integration
import { GLTFLoader } from "three/examples/jsm/loaders/GLTFLoader";
const loader = new GLTFLoader();
loader.load(glb_url, (gltf) => scene.add(gltf.scene));
Gaussian Splat Files: PLY format requiring custom rendering implementations or specialized viewers. File sizes range 5-50MB. Provides higher texture fidelity at computational cost.
Metadata Utilization: Body reconstructions include skeletal keypoints and camera intrinsics. Object outputs contain transformation matrices and scale factors. Use this structured data for procedural animation and runtime scene modification.
Technical Limitations
Pose Constraints: Pose ambiguity increases significantly outside standard viewing angles. Handstands, splits, and extreme positions reduce reconstruction quality.
Material Handling: Transparent and reflective surfaces confuse depth estimation algorithms. Production workflows requiring glass, mirrors, or chrome objects need manual mesh cleanup.
Scale Ambiguity: Single-image reconstruction cannot determine absolute scale without reference objects of known dimensions. Relative scaling between elements works effectively, but absolute measurements require manual adjustment.
Processing Variability: Simple reconstructions complete in 5-10 seconds. Complex, high-resolution inputs may require 30+ seconds. Monitor response times and implement appropriate timeout values.
Format Compatibility: GLB enjoys broad support, but legacy 3D engines may exhibit import issues with specific material properties or skeletal animations. Validate complete pipeline before production deployment.
Troubleshooting Common Issues
Distorted Reconstruction: Source image has strong directional shadows or extreme lighting. Use diffused, even lighting. Minimum 512px resolution required.
Scene Alignment Failure: Verify identical source image used for all components. Pass focal_length from body reconstruction to align call. Mismatched perspectives cause scaling errors.
Segmentation Failure (422 Error): Object prompt too generic ("object" vs "wooden chair"). Use specific descriptions. Alternatively, provide explicit mask_url or box_prompts: [[x1, y1, x2, y2]].
Processing Timeout: High-resolution input (2048px+) requires 60+ second timeout. Reduce resolution or implement async processing with on_queue_update callback.
Missing Object Detection: Use point_prompts: [[x, y, 1]] to indicate foreground, or generate mask via SAM 2 for complex backgrounds.
Incorrect File Format: Transparent/reflective materials (glass, chrome) cause depth estimation failures. These require manual mesh cleanup for production use.
Performance Optimization
Batch Processing: Submit multiple requests concurrently. fal's serverless infrastructure handles parallel processing automatically. No explicit batching API required.
Cost Management: Each API call costs $0.02. Typical workflows: Body-only ($0.02), Body+Objects ($0.04), Full scene ($0.06). Cache results aggressively.
Mobile Deployment: Target 10,000-50,000 triangles for real-time mobile performance. Post-process GLB with mesh decimation tools. Apply texture compression (KTX2, Basis Universal).
Resolution Strategy: Use 512px for previews (2-4s processing), 1024px for standard quality (5-10s), 2048px+ for maximum detail (15-30s). Balance quality against processing time and cost.
Progressive Loading: For web applications, load low-poly preview mesh immediately, then swap to high-resolution asset once fully loaded. Improves perceived performance.
Conclusion
SAM 3D provides production-grade 3D reconstruction through specialized components: Body for human geometry, Objects for item meshes, and Align for scene composition. Complete scenes process in 12-24 seconds at $0.06 per scene with GLB outputs compatible with Three.js, Unity, Unreal Engine, and Babylon.js.
Recently Added
References
-
Kerbl, Bernhard, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Transactions on Graphics (SIGGRAPH), 2023. https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ ↩ ↩2



