SAM 3D Developer Guide: Building 3D Experiences from 2D Images

Explore all models

SAM 3D transforms single 2D images into detailed 3D models of humans, objects, and complete scenes in seconds through three specialized APIs - perfect for AR/VR, gaming, e-commerce product visualization, and immersive storytelling.

last updated
12/9/2025
edited by
Zachary Roth
read time
5 minutes
SAM 3D Developer Guide: Building 3D Experiences from 2D Images

Integrating Single-Image 3D Reconstruction

Traditional 3D modeling demands specialized equipment, controlled capture environments, and substantial processing infrastructure. SAM 3D reconstructs detailed 3D assets from single RGB images through three specialized APIs addressing distinct reconstruction challenges: human body geometry, object meshes, and spatial scene alignment.

This guide demonstrates integration patterns for SAM 3D's component architecture. The system applies Gaussian splatting techniques for photorealistic rendering from sparse input data1. These methods enable production-grade 3D asset generation suitable for AR/VR applications, game development, e-commerce visualization, and interactive media.

Component Architecture

ComponentPurposeInputOutputProcessing Time
SAM 3D BodyHuman reconstructionRGB imageGLB mesh + keypoints5-10 seconds
SAM 3D ObjectsObject reconstructionRGB image + prompt/maskGLB mesh + Gaussian splat4-8 seconds
SAM 3D AlignScene compositionImage + body/object meshesUnified GLB scene3-6 seconds

falMODEL APIs

The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models

falSERVERLESS

Scale custom models and apps to thousands of GPUs instantly

falCOMPUTE

A fully controlled GPU cloud for enterprise AI training + research

Installation

Install the appropriate client library:

# Python
pip install fal-client

# JavaScript
npm install --save @fal-ai/client

Authentication requires a fal API key stored in environment variables.

SAM 3D Body: Human Reconstruction

SAM 3D Body reconstructs human body geometry from single images using parametric body models combined with learned pose estimation. The system infers complete body structure including occluded regions, generates skeletal keypoint data, and exports camera intrinsics.

Implementation

import fal_client

result = fal_client.subscribe(
    "fal-ai/sam-3/3d-body",
    arguments={
        "image_url": "YOUR_IMAGE_URL",
        "export_meshes": True,
        "include_3d_keypoints": True
    }
)

glb_model_url = result["model_glb"]
keypoints = result["metadata"]["people"][0]["keypoints_3d"]
focal_length = result["metadata"]["people"][0]["focal_length"]

Parameters

  • mask_url: Binary segmentation mask (white=person, black=background) for explicit figure selection
  • export_meshes: Generate individual mesh files per detected person (default: true)
  • include_3d_keypoints: Include skeletal keypoint markers in GLB output (default: true)

Technical Constraints

Accuracy degrades with extreme poses (inverted positions, complex acrobatics) where pose ambiguity increases outside standard viewing angles. Occlusion beyond 40-50% compromises geometric precision. Front-facing or three-quarter views produce optimal results. Multi-person detection operates automatically but benefits from explicit masks in overlapping scenarios.

SAM 3D Objects: Object Reconstruction

SAM 3D Objects employs Gaussian splatting for photorealistic texture capture while maintaining geometric fidelity1. Segmentation operates through text descriptions, coordinate-based point prompts, or bounding box specifications.

Implementation

import { fal } from "@fal-ai/client";

const result = await fal.subscribe("fal-ai/sam-3/3d-objects", {
  input: {
    image_url: "YOUR_OBJECT_IMAGE_URL",
    prompt: "wooden dining chair",
    seed: 42,
  },
});

const gaussianSplatUrl = result.data.gaussian_splat.url;
const glbModelUrl = result.data.individual_glbs[0].url;
const metadata = result.data.metadata[0];

Segmentation Methods

Text prompts: Describe target object ("red sports car" vs "car" for disambiguation)

Point prompts: Coordinate arrays [[x, y, label], ...] where label is 1 (foreground) or 0 (background)

Box prompts: Bounding box arrays [[x1, y1, x2, y2], ...] indicating object regions

Custom masks: Pre-segmented masks for absolute control in complex scenes

Technical Constraints

Performance degrades with transparent or highly reflective materials (glass, polished metal) where depth estimation becomes ambiguous. Multi-object scenes require explicit segmentation guidance through masks or prompts. Output includes traditional meshes (GLB) and Gaussian splat files (PLY) with transformation metadata.

SAM 3D Align: Scene Composition

SAM 3D Align computes relative scales and transformations between human and object reconstructions, preserving perspective consistency from source imagery.

Implementation

import fal_client

result = fal_client.subscribe(
    "fal-ai/sam-3/3d-align",
    arguments={
        "image_url": "YOUR_SCENE_IMAGE_URL",
        "body_mesh_url": body_result["model_glb"],
        "object_mesh_url": object_result["individual_glbs"][0]["url"],
        "focal_length": body_result["metadata"]["people"][0]["focal_length"]
    }
)

scene_glb_url = result["scene_glb"]["url"]
aligned_body_url = result["body_mesh_glb"]["url"]

Requirements

The model requires identical source images for all components to maintain shared camera parameters. Perspective shifts between images cause alignment failures. Optimal performance occurs with 2-3 scene elements; accuracy decreases as element count increases. Passing focal length from body reconstruction prevents scale mismatches.

Complete Pipeline Example

import fal_client

def create_3d_scene(image_url):
    """Generate complete 3D scene from single image."""
    try:
        # Reconstruct human body
        body_result = fal_client.subscribe(
            "fal-ai/sam-3/3d-body",
            arguments={"image_url": image_url}
        )

        # Reconstruct objects
        object_result = fal_client.subscribe(
            "fal-ai/sam-3/3d-objects",
            arguments={
                "image_url": image_url,
                "prompt": "chair"
            }
        )

        # Align into unified scene
        scene_result = fal_client.subscribe(
            "fal-ai/sam-3/3d-align",
            arguments={
                "image_url": image_url,
                "body_mesh_url": body_result["model_glb"],
                "object_mesh_url": object_result["individual_glbs"][0]["url"],
                "focal_length": body_result["metadata"]["people"][0]["focal_length"]
            }
        )

        return {
            "scene_url": scene_result["scene_glb"]["url"],
            "cost": 0.06,  # $0.02 * 3 components
            "processing_time": "12-24 seconds"
        }

    except Exception as e:
        # 400: Invalid image format/quality
        # 422: Segmentation failure (no objects/people detected)
        if hasattr(e, 'status_code'):
            if e.status_code == 422:
                # Retry with explicit mask or different prompt
                pass
        raise

Response Schema

SAM 3D Body Returns:

{
    "model_glb": str,  # URL to GLB file
    "metadata": {
        "people": [{
            "keypoints_3d": [[x, y, z], ...],  # 3D joint coordinates
            "focal_length": float,  # Camera focal length
            "camera_intrinsics": {"fx": float, "fy": float, "cx": float, "cy": float}
        }]
    }
}

SAM 3D Objects Returns:

{
    "gaussian_splat": {"url": str},  # PLY format
    "individual_glbs": [{"url": str}, ...],  # One per object
    "metadata": [{"scale": float, "rotation": [...], "translation": [...]}]
}

SAM 3D Align Returns:

{
    "scene_glb": {"url": str},  # Combined scene
    "body_mesh_glb": {"url": str}  # Aligned body mesh
}

Implementation Best Practices

Resolution: Minimum 512px on shortest dimension; use 1024px+ for facial/hand detail. Higher resolution increases processing time proportionally.

Lighting: Use diffused, even lighting. Strong directional shadows embed in 3D textures creating viewing artifacts.

Segmentation: Specific prompts ("wooden dining chair with armrests") improve disambiguation. Generate masks via SAM 2 for complex backgrounds.

Caching: Hash input parameters (image URL, prompts, seed) for cache keys. Store generated assets with content-addressable storage. Typical 40-60% cache hit rates reduce costs.

Error Handling: Implement retry with exponential backoff for transient failures. Handle 400 (invalid input) and 422 (segmentation failure) explicitly.

Async Processing: Use on_queue_update callback for progress. Implement 60+ second timeouts for high-resolution inputs.

Output Format Integration

GLB Meshes: Compatible with Three.js (GLTFLoader), Babylon.js, Unity, and Unreal Engine. Import directly into asset pipelines. File sizes range 2-15MB depending on complexity.

// Three.js integration
import { GLTFLoader } from "three/examples/jsm/loaders/GLTFLoader";
const loader = new GLTFLoader();
loader.load(glb_url, (gltf) => scene.add(gltf.scene));

Gaussian Splat Files: PLY format requiring custom rendering implementations or specialized viewers. File sizes range 5-50MB. Provides higher texture fidelity at computational cost.

Metadata Utilization: Body reconstructions include skeletal keypoints and camera intrinsics. Object outputs contain transformation matrices and scale factors. Use this structured data for procedural animation and runtime scene modification.

Technical Limitations

Pose Constraints: Pose ambiguity increases significantly outside standard viewing angles. Handstands, splits, and extreme positions reduce reconstruction quality.

Material Handling: Transparent and reflective surfaces confuse depth estimation algorithms. Production workflows requiring glass, mirrors, or chrome objects need manual mesh cleanup.

Scale Ambiguity: Single-image reconstruction cannot determine absolute scale without reference objects of known dimensions. Relative scaling between elements works effectively, but absolute measurements require manual adjustment.

Processing Variability: Simple reconstructions complete in 5-10 seconds. Complex, high-resolution inputs may require 30+ seconds. Monitor response times and implement appropriate timeout values.

Format Compatibility: GLB enjoys broad support, but legacy 3D engines may exhibit import issues with specific material properties or skeletal animations. Validate complete pipeline before production deployment.

Troubleshooting Common Issues

Distorted Reconstruction: Source image has strong directional shadows or extreme lighting. Use diffused, even lighting. Minimum 512px resolution required.

Scene Alignment Failure: Verify identical source image used for all components. Pass focal_length from body reconstruction to align call. Mismatched perspectives cause scaling errors.

Segmentation Failure (422 Error): Object prompt too generic ("object" vs "wooden chair"). Use specific descriptions. Alternatively, provide explicit mask_url or box_prompts: [[x1, y1, x2, y2]].

Processing Timeout: High-resolution input (2048px+) requires 60+ second timeout. Reduce resolution or implement async processing with on_queue_update callback.

Missing Object Detection: Use point_prompts: [[x, y, 1]] to indicate foreground, or generate mask via SAM 2 for complex backgrounds.

Incorrect File Format: Transparent/reflective materials (glass, chrome) cause depth estimation failures. These require manual mesh cleanup for production use.

Performance Optimization

Batch Processing: Submit multiple requests concurrently. fal's serverless infrastructure handles parallel processing automatically. No explicit batching API required.

Cost Management: Each API call costs $0.02. Typical workflows: Body-only ($0.02), Body+Objects ($0.04), Full scene ($0.06). Cache results aggressively.

Mobile Deployment: Target 10,000-50,000 triangles for real-time mobile performance. Post-process GLB with mesh decimation tools. Apply texture compression (KTX2, Basis Universal).

Resolution Strategy: Use 512px for previews (2-4s processing), 1024px for standard quality (5-10s), 2048px+ for maximum detail (15-30s). Balance quality against processing time and cost.

Progressive Loading: For web applications, load low-poly preview mesh immediately, then swap to high-resolution asset once fully loaded. Improves perceived performance.

Conclusion

SAM 3D provides production-grade 3D reconstruction through specialized components: Body for human geometry, Objects for item meshes, and Align for scene composition. Complete scenes process in 12-24 seconds at $0.06 per scene with GLB outputs compatible with Three.js, Unity, Unreal Engine, and Babylon.js.

Recently Added

References

  1. Kerbl, Bernhard, et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM Transactions on Graphics (SIGGRAPH), 2023. https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ ↩ ↩2

about the author
Zachary Roth
A generative media engineer with a focus on growth, Zach has deep expertise in building RAG architecture for complex content systems.

Related articles