SAM 3D Prompt Guide: Mastering 3D Generation from Images

Building 3D Assets from Single Images

Single-image 3D reconstruction traditionally demanded specialized hardware, controlled environments, and multiple camera angles. SAM 3D eliminates these requirements through neural reconstruction techniques that extract depth and geometry from ordinary photographs.

The suite comprises three models: SAM 3D Body reconstructs human forms, SAM 3D Objects generates object meshes, and SAM 3D Align combines both into unified scenes. Processing times range from 5-10 seconds for single subjects at 512px to 30+ seconds for complex multi-object scenes at 1024px+. Output file sizes typically span 2-15MB for GLB meshes and 5-50MB for Gaussian splat files, depending on geometric complexity.

SAM 3D Body: Human Form Reconstruction

SAM 3D Body reconstructs three-dimensional human body geometry and pose from single RGB images¹. The model applies parametric body representations combined with learned pose estimation to infer complete body structure even when portions remain occluded.

Core Parameters

Parameter	Type	Purpose	Default
`image_url`	string	Source image containing person	Required
`mask_url`	string	Binary segmentation mask (white=person, black=background)	Optional
`export_meshes`	boolean	Generate individual mesh files per detected person	`true`
`include_3d_keypoints`	boolean	Include skeletal keypoint markers in visualization	`true`

Prompting Strategies

Image Composition. Clear subject isolation produces superior results. When working with complex backgrounds, generate precise masks using Segment Anything Model 2 before reconstruction. Full-body visibility improves accuracy, though the model infers occluded limbs based on visible anatomical constraints.

Pose Considerations. Natural standing or walking poses yield the most reliable reconstructions. Front-facing or three-quarter views align with training data distributions. Extreme poses (handstands, complex acrobatics) introduce ambiguity that degrades geometric accuracy².

Resolution Requirements. Minimum 512px on shortest dimension. For facial detail and hand geometry, use 1024px+ resolution. Higher resolution increases processing time proportionally but captures fine features more accurately.

The model returns GLB files containing textured meshes, visualization images, individual PLY mesh files when multiple people appear, and structured metadata including camera intrinsics and skeletal keypoints.

fal^{MODEL APIs}

The fastest, cheapest and most reliable way to run genAI models. 1 API, 100s of models

Build

fal^SERVERLESS

Scale custom models and apps to thousands of GPUs instantly

Deploy

fal^COMPUTE

A fully controlled GPU cloud for enterprise AI training + research

Train

SAM 3D Objects: Item Reconstruction

SAM 3D Objects transforms objects into three-dimensional representations using Gaussian splatting for photorealistic texture capture³. The model handles geometric complexity while preserving surface appearance across viewing angles.

Essential Parameters

prompt: Text description enabling automatic segmentation
mask_urls: Explicit boundaries for precise object isolation (array of URLs)
point_prompts: Coordinate arrays [[x, y, label], ...] where label is 1 (foreground) or 0 (background)
box_prompts: Bounding box arrays [[x1, y1, x2, y2], ...] indicating regions
seed: Reproducibility control (integer)
pointmap_url: External depth map for improved geometric accuracy

Object Reconstruction Techniques

Segmentation Precision. Automatic segmentation performs well with prominent, clearly defined objects against simple backgrounds. Complex scenes benefit from explicit masks. Specific text prompts ("wooden dining chair with armrests") improve multi-object disambiguation versus generic terms ("chair").

Lighting and Materials. Diffused, even lighting produces accurate geometry by minimizing shadow-induced artifacts. Reflective or transparent surfaces (glass, polished metal) confuse depth estimation and require manual mesh refinement in post-processing.

Multi-Object Workflows. For scenes containing multiple items, use individual masks per object. Generate accurate masks via Segment Anything Model 2 as preprocessing input.

Output includes Gaussian splat files (.ply format), GLB mesh files with embedded textures, and transformation metadata containing scale factors and rotation matrices for each reconstructed object.

SAM 3D Align: Scene Assembly

SAM 3D Align positions human and object reconstructions within coherent spatial relationships. The model computes relative scales and transformations that preserve perspective consistency from the source image.

Alignment Parameters

Minimum requirements: image_url and body_mesh_url. Optional parameters include body_mask_url for refined human positioning, object_mesh_url for scene element inclusion, and focal_length from SAM 3D Body metadata (estimated if omitted).

Scene Composition Practices

Maintain Source Consistency. Use identical source images for body and object reconstruction. Perspective shifts between images cause alignment failures because the model assumes shared camera parameters.

Leverage Metadata. Pass focal length values from body reconstruction to object alignment. This prevents scale mismatches between scene elements. Without correct focal length, relative sizes may appear distorted.

Manage Complexity Progressively. Begin with simple scenes (one person, one object) before attempting multi-object arrangements. Each additional element introduces potential alignment error.

Advanced Techniques

Depth Map Integration

External depth maps improve geometric accuracy for SAM 3D Objects⁴. Generate depth estimation using Midas Depth Estimation, then pass as pointmap_url parameter. This provides geometric priors that constrain reconstruction, particularly for objects with ambiguous depth cues.

Complex Scene Workflows

Multi-element scenes require modular assembly: generate individual object reconstructions with SAM 3D Objects, create human models using SAM 3D Body, then combine elements through SAM 3D Align. This pipeline offers granular control while ensuring spatial coherence.

Performance Characteristics

Input Configuration	Processing Time	Output Size (GLB)	Use Case
Single person, 512px	5-8 seconds	2-5MB	Mobile AR, avatars
Single person, 1024px+	8-12 seconds	5-12MB	Desktop applications, detailed models
Single object, 512px	4-7 seconds	3-8MB	E-commerce visualization
Multi-object scene, 1024px+	25-35 seconds	12-25MB	Game assets, architectural visualization

Production Deployment Checklist

Error Handling. API returns status codes for common failures. Insufficient image quality produces 400 errors. Segmentation failures (no valid masks) return 422 errors. Implement retry logic with exponential backoff for transient failures.

File Size Management. GLB meshes range 2-25MB depending on complexity. Gaussian splat files (.ply) range 5-50MB. For web delivery, consider mesh decimation post-processing or progressive loading strategies. Cache generated assets with content-addressable storage to avoid redundant processing.

Rate Limits. Standard fal accounts support concurrent requests based on subscription tier. Batch processing workflows should implement request queuing to respect limits. Monitor X-RateLimit-Remaining response headers.

Caching Strategies. Hash input parameters (image URL, prompts, seed) to create cache keys. Store generated assets in CDN with appropriate cache-control headers. Typical cache hit rates of 40-60% reduce processing costs in production environments.

Integration Patterns. GLB format integrates directly with Three.js, Babylon.js, Unity, and Unreal Engine. For web applications, use GLTFLoader in Three.js. For game engines, import GLB files directly into asset pipelines. Gaussian splat files require custom rendering implementations or specialized viewers.

Limitations and Constraints

Occlusion Handling. Heavily occluded subjects produce less accurate geometry. The models infer hidden structure, but reconstruction quality degrades proportionally with occlusion percentage. Critical applications require minimal occlusion or planned post-processing.

Material Constraints. Transparent and highly reflective surfaces confuse depth estimation. Glass, mirrors, and chrome objects require manual mesh cleanup for production use.

Scale Ambiguity. Single-image reconstruction cannot determine absolute scale without reference objects of known dimensions. SAM 3D Align handles relative scaling effectively, but absolute measurements may need manual adjustment.

Format Considerations. GLB format enjoys broad support, but some legacy 3D engines exhibit issues with specific material properties. Validate your complete pipeline before production deployment.

Troubleshooting Common Issues

Incomplete Reconstructions: Provide manual masks generated via SAM 2 to improve segmentation accuracy

Inaccurate Scaling: Include known focal length values from camera metadata when available. Extract from EXIF data if present.

Misaligned Elements: Verify consistent source imagery across reconstructions. Perspective shifts break alignment assumptions.

Missing Details: Use specific text prompts for complex objects ("wooden dining chair with armrests" versus "chair")

Texture Artifacts: Avoid images with strong directional lighting or heavy shadows that become embedded in 3D textures

Processing Timeouts: For high-resolution inputs (2048px+), expect extended processing times. Implement appropriate timeout values (60+ seconds) in production clients.

Implementation

SAM 3D provides accessible 3D reconstruction through Body, Objects, and Align models. Success depends on understanding each model's constraints, providing clear inputs, implementing appropriate error handling, and managing file sizes for your target platform.

For API implementation details, consult the quickstart guide.

SAM 3D Prompt Guide