torch.compile() with PyTorch models, the first run compiles optimized CUDA kernels, which can take significant time. By sharing these compiled kernels across workers, you can dramatically reduce startup latency for subsequent workers.
The Problem
PyTorch’s Inductor compiler (torch.compile) generates optimized GPU kernels on first run. Without cache sharing:
- Every worker recompiles the same kernels, wasting GPU time
- Startup latency multiplies across workers (N workers × compilation time)
- GPU resources are used inefficiently during deployment
- One worker compiles during initial warmup
- Other workers load pre-compiled kernels (significantly faster)
- Consistent performance across all workers
Quick Start
The simplest way to use Inductor caching is with thesynchronized_inductor_cache context manager:
/data/inductor-caches/<GPU_TYPE>/<cache_key>.zip (on the shared /data filesystem accessible to all workers), while subsequent workers load the pre-compiled kernels.
API Reference
synchronized_inductor_cache(cache_key: str)
A context manager that handles both loading and syncing of Inductor caches automatically.
Parameters:
cache_key(str): A unique identifier for this cache. Use a descriptive name with versioning (e.g.,"my-model/v1").
- Loads existing cache from
/data/inductor-caches/if available - After the context exits, syncs any newly compiled kernels back to shared storage
- Handles GPU-specific cache organization automatically
load_inductor_cache(cache_key: str) -> str
Explicitly loads an Inductor cache from shared storage.
Parameters:
cache_key(str): The cache identifier to load.
str: A directory hash representing the cache state. Pass this tosync_inductor_cache()later.
sync_inductor_cache(cache_key: str, dir_hash: str) -> None
Syncs the local Inductor cache back to shared storage.
Parameters:
cache_key(str): The cache identifier to sync.dir_hash(str): The directory hash returned byload_inductor_cache().
- Compares local cache with the hash to detect new compiled kernels
- If changes detected, re-packs and uploads entire cache to
/data/inductor-caches/ - If no changes, skips upload (no-op)
Complete Working Example
Here’s a complete, runnable example using Stable Diffusion Turbo that demonstrates the actual speedup:- First worker takes longer during warmup (compilation happening)
- Subsequent workers warmup significantly faster (loading cached kernels)
- All workers produce identical outputs - only startup time changes
Manual Approach (Advanced)
For more control over cache loading and syncing, you can use the explicit API:- Multi-stage warmup processes
- Distributed training with controlled sync timing
- Need explicit control over cache load/sync behavior
How It Works
Storage Locations & Connection Mechanism
- Local cache:
/tmp/inductor-cache/- Each worker’s temporary cache - Shared cache:
/data/inductor-caches/<GPU_TYPE>/<key>.zip- Persistent, shared across workers
load_inductor_cache() is called, it sets the environment variable:
torch.compile() automatically reads this environment variable to locate compiled kernels. You don’t need to configure anything - just call load_inductor_cache() before torch.compile() and the connection happens automatically.
GPU Separation
Caches are GPU-specific (H100, H200, A100, etc.) and automatically organized by GPU type usingget_gpu_type(). This ensures compiled kernels match the hardware they’ll run on.
Process Flow
The behavior differs based on whether a cache already exists: Cache Miss (First Worker):- Load attempt → Cache not found
- Compilation phase → torch.compile() generates CUDA kernels
- Kernels saved to
/tmp/inductor-cache/ - Warmup triggers compilation
- Sync creates
.zipand uploads to/data/inductor-caches/<GPU_TYPE>/<cache_key>.zip
- Load attempt → Cache found
- Extract
.zipto/tmp/inductor-cache/ - torch.compile() finds existing kernels
- Warmup uses cached kernels (no compilation)
- Sync compares hash → Usually no-op (no changes to upload)
This does not change model outputs or behavior - only startup speed changes. The compiled model produces identical results to the uncompiled version.
Best Practices
Warmup Coverage
Warm up with representative input shapes to maximize cache coverage:- Use
dynamic=Truefor flexibility across input variations - Cover 3-5 representative sizes
- Focus on your most common use cases
- Tradeoff: More warmup shapes = longer startup, but faster inference
When to Skip
Not all models benefit from Inductor caching:- CPU-only models - No GPU compilation involved
- Models without torch.compile - No Inductor caching needed
- Lightweight models - Minimal compilation overhead, caching may not be worth it
Troubleshooting
Workers Still Compiling (Cache Not Working)?
If you see compilation happening on every worker despite usingsynchronized_inductor_cache:
1. Verify you’re calling warmup inside the cache context
dynamic=True for flexible input shapes
Debugging
Enable verbose logging to see what PyTorch is doing:See Also
Optimize Model Performance
Learn about torch.compile and the
optimize() helperUse Persistent Storage
Understand the
/data directory for persistent storageDeploy Multi-GPU Inference
Deploy large compiled models across multiple GPUs