LT3SD: Latent Trees for 3D Scene Diffusion
-
Technical University of Munich
Abstract
We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.
LT3SD enables high-fidelity generation of infinite 3D environments in a patch-by-patch and coarse-to-fine fashion. Samples of generated infinite 3D scene mesh can be downloaded here.
Video
How It Works
We formulate 3D scene generation as a patch-based latent diffusion process. Left: To characterize complex scene geometry, we encode 3D scenes in a novel latent tree representation, where each scene resolution level is decomposed into a TUDF grid and a latent feature grid. Top Right: During latent tree training, the encoder encodes a patch from the scene grid at resolution level i+1 to a coarser TUDF patch and a latent feature patch at level i. The decoder then reconstructs the scene patch based on the factorized grids. Bottom Right: During generation, the diffusion model learns to generate a latent feature patch conditioned on a TUDF patch within the same level. Our method enables arbitrary-sized 3D scene generation at inference time by synthesizing scenes in a coarse-to-fine hierarchy and a patch-by-patch fashion.
Comparisons
We compare unconditional 3D scene generation with diverse 3D diffusion methods PVD, NFD, and BlockFusion. All methods were trained on the house level of the 3D-FRONT dataset. Our latent tree-based 3D scene diffusion approach synthesizes cleaner surfaces with more geometric details and captures diverse furniture objects.
Citation
Acknowledgements
This work was supported by the ERC Starting Grant SpatialSem (101076253). Matthias Niessner was supported by the ERC Starting Grant Scan2CAD (804724).