LT3SD: Latent Trees for 3D Scene Diffusion

Technical University of Munich

CVPR 2025

Abstract

We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.

LT3SD enables high-fidelity generation of infinite 3D environments in a patch-by-patch and coarse-to-fine fashion. Samples of generated infinite 3D scene mesh can be downloaded here.

Video

How It Works

We formulate 3D scene generation as a patch-based latent diffusion process. Left: To characterize complex scene geometry, we encode 3D scenes in a novel latent tree representation, where each scene resolution level is decomposed into a TUDF grid and a latent feature grid. Top Right: During latent tree training, the encoder encodes a patch from the scene grid at resolution level i+1 to a coarser TUDF patch and a latent feature patch at level i. The decoder then reconstructs the scene patch based on the factorized grids. Bottom Right: During generation, the diffusion model learns to generate a latent feature patch conditioned on a TUDF patch within the same level. Our method enables arbitrary-sized 3D scene generation at inference time by synthesizing scenes in a coarse-to-fine hierarchy and a patch-by-patch fashion.

Comparisons

We compare unconditional 3D scene generation with diverse 3D diffusion methods PVD, NFD, BlockFusion, and XCube. All methods were trained on the house level of the 3D-FRONT dataset. Our latent tree-based 3D scene diffusion approach synthesizes cleaner surfaces with more geometric details and captures diverse furniture objects.

Outdoor Scene Generation

Our approach leverages a unified voxel-based representation for 3D scenes that generalizes beyond indoor environments. This enables seamless generation of diverse outdoor scenes, demonstrating the flexibility and robustness of our method across different types of 3D environments.

Citation


@article{meng2024lt3sdlatenttrees3d,
    title={LT3SD: Latent Trees for 3D Scene Diffusion}, 
    author={Quan Meng and Lei Li and Matthias Nießner and Angela Dai},
    journal={arXiv preprint arXiv:2409.08215},
    year={2024}
}

Acknowledgements

This work was supported by the ERC Starting Grant SpatialSem (101076253). Matthias Niessner was supported by the ERC Starting Grant Scan2CAD (804724).

LT3SD: Latent Trees for 3D Scene Diffusion

Quan Meng

Lei Li

Matthias Nießner

Angela Dai

Technical University of Munich