FLAT turns video diffusion latents into triangle-based 3D scenes

FLAT aims to turn latent video features into usable scene geometry

Researchers from Google Research, the University of Oxford and the Technical University of Munich have introduced FLAT, a feedforward method that decodes video diffusion latents into explicit triangle-based 3D scene representations. The work, presented as a preprint, is designed to improve geometric accuracy while keeping the rendering pipeline lightweight and practical for downstream use.

Unlike approaches that first generate volumetric or implicit scene data and then refine it, FLAT predicts surface-aligned triangle splats in a single forward pass. The authors say this makes the output easier to render with standard triangle rasterizers and more suitable for later refinement into assets that can support physics-based interaction.

A different target than Gaussian splats

Many recent feedforward scene generation systems have focused on 3D Gaussians or similar volumetric primitives. FLAT takes a different route. The method uses the structure already present in compressed video diffusion latents and maps those latent features directly to triangle splats, which represent surfaces more explicitly.

According to the researchers, that change in representation matters because triangles are a better fit for applications that need clear geometry, not just visually plausible output. The system is intended to preserve competitive visual quality while improving the consistency of the underlying shape.

To make triangle regression stable, the team introduced geometry-focused training choices, including a ray-centered triangle parameterization and a product window rendering function. The paper says these design elements help keep gradients usable even when small orientation errors would otherwise disrupt optimization.

One decoder, multiple video pipelines

FLAT is built to decode denoised latents from Wan-2.1, a video model family that supports several generation modes. Because the method operates in the latent space rather than directly on RGB frames, it can be used across different Wan-2.1 variants without requiring a separate scene model for each one.

The researchers say that includes text-to-video, image-to-video, video-to-video, long-horizon and interactive settings, as long as the model produces the same denoised latent representation. In practice, FLAT replaces the usual visual decoder with a scene decoder that outputs geometry instead of pixels.

That design could make the approach more flexible as the underlying video model improves. The paper argues that better generation quality or new upstream controls can transfer automatically to the same latent scene decoder.

From triangle soup to interactive assets

The raw output from FLAT is described as a triangle soup optimized for geometric fidelity. A lightweight refinement stage then turns that prediction into a fully opaque asset that can be used in more conventional graphics and simulation pipelines.

The project demonstrates that the resulting scenes can be viewed with a simple triangle renderer and navigated interactively. The authors also say the refined geometry can be used directly in a rigid-body simulation without relying on a separate collision proxy.

The paper includes examples of novel-view renders and surface normals, which it uses to show that appearance and structure remain aligned across viewpoints. That consistency is central to the project’s goal of making the geometry legible rather than merely visually convincing.

Broader implications for scene generation

FLAT is part of a broader push to make generative vision systems produce assets that are easier to export, inspect and use in real applications. By decoding explicit surface geometry from video latents, the method tries to bridge a gap between high-quality generative models and the demands of graphics, simulation and interactive environments.

The authors position the work as a feedforward alternative to generate-and-optimize pipelines, with the potential to support more practical scene generation from existing video diffusion systems.