Flex4DHuman aims to simplify 4D human reconstruction

A research team has released Flex4DHuman, a system designed to convert monocular or sparse multi-view footage of people into synchronized dense multi-view video. The project, published by researchers affiliated with the University of Washington and World Labs, uses camera-pose conditioning rather than hand-crafted geometry priors to infer novel views of a moving subject.

The method is intended to support 4D human reconstruction, a task that combines appearance, motion and scene geometry over time. According to the project materials, Flex4DHuman can take one or more reference videos, along with their camera poses and target poses, and synthesize consistent videos from the requested viewpoints. Those generated views can then be used to build dynamic 4D Gaussian splats.

How the system works

Flex4DHuman relies on video diffusion conditioned on relative camera pose. In practical terms, the model is fed reference footage and camera information, then asked to produce synchronized clips from new angles. The project states that this approach works with a single input view as well as sparse multi-view setups.

The key emphasis is on synchronization. Rather than producing separate, loosely related view predictions, the system is designed to output multi-view clips that line up in time. That makes the generated footage more useful for downstream reconstruction, where consistency across views is important for representing moving subjects.

The project says this workflow does not depend on explicit geometry priors. Instead, it uses relative camera-pose information to guide the generation process. The resulting videos can then be lifted into 4D Gaussian splats, a representation that can be rendered interactively.

Potential uses in AR, VR and content creation

The researchers say the technique could be used for applications such as augmented reality, virtual reality, gaming, simulation and video re-shooting. One example highlighted by the project describes a workflow in which a person records a casual social media clip on a phone and then uses Flex4DHuman to produce a 4D asset from that footage.

The project frames the system as a way to move from ordinary single-camera video capture toward a more flexible asset creation pipeline. Instead of requiring specialized multi-camera rigs, a user could start with a monocular recording and generate the additional views needed for reconstruction.

That could make 4D capture more accessible for creators and researchers working with human motion, although the project materials do not claim the method is a finished production tool. The release focuses on the underlying research result and its reconstruction workflow.

Release details

Flex4DHuman is presented as “Flexible Multi-view Video Diffusion for 4D Human Reconstruction.” The project lists Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang and Jenq-Neng Hwang as authors. The work is available through an arXiv paper, a public code repository and a multi-view caption dataset on Hugging Face.

The project also shows results generated from different numbers of reference views, including monocular input, two-view input and four-view input. In each case, the system produces dense synchronized novel-view clips of a dynamic subject.

By combining camera-conditioned generation with 4D Gaussian splat reconstruction, Flex4DHuman adds to a growing set of tools aimed at capturing moving people from more limited source footage. For researchers, the release offers another path toward turning simple video into editable, view-consistent 3D or 4D assets.