GEN3C is a generative video model designed to produce high-quality 3D video content with precise camera control and spatiotemporal consistency.
What is GEN3C?
GEN3C is a generative video model developed by NVIDIA, the University of Toronto, and the Vector Institute. It generates high-quality 3D video content based on precise camera control and spatiotemporal consistency. GEN3C constructs a 3D cache based on point clouds to guide video generation, using depth estimation from input images or video frames to back-project and generate 3D scenes. It then renders 2D videos based on user-provided camera trajectories, which serve as conditional inputs for the generative model.
Key Features of GEN3C
- Precise Camera Control: Generates videos based on user-specified camera trajectories, supporting complex camera movements (such as zooming, panning, and rotating) while maintaining spatiotemporal consistency.
- 3D Consistent Video Generation: Produces videos with realism and consistency, avoiding issues like objects suddenly appearing or disappearing.
- Novel View Synthesis from Multiple and Sparse Views: Supports input from single, sparse, or dense multiple viewpoints to generate high-quality novel view videos.
- 3D Editing and Scene Manipulation: Allows users to modify 3D point clouds (e.g., adding or removing objects) to edit scenes and generate corresponding videos.
- Long Video Generation: Supports the generation of long videos while maintaining spatiotemporal consistency.
Technical Principles of GEN3C
- Building 3D Cache: Depth estimation from input images or video frames is back-projected to generate 3D point clouds, forming a spatiotemporal consistent 3D cache. The cache serves as the foundation for video generation, providing explicit 3D structure for the scene.
- Rendering 3D Cache: The 3D cache is rendered into 2D videos based on user-provided camera trajectories.
- Video Generation: Pre-trained video diffusion models (such as Stable Video Diffusion or Cosmos) use the rendered 3D cache as conditional input to generate high-quality videos. The model optimizes the denoising process during diffusion to repair rendering flaws and fill in missing information.
- Multi-View Fusion: When the input includes multiple viewpoints, GEN3C uses a max-pooling fusion strategy to aggregate information from different viewpoints into the video generation model, producing consistent videos.
- Autoregressive Generation and Cache Update: For long video generation, GEN3C divides the video into multiple overlapping segments, generating them sequentially while updating the 3D cache to maintain spatiotemporal consistency.
Applications of GEN3C
- Single-View Video Generation: Generates dynamic videos from a single image, suitable for rapid content creation.
- Novel View Synthesis: Generates new view videos from a limited number of viewpoints, useful for VR/AR and 3D reconstruction.
- Driving Simulation: Generates different viewpoint videos of driving scenes, aiding in autonomous driving training.
- Dynamic Video Re-Rendering: Generates new viewpoints for existing videos, useful for video editing and secondary creation.
- 3D Scene Editing: Modifies scene content and generates new videos, assisting in film production and game development.