GEN3C

by NVIDIA, University of Toronto, Vector Institute

GEN3C is a generative video model designed to produce high-quality 3D video content with precise camera control and spatiotemporal consistency.

What is GEN3C?

GEN3C is a generative video model developed by NVIDIA, the University of Toronto, and the Vector Institute. It generates high-quality 3D video content based on precise camera control and spatiotemporal consistency. GEN3C constructs a 3D cache based on point clouds to guide video generation, using depth estimation from input images or video frames to back-project and generate 3D scenes. It then renders 2D videos based on user-provided camera trajectories, which serve as conditional inputs for the generative model.

Key Features of GEN3C

Precise Camera Control: Generates videos based on user-specified camera trajectories, supporting complex camera movements (such as zooming, panning, and rotating) while maintaining spatiotemporal consistency.
3D Consistent Video Generation: Produces videos with realism and consistency, avoiding issues like objects suddenly appearing or disappearing.
Novel View Synthesis from Multiple and Sparse Views: Supports input from single, sparse, or dense multiple viewpoints to generate high-quality novel view videos.
3D Editing and Scene Manipulation: Allows users to modify 3D point clouds (e.g., adding or removing objects) to edit scenes and generate corresponding videos.
Long Video Generation: Supports the generation of long videos while maintaining spatiotemporal consistency.

Technical Principles of GEN3C

Building 3D Cache: Depth estimation from input images or video frames is back-projected to generate 3D point clouds, forming a spatiotemporal consistent 3D cache. The cache serves as the foundation for video generation, providing explicit 3D structure for the scene.
Rendering 3D Cache: The 3D cache is rendered into 2D videos based on user-provided camera trajectories.
Video Generation: Pre-trained video diffusion models (such as Stable Video Diffusion or Cosmos) use the rendered 3D cache as conditional input to generate high-quality videos. The model optimizes the denoising process during diffusion to repair rendering flaws and fill in missing information.
Multi-View Fusion: When the input includes multiple viewpoints, GEN3C uses a max-pooling fusion strategy to aggregate information from different viewpoints into the video generation model, producing consistent videos.
Autoregressive Generation and Cache Update: For long video generation, GEN3C divides the video into multiple overlapping segments, generating them sequentially while updating the 3D cache to maintain spatiotemporal consistency.

Applications of GEN3C

Single-View Video Generation: Generates dynamic videos from a single image, suitable for rapid content creation.
Novel View Synthesis: Generates new view videos from a limited number of viewpoints, useful for VR/AR and 3D reconstruction.
Driving Simulation: Generates different viewpoint videos of driving scenes, aiding in autonomous driving training.
Dynamic Video Re-Rendering: Generates new viewpoints for existing videos, useful for video editing and secondary creation.
3D Scene Editing: Modifies scene content and generates new videos, assisting in film production and game development.

Model Capabilities

Model Type

vision

Supported Tasks

Video Generation Novel View Synthesis 3d Editing Dynamic Scene Rendering Camera Control

Usage & Integration

License

Open Source

Screenshots & Images

Primary Screenshot

Additional Images

Try Now Documentation

Stats

90 Views

0 Favorites

Community & Support

GitHub Repository

Similar Models

Ola by Tsinghua University, Tencent Hunyuan Research Team, NUS S-Lab

453

Zonos by Zyphra

389

Step-Video-T2V by Leapfrogging Star

460

GEN3C

What is GEN3C?

Key Features of GEN3C

Technical Principles of GEN3C