Phantom

Phantom

by ByteDance
Phantom is a framework for generating videos with consistent subjects, combining text and image prompts to extract elements from reference images and create video content that matches the text description.

What is Phantom?

Phantom is a framework developed by ByteDance's Intelligent Creation Team for Subject-to-Video (S2V) generation. It uses cross-modal alignment technology to combine text and image prompts, extracting subject elements from reference images to generate video content that aligns with the text description. Based on existing Text-to-Video (T2V) and Image-to-Video (I2V) architectures, Phantom redesigns the joint text-image injection model, learning cross-modal alignment from text-image-video triplet data. The framework supports single and multi-subject references, emphasizing subject consistency in human generation tasks and covering existing identity-preserving video generation tasks with enhanced advantages.

Key Features of Phantom

  • Extract Subject Elements from Reference Images: Identifies and extracts subjects (e.g., people, animals, objects) from images as the core content for video generation.
  • Generate Videos Based on Text Prompts: Users can control the content and style of videos through text instructions, enabling highly customized video generation.
  • Multi-Subject Video Generation: Supports handling multiple subjects simultaneously, generating complex interactive scenes such as multi-person interactions or human-pet interactions.
  • Identity Preservation (ID-Preserving): Retains the identity features of subjects (e.g., faces, clothing) in generated videos, making it particularly suitable for virtual try-ons and digital human generation.
  • High-Quality Video Output: The generated videos excel in visual effects, subject consistency, and text responsiveness, comparable to existing commercial solutions.

Technical Principles of Phantom

  • Data Structure Design: Phantom constructs a text-image-video triplet data structure to train the model in understanding relationships between different modalities. The data is divided into In-paired (image and video subjects match) and Cross-paired (cross-video matching) types to prevent the model from simply copying input images.
  • Model Architecture: Based on existing T2V and I2V architectures, Phantom redesigns the joint text-image injection model. The model consists of an Input Head and a trainable DiT module. The Input Head encodes video, text, and reference images, while the DiT module handles cross-modal alignment and video generation.
  • Cross-Modal Alignment: Reference images are encoded using specific visual encoders (e.g., VAE and CLIP) and then concatenated with video and text features, which are input into the visual and text branches of the DiT module.
  • Identity Preservation Technology: When handling identity features like faces, a facial recognition model (e.g., ArcFace) evaluates the similarity between the generated video and the reference image to ensure subject identity consistency.
  • Optimization and Training: The model is trained on large-scale triplet data to learn how to balance text and image prompts during video generation. During pre-training, the model inherits weights from the base model and is further fine-tuned on cross-modal data to achieve high-quality video generation.

Project Links for Phantom

Application Scenarios of Phantom

  • Virtual Try-On: Generates dynamic clothing display videos to help users preview effects.
  • Digital Human Generation: Creates virtual characters with specific appearances for use in scenarios like virtual hosts.
  • Ad Video Production: Quickly generates product advertisements based on images and text, improving production efficiency.
  • Film and Animation: Generates character animation prototypes to assist in creative validation and reduce production costs.
  • Education and Training: Generates teaching videos for scientific experiments or historical scenes to enhance interactivity.

Framework Features

Supported Tasks
Video Generation Subject-To-Video Identity Preservation Multi-Subject Generation
Tags
Video Generation Subject-to-Video Cross-Modal Alignment AI Video Tools Creative AI Identity Preservation Multi-Subject Generation Text-to-Video Image-to-Video ByteDance

Getting Started

Screenshots & Images

Primary Screenshot
Additional Images

Stats

0 Views
0 Favorites
544 GitHub Stars

Community & Support

Similar Frameworks

TPO
0
AgentSociety by Tsinghua University
0
DualPipe by DeepSeek
0

Recently Viewed

Hyper-SD Framework