AniPortrait is an open-source framework developed by Tencent that generates photorealistic, lip-synced animations from audio and a reference portrait image. It is designed to produce high-quality, temporally consistent videos with natural facial expressions and precise lip movements.
AniPortrait consists of two main modules:
This module extracts 3D facial meshes and head pose information from audio input. It uses a pre-trained wav2vec model to identify pronunciation and intonation, which are crucial for generating realistic animations. These features are then transformed into 3D meshes and converted into 2D facial landmarks.
This module generates temporally consistent videos from the reference portrait and facial landmarks. It uses Stable Diffusion 1.5 as the backbone, combined with a temporal motion module, to produce high-quality video frames. A ReferenceNet ensures consistent facial identity throughout the animation.
To get started with AniPortrait, clone the GitHub repository and follow the setup instructions. The repository includes detailed documentation, examples, and pre-trained models to help you generate your own animations.