VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Sicheng Xu^*, Guojun Chen^*, Yu-Xiao Guo^*, Jiaolong Yang^*‡,
Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo

Microsoft Research Asia
^*Equal Contributions ^‡Corresponding Author: jiaoyan@microsoft.com

TL;DR: single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.

Abstract

We introduce VASA, a framework for generating lifelike talking faces of virtual characters with appealing visual affective skills (VAS), given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512×512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.

(Note: all portrait images on this page are virtual, non-existing identities generated by StyleGAN2 or DALL·E-3 (except for Mona Lisa). We are exploring visual affective skill generation for virtual, interactive characters, NOT impersonating any person in the real world. This is only a research demonstration and there’s no product or API release plan. See also the bottom of this page for more of our Responsible AI considerations.)

Realism and liveliness

Our method is capable of not only producing precious lip-audio synchronization, but also generating a large spectrum of expressive facial nuances and natural head motions. It can handle arbitary-length audio and stably output seamless talking face videos.

Examples with audio input of one minute long

More shorter examples with diverse audio input

Out-of-distribution generalization

Our method exhibits the capability to handle photo and audio inputs that are out of the training distribution. For example, it can handle artistic photos, singing audios, and non-English speech. These types of data were not present in the training set.

Real-time efficiency

Our method generates video frames of 512×512 size at 45fps in the offline batch processing mode, and can support up to 40fps in the online streaming mode with a preceding latency of only 170ms , evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.

Risks and responsible AI considerations

Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection. Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there’s still a gap to achieve the authenticity of real videos.

While acknowledging the possibility of misuse, it’s imperative to recognize the substantial positive potential of our technique. The benefits – such as enhancing educational equity, improving accessibility for individuals with communication challenges, offering companionship or therapeutic support to those in need, among many others – underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being.

Given such context, we have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Abstract

Realism and liveliness

Out-of-distribution generalization

Real-time efficiency

Risks and responsible AI considerations

Vasa 1 ai

How to use vasa 1

Vasa 1 try

vasa-1 huggingface

vasa-1 release date

vasa-1 download