VIVE3D: Meta’s GAN-Based Deepfake and Video-Altering Framework

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

It seems that Generative Adversarial Networks (GANs) are finally off the back-foot, despite more than six months where it has seemed that latent diffusion models such as Stable Diffusion (SD) had consigned them to relatively obscurity, in terms of state-of-the-art image synthesis.

A few weeks ago, GigaGAN, the first truly powerful ad hoc SD-style system for text-to-image generation was debuted, with impressive results obtainable in a fraction of the time of LDM systems.

Now, a new system called VIVE3D, a collaboration between Facebook’s Meta research arm and Saudi Arabia’s King Abdullah University of Science and Technology (KAUST), is offering the ability to produce GAN-based deepfaking and face editing at a quality and effectiveness that overshadows many similar GAN-centric editing projects of recent years, and rivals more recent approaches.

By jointly embedding and processing multiple source frames at a time, VIVE3D is capable of diverse types of video-based manipulations. One of the most impressive of these is the ability to change the facial angle of the subject in a source video:

President Obama (source video in center) is made to orient his head in diverse directions by VIVE3D. See source video (embedded at bottom of this article) for better resolution and definition. Source: https://www.youtube.com/watch?v=qfYGQwOw8pg
President Obama (source video in center) is made to orient his head in diverse directions by VIVE3D. See source video (embedded at bottom of this article) for better resolution and definition. Source: https://www.youtube.com/watch?v=qfYGQwOw8pg

We have already seen both GAN and NeRF-based systems able to produce variously plausible ‘roll-around’ recreations of people – limited circles of motion indicating a 3D face, but where the face is static and immobile, and VIVE3D’s supplementary materials do indeed show us this:

Familiar fare, as multiple source images are concatenated into a (scarcely) explorable moving image through VIVE3D, complete with obtained depth images (far right).
Familiar fare, as multiple source images are concatenated into a (scarcely) explorable moving image through VIVE3D, complete with obtained depth images (far right).

The difference with VIVE3D is that once it has assimilated the source material, it can not only recreate the original video neurally, but use the obtained latent codes to transfer alternate identities into the video, in a manner similar to autoencoder-based deepfake systems – and neither gender, expression (which can be changed), age, glasses (or lack of glasses – they can be added!) are any obstacle:

In this case, the source subject, a middle-aged man, is changed to a younger woman, with glasses removed and the expression changed, and a range of possible facial direction provided.
In this case, the source subject, a middle-aged man, is changed to a younger woman, with glasses removed and the expression changed, and a range of possible facial direction provided.

Age can be changed also for a same-source individual subject:

A range of ages can be superimposed for a single identity in a source video. This is a still from a moving video – see the embedded video at end for movement and better resolution.
A range of ages can be superimposed for a single identity in a source video. This is a still from a moving video – see the embedded video at end for movement and better resolution.

Though it is quite common to be able to remove glasses (something that VIVE3D does about as well as any of its predecessors)…

Along with other transformations, glasses can be removed as necessary. This is a still from a moving video – see the embedded video at end for movement and better resolution.
Along with other transformations, glasses can be removed as necessary. This is a still from a moving video – see the embedded video at end for movement and better resolution.

…VIVE3D can also add glasses where there were none, or change the type of glasses:

Various types of glasses can be added to source footage, or existing glasses changed to a different style.
Various types of glasses can be added to source footage, or existing glasses changed to a different style.

Likewise, individual facets such as hair color and facial expression can be liberally edited in VIVE3D:

Changing an individual's appearance, including hair color and facial expression. This is a still from a moving video – see the embedded video at end for movement and better resolution.
Changing an individual's appearance, including hair color and facial expression. This is a still from a moving video – see the embedded video at end for movement and better resolution.

In terms of age-changing, VIVE3D offers improved qualitative results over prior methods VideoEditGAN and Stitch it in time (SIT):

Comparing VIVE3D's ageing of President Obama to former methods. This is a still from a moving video – see the embedded video at end for movement and better resolution.
Comparing VIVE3D's ageing of President Obama to former methods. This is a still from a moving video – see the embedded video at end for movement and better resolution.

The new paper is titled VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs, and comes from five researchers across Meta Reality Labs Research, Sausalito, and KAUST.

Approach

VIVE3D uses a data-specific 3D-aware generator that infers 3D geometry from images via a photogrammetry-style method that’s not dissimilar to NeRF, which (in its base form) is also trained exclusively on 2D images.

The crucial innovation in the system is that the generator is simultaneously trained on multiple frames, rather than sequentially on single images – again, similar to a NeRF-style methodology.

The personalized generator at the heart of the system. Source: https://arxiv.org/pdf/2303.15893.pdf
The personalized generator at the heart of the system. Source: https://arxiv.org/pdf/2303.15893.pdf

The personalized generator inverts the source images (which is to say that it converts them into latent vectors that are pixel-independent, even though, at this stage, their maximum perceptual resolution is no higher than the source images themselves, since they have not been through any upsampling processes yet).

These images need optimizing through this process of adaptation, and this is handled by two processes – basic pixel loss and the Perceptual Similarity Metric and Dataset (LPIPS) metric. These two methods let the training system know how well the optimization process is going, and the extent to which further optimization may be needed.

The first of the two components that handle this personalized generator is Efficient Geometry-aware 3D Generative Adversarial Networks (EG3D), which maps random vectors into a semantically meaningful space, called w-space.

EG3D produces high-quality 3D geometry in a GAN context. Source: https://arxiv.org/pdf/2112.07945.pdf
EG3D produces high-quality 3D geometry in a GAN context. Source: https://arxiv.org/pdf/2112.07945.pdf

The second key factor in the personalized generator is a 2D-based upsampler that generates 4x super-resolution from the original output of the first stage. These two processes project the source material into w-space in a manner similar to work from the 2021 paper Pivotal Tuning for Latent-based Editing of Real Images, according to the new paper’s authors.

Examples of 'pivotal tuning inversion', from the 2021 workflow utilized in VIVE3D. Source: https://arxiv.org/pdf/2106.05744.pdf
Examples of 'pivotal tuning inversion', from the 2021 workflow utilized in VIVE3D. Source: https://arxiv.org/pdf/2106.05744.pdf

Once the latent codes exist in w-space, a global variable is identified that captures the obtained features.

The pipeline for VIVE3D.
The pipeline for VIVE3D.

The authors found that evaluating the lower-resolution samples in the pipeline (i.e. the least ‘adulterated’ images) yielded more accurate results for the overall process than using upsampled imagery. To ensure that the target identity is maintained throughout the upsampling and inversion process, the BiseNet semantic segmentation system is run over the incoming results, with facial regions such as noses and eyes encoded into color-indicated areas.

BiseNet face segmentation helps to maintain consistency of identity throughout the process of inverting real photos into the latent space. Source: https://www.researchgate.net/figure/An-example-of-face-segmentation-using-BiseNet_fig1_359685832
BiseNet face segmentation helps to maintain consistency of identity throughout the process of inverting real photos into the latent space. Source: https://www.researchgate.net/figure/An-example-of-face-segmentation-using-BiseNet_fig1_359685832

Camera parameters such as yaw and pitch are estimated and considered during these calculations, so that the latent face obtained comes with a ‘real world’ photographic context that helps alignment with true video target footage.

The authors note:

‘A key advantage of this joint optimization is that the facial characteristics of the person preserve their high fidelity even when seen from novel views. When inverting a single image of a side-facing person into the EG3D latent space, exploring other viewpoints of the inverted latent can lead to significant distortions. Often, unseen features (e.g. hidden ears) can be blurry or distorted, and the identity no longer resembles the input from a different viewpoint.

‘The joint inversion, however, ensures that the different views are embedded closely enough in latent space such that even unseen views yield consistently identity-preserving outputs.’

Conversion and Editing

With the base generator and framework now adequately ready to provide an editable source identity, the target video footage must now also be prepared. For this, facial keypoints are extracted from each frame in order to determine the location of the bounding box which surrounds the face (which is the only area which will be changed in the transformed video).

The keypoints are likely to be quite jittery between frames, and therefore (as is common with popular deepfakes packages such as DeepFaceLab and FaceSwap, each of which use the same FAN Align extractor as VIVE3D), the continuity of these facial landmarks are smoothed out with Gaussian averaging.

The authors note, however, that excessive smoothing would interfere with the system’s ability to handle sudden or abrupt movements, where major changes in landmarks can naturally be expected from one frame to the next (also a common issue with the aforementioned popular deepfakes packages).

After this, frame-by-frame inversion is performed between the generator and the target footage, and the data is ready for altering.

The authors note that EG3D, foundational to VIVE3D, is built over StyleGAN2, which allows the user to address various latent space directions (i.e., you can navigate between ‘blonde’ and ‘brown’ hair inside the neural network, or between ‘male’ and ‘female’-coded latent codes). Thus the great number of possible attribute edits we have seen demonstrated above become possible.

Navigating between various latent directions allows for numerous types of facial editing, as we have also seen in earlier examples illustrated above.
Navigating between various latent directions allows for numerous types of facial editing, as we have also seen in earlier examples illustrated above.

Now that we can perform all these edits, the challenge remains to integrate them into the source video without disturbing the user’s credibility in the footage. Since some of these changes involve actually altering the direction in which the person’s head is facing, many of the potential amendments are prone to integrate badly – a problem that VIVE3D tackles through optical flow correction.

Optical flow is a video-based method that effectively unrolls the entire data of a video clip until it is a flat and navigable single space that can be examined and mapped – not unlike the way that a sound wave can be turned into an image and ‘painted over’ to remove background noise or unwanted voices, as if it were a bitmap (as with Adobe Audition’s spectral display).

VIVE3D's view adjustment workflow in action, accounting for the way that transformations won't necessarily accord with the source video.
VIVE3D's view adjustment workflow in action, accounting for the way that transformations won't necessarily accord with the source video.

To accomplish this re-alignment, the authors convert the original and altered images to grayscale and use Farneback optical flow to define a dense field that delineates the original and altered placement of the target faces (see image above). The resulting differential is then massaged and smoothed by various other methods before being used as a template for placement.

Tests

The researchers conducted a variety of qualitative and quantitative tests, pitting VIVE3D against StiiT and VEG. Since the three frameworks are not equally arrayed or entirely like-for-like in their capabilities, settings and metrics, some accommodation (as is common in these cases) was made to enable reasonable bases for comparison.

Only facial metrics, for instance, were considered in these tests, even though some of the frameworks take in wider possible applications.

The first test was for inversion quality – the extent to which the original source material is re-represented in the reconstructed output. For this test, the authors evaluated StiiT and VIVE3D on the metrics Peak signal-to-noise ratio (PSNR) and Structural Similarity Index (SSIM), using 16 videos from the VoxCeleb dataset. Fréchet Inception Distance (FID) was also used as a metric for age and angle editing:

Results for inversion/reconstruction quality.
Results for inversion/reconstruction quality.

Of these results, the authors state:

‘Both methods perform well on the reconstruction of the input signal and the final reconstruction quality of our technique is on par with StiiT.’

They also note, of the FID-based results, that VIVE3D scores very well in this respect. Since FID essentially governs ‘recognizability’, this is an important victory for the system.

Next, the researchers tested for face fidelity, based on the Imperial College London ArcFace metric. Here the dissimilarity between adjacent frames was measured, as an index of temporal coherence:

Results for the face similarity metrics test.
Results for the face similarity metrics test.

Regarding these results, the researchers observe that VIVE3D’s results lack the jarring artefacts of VEG, and is able to faithfully reconstruct faces even from different (simulated) angles to those that appear in the source video material.

In calculating resource usage, VIVE3D also proved to be notably more efficient than the competition. The comparisons were run on a sample video with a 1920x1080px resolution on a single NVIDIA A100 GPU with 40GB of VRAM:

Though VEG has slightly lower VRAM requirements than VIVE3D, it also comes with notably extended processing times, offsetting any potential electricity savings.
Though VEG has slightly lower VRAM requirements than VIVE3D, it also comes with notably extended processing times, offsetting any potential electricity savings.

Qualitative tests followed, beginning with the synthesizing of novel (unseen) views of target subjects. See the paper for the full range of tests conducted, but the tests for ‘viewpoint changes’ are noteworthy here:

Rival methods do not contain VIVE3D's capability to plausibly change the angle of the head, with missing margins and other artefacts destroying the verisimilitude of the results, compared to VIVE3D.
Rival methods do not contain VIVE3D's capability to plausibly change the angle of the head, with missing margins and other artefacts destroying the verisimilitude of the results, compared to VIVE3D.

The system was also tested for its ability to cope with extended or challenging boundaries, such as cases where long hair (which may need to be altered in some way) extends beyond the typical bounding box of transformation, and VIVE3D was able to cope well also with this scenario, as there are no abrupt transitions from the altered to the original source:

Conclusion

VIVE3D is the first GAN-based system capable of plausibly changing head direction, whilst enabling a wide variety of other types of GAN-based facial editing techniques, all of which can be applied to video – further evidence that Generative Adversarial Networks are beginning to encroach not only on the new wave of interest in latent diffusion generative AI systems, but that they may have advantages over both LDMs and autoencoder-based deepfakes, not least because of the native disentanglement of a GAN’s latent space.

Together with GigaGAN, VIVE3D may prove to be one of the pivotal motivations to revisit the GAN as a promising method of video-based and versatile facial image synthesis, particularly since the static-image editing tools it already had may adapt well to video-based workflows.

More To Explore

AI ML DL

ChatFace Offers Better Disentangled Neural Expression-Editing

A new system from Peking University improves on the state-of-the-art for neural face-editing, offering more faithful expression manipulation and more disentangled editing of facets such as hair and eye color, among other attributes.

AI ML DL

Faking Depth Occlusion for Better Augmented Reality

New research could improve the ability of augmented reality (AR) systems to convincingly insert synthetic objects into scenes, by studying the currently complex ways that they are matted and occluded, and simply ‘guessing’ what the best results would be.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle