It seems that Generative Adversarial Networks (GANs) are finally off the back-foot, despite more than six months where it has seemed that latent diffusion models such as Stable Diffusion (SD) had consigned them to relatively obscurity, in terms of state-of-the-art image synthesis.
A few weeks ago, GigaGAN, the first truly powerful ad hoc SD-style system for text-to-image generation was debuted, with impressive results obtainable in a fraction of the time of LDM systems.
Now, a new system called VIVE3D, a collaboration between Facebook’s Meta research arm and Saudi Arabia’s King Abdullah University of Science and Technology (KAUST), is offering the ability to produce GAN-based deepfaking and face editing at a quality and effectiveness that overshadows many similar GAN-centric editing projects of recent years, and rivals more recent approaches.
By jointly embedding and processing multiple source frames at a time, VIVE3D is capable of diverse types of video-based manipulations. One of the most impressive of these is the ability to change the facial angle of the subject in a source video:
We have already seen both GAN and NeRF-based systems able to produce variously plausible ‘roll-around’ recreations of people – limited circles of motion indicating a 3D face, but where the face is static and immobile, and VIVE3D’s supplementary materials do indeed show us this:
The difference with VIVE3D is that once it has assimilated the source material, it can not only recreate the original video neurally, but use the obtained latent codes to transfer alternate identities into the video, in a manner similar to autoencoder-based deepfake systems – and neither gender, expression (which can be changed), age, glasses (or lack of glasses – they can be added!) are any obstacle:
Age can be changed also for a same-source individual subject:
Though it is quite common to be able to remove glasses (something that VIVE3D does about as well as any of its predecessors)…
…VIVE3D can also add glasses where there were none, or change the type of glasses:
Likewise, individual facets such as hair color and facial expression can be liberally edited in VIVE3D:
In terms of age-changing, VIVE3D offers improved qualitative results over prior methods VideoEditGAN and Stitch it in time (SIT):
The new paper is titled VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs, and comes from five researchers across Meta Reality Labs Research, Sausalito, and KAUST.
VIVE3D uses a data-specific 3D-aware generator that infers 3D geometry from images via a photogrammetry-style method that’s not dissimilar to NeRF, which (in its base form) is also trained exclusively on 2D images.
The crucial innovation in the system is that the generator is simultaneously trained on multiple frames, rather than sequentially on single images – again, similar to a NeRF-style methodology.
The personalized generator inverts the source images (which is to say that it converts them into latent vectors that are pixel-independent, even though, at this stage, their maximum perceptual resolution is no higher than the source images themselves, since they have not been through any upsampling processes yet).
These images need optimizing through this process of adaptation, and this is handled by two processes – basic pixel loss and the Perceptual Similarity Metric and Dataset (LPIPS) metric. These two methods let the training system know how well the optimization process is going, and the extent to which further optimization may be needed.
The first of the two components that handle this personalized generator is Efficient Geometry-aware 3D Generative Adversarial Networks (EG3D), which maps random vectors into a semantically meaningful space, called w-space.
The second key factor in the personalized generator is a 2D-based upsampler that generates 4x super-resolution from the original output of the first stage. These two processes project the source material into w-space in a manner similar to work from the 2021 paper Pivotal Tuning for Latent-based Editing of Real Images, according to the new paper’s authors.
Once the latent codes exist in w-space, a global variable is identified that captures the obtained features.
The authors found that evaluating the lower-resolution samples in the pipeline (i.e. the least ‘adulterated’ images) yielded more accurate results for the overall process than using upsampled imagery. To ensure that the target identity is maintained throughout the upsampling and inversion process, the BiseNet semantic segmentation system is run over the incoming results, with facial regions such as noses and eyes encoded into color-indicated areas.
Camera parameters such as yaw and pitch are estimated and considered during these calculations, so that the latent face obtained comes with a ‘real world’ photographic context that helps alignment with true video target footage.
The authors note:
‘A key advantage of this joint optimization is that the facial characteristics of the person preserve their high fidelity even when seen from novel views. When inverting a single image of a side-facing person into the EG3D latent space, exploring other viewpoints of the inverted latent can lead to significant distortions. Often, unseen features (e.g. hidden ears) can be blurry or distorted, and the identity no longer resembles the input from a different viewpoint.
‘The joint inversion, however, ensures that the different views are embedded closely enough in latent space such that even unseen views yield consistently identity-preserving outputs.’
Conversion and Editing
With the base generator and framework now adequately ready to provide an editable source identity, the target video footage must now also be prepared. For this, facial keypoints are extracted from each frame in order to determine the location of the bounding box which surrounds the face (which is the only area which will be changed in the transformed video).
The keypoints are likely to be quite jittery between frames, and therefore (as is common with popular deepfakes packages such as DeepFaceLab and FaceSwap, each of which use the same FAN Align extractor as VIVE3D), the continuity of these facial landmarks are smoothed out with Gaussian averaging.
The authors note, however, that excessive smoothing would interfere with the system’s ability to handle sudden or abrupt movements, where major changes in landmarks can naturally be expected from one frame to the next (also a common issue with the aforementioned popular deepfakes packages).
After this, frame-by-frame inversion is performed between the generator and the target footage, and the data is ready for altering.
The authors note that EG3D, foundational to VIVE3D, is built over StyleGAN2, which allows the user to address various latent space directions (i.e., you can navigate between ‘blonde’ and ‘brown’ hair inside the neural network, or between ‘male’ and ‘female’-coded latent codes). Thus the great number of possible attribute edits we have seen demonstrated above become possible.
Now that we can perform all these edits, the challenge remains to integrate them into the source video without disturbing the user’s credibility in the footage. Since some of these changes involve actually altering the direction in which the person’s head is facing, many of the potential amendments are prone to integrate badly – a problem that VIVE3D tackles through optical flow correction.
Optical flow is a video-based method that effectively unrolls the entire data of a video clip until it is a flat and navigable single space that can be examined and mapped – not unlike the way that a sound wave can be turned into an image and ‘painted over’ to remove background noise or unwanted voices, as if it were a bitmap (as with Adobe Audition’s spectral display).
To accomplish this re-alignment, the authors convert the original and altered images to grayscale and use Farneback optical flow to define a dense field that delineates the original and altered placement of the target faces (see image above). The resulting differential is then massaged and smoothed by various other methods before being used as a template for placement.
The researchers conducted a variety of qualitative and quantitative tests, pitting VIVE3D against StiiT and VEG. Since the three frameworks are not equally arrayed or entirely like-for-like in their capabilities, settings and metrics, some accommodation (as is common in these cases) was made to enable reasonable bases for comparison.
Only facial metrics, for instance, were considered in these tests, even though some of the frameworks take in wider possible applications.
The first test was for inversion quality – the extent to which the original source material is re-represented in the reconstructed output. For this test, the authors evaluated StiiT and VIVE3D on the metrics Peak signal-to-noise ratio (PSNR) and Structural Similarity Index (SSIM), using 16 videos from the VoxCeleb dataset. Fréchet Inception Distance (FID) was also used as a metric for age and angle editing:
Of these results, the authors state:
‘Both methods perform well on the reconstruction of the input signal and the final reconstruction quality of our technique is on par with StiiT.’
They also note, of the FID-based results, that VIVE3D scores very well in this respect. Since FID essentially governs ‘recognizability’, this is an important victory for the system.
Next, the researchers tested for face fidelity, based on the Imperial College London ArcFace metric. Here the dissimilarity between adjacent frames was measured, as an index of temporal coherence:
Regarding these results, the researchers observe that VIVE3D’s results lack the jarring artefacts of VEG, and is able to faithfully reconstruct faces even from different (simulated) angles to those that appear in the source video material.
In calculating resource usage, VIVE3D also proved to be notably more efficient than the competition. The comparisons were run on a sample video with a 1920x1080px resolution on a single NVIDIA A100 GPU with 40GB of VRAM:
Qualitative tests followed, beginning with the synthesizing of novel (unseen) views of target subjects. See the paper for the full range of tests conducted, but the tests for ‘viewpoint changes’ are noteworthy here:
The system was also tested for its ability to cope with extended or challenging boundaries, such as cases where long hair (which may need to be altered in some way) extends beyond the typical bounding box of transformation, and VIVE3D was able to cope well also with this scenario, as there are no abrupt transitions from the altered to the original source:
VIVE3D is the first GAN-based system capable of plausibly changing head direction, whilst enabling a wide variety of other types of GAN-based facial editing techniques, all of which can be applied to video – further evidence that Generative Adversarial Networks are beginning to encroach not only on the new wave of interest in latent diffusion generative AI systems, but that they may have advantages over both LDMs and autoencoder-based deepfakes, not least because of the native disentanglement of a GAN’s latent space.
Together with GigaGAN, VIVE3D may prove to be one of the pivotal motivations to revisit the GAN as a promising method of video-based and versatile facial image synthesis, particularly since the static-image editing tools it already had may adapt well to video-based workflows.