Researchers at Tel-Aviv university have devised a personalized deepfake generator that, after briefly scanning a user’s facial expressions and angles (similar to the one-off face-scanning that enables Apple’s Face ID in iOS devices) can then use their facial movement to drive the facial actions of another identity.
Above, the ‘selfie’ capture that provides fodder for the GAN-assisted generator system. Below, sample results, as the ‘captured’ user controls an alternate identity. Sources: https://arxiv.org/pdf/2307.06307.pdf and https://arielazary.github.io/PGR/
The new system generalizes the information that it needs regarding how faces work primarily from the user’s brief face scan, though it does also use weights from the FFHQ web-scraped dataset as a reference for the way that faces may be posed, or the way that expressions manifest.
Using trained datasets featuring thousands of identities is a sure way to obtain supporting information for all the coverage that will be needed for a potential neural reproduction of a face. If a generative system that’s trying (for example) to recreate Tom Cruise doesn’t have in its training data a source image of Cruise smiling broadly at an extreme angle, a pre-trained system powered by millions of web-scraped images will almost certainly have a usable latent prior into which the Cruise identity can be superimposed.
However, it has to be admitted that freely using a hyperscale dataset of this nature has begun to draw ever-more intense attention from the legal departments of copyright holders. While the new system proposed by the researchers keys on the user’s own scan, the scans are actually made useful and ductile by being incorporated into the latent space of a Generative Adversarial Network (GAN) that has been trained on the web-scraped FFHQ dataset.
In an ideal world, future systems of this type would be able to generate conceptual models of the human face by studying one face alone (i.e., the data obtained from the brief phone-scan previously described), instead of needing the trained priors from datasets of dubious legal provenance; or else have access to a truly hyperscale but open source IP-respecting database (a project that could take years, or even decades to compile).
Nonetheless, no-one else has solved this ‘legacy’ legal quandary either, and the proposed new system offers an extraordinarily accurate form of deepfake puppetry, using a number of novel approaches that are inspired by, or re-imagined from previous works.
The new system can reproduce and transfer facial movement with notable fidelity. Source: https://arielazary.github.io/PGR/
A novel, genericized approach to facial landmark capture, curation and engagement means that even very different-looking faces can be paired up for deepfake puppetry under the new system.
Challenges and Approach
The authors note the current trend for attempting to create effective deepfake puppetry techniques by the use of a single image of a person. Though a number of frameworks have attempted this (including Video2StyleGAN, StyleCLIP, and InterFaceGAN, among others), the ROOP repository has most recently garnered headlines with this technique.
As might be intuitive to us, a single image cannot provide the necessary breadth of topographic information for effective deepfaking into videos where the target subjects move about extensively, revealing parts of their face and head which are occluded in the single driving static image.
Regarding this, and regarding also the use of CGI primitives (3DMMs, for example) as intermediary systems for neural facial reenactment, the authors of the new work comment*:
‘A single image is insufficient to learn the entire breadth of appearances of an individual, leading such methods to resort to unfaithful hallucination. Accordingly, other works have incorporated more footage of the individual on top of generic knowledge of faces. Kim et al.  and the concurrent work by Wang et al .  classically perform reenactment using 3DMM but employ person-specific generative models to transform 3DMM-based renders to realistic images.
‘Inevitably, they still partially suffer from limitations stemming from using 3D mesh models, such as not modeling the mouth cavity.’
It is not just the unseen angles in limited source photos which obstruct an authentic reconstruction. As we have pointed out before, one cannot possibly guess what a person’s smile will look like from source data that does not feature a smile – or how their ‘resting’ face will appear when the target video or image requires that the subject not be smiling, but the sole available data features big smiles (and this is something of a curse with many popular datasets that leverage ‘catwalk’ and ‘premiere’ photos of celebrities, such as Celeb-A, where the nature of the occasion and the prevailing social mores practically demand large smiles).
However, the alternative is to gather multiple photos of the source and train them into an aggregate neural representation. In the case of autoencoder deepfakes, this may require an extraordinary number of images, and entail a notable burden of curation; while in the case of less demanding and more recent systems such as DreamBooth (a personalization technique for Stable Diffusion that lets you ‘inject’ yourself – or anybody else – into the generative system), one has to prioritize and carefully anticipate the kind of generations that will be needed, since it is difficult to force the training process’s attention in any particular direction without unbalancing the data.
In the case of the new Tel-Aviv system, the minute or so of self-scanning produces in the order of 10,000 usable images, which can be trained – though not particularly quickly – into a GAN-based system that’s derived from the MyStyle generative framework offered by Google Research and Tel-Aviv in 2022:
The MyStyle system curates a limited number of diverse poses of an individual, and then generalizes a model from these by incorporating them into a StyleGAN system.
One of the main innovations in the new paper is to democratize this process so that ordinary individuals can quickly provide the necessary poses by self-scanning, without needing to run through any particular scripted sequence of actions, as is often necessary with similar systems.
Disentangling the source data
One innovation present in the new work is the extent to which it disentangles identity from driving movement.
A disadvantage of many autoencoder-based neural facial synthesis systems is the need for the contributing source data or priors to contain a wide range of facial expressions in a wide range of facial/head poses.
If, for instance, there are very few images of an individual smiling broadly, and all these images are ‘frontal’, passport-style poses that face the camera directly, it can be difficult for the training routine to separate out the characteristics of the facial expression from the pose that it appears in, within the source data.
A reason for this is that the facial landmarks which are obtained by such systems tend to inform the trained model not just about the conceptual or abstract movement of a face (i.e., ‘a smile’, ‘a frown’), but also end up informing the system about the actual geometry of the specific subject at hand (i.e., ‘Tom Cruise’s smile’, ‘Tom Cruise’s frown’).
This is not what’s wanted. What’s desirable is that the system achieve a default canonical pose from which any combination of expressions and poses (poses being the orientation of the head from the viewer’s point-of-view) can be derived, and that the expression-based landmarks be independent of facial identity. The authors explain**:
‘We find that a short, single-take, RGB video can include sufficiently varied head poses and expressions. We call such videos self-scans, as they can be conveniently captured by the individual themselves. Note that the individual does not need to be following any script.
‘Instead, the self-scan should intuitively include the poses and expressions that one desires to be possible for the generated video. One does not need to exhaust all combinations of poses and expressions, and performing each once is sufficient […]
‘…our method enables disentangling and recomposing poses and expressions that were not seen jointly in the self-scan.
‘In practice, for the experiments in this paper, we instructed several individuals to capture a 20-30 seconds, portrait video of themselves with varying head poses and expressions that could be used in a social interaction, like conversation.’
Once the frames are obtained, very similar frames (for instance, excessive contiguous frames which over-represent a single pose) are excised and fine-tuned in the StyleGAN2 component of the MyStyle framework.
Subsequently, the neural representation can be edited and amended, with the system offering the capability to change the pose of the face, or to change the appearance of the output. These transformations can either be applied to the original source identity or a target, diverse identity.
Please allow time for the animated GIF below to load
The ability to change the pose of the target has become a subject of interest in the neural facial synthesis research sector recently, since it potentially allows webcam footage or other source data, where the subject is at an ‘obscure’ angle, to be straightened out so that the subject appears to be facing the camera directly. This functionality is eventually intended for avatar-style photorealistic videoconferencing.
Please allow time for the animated GIF below to load
The Stitch-in-Time method splits videos into individual frames, which are cropped and aligned before being inverted (i.e., ‘projected’ or ‘inserted’) into the latent space of the StyleGAN2 model using a pre-trained encoder. The inversion process ingrains the source data into available latent codes which can be accessed by the generator.
The authors note, citing prior work from 1985, that facial landmarks (or keypoints) tend to bring with them identity-specific information that can interfere with a puppet-style performance between two identities that do not have very similar physiognomies.
Please allow time for the animated GIF below to load
Traditionally, the paper observes, these ‘identity-coded’ landmarks only support self-reenactment, rather than identity transfer, or the ability of one person to drive a video depicting another person. Therefore the new Tel-Aviv initiative treats alignments with unusual care and curation.
The initial 468 landmarks provided by MediaPipe are notably in excess of the typical 69 provided by the dominant current landmark estimator, the Facial Alignment Network (FAN Align).
Surprisingly, in the new system, the subset of keypoints filtered out from this abundance of landmarks is very meager, covering only the inner parts of eyelids, irises (for estimating eye direction), and the inner parts of both lips. The rest of the information is made up and supported by the pixel data. This apparently scrawny set of minimal landmarks is, however, more than sufficient to provide representative movement that can power neural face swapping.
Under the new approach, the keypoints are separated into groups and normalized, which means that they no longer carry any cohesive and entangled traits of identity, but rather are separated out to control individual sections of the face, without dragging other tranches of movement (or identity) with them.
A particular benefit of this optimization is that iris placement, one of the central bugbears of neural facial synthesis, which frequently finds depicted subjects looking in the wrong direction, is greatly improved. This problem comes about, as hinted at earlier, because the legacy datasets (including FFHQ) rely on ‘premiere’ and ‘event’ celebrity photos, where the subject tends to be all too frequently looking directly into the camera. By minimizing the general processing demands down to a smaller subset of landmarks, it’s possible to gain greater control over eye gaze direction.
Temporal consistency, currently the greatest challenge in latent diffusion video, is handled in part by basing the alignments of the current frame on the previous frame, whilst adding Gaussian noise to ensure fluidity of transition. The authors state:
‘Adding noise is necessary to allow optimization [to] escape the local minima of using the previous latent code again. We find that this initialization is crucial for biasing the optimization to consistently converge to nearby latent codes, minimizing unnecessary perturbation. We finally apply another Gaussian low-pass filter, this time on the produced latent codes themselves, making the transition between frames even smoother.’
Data and Tests
(Due to bandwidth limitations, it’s not possible to reproduce the video results published for the study, but these comparison videos can be viewed in full motion at the site’s project page)
Four of the self-scanned videos taken during research were used as data, with a random 5-10 seconds utilized from the driving videos, leading to ten results.
Various metrics were used, including identity preservation (ID), Average Pose Distance (APD), and Average Expression Distance (AED). Additionally, the more traditional Fréchet Inception Distance (FID) was considered, but deemed unsuitable to the few-shot nature of the project; therefore the authors used Natural Image Quality Evaluator (NIQE), which evaluates image quality per se, and without direct reference to adjacent in-program data.
Of these results, the authors state:
‘As can be seen, our method outperform all existing methods on identity and expression preservation, and quality, and performs competitively with the leader LIA on pose preservation. Specifically, by comparing generated images to reference self-scan frames, one can observe that our method excels at preserving identity.’
The authors conducted further studies into stylization and semantic editing. As with previous reports, we will not examine the stylization results (in this case, turning the subject into ‘the Joker’ from Batman), as, in the literature, these tend to be included more to garner attention than offer potential useful applicability – a situation that occurs here as well, in our opinion. Nonetheless we refer the reader to the paper for further details about this aspect of the work.
More interesting are the authors’ efforts to perform editing of latent codes inside the latent space itself, by addressing the features elicited during training (whose location in the latent space is surprisingly easy to find, thanks to the architecture of the foundational MyStyle project).
Of this, the authors comment*:
‘In StyleGAN’s latent space there exist vectors or “directions” that when traversed upon, affect the generated image in a semantically consistent manner. For example, one such vector would consistently and gradually add a smile. The result of our latent optimization method is a set of latent codes, that turn into the reenacted video once passed through the generator.
‘Here, instead of generating frames from these latent codes, we first shift them a constant sized step in the direction of a specific semantic vector. Passing these shifted latent codes through the generator results in a reenacted video that was semantically additionally edited. We use three semantic vectors that are identified using InterFaceGAN [Shen et al. 2020] – smile, pose, and age.’
The researchers themselves conclude with the promising note that their system, while it cannot reproduce expressions absent in the capture dataset (and it would be a bad idea to try, as we have seen), can very easily separate out pose and expression, easing the rigor and protocols for the capture phase. Many similar systems in recent papers have imposed onerous burdens in this respect, such as the need for green screens or other methods of isolating the subject, or of forcing the subject to run through a repeated range of facial expressions in a tedious cycle of face poses.
The ability to separate these necessary components for the dataset, combined with the accent on in-latent editing, and a system with an extraordinarily well-mapped and predictable latent space, could signify a step forward in the developing conventions for automated identity transfer.
One other notable advantage of the new system is that each participating party will need to cooperate and provide scans, making non-consensual deepfaking quite a challenge.
Probably Not Coming to an App Store Near You
It should be emphasized that the breakthrough here is one of ease and ergonomics, and not that the system is intended to ever produce sophisticated deepfake puppetry from scratch on even a powerful modern smartphone, which does not have the necessary processing capacity, and is not likely to in the near future.
The use of the smartphone as a capture device is relatively arbitrary in this respect. It’s useful because nearly all known lens configurations and capture schema in mobile devices are well-documented and predictable, which means that the system itself will not need expensive bespoke configuration in order to process data; but the actual number-crunching will require more formidable resources, and the only realistic way to incorporate that into an all-mobile workflow would be via remote processing and use of APIs.
* My conversion of the authors’ inline citations to hyperlinks.
** My emphases.