As we have noted before, it is extraordinarily difficult for neural facial synthesis systems to ‘guess’ what a person’s profile might look like based solely on training data that shows the front (rather than the side) of their face. Yet since such photos represent the overwhelming majority of facial poses available in common and popular datasets, and since there is so little demand for profile (i.e., side-view) photos of people, the problem seems insoluble – and leads directly to poor performance on neural synthesis architectures when attempting to invent profile views.
Click to play. From our feature last year, evidence that the lack of profile pictures, even of well-known celebrities, using systems such as DeepFaceLive, deeply affects the outcome if the host turns sideways.
Though it is possible to use 3D projection and techniques such as analysis of monocular depth estimation (depth maps) to infer what the side of a head may look like based solely on knowledge of a front view, renders require some reasonable amount of ground truth to fill in the sketchy blanks concerning complex geometry such as ears. This data is simply not available most of the time, except in cases, such as a dedicated VFX workflow, where resources have been spent to scan the subject in question.
Very few systems perform well in this regard, though we’ve recently noted that the now-abandoned ROOP deepfake project achieved above-average results, based on the InsightFace framework. In terms of FOSS code, however, and in general, the options for effective profile interpretation are rather slim.
Now, however, a new research collaboration from Korea offers a more effective approach to the problem, with a novel system that takes a two-tier approach to break down the challenge and provide notably more effective side-view neural renders, without needing custom scans or dedicated data:
The innovation with the Korean system is that the pose interpretation scenario is broken down into two separate and discrete challenges. The first of these concentrates resources on testing whether an interpreted photo looks real or fake; and the other learns to tell whether or not a synthesized image is in agreement with the desired camera pose.
The system, titled SideGAN, also offers a new pose-matching loss function to learn the pose consistency of 3D Generative Adversarial Networks (GANs). Additionally, a new pose-sampling strategy further improves distillation of side-views, helping the system to offer what may be the most effective passport>profile inference from in-the-wild data yet seen.
The new paper is titled 3D-Aware Generative Model for Improved Side-View Image Synthesis, and comes from five researchers across KAIST, POSTECH, and Kakao Brain.
(It should be noted that many training routines turn on data augmentation routines that duplicate photos flipped horizontally, which effectively doubles the number of available images; however, profile data is so severely under-represented in major collections that this makes only a minimal difference – and this cannot be used, in any case, for notably asymmetrical subjects, such as people who have a mole on one side of their face)
To address this imbalance, the authors of the new work have taken a deeper look at a process that has until now been considered a single obstacle. They explain:
‘To ease the challenging problem of learning photo-realistic and multi-view consistent image synthesis, we split the problem into two subproblems, each of which can be solved more easily. Specifically, we formulate the problem as a combination of two simple discrimination problems, one of which learns to discriminate whether a synthesized image looks real or not, and the other learns to discriminate whether a synthesized image agrees with the camera pose.
‘Unlike the formulations of the previous methods, which try to learn the real image distribution for each pose, or to learn pose estimation, our subproblems are much easier as each of them is analogous to a basic binary classification problem.’
This concept is encapsulated in SideGAN with the development of a dual-branched discriminator, which separately learns photo-realism and pose consistency.
(In Generative Adversarial Networks, the discriminator performs iterative tests on the current output being generated at training time and grants the generating module a score for its effort, without ever telling it where – or if – it went wrong. Therefore the generator module keeps trying, working out the glitches for itself, until the scores improve. In this sense, the discriminator is akin to an unskilled arts patron that knows nothing of the technical challenges involved, but ‘knows what it likes’, and the generator akin to an exhausted artist with a thorny client)
The SideGAN generator is designed to address the face content and background content separately, which is unusual in a GAN architecture. It does this because otherwise the GAN begins to take non-face content into account, not least when evaluating the accuracy of images at training time.
The background component of SideGAN takes inspiration from the 2022 NeurIPS project EpiGRAF, from Snap Inc.; and the foreground generator component is taken from the hybrid morphable face model (EG3D) presented in the Stanford/NVIDIA 2023 paper Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization, which creates triplane features from the latent codes and the known camera parameters (i.e., lens configuration, distance to subject, etc.) trained into the model based on annotated data (or else superimposed by the training configuration).
The 3D positions estimated by this component yield foreground feature maps which are ultimately fed through to a super-resolution module that up-scales from the low-resolution generations into a workable resolution.
Meantime, the multilayer perceptron (MLP) in the EpiGRAF module converts the extracted latent code and estimated 3D position and generates a feature vector, which is aggregated into the previous calculations and sent on to the image generator.
The dual branch discriminator (DBD) receives a camera pose and an image as inputs, and the image pose can either be positive or negative. The discriminator then attempts to gauge whether the supplied photo is fake or real, and whether it agrees with the posited pose.
SideGAN incorporates an Additional Uniform Pose Sampling (AUPS) strategy, which compares poses drawn from a uniform distribution of poses available in the source dataset, and with the actual pose presented, which improves learning opportunities for steep angles (which otherwise, for aforementioned reasons, would be statistically scarce).
Additional losses applied in the DBD include non-saturating GAN loss (wherein the generator maximizes the log of the discriminator’s available probabilities), identity regularization, and final loss, which mediates between the other available losses.
Data and Tests
Since SideGAN draws heavily on EG3D, the majority of experimental parameters and settings match those of the Stanford/NVIDIA project. Exceptions are that the dimension of the background latent vectors are set to 512, for a final image resolution of 256x256px, and a neural rendering resolution of 64x64px (the final resolution is obtained by internal upscaling routines).
Datasets used were Celeb-A-HQ, FFHQ and AFHQ (the cat database, not extensively explored in the work). The background regions were removed from CelebA-HQ by means of the associated ground truth segmentation masks, though the backgrounds for FFHQ were retained (the paper does not mention why, but presumably pre-made masks were not available).
Some of the tests were conducted with transfer learning (where an existing dataset is exploited to generate novel material), a common technique used in 3D GANs (i.e., GANs that have some conceptual knowledge of 3D space applied during the generative process, often via ancillary systems such as 3D Morphable Models, aka 3DMMs).
To accomplish this, a generator was pretrained with a pose-balanced in-the-wild (i.e., not custom-made for the task) dataset (from the Microsoft ‘ Fake It Till You Make It ‘ project, see embedded video below), to compensate for the aforementioned deficit of profile views in the datasets under test.
Microsoft’s synthetic dataset is able to generate any number of profile views ad hoc, since the source material is CGI-based.
Prior networks tested initially were Pi-GAN and EG3D, on the three aforementioned datasets. The AFHQ cats dataset was omitted in scenarios where transfer learning was used, since the size of the cats dataset did not support pretraining.
Of these results, the authors comment:
‘For all the real-world human face datasets, π-GAN and EG3D generate blurry images for steep angles compared to realistic frontal images. In contrast, SideGAN robustly generates high-quality images irrespective of camera pose.’
Below are qualitative results for similar subsequent tests, with SideGAN pitched exclusively against EG3D, this time employing transfer learning:
Here the authors state:
‘For all the datasets, EG3D generates unnatural images for steep angles compared to realistic frontal images. On the other hand, SideGAN robustly generates high-quality images irrespective of camera pose.
‘These results indicate that our method is effective in learning to synthesize high-quality images at all camera poses in both cases with and without transfer learning.’
In examining small patches of recreated profile views, the authors contend that EG3D produces less realistic results, with a greater number of artifacts, than the new system, and that while both systems benefit from transfer learning, EG3D’s pose-sensitive process, which does not contain the extra generalization measures in SideGAN, causes it to be unable to ‘fill in’ or interpret ‘holes’ in the supplied pose when attempting to infer a side-view.
For a quantitative comparison, evaluating image quality and shape quality, the authors generated images from randomly-sampled camera poses (i.e., points of view on the subject), using Fréchet Inception Distance (FID) and depth error as metrics (depth error was estimated by calculating the Mean Squared Error, or MSE, between the generated depth created by the model and the rendered depth from the estimated geometry). These reconstructions were provided, as they were with the prior EG3D model, by the Max Planck Institute’s 2021 DECA project.
Here the authors assert:
‘[In] both cases with and without transfer learning, SideGAN outperforms all the other baselines in terms of image [quality] thanks to our effective training method.’
Lastly, the authors conducted quantitative experiments to compare the systems’ varying capacity to produce synthesized images at frontal, steep and extrapolated angles. This was undertaken with Microsoft’s above-mentioned FaceSynthetic dataset, since FID would have required a balance of poses not extant in the popular datasets.
To determine whether SideGAN can really produce better results from in-the-wild data collections (instead of bespoke and highly-curated datasets) the researchers first created an artificially unbalanced sub-set from the FaceSynthetic dataset, for a collection with a more ‘typical’ distribution of poses. Once a model was trained with this skewed dataset, the FID scores were evaluated for EG3D and SideGAN, without the use of transfer learning.
In regard to these results, the authors state:
‘[Our] model performs comparably to EG3D at near-frontal angles, and as the angle gets larger, our model performs significantly better than EG3D, proving the effectiveness of our approach.’
The paper concludes:
‘Our experimental results show that our method can synthesize photo-realistic images irrespective of the camera pose on human and animal face datasets. Especially, even only with pose-imbalanced in-the-wild datasets, our model can generate details of side-view images such as ears, unlike blurry images from the baselines.’
We have noted before that creating authentic side-views is a data problem, not a mathematical challenge, or some type of coding roadblock, and have indicated that those voices on the internet who believe that an easy fix is imminent do not yet understand this.
Thus the central challenge, in the absence of an unexpected avalanche of profile data, is to ‘do more with less’, and to use such data as exists in the most efficient and imaginative way possible.
One potential road forward is the development of pretrained networks that specialize in profile views, whether synthetic or gathered arbitrarily from the small representations of these viewpoints across the popular collections.
Another might be to develop loss functions that explicitly map relationships between side-view data and front-view data, so that any arbitrary configuration of ‘side of head’ pixels would be associated with a specific and limited range of probabilities as to what the equivalent side view would look like. Since side-data foreshortening is so severe in a passport-style picture, this is not an easy prospect.
Regarding SideGAN, we have to note that despite using celebrity-laden datasets, very well-known faces were not selected for the samples provided, which may indicate that the resulting inferences are more ‘plausible’ than ‘accurate’. That said, this is one of the hardest hurdles in synthetic facial synthesis, and no ‘obvious’ or sudden solutions seem likely to emerge in the near future.