Solving the ‘Profile View Famine’ With Generative Adversarial Networks

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

As we have noted before, it is extraordinarily difficult for neural facial synthesis systems to ‘guess’ what a person’s profile might look like based solely on training data that shows the front (rather than the side) of their face. Yet since such photos represent the overwhelming majority of facial poses available in common and popular datasets, and since there is so little demand for profile (i.e., side-view) photos of people, the problem seems insoluble – and leads directly to poor performance on neural synthesis architectures when attempting to invent profile views.

Click to play. From our feature last year, evidence that the lack of profile pictures, even of well-known celebrities, using systems such as DeepFaceLive, deeply affects the outcome if the host turns sideways.

Though it is possible to use 3D projection and techniques such as analysis of monocular depth estimation (depth maps) to infer what the side of a head may look like based solely on knowledge of a front view, renders require some reasonable amount of ground truth to fill in the sketchy blanks concerning complex geometry such as ears. This data is simply not available most of the time, except in cases, such as a dedicated VFX workflow, where resources have been spent to scan the subject in question.

From the new paper, we see the prior system PI-GAN failing to extrapolate an accurate side-view from a frontal pose. Source: https://arxiv.org/pdf/2309.10388.pdf
From the new paper, we see the prior system PI-GAN failing to extrapolate an accurate side-view from a frontal pose. Source: https://arxiv.org/pdf/2309.10388.pdf

Very few systems perform well in this regard, though we’ve recently noted that the now-abandoned ROOP deepfake project achieved above-average results, based on the InsightFace framework. In terms of FOSS code, however, and in general, the options for effective profile interpretation are rather slim.

Now, however, a new research collaboration from Korea offers a more effective approach to the problem, with a novel system that takes a two-tier approach to break down the challenge and provide notably more effective side-view neural renders, without needing custom scans or dedicated data:

In tests, we see that the new Korean system creates far more plausible side views in comparison to analogous recent frameworks.
In tests, we see that the new Korean system creates far more plausible side views in comparison to analogous recent frameworks.

The innovation with the Korean system is that the pose interpretation scenario is broken down into two separate and discrete challenges. The first of these concentrates resources on testing whether an interpreted photo looks real or fake; and the other learns to tell whether or not a synthesized image is in agreement with the desired camera pose.

Above, evidence of the greatly-improved profile synthesis possible with the new approach, which may lay the groundwork for inferring profile views more effectively with available data.
Above, evidence of the greatly-improved profile synthesis possible with the new approach, which may lay the groundwork for inferring profile views more effectively with available data.

The system, titled SideGAN, also offers a new pose-matching loss function to learn the pose consistency of 3D Generative Adversarial Networks (GANs). Additionally, a new pose-sampling strategy further improves distillation of side-views, helping the system to offer what may be the most effective passport>profile inference from in-the-wild data yet seen.

The new paper is titled 3D-Aware Generative Model for Improved Side-View Image Synthesis, and comes from five researchers across KAIST, POSTECH, and Kakao Brain.

Approach

In the new work, the researchers illustrate how unbalanced common and popular face datasets are. Below we see the frequency of profile views across three of the most influential collections: FFHQ; CelebA-HQ; and AFHQ Cats.

As the angle of severity increases across common datasets, the number of available photos plummets.
As the angle of severity increases across common datasets, the number of available photos plummets.

(It should be noted that many training routines turn on data augmentation routines that duplicate photos flipped horizontally, which effectively doubles the number of available images; however, profile data is so severely under-represented in major collections that this makes only a minimal difference – and this cannot be used, in any case, for notably asymmetrical subjects, such as people who have a mole on one side of their face)

To address this imbalance, the authors of the new work have taken a deeper look at a process that has until now been considered a single obstacle. They explain:

‘To ease the challenging problem of learning photo-realistic and multi-view consistent image synthesis, we split the problem into two subproblems, each of which can be solved more easily. Specifically, we formulate the problem as a combination of two simple discrimination problems, one of which learns to discriminate whether a synthesized image looks real or not, and the other learns to discriminate whether a synthesized image agrees with the camera pose.

‘Unlike the formulations of the previous methods, which try to learn the real image distribution for each pose, or to learn pose estimation, our subproblems are much easier as each of them is analogous to a basic binary classification problem.’

This concept is encapsulated in SideGAN with the development of a dual-branched discriminator, which separately learns photo-realism and pose consistency.

Conceptual architecture for the branched discriminator.
Conceptual architecture for the branched discriminator.

(In Generative Adversarial Networks, the discriminator performs iterative tests on the current output being generated at training time and grants the generating module a score for its effort, without ever telling it where – or if – it went wrong. Therefore the generator module keeps trying, working out the glitches for itself, until the scores improve. In this sense, the discriminator is akin to an unskilled arts patron that knows nothing of the technical challenges involved, but ‘knows what it likes’, and the generator akin to an exhausted artist with a thorny client)

The SideGAN generator is designed to address the face content and background content separately, which is unusual in a GAN architecture. It does this because otherwise the GAN begins to take non-face content into account, not least when evaluating the accuracy of images at training time.

The background component of SideGAN takes inspiration from the 2022 NeurIPS project EpiGRAF, from Snap Inc.; and the foreground generator component is taken from the hybrid morphable face model (EG3D) presented in the Stanford/NVIDIA 2023 paper Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization, which creates triplane features from the latent codes and the known camera parameters (i.e., lens configuration, distance to subject, etc.) trained into the model based on annotated data (or else superimposed by the training configuration).

Examples of facial inference from the contributing paper Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization (EG3D). Source: https://arxiv.org/pdf/2305.03043.pdf
Examples of facial inference from the contributing paper Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization (EG3D). Source: https://arxiv.org/pdf/2305.03043.pdf

The 3D positions estimated by this component yield foreground feature maps which are ultimately fed through to a super-resolution module that up-scales from the low-resolution generations into a workable resolution.

Meantime, the multilayer perceptron (MLP) in the EpiGRAF module converts the extracted latent code and estimated 3D position and generates a feature vector, which is aggregated into the previous calculations and sent on to the image generator.

The dual branch discriminator (DBD) receives a camera pose and an image as inputs, and the image pose can either be positive or negative. The discriminator then attempts to gauge whether the supplied photo is fake or real, and whether it agrees with the posited pose.

SideGAN incorporates an Additional Uniform Pose Sampling (AUPS) strategy, which compares poses drawn from a uniform distribution of poses available in the source dataset, and with the actual pose presented, which improves learning opportunities for steep angles (which otherwise, for aforementioned reasons, would be statistically scarce).

Additional losses applied in the DBD include non-saturating GAN loss (wherein the generator maximizes the log of the discriminator’s available probabilities), identity regularization, and final loss, which mediates between the other available losses.

Data and Tests

Since SideGAN draws heavily on EG3D, the majority of experimental parameters and settings match those of the Stanford/NVIDIA project. Exceptions are that the dimension of the background latent vectors are set to 512, for a final image resolution of 256x256px, and a neural rendering resolution of 64x64px (the final resolution is obtained by internal upscaling routines).

Datasets used were Celeb-A-HQ, FFHQ and AFHQ (the cat database, not extensively explored in the work). The background regions were removed from CelebA-HQ by means of the associated ground truth segmentation masks, though the backgrounds for FFHQ were retained (the paper does not mention why, but presumably pre-made masks were not available).

Some of the tests were conducted with transfer learning (where an existing dataset is exploited to generate novel material), a common technique used in 3D GANs (i.e., GANs that have some conceptual knowledge of 3D space applied during the generative process, often via ancillary systems such as 3D Morphable Models, aka 3DMMs).

To accomplish this, a generator was pretrained with a pose-balanced in-the-wild (i.e., not custom-made for the task) dataset (from the Microsoft ‘ Fake It Till You Make It ‘ project, see embedded video below), to compensate for the aforementioned deficit of profile views in the datasets under test.

Microsoft’s synthetic dataset is able to generate any number of profile views ad hoc, since the source material is CGI-based.

Prior networks tested initially were Pi-GAN and EG3D, on the three aforementioned datasets. The AFHQ cats dataset was omitted in scenarios where transfer learning was used, since the size of the cats dataset did not support pretraining.

Qualitative comparison against EG3D and Pi-GAN on CelebA-HQ.
Qualitative comparison against EG3D and Pi-GAN on CelebA-HQ.

Of these results, the authors comment:

‘For all the real-world human face datasets, π-GAN and EG3D generate blurry images for steep angles compared to realistic frontal images. In contrast, SideGAN robustly generates high-quality images irrespective of camera pose.’

Qualitative comparison against EG3D and Pi-GAN on FFHQ.
Qualitative comparison against EG3D and Pi-GAN on FFHQ.

Below are qualitative results for similar subsequent tests, with SideGAN pitched exclusively against EG3D, this time employing transfer learning:

Results for CelebA-HQ and FFHQ using pretraining via transfer learning.
Results for CelebA-HQ and FFHQ using pretraining via transfer learning.

Here the authors state:

‘For all the datasets, EG3D generates unnatural images for steep angles compared to realistic frontal images. On the other hand, SideGAN robustly generates high-quality images irrespective of camera pose.

‘These results indicate that our method is effective in learning to synthesize high-quality images at all camera poses in both cases with and without transfer learning.’

In examining small patches of recreated profile views, the authors contend that EG3D produces less realistic results, with a greater number of artifacts, than the new system, and that while both systems benefit from transfer learning, EG3D’s pose-sensitive process, which does not contain the extra generalization measures in SideGAN, causes it to be unable to ‘fill in’ or interpret ‘holes’ in the supplied pose when attempting to infer a side-view.

Comparisons of ear reconstruction across EG3D and SideGAN, both with and without transfer learning.
Comparisons of ear reconstruction across EG3D and SideGAN, both with and without transfer learning.

For a quantitative comparison, evaluating image quality and shape quality, the authors generated images from randomly-sampled camera poses (i.e., points of view on the subject), using Fréchet Inception Distance (FID) and depth error as metrics (depth error was estimated by calculating the Mean Squared Error, or MSE, between the generated depth created by the model and the rendered depth from the estimated geometry). These reconstructions were provided, as they were with the prior EG3D model, by the Max Planck Institute’s 2021 DECA project.

Results from the quantitative comparison.
Results from the quantitative comparison.

Here the authors assert:

‘[In] both cases with and without transfer learning, SideGAN outperforms all the other baselines in terms of image [quality] thanks to our effective training method.’

Lastly, the authors conducted quantitative experiments to compare the systems’ varying capacity to produce synthesized images at frontal, steep and extrapolated angles. This was undertaken with Microsoft’s above-mentioned FaceSynthetic dataset, since FID would have required a balance of poses not extant in the popular datasets.  

To determine whether SideGAN can really produce better results from in-the-wild data collections (instead of bespoke and highly-curated datasets) the researchers first created an artificially unbalanced sub-set from the FaceSynthetic dataset, for a collection with a more ‘typical’ distribution of poses. Once a model was trained with this skewed dataset, the FID scores were evaluated for EG3D and SideGAN, without the use of transfer learning.

FID results relating to the extremity of the camera angle (i.e., how near the subject is to a complete 'side view' or profile pose). Though SideGAN is outperformed on front views, this is the lowest-hanging fruit, where any number of similar projects might shine. By contrast, SideGAN outperforms its rival in profile generation. All angles range between -90 and +90 degrees of horizontal facial orientation.
FID results relating to the extremity of the camera angle (i.e., how near the subject is to a complete 'side view' or profile pose). Though SideGAN is outperformed on front views, this is the lowest-hanging fruit, where any number of similar projects might shine. By contrast, SideGAN outperforms its rival in profile generation. All angles range between -90 and +90 degrees of horizontal facial orientation.

In regard to these results, the authors state:

‘[Our] model performs comparably to EG3D at near-frontal angles, and as the angle gets larger, our model performs significantly better than EG3D, proving the effectiveness of our approach.’

The paper concludes:

‘Our experimental results show that our method can synthesize photo-realistic images irrespective of the camera pose on human and animal face datasets. Especially, even only with pose-imbalanced in-the-wild datasets, our model can generate details of side-view images such as ears, unlike blurry images from the baselines.’

Conclusion

We have noted before that creating authentic side-views is a data problem, not a mathematical challenge, or some type of coding roadblock, and have indicated that those voices on the internet who believe that an easy fix is imminent do not yet understand this.

Thus the central challenge, in the absence of an unexpected avalanche of profile data, is to ‘do more with less’, and to use such data as exists in the most efficient and imaginative way possible.

One potential road forward is the development of pretrained networks that specialize in profile views, whether synthetic or gathered arbitrarily from the small representations of these viewpoints across the popular collections.

Another might be to develop loss functions that explicitly map relationships between side-view data and front-view data, so that any arbitrary configuration of ‘side of head’ pixels would be associated with a specific and limited range of probabilities as to what the equivalent side view would look like. Since side-data foreshortening is so severe in a passport-style picture, this is not an easy prospect.

Regarding SideGAN, we have to note that despite using celebrity-laden datasets, very well-known faces were not selected for the samples provided, which may indicate that the resulting inferences are more ‘plausible’ than ‘accurate’. That said, this is one of the hardest hurdles in synthetic facial synthesis, and no ‘obvious’ or sudden solutions seem likely to emerge in the near future.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle