Solving the ‘Profile View’ Crisis in Facial Image Synthesis

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

In 2022, we highlighted a critical shortcoming of neural facial synthesis and deepfake models – that they are not very good at recreating profile views of people.

Click to play. Failure cases in our tests with DeepFaceLive, whose facial deepfakes collapse catastrophically at profile angles. Source: https://blog.metaphysic.ai/to-uncover-a-deepfake-video-call-ask-the-caller-to-turn-sideways/

In the intense discussion that followed this study, and given that the advent of Stable Diffusion in that very month was set to astound the world, many people assumed that this weakness would very quickly be overcome by what was set to be a series of massive leaps in the state-of-the-art in this field.

For exactly the reason that we originally outlined, 18 months later, this has not proved to be the case – the reason being that any model that trains on human faces will train best on the most frequent types of facial pose that feature in the data. That means frontal or near-frontal poses, leaving ‘extreme’ or profile facial poses severely underrepresented, for cultural and practical reasons that we looked into in that article.

We suggested at the time that the problem is in the data (or lack of it), and that therefore no amount of AI ingenuity can resolve the issue unless the quantity of extreme facial profile data is massively increased.

Considered Attention

Arguably, at least in some commonly-used image or video datasets, there is actually enough profile data to address this issue. The trouble is that these oblique views represent such a small portion of the total data that, even if included, they don’t get enough attention during training to obtain high detail in the final model.

Additionally, data architects need to consider whether they want to siphon valuable training attention away from more commonly-desired frontal view capabilities, given that side-views are often needed only momentarily, for instance when a person turns away for a moment at the end of a conversation.

Do we want to obtain models that are X% less effective at frontal views, in favor of being able to more realistically recreate side-views on demand?

Or do we want to be able to create additional models that can pinch-hit in moments in a video where extreme views (such as looking acutely sideways or acutely upwards) are required?

To boot, can we be certain that changing the distribution of frontal/profile face poses would even have an adverse effect on the model’s performance? To date, there has not been enough evenly-distributed data to definitively prove this.

Because of this lack of data, these have remained moot points, to date; but it seems that if we wish to create an agile and effective and profile-capable AI model, there are no easy data choices to facilitate this.

Extreme Prejudice

Towards remedying the issue, a new dataset has been created by researchers from Vietnam, which curates over 450,000 of these hitherto-unloved facial poses into a new collection, titled Extreme Pose High quality Dataset (EFHQ).

Tested against a variety of synthesizing frameworks, including Generative Adversarial Networks (GANs) and Stable Diffusion, the addition of this new data has facilitated, in tests (according to the authors) a noticeable improvement in the rendering of extreme face angles.

Generative systems trained with FFHQ (top row), FFHQ and LPFF (middle row) and FFHQ and the new dataset (bottom row), with clear improvement of accuracy using the new data. Source: https://bomcon123456.github.io/efhq/
Generative systems trained with FFHQ (top row), FFHQ and LPFF (middle row) and FFHQ and the new dataset (bottom row), with clear improvement of accuracy using the new data. Source: https://bomcon123456.github.io/efhq/

The new collection is far larger than the only prior equivalent project, the Large-Pose Flickr Face (LPFF) initiative, which contains 19,000 images, in comparison to the 450,000 of the new work. The authors have additionally devised a novel and exhaustive curation pipeline, complete with custom GUI, to handle exceptions or edge cases in regards to pictures that should be included in the dataset, and in order to ‘rescue’ images which are generally not handled well in curation methods for the contributing datasets.

Regarding the LPFF project, the authors of the new work state:

‘[LPFF] relies on images crawled from Flickr, thus having limited size (19k images) and missing identity  information. Hence, its applications are restricted, covering only several face-generation tasks, unlike our large-scale and multi-purpose dataset.’

A concatenation of the example clips from the project site, demonstrating the efficacy of EFHQ in comparison to rival collections, across a variety of synthesis systems. Source: https://bomcon123456.github.io/efhq/

The new paper, titled EFHQ: Multi-purpose ExtremePose-Face-HQ dataset, is accompanied by a project site with numerous video examples of EFHQ-augmented video and still renderings, and comes from three researchers from VinAI research in Hanoi City. The project page promises that code (and data) will be forthcoming, though it is not currently available.

Method

Rather than resorting to static datasets such as Flickr, or curating out the small number of extreme poses available in common and massive collections such as ImageNet and LAION, the researchers used apposite frames from two recent facial video datasets: Video Face High Quality (VFHQ), and CelebV-HQ.

From the project site for CelebV-HQ, examples of the kind of high quality data from which the new dataset has been assembled. Source: https://celebv-hq.github.io/
From the project site for CelebV-HQ, examples of the kind of high quality data from which the new dataset has been assembled. Source: https://celebv-hq.github.io/

This is an innovative, high effort and relatively unusual approach. This kind of curation would not likely have been attempted even 7-8 years ago, since the available quality and resolution of video in a similar collection of the era would have been largely inadequate in comparison to the work of actual photographers (as exemplified in the classic ‘static’ face datasets).

However, the recent surge to 4K and beyond means that a typical grab from a red-carpet interview is likely to be of significantly higher quality than these older sets, and can frequently offer superior detail and resolution.

The extraction pipeline is composed of four stages: defining bounding boxes for the face; applying facial landmarks; estimating an image quality score; and estimating facial identity.

For the bounding boxes, the 2019 RetinaFace and 2021 SynergyNet initiatives were used and image quality was graded via the HyperIQA system. Since the VFHQ collection already comes with bounding boxes and five initial facial keypoints, these were re-utilized and matched with an implementation of the 1955 Hungarian Method.  

The SynergyNet workflow, used as one of the various pose estimators for EFHQ. Source: https://arxiv.org/pdf/2110.09772.pdf
The SynergyNet workflow, used as one of the various pose estimators for EFHQ. Source: https://arxiv.org/pdf/2110.09772.pdf

However, in general, the current crop of head pose estimators are not capable of reliably capturing extreme poses, for exactly the same reasons that the current crop of facial synthesis systems likewise often fail – the data is lacking.

Therefore the researchers used a panoply of diverse estimators besides the aforementioned 3DMM-based SynergyNet, including the joint landmark/head-pose estimator DirectMHP, and the in-house system FacePoseNet, which uses a simple convolutional neural network (CNN) to directly estimate head poses.

Challenging examples for DirectMHP, one of the contributing pose estimation systems put through its paces for the project's unusually demanding scope. Source: https://arxiv.org/pdf/2302.01110.pdf
Challenging examples for DirectMHP, one of the contributing pose estimation systems put through its paces for the project's unusually demanding scope. Source: https://arxiv.org/pdf/2302.01110.pdf

Thereafter, candidates are put through a ‘binning’ procedure, which attempts to categorize which kind of extreme pose the image may represent (if any). Cases which are doubtful are moved to the ‘confusing’ bin for further analysis or discarding.

Conceptual schema for the fitting and selection workflow for EFHQ.
Conceptual schema for the fitting and selection workflow for EFHQ.

Naturally, labels have accompanied the images obtained so far, and here the graphical user interface devised by the researchers plays a role in the triage process.

The GUI for the assessment process, featuring diverse guesses from the various contributing pose estimators (second column from left).
The GUI for the assessment process, featuring diverse guesses from the various contributing pose estimators (second column from left).

In the end, the 450,000 frames obtained came from 5000 individual video clips. Each identity featured included at least one frontal face of the subject, but multiple frames with extreme poses.

The intent of EFHQ is to redress the existing frontal/extreme imbalance in existing datasets, and (unless creating a dedicated model specifically for extreme angles) is not intended to be trained into systems by itself.

The entire EFHQ dataset can therefore be used, selectively, to re-balance existing datasets that lack extreme pose data. Simply dumping EFHQ into an existing dataset would be unhelpful, since the target dataset may be too small not to be overwhelmed by an unexpected influx of lateral data, producing models that are better at side views than frontal views.

An example of generative results from the exhaustive extreme poses available in EFHQ.
An example of generative results from the exhaustive extreme poses available in EFHQ.

In the case of the very high volume VoxCeleb dataset, it was, however, possible to add the entirety of EFHQ to the older collection, for testing purposes, bringing the Vox face count from 4 million to 4.5 million – a suitable adjunct body of data to compensate for missing extreme cases, in this particular instance. Below is a table featuring the various characteristics of datasets considered for the project.

A table of key attributes and supported tasks across a range of datasets dealt with in the development of EFHQ.
A table of key attributes and supported tasks across a range of datasets dealt with in the development of EFHQ.

Data and Tests

GAN

The new dataset was tested across a range of frameworks. For a GAN test, StyleGAN2-ADA models were trained from scratch using the FFHQ dataset in combination with EFHQ, at 1024px2 resolution. An additional version featuring direct control over frontal vs. profile characteristics was also trained.

The training itself was extremely resource intensive, requiring a full six days across 8 NVIDIA A100 GPUs with 40GB of VRAM each.

The results were evaluated with Fréchet Inception Distance (FID), and the improved Recall metric from the QuatNet project. Conditional and unconditional models were trained, some incorporating the LPFF face dataset.

Qualitative results across diverse datasets for StyleGAN2-ADA. Please refer to source paper for better resolution and detail.
Qualitative results across diverse datasets for StyleGAN2-ADA. Please refer to source paper for better resolution and detail.

The authors comment:

‘The qualitative [results] illustrate that our models generate high-quality frontal faces comparable to the FFHQ model while producing realistic and varied profile faces. Additionally, our method has less noise in extreme poses than the FFHQ+LPFF model.

‘Specifically, [the lower images show] LPFF’s generated face contains more noise and less photorealistic details than our model, such as the pixelated eye and nose patches. Meanwhile, ours achieves greater realism with smoother skin, well-defined features, noise reduction, and natural lighting.’

In quantitative terms, the GAN tests show that the EFHQ-enhanced material achieves competitive FID scores and improved Recall in comparison to other models, but also an enhanced pose diversity.

Quantitative results for the GAN testing round.
Quantitative results for the GAN testing round.

EG3D

Next, the authors tested against the Efficient Geometry-aware 3D Generative Adversarial Networks (EG3D) model, which required an even more egregious seven days of training, again across eight A100 GPUs.

In line with the original EG3D approach, the models were evaluated on FID,  and also for multi-view consistency and identity preservation, using the ArcFace facial recognition model.

(Identity preservation is a particular concern in this pursuit, since it is otherwise possible for plausible but inaccurate extreme view recreation to occur, and identity fidelity for profile views is notoriously difficult to preserve when the data is scant.)

Pose accuracy and geometry were also evaluated using Mean Squared Error (MSE), against pseudo-labels taken from the Accurate 3D Face Reconstruction initiative. All the models were tested using 1,024 generated images, except for FID, which was calculated on 50,000 images. Backbones were trained on various combinations of FFHQ and LPFF.

Comparison across multiview samples for EG3D with varying datasets.
Comparison across multiview samples for EG3D with varying datasets.

The researchers contend that qualitative results (image above) indicate superior face shapes in extreme poses in comparison to other trained datasets. In quantitative results (image below), performance appears comparable, without any degradation in the quality of frontal views.

Quantitative results for the EG3D tests.
Quantitative results for the EG3D tests.

The authors comment*:

‘Regarding identity consistency, we see a significant drop when the reference dataset contains profile-view images, due to the weak confidence of the pretrained face recognition model when handling profile-view images, as confirmed by its low average cosine similarity of around 0.6 on the simple frontal-to-profile verification set [CFP-FP].

‘Furthermore, our model also outperforms in pose accuracy and geometry quality with both frontal and cross-pose dataset.’

Stable Diffusion

Next, the authors tested the efficacy of EFHQ on Stable Diffusion, using the popular V1.5 model, and with the use of ControlNet, to see if extreme facial view quality could be enhanced with EFHQ data.

This time training (fine-tuning) took two days on a single A100 GPU, using a learning rate of 1×e−5 and the AdamW optimizer. Metrics were Normalized Mean Error (NME), with the prompt ‘a profile portrait image of a person’. Additional positive and negative text prompts were used, as is common practice with Stable Diffusion.

LPFF and EFHQ were the reference datasets, with FID the standard metric. Conditional images (image-to-image) were used, and particular attention was paid to the quality of gaze direction in the results. Naturally, the quality of facial synthesis was also considered.

On the left, profile views generated with ControlNet, and on the right, results from the post-trained ControlNet using the new EFHQ data.
On the left, profile views generated with ControlNet, and on the right, results from the post-trained ControlNet using the new EFHQ data.

Of the qualitative results, the researchers state:

‘Our fine-tuned model adeptly conditions diverse extreme poses, generating high-quality results with precise details while maintaining fidelity to the condition, such as mouth shape, and exhibiting minimal artifacts. These patterns indicate the superiority of the fine-tuned model over the baseline model in handling extreme viewing angles.’

The authors further note that their improved model offers superior eye gaze rendering. In quantitative terms, FID was applied generally and also specifically for the eye regions, in addition to NME:

Quantitative results for training on Stable Diffusion.
Quantitative results for training on Stable Diffusion.

Here the paper comments:

‘Quantitative [results] consistently favor our fine-tuned model across all metrics. FID scores on full and eye patches exhibit significant enhancement, and NME performance remains stable on training and reference datasets (EFHQ and LPFF).

‘These findings showcase the fine-tuned model’s capability to generate high-quality, geometrically consistent faces across diverse poses.’

Face Reenactment

To test the extent to which the new data can improve facial reenactment, the authors used the popular Thin-Plate Spline Motion Model (TPS) and Latent Image Animator (LIA), and used the same verification techniques originated for the First Order Motion Model (FOMM) project.

For this, the datasets VoxCeleb1 and VoxCeleb2 were enhanced with EFHQ data, and trained across two A100 GPUs, with hyperparameters that conformed to the respective original papers. Since copyright issues prevented obtaining the entirety of the original data, the models were trained from scratch on the available remaining data, rather than fine-tuned. Please refer to the supplementary section of the new paper for extensive details of the settings used for each of the frameworks, which we do not have space to cover here.

The three metrics used were Average Keypoint Distance (AKD), which assesses the mean distance between facial landmarks; Average Euclidian Distance (AED), which measures identity preservation by computing L2 distance between identity embeddings; and L1 distance.

Qualitative comparisons for frontal/profile facial reenactment across the tested frameworks.
Qualitative comparisons for frontal/profile facial reenactment across the tested frameworks.

Of the qualitative results for this round of tests, the paper asserts:

[Qualitative] results prove that EFHQ-trained models excel in transferring facial expressions and motion across diverse poses. Models trained on EFHQ showcase improved image quality, reduced artifacts, and well-preserved shapes of driving faces.

‘These results highlight our models’ effectiveness in capturing and reproducing facial motions across varying poses while enhancing fidelity to the original facial features.’

In quantitative tests, EFHQ proved able to obtain notable improvement across all metrics, with the system offering at least comparable results across the board, despite the apparent ‘overcrowding’ of source data to accommodate the new extreme angles.

Quantitative results for the EFHQ-enhanced reenactment tests.
Quantitative results for the EFHQ-enhanced reenactment tests.

Face Verification

The authors then tested whether face recognition itself could be maintained under the additional burden of EFHQ data. The two systems selected for testing were ArcFace and AdaFace, with diverse backbones (ResNet variants 18, 50 and 10), and an assortment of datasets (including MS1MV [ArcFace], Glint360K [InsightFace], and WebFace4M)

Prior to assessment, the authors first amended the common distribution of Cross-Pose LFW (CPLFW) to remove misaligned images. The authors report that after this, the models ‘performed well’ on the curated set, with few gaps between ArcFace and AdaFace.

The quantitative results below denote face-to-face (f2f), face-to-profile (f2p) and profile-to-profile (p2p) comparisons.

Quantitative results for face verification tests. Lowest benchmark scores highlighted in bold.
Quantitative results for face verification tests. Lowest benchmark scores highlighted in bold.

Regarding this, the authors conclude:

‘In summary, pose mismatch brings significant challenges for face recognition, necessitating larger models trained on diverse data covering profile faces.’

Finally, a user survey was conducted, comparing the outputs of the aforementioned models (see supplementary material in source paper for criteria for this):

Results from the user survey.
Results from the user survey.

Regarding the survey results, the authors comment:

‘In all human evaluation [results], our models outperform previous ones, notably on identity and motion test of face reenactment models (74.80% and 64.22%, respectively).’

Conclusion

One of the most interesting outcomes of attempting to impose high-scale extreme poses into standard datasets is the minimal extent to which the extra data affects the ability of most models to enact frontal poses. This suggests, going forward, that a typical latent space is capable of assimilating this useful data without sacrificing synthesis quality for the more common tasks to which such trained models are put, i.e., the rendering of frontal and near-frontal faces.

From the point of view of a VFX pipeline, if this is truly the case, this means that adjunct or specialized models for extreme views may not be necessary. This would be good news, since the transition points between the standard and specialized models would otherwise have had to be negotiated in some way, in terms of output, to avoid potentially jarring hand-offs between models.

From a security point of view, however, the prospect of extreme-angle-capable models opens up the possibility for scammers to survive the growing number of security checks, in face-based video conferencing and identification systems, which have latched on to ‘the profile gap’ as a shortcut to assess the authenticity of a person in a video call.

* My conversion of the authors’ inline citations to hyperlinks.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle