Using Generative Adversarial Networks to Rethink the Selfie

Images taken from the source paper at https://arxiv.org/pdf/2406.12700
Images taken from the source paper at https://arxiv.org/pdf/2406.12700

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

In recent years, frameworks using Generative Adversarial Networks (GANs) have become eclipsed by subsequent synthesis approaches such as Neural Radiance Fields (NeRF), Latent Diffusion Models (LDMs), and, more recently, Gaussian Splatting.

Therefore it was nice last week to come across a new paper, from Samsung Research, that leverages GANs, but makes use of some more recent techniques and adjunct technologies as well.

The new approach, titled SUPER, is a prototype for a selfie-editing architecture that uses GANs, in combination with other techniques, to allow a user to adjust a self-portrait mobile phone image after the fact, so that aberrations such as ‘disappearing ears’ or oversized noses – caused by close proximity and wide lenses – can be compensated for after taking the picture.

The transformations that SUPER enables can, additionally, make a selfie appear as if it was taken by another person, casually:

From the new paper, selfie pictures are transformed into new angles that don't smack of vanity. Source: https://arxiv.org/pdf/2406.12700
From the new paper, selfie pictures are transformed into new angles that don't smack of vanity. Source: https://arxiv.org/pdf/2406.12700

This is accomplished with a mixture of techniques, including reconstruction of the obtained face via a 3D GAN, inference of a depth map, and subsequent transformation of the depth map into a CGI mesh – as well as a method to estimate which areas of the image are occluded, so that this can be compensated for.

Various elements that constitute the interpretive powers of SUPER.
Various elements that constitute the interpretive powers of SUPER.

Though the system is a proof of concept, and would clearly need to operate within the (admittedly growing) AI hardware capabilities of mobile devices, the authors establish in tests that it is broadly superior, qualitatively and quantitatively, to prior similar methods.

It’s refreshing to see GAN return from obscurity and playing to its traditional strength – the synthesis and manipulation of faces.

What’s additionally refreshing about the work is that the researchers generated and used their own data for the examples in the paper, in a research climate where papers tend to be strewn with unauthorized celebrity examples – though the unique capture requirements for this proof-of-concept system (which we will take a look at) required this kind of original data.

The paper is titled SUPER: Selfie Undistortion and Head Pose Editing with Identity Preservation, and comes from seven researchers at Samsung Research.

Method

The beginning of the workflow for SUPER is to extract a 3D Morphable Model (3DMM) CGI representation of the face, which is handled by the 2020 Microsoft project Accurate 3D Face Reconstruction with Weakly-Supervised Learning (Deep3DFaceRecon).

From the original 2020 Microsoft paper 'Accurate 3D Face Reconstruction with Weakly-Supervised Learning', examples of the 3D inference from static images, CGI representations bolstered from interpreted facial alignment mappings (bottom left). Source: https://github.com/Microsoft/Deep3DFaceReconstruction
From the original 2020 Microsoft paper 'Accurate 3D Face Reconstruction with Weakly-Supervised Learning', examples of the 3D inference from static images, CGI representations bolstered from interpreted facial alignment mappings (bottom left). Source: https://github.com/Microsoft/Deep3DFaceReconstruction

Deep3DFaceRecon provides essential variables such as interpreted rotation, translation and lens focal length, inferred from the original image.

Though a notable previous approach called DisCO (which we covered when it came out) used a fairly exhaustive and resource-intensive method to achieve GAN inversion (the projection of the latent embeddings extracted into the GAN itself, for further manipulation), the authors instead use a lighter triplane network encoder called TriPlaneNet – and they note that this method easily separates obtained geometry from camera effects, which is a critical function for the process.

Examples of novel view synthesis from TriPlaneNet, which is incorporated into the workflow for SUPER. Source: https://arxiv.org/pdf/2303.13497
Examples of novel view synthesis from TriPlaneNet, which is incorporated into the workflow for SUPER. Source: https://arxiv.org/pdf/2303.13497

With TriPlaneNet handling the inference of an initial face latent code, and Deep3DFaceRecon providing the camera parameters, both of these sets of variables are then optimized with loss functions, namely Learned Perceptual Similarity Metrics (LPIPS) and L2 facial landmark losses based on face landmarks estimated by Google’s popular MediaPipe FaceMesh-V2 framework.

The latent code and camera parameters are optimized at a learning rate of 0.0001 for 200,000 iterations.

Conceptual schema for SUPER.
Conceptual schema for SUPER.

With the requisite elements now at hand, the task of generating a novel synthesized view is passed to the Stanford/NVIDIA 2022 3D GAN EG3D.

An overview of the functionality provided by the 2022 Stanford + NVIDIA offering Eg3D. Source: https://www.youtube.com/watch?v=2SGhKAX6x4g

EG3D subsequently outputs a generated image and a depth map which can facilitate the creation of base 3D X/Y/Z coordinates for the optimized facial variable (see lower left section of schema image a few paragraphs above).

The 3D coordinates are generated from the depth map provided by EG3D. Connecting up the adjacent vertices in the map obtains a coarse mesh (center of earlier schema image above), which is then refined with bilateral blur smoothing to obviate sharp angles appearing in the final rendered image.

The enhanced mesh is then projected into the original source camera pose, in order to get texture coordinates, and the texture is then resampled for the purposes of a novel view synthesis.

An image of the subject’s face is then rendered with a refined EG3D depth-map. This is a ‘warped’ image, wherein the original data passed through to this stage is essentially made plastic and deformed into the approximation of a novel viewpoint.

This is essentially the difference between a 2D and a 3D GAN, in that warping is a staple of image and video manipulation, and long since subject to automation in CGI-based pipelines; conversely, a 3D GAN has a genuine understanding of 3D space, and an innate vision of the entirety of the passed-through identity, and can view it from alternate angles.

However, the priority at this stage is the retention of identity, as well as other qualities from the original source image (the real-world image). Therefore two representations, the warped image and an image generated by the 3D-capable neural networks operating in the workflow, are blended with a blurred mask, via a three-level Laplacian pyramid.

Data and Tests

For testing purposes, the researchers used two datasets: Caltech Multi-Distance Portraits (CMDP) and In-the-wild images (ITWI), from the aforementioned DisCO project.

Original and warped samples from Caltech Multi-Distance Portraits (CMDP), used as one of the baselines in testing the new system. Source: https://gfx.cs.princeton.edu/pubs/Fried_2016_PMO/fried2016-portraits.pdf
Original and warped samples from Caltech Multi-Distance Portraits (CMDP), used as one of the baselines in testing the new system. Source: https://gfx.cs.princeton.edu/pubs/Fried_2016_PMO/fried2016-portraits.pdf

CMDP is a small collection featuring 53 people with diverse facial characteristics, each photographed at seven distinct distances. ITWI is composed of very distorted portrait photos, and the authors chose to use only those examples sourced from the open-license Unsplash repository. These were used as qualitative fodder, since ground truth is not applicable at many stages of an approach of this nature.

Additionally the authors generated and curated their own dataset, titled Head Rotation (HeRo), which contains portrait images of 19 subjects with varying facial attributes such as facial hair, glasses, and a range of facial expressions.

These images were captured across four Samsung Galaxy S23FE phones arrayed on a rig, and all synchronized so that images were taken simultaneously.

Left, the capture rig with the Samsung smartphones; right, examples of the varied simultaneous exposures.
Left, the capture rig with the Samsung smartphones; right, examples of the varied simultaneous exposures.

Metrics used were Peak Signal-to-Noise Ratio (PSNR); Structural Similarity Index (SSIM); the aforementioned LPIPS; and an ID score based on Cosine distance between the predicted and source face images.

Rival frameworks tested were Perspective-aware Manipulation of Portrait Photos (‘Fried’s’ in test results); 3D Photography using Context-aware Layered Depth Inpainting (‘3DP’ in test results); High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization (‘HFGi3D’ in test results); 3D GAN Inversion with Pose Optimization (‘Ko’s’, in test results) the aforementioned TriPlaneNet; and DisCO.

Tested quantitatively, the results are depicted in the table below:

Metric-based quantitative results for the framework comparison on SUPER.
Metric-based quantitative results for the framework comparison on SUPER.

Regarding the results for the initial quantitative round pictured above, the paper states:

‘Evidently, our method notably outperforms others in terms of all metrics, especially in identity preservation. Due to the sampling of pixels from the original image, SUPER keeps crucial details of  identity, such as eye color, wrinkles, earrings, etc.’

Next came the qualitative round:

Results for the qualitative round against rival frameworks. Please refer to the original source paper for better resolution (https://arxiv.org/pdf/2406.12700).
Results for the qualitative round against rival frameworks. Please refer to the original source paper for better resolution (https://arxiv.org/pdf/2406.12700).

Here the authors comment*:

‘The portraits corrected through a variety of approaches, including SUPER, are depicted in [image above]. [Fried’s warping does] not seem to have a major effect on inputs. In contrast 3DP introduces noticeable changes, while amplifying distortions, so that the middle part of a face exhibits less distortion, but the head and chin are malformed.

‘Generative TriPlaneNet and DisCO make [faces] look different. HFGI3D produces recognizable portraits, yet oversmoothed and featuring visual artifacts. Our method generates faces with fewer perspective distortions while maintaining identity [as seen in image below], especially in cases when the camera changes notably and the inpainted regions are relatively large.’

Additional qualitative comparison, with In-the-wild Images.
Additional qualitative comparison, with In-the-wild Images.

In ablation tests, the authors sought to identify the optimal number of iteration steps for the process, noting that the PSNR metric peaks beyond 100 iterations. In the end they rounded on 200 iterations as a default, and note that DisCO requires the same number.

Next the researchers assessed the head pose correction functionality of SUPER, by attempting to transform a front-facing reference (i.e., source or real) photos into a view corresponding to left, right, or top. Quantitative results across the metric array are depicted in the table below:

Quantitative comparison for results on the head pose correction tests.
Quantitative comparison for results on the head pose correction tests.

We can see in the above results that HFGI3D and PTI were able to achieve higher scores in SSIM and PSNR, respectively, but that SUPER otherwise dominates. Below are qualitative examples from this round:

Qualitative results from the head pose correction tests on the authors' own HeRo dataset.
Qualitative results from the head pose correction tests on the authors' own HeRo dataset.

The authors conclude:

‘Our approach enriches 3D warping with the flexibility and expressiveness of a 3D generative model. The resulting image is a blend of a warped image obtained through mesh-based rendering, and another image produced with a 3D GAN.

‘Experiments on face undistortion benchmarks and our novel Head Rotation dataset proved that SUPER provides more realistic results with finer details and better preserves identity compared to existing techniques, and hence establishes a new state-of-the-art in face undistortion and head pose correction tasks.’

Conclusion

SUPER offers desirable functionality in the form of a truly AI-enhanced ‘filter’ for personal images – but, assuming it can mature into a performant system, at what cost? While there is a good deal of research going on into optimizing GANs so that they can run on mobile devices, the more computing heft and volume of data that’s necessary, the more remote the prospect is.

Neither is this necessarily the most appealing method of provision for investors: a more realistic approach for a SUPER-style framework, and one which the public may grow increasingly weary of over the next 12-18 months (but which excites shareholders in generative AI companies), is to make such a system API-based, so that the taken photos transit to a black-box model on corporate servers – and the results are fed back over the network.

Currently, the gen-AI superstar following this model is Luma, a predictably censored but otherwise fairly capable multimodal video generation tool, which – at least for the moment – offers 150 5-second text/image>video generations for around $30 a month, 430 for $80 p/m, and 2,030 for $400 a month (with higher rates in all cases for monthly instead of annual billing).

Therefore, unless systems of this type can adapt themselves natively to local execution on at least iPhone-level hardware, and the suppliers content themselves with the lower revenue that ‘outright-sale’ products provide, it seems that gen-AI’s march towards a multitude of rental models is set to highlight the tipping point between consumer demand and long-term customer retention.

* Repetitive citations omitted.

More To Explore

Images from the accompanying YouTube video for the paper ' MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos'. Source: https://www.youtube.com/watch?v=Kpbpujkh2iI
AI ML DL

Extracting Controllable CGI From the ‘Black Box’ of Neural Human Avatars

A new collaboration between China and Denmark offers a way to extract traditional CGI meshes and textures from implicit neural human avatars – a task that is extraordinarily challenging, but which could pave the way for more controllable AI-generated imagery and video in the future.

Montaged images from the paper 'From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment' - https://arxiv.org/pdf/2406.13912
AI ML DL

Generative AI’s Captioning Crisis May Not Be Fixable With Large Language Models

Text-to-image and text-to-video models such as Stable Diffusion and Sora rely on datasets of images that include captions which accurately describe the photos in the collection. Most often, these captions are either inadequate or inaccurate – frequently both. Sometimes they’re downright deceptive, damaging models trained on them. But the research sector’s hopes that multi-modal large language models can create better captions is challenged in a recent new paper from NVIDIA and Chinese researchers.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle