Creating Hyperscale Face Datasets via ControlNet and Stable Diffusion

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A new academic collaboration between two British universities has produced a novel yet strangely homogeneous high-scale dataset of 250,000 faces, using the ControlNet guide architecture for Stable Diffusion.

The depth-mapping functionality in ControlNet is used to produce a diversity of faces from a single face shape, complete with multiple lens projections (i.e., how 'wide-angle' the picture is, and how distorted the face depicted might be, based on the lens' focal length). Source: https://arxiv.org/pdf/2307.13639.pdf
The depth-mapping functionality in ControlNet is used to produce a diversity of faces from a single face shape, complete with multiple lens projections (i.e., how 'wide-angle' the picture is, and how distorted the face depicted might be, based on the lens' focal length). Source: https://arxiv.org/pdf/2307.13639.pdf

Titled SynthFace, the ‘seeds’ for the varied faces in the dataset are provided by 3D Morphable Models (3DMMs), which offer flexible parametric faces, and an accessible intra-domain instrumentality that’s rarely found in purely neural architectures.

The conceptual architecture for the SynthFace generation pipeline. The 3DMM model (far left) produces face images from which features are derived, and a depth map sent to Stable Diffusion. An iterative and automated evolving prompt process produces multiple identities from the same depth map, and multiple depth maps are used to keep the collection variegated and diverse.
The conceptual architecture for the SynthFace generation pipeline. The 3DMM model (far left) produces face images from which features are derived, and a depth map sent to Stable Diffusion. An iterative and automated evolving prompt process produces multiple identities from the same depth map, and multiple depth maps are used to keep the collection variegated and diverse.

Since the face-generation pipeline is powered by such easily controllable techniques, it’s easy to therefore generate not only a diverse range of faces, but a diverse range of focal lengths.

Freely interpreted by the author from examples in the new paper, we see that the focal length of the viewpoint can vary while retaining the integrity and consistency of the style of the dataset.
Freely interpreted by the author from examples in the new paper, we see that the focal length of the viewpoint can vary while retaining the integrity and consistency of the style of the dataset.

Being able to produce identities at varying focal lengths is, in general, a boon to the usability of a face dataset, since identity information is strongly bound up with focal length.

The distorting effect of varying focal lengths on identity qualities in the human face. Source: https://www.danvojtech.cz/blog/2016/07/amazing-how-focal-length-affect-shape-of-the-face/

In addition to the new dataset, which – the authors contend – is the highest-volume set of its kind that features actual photorealistic faces, the researchers offer a new recognition framework called ControlFace, designed to be implemented in a wide range of possible applications beyond mere image synthesis, including uses in security and medical imaging.

Running inference on ControlFace.
Running inference on ControlFace.

The new paper is titled Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face Shape Estimation, and comes from four researchers across the University of York and the University of Leeds.

Approach

The paper’s title is a re-take on the 2021 Microsoft paper Fake it till you make it: face analysis in the wild using synthetic data alone, which used purely (and evidently) synthetic, CGI-based data to provide a comprehensive face dataset of 100,000 facial images at 512x512px.

Regarding this prior work, the new paper’s authors comment:

‘This leads to robust performance but there remains a large domain gap; the images are not photorealistic, the process requires crafted assets, and it is computationally expensive. They propose to ‘fake it till you make it’ with crafted ‘fake’ data enabling them to ‘make it’ with strong performance in the real world.

‘We ‘fake’ it without having to make any assets at all.’

The domain gap issue is critical: datasets with real images may have copyright provenance issues (if not now, then in the future), be damagingly repetitive, and may use a variety of non-annotated focal lengths, which will make consistent semantic training difficult.

On the other hand, as we can see in the video embedded below, which demonstrates Microsoft’s Fake It Til You Make It outing, systems trained on manifestly synthetic data do not resemble the real-world data on which the models are ultimately intended to be deployed, and would logically work much better if they resembled the target data more.

Play Video

The SynthFace dataset consists of 250,000 photorealistic faces utilizing 10,000 distinct facial shapes. These 512x512px images were created in thirty hours on 12 NVIDIA 1080 GPUs – a model that is considered ‘mid-range’ in gaming (and certainly in machine learning), and which has only 8GB of VRAM.

The faces were originated by sampling 10,000 faces from the Faces Learned with an Articulated Model and Expressions (FLAME) 3DMM parametric head model, which in itself was trained on the basis of thousands of real-world 3D scans.

The FLAME model was originated from real-life 3D head scans. Source: https://ps.is.mpg.de/uploads_file/attachment/attachment/400/paper.pdf
The FLAME model was originated from real-life 3D head scans. Source: https://ps.is.mpg.de/uploads_file/attachment/attachment/400/paper.pdf

Five depth maps were rendered from each head, with varying perspective projections (i.e., focal lengths of the virtual camera, see above), resulting in 50,000 depth maps to exploit. As the authors explain, the range of projections is designed to allow later networks to disentangle identity from focal length of the photo, and the underlying 3D shape that results from it when the features are extracted.

The authors then used ControlNet in Stable Diffusion (the traditional and stable 1.5 model) to generate photorealistic faces from the depth maps. The resulting data comes complete with 3DMM parameters, which gives a granular and very useful interface and general data overview of the faces. This means that when the faces are subsequently used in third-party systems, it will not be necessary to plug in quite so many third-party libraries, in order to ‘guess’ focal length and many of the other parameters.

The authors note that 25 images are produced for each distinct generated 3D shape, which ultimately produces the 250,000 images. This means that a single ‘outline’ is powering a large number of identities:

Multiple identities emerge from the same depth map.
Multiple identities emerge from the same depth map.

The entire 300 available FLAME shape parameters are used in the output, and feature extraction is then handled by ArcFace, a 2022 project lead-authored by Imperial College London, which is capable also of generating identity-based priors as well as extracting them.

The new project did not use all the capabilities of ArcFace, or of several of the libraries or contributing technologies, many of which are designed to address faces in a variety of poses. Instead, the SynthFace output is designed to produce ‘passport-style’ images at a range of simulated focal distances or projections. Therefore the entire workflow is set up to arrive at a canonical (or ‘default’) view of the subject.

The Stable Diffusion generation procedure is particularly interesting, since the framework iteratively builds up the prompts used to inform the image-to-image procedure that starts with the depth maps. The authors explain:

‘A systematic method was undertaken to iteratively refine our prompt to generate realistic human faces. This process involved starting with a single text prompt, ‘studio portrait’, and iteratively adding single phrases, both to positive and negative prompts, to build an improved prompt.

‘The impact of these additional phrases was qualitatively evaluated in each case with only phrases that produced more visually lifelike outputs kept.’

One negative effect of the upstream frameworks used in the system is a notable age/race/gender imbalance, with SynthFace estimated by FaceLib to be 83.1% male and 16.9% female, with the age distribution similarly skewed:

The age distribution of rendered subjects is unkind to older folks, in SynthFace.
The age distribution of rendered subjects is unkind to older folks, in SynthFace.

The authors note that these imbalances reflect data distributions in many of the contributing technologies, including Stable Diffusion and FLAME, and the way that they are used through ControlNet.

The second offering in the new paper is ControlFace, an interpretive network derived from SynthFace, which is capable of disentangling age from identity, and also of taking focal length into account, which is a rare boon in systems of this kind, that must usually either rely on the sporadic accuracy of plug-in libraries or on expensive labeling.

Training ControlFace, a mapping network informed by the SynthFace data. ControlFace minimizes the mesh reconstruction error between a mesh prediction and a known 3D mesh for each face that comes through from SynthFace. This is a rare accommodation in such projects.
Training ControlFace, a mapping network informed by the SynthFace data. ControlFace minimizes the mesh reconstruction error between a mesh prediction and a known 3D mesh for each face that comes through from SynthFace. This is a rare accommodation in such projects.

Data and Tests

For the training process, RetinaFace is used for localization and cropping based on perceived landmarks.

RetinaFace is a collaboration between Imperial College, InsightFace, FaceSoft, and Middlesex University London, and performs complex 3D reconstruction in order to delineate facial characteristics. Source: https://openaccess.thecvf.com/content_CVPR_2020/papers/Deng_RetinaFace_Single-Shot_Multi-Level_Face_Localisation_in_the_Wild_CVPR_2020_paper.pdf
RetinaFace is a collaboration between Imperial College, InsightFace, FaceSoft, and Middlesex University London, and performs complex 3D reconstruction in order to delineate facial characteristics. Source: https://openaccess.thecvf.com/content_CVPR_2020/papers/Deng_RetinaFace_Single-Shot_Multi-Level_Face_Localisation_in_the_Wild_CVPR_2020_paper.pdf

RetinaFace’s versatility allows the new project to warp input faces to the ‘straight-on’ canonical pose that’s desired for the work.

For the mapping network, the authors have leveraged MICA, the framework behind the aggregated dataset that the new paper’s researchers consider to be the current state of the art, and the nearest analog to their own work.

However, they argue, the MICA project is not amenable to further development:

‘The unification of existing 3D datasets through MICA has shown promising results in 3D face shape estimation, but it represents an upper bound on a dataset for supervised 3DMM regression unless more 3D data is collected.

‘We overcome this limitation by devising a dataset generation pipeline which combines 2D and 3D generative models.’

The training data split for the SynthFace images was a standard 80/20, with the best-performing model chosen based on the validation loss. Early stopping is used, with a patience (the number of unproductive epochs allowed to run) of 20, and the system is run for 100 epochs (a complete cycle through the available training data).

The AdamW optimizer was used, under the strategy employed in MICA, with masked mesh loss concentrating on the inner regions of the face (which is, arguably, a controversial tactic these days).

For testing the system, the researchers used the ‘Not Quite In-The-Wild’ (NoW) benchmark, which offers 2054 images covering 100 identities, and which is similarly powered by FLAME.

From the paper 'Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision', The Max Planck Institute's RingNet also learns from the FLAME model, which inference will power the NOW network. Source: https://arxiv.org/pdf/1905.06817.pdf
From the paper 'Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision', The Max Planck Institute's RingNet also learns from the FLAME model, which inference will power the NOW network. Source: https://arxiv.org/pdf/1905.06817.pdf

The competing frameworks tested were Deep3D; DECA (2 variants); AlbedoGAN (2 variants); and MICA.

The procedure tested involved evaluating alignments from predicted meshes to the equivalent scans, via facial landmarks. For each vertex, the scan-to-mesh distance was recorded, and subsequently the mean distances calculated.

Results for the comparison run.
Results for the comparison run.

Of these results, the authors state:

‘Our results are competitive with the current state-of-the-art in 3D face shape estimation without requiring any ground truth data. We achieve this by introducing a novel method for large dataset generation for 3D face shape estimation. Our work with ControlFace demonstrates that supervised training on this dataset leads to accurate 3D face shape estimation.

Crucially, our work is easily extensible. A longer generation time can lead to a larger dataset and improvements in 2D and 3D generative model capabilities can directly feed into future work. We believe this will enable future versions of SynthFace to close the performance gap with methods such as MICA and AlbedoGAN. Datasets for specific use cases, be that large pose variations or expressions, can be created by updating parameters in our generation code.’

Regarding limitations, the primary one – the demographic imbalance – is the most serious, though indicative of wider issues regarding the use of legacy data and methodologies that predate current concerns about fair representation. The authors also contend that the current embedding network used in the system could eventually be entirely removed and replaced with a single unified network that could learn to map 3DMM parameters in a supervised way.

The researchers additionally acknowledge the growing concern about the use of LAION-style web-scraped datasets, that power systems such as Stable Diffusion:

‘Generative models like stable diffusion require extensive datasets for training, which typically rely on publicly available data. Consequently, there’s a likelihood that individuals’ data has been used without their explicit consent. This raises clear ethical and legal concerns, particularly for models deployed in the real world.’

Finally, they concede that though the system is intended for practical and beneficial uses, such as prosthesis development, it could also be used to improve deepfake systems, and also mass surveillance systems.

Conclusion

Though the new dataset presented in the work has a number of clear limitations, those involving demographic bias are probably the easiest to systematically solve. This brings into view the growing possibility of generating highly diverse hyperscale facial datasets that easily bridge the domain gap – a challenge that cannot realistically be met by any other method, except in an authoritarian society that may consider the benefits of such research to outweigh the privacy and consent issues that may trouble its populace (i.e., the ad hoc use of state surveillance footage and/or citizen documents created for other purposes).

The other troubling aspect of systems such as the one proposed is their habitual dependence on upstream data that may eventually fall foul of developing privacy laws, and risk ending up in what would be a quite substantial bushel of ‘poison fruit’ downstream projects. Affected projects would suddenly need to find legitimate data at apparently impossible scales. Currently therefore, the challenge is somewhat recursive.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle