Repairing Demographic Imbalance in Face Datasets With StyleGAN3

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

There is a kind of cultural entropy around historical face image datasets; many of them were scraped from the web ad hoc, years ago, from sources where white people – and in particular white males – tended, at least at the time, to predominate.

Examples from the Celeb-A dataset, one of the oldest and most influential in the computer vision research scene. Source: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
Examples from the Celeb-A dataset, one of the oldest and most influential in the computer vision research scene. Source: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

In the image research sector, however, such ancestral issues have a far-reaching effect that is difficult to address, primarily because the largest and often least-balanced datasets naturally became very influential, and embedded in year-on-year metrics. The more that they predominated in the literature as the sector developed, the more the metrics for new work came to rely on them.

In turn, many of the most fundamental algorithms in the current computer vision research sector, such as certain loss functions, still reflect the demographics of these unbalanced datasets. Time has allowed these older collections to impose a significant technical debt on new work.

Though the various costs associated with fixing the issues are not insignificant, the fact that racial and gender imbalance can affect the efficiency and fairness of facial recognition (FR) systems, and that this has become a news flashpoint in recent years, has given the industry a strong motivation to address the problem.

Several important datasets in FR systems have been pulled in recent years, partially due to controversies centering around demographic inequality, including MS-Celeb1M, MegaFace, and Oxford University’s VGGFACE2.

Images from the VGGFace2 dataset, no longer available. Source: https://github.com/ox-vgg/vgg_face2
Images from the VGGFace2 dataset, no longer available. Source: https://github.com/ox-vgg/vgg_face2

As others have noted, pulling a troublesome dataset does not represent a definitive withdrawal of the data, since it will often already have been used to train models, create subsets (which do not get pulled when the ‘mother’ set is cancelled), or even to originate algorithms that may end up contributing to and shaping a far wider range of models.

Creating new hyperscale datasets to replace these historical collections is not a trivial matter, however; even though web-scraping can ingest terabytes of face images in a short amount of time, the scale of the data makes it impossible to manually curate – and, ironically, many of the algorithms that could help in identifying race and gender classes within the collection are themselves affected by having been trained on imbalanced datasets.

This represents a formidable logistical problem. Any labels and metadata associated with a web-scraped image may itself be biased; more often, though, such labels are either absent or unhelpful in terms of racial and gender classification, not least because such annotation could be interpreted as marginalizing, in the context of the original website.

One possible solution that has been explored in recent years is the creation of hyper-realistic synthetic datasets that provide a fairer distribution of races and gender, using generative systems such as Generative Adversarial Networks (GANs) and Latent Diffusion Models (LDMs) such as Stable Diffusion.

In 2022 the Generative Visual Prompt project used the FairFace dataset in combination with the text/image semantic encoding system CLIP to minimize racial and gender divergence.

Images from the Generative Visual Prompt project. Source: https://github.com/ChenWu98/Generative-Visual-Prompt
Images from the Generative Visual Prompt project. Source: https://github.com/ChenWu98/Generative-Visual-Prompt

Further attempts have been made to re-balance older dataset content by changing race and/or gender (see image below) by manipulating the latent space of a GAN, and by utilizing an autoencoder to create visual distinctions.

The authors of the 2021 paper ' Semantic and Geometric Unfolding of StyleGAN Latent Space' experimented with altering race and/or gender in unbalanced datasets. Source: https://arxiv.org/pdf/2107.04481.pdf
The authors of the 2021 paper ' Semantic and Geometric Unfolding of StyleGAN Latent Space' experimented with altering race and/or gender in unbalanced datasets. Source: https://arxiv.org/pdf/2107.04481.pdf

Other approaches have included the use of traditional CGI, though it is not possible to obtain the same level of photorealism with this approach, as evidenced by initiatives such as DigiFace-1M and DreamFace.

Example faces from the DigiFace-1M dataset. Source: https://arxiv.org/pdf/2210.02579.pdf
Example faces from the DigiFace-1M dataset. Source: https://arxiv.org/pdf/2210.02579.pdf

Now, researchers from France and Switzerland are proposing to augment existing datasets with racially and gender-balanced augmentations based on the NVIDIA StyleGAN latent space, allowing the project to synthesize novel examples from the desired categories.

Ad hoc generation of target label faces using the new system. Source: https://arxiv.org/pdf/2309.08442.pdf
Ad hoc generation of target label faces using the new system. Source: https://arxiv.org/pdf/2309.08442.pdf

Since, as the authors observe, a notable domain gap can occur across diverse types of StyleGAN architecture, the StyleGANV3 architecture was adopted for the new work, as it has, the researchers assert, been proven to be less prone to this issue.

The authors state:

‘In contrast to previous works that are using much more complex modeling schemes we used simple modeling technique. Our method can be employed to model and later on generate synthetic images according to arbitrary demographic groups.

‘One can categorize our proposed method as pre-processing method for addressing bias in existing models.’

The new paper is titled Toward responsible face datasets: modeling the distribution of a disentangled latent space for sampling face images from demographic groups, and comes from three researchers across the Idiap Research Institute, the École Polytechnique Fédérale de Lausanne (EPFL), and the School of Criminal Justice, at the Université de Lausanne (UNIL).

Approach

The core approach of the new system is to identify racial and gender-classified content in a StyleGAN latent space and use its latent codes to reproduce more examples within the same category (see image above).

Conceptual architecture for the new proposed pipeline.
Conceptual architecture for the new proposed pipeline.

The authors point out that directly modeling the latent space of StyleGAN is not possible, due to entanglement of various facets of the trained material, and that their system instead uses an autoencoder network to form an effective ancillary space where the material can be disentangled.

The autoencoder component handles the disambiguation of content.
The autoencoder component handles the disambiguation of content.

The autoencoder is itself trained to recognize and classify semantically relevant material, based on the desired class. The ‘cleaned’ latent codes are then passed to a Gaussian Mixture Modeling (GMM) module, which formulates a rational amalgam of different noise patterns (where ‘noise’ is effectively a semantic aspect which must be mixed with another semantic aspect, such as race + gender).

Here we see the disentangled autoencoder latent codes being assembled by GMM before being passed to StyleGAN3.
Here we see the disentangled autoencoder latent codes being assembled by GMM before being passed to StyleGAN3.

The decoder module in the autoencoder is used to obtain a refreshed latent code that encapsulates the target demographic within the latent space of StyleGAN3. These final codes are then passed to the StyleGAN3 generator module.

The system’s primary encoder uses the pixel2style2pixel (pSp) technique developed for the Encoding in Style project from Pen-ta-AI and Tel Aviv University in 2021.

The pixel2stylepixel project can perform a number of innovative transformations and recognitions. Source: https://arxiv.org/pdf/2008.00951.pdf
The pixel2stylepixel project can perform a number of innovative transformations and recognitions. Source: https://arxiv.org/pdf/2008.00951.pdf

Data and Tests

The researchers tested the new system by devising an image classification task using a module from the Discrimination aware decision tree learning project.  

The MORPH facial age estimation dataset was used to train the autoencoder modules. The base dataset for the StyleGAN3 latent space was FFHQ (which only last month was one of the targets of another, NYU-created system for re-balancing datasets). Therefore neither the training data for the autoencoder or the primary latent space had received prior exposure to the test images used in the trials.

PyTorch was used for the autoencoder implementation, while the inversion technique (i.e., the method for inserting the source pictures into the latent space) was borrowed from the 2022 project Third Time’s the Charm? Image and Video Editing with StyleGAN3, from Tel-Aviv University, the Hebrew University of Jerusalem, and Adobe Research.

The GMM module was informed by scikit-learn, and the autoencoder itself was trained on a NVIDIA RTX 3090ti with 24GB of VRAM. The batch size was set to a broadly representative 192 – a high figure for many use cases, but necessary when seeking to define broader qualities, such as the demographic characteristics, in this case.

The authors generated 1000 images each for the male and female class, to perform gender classification. For race, the labels white, black and Latino-Hispanic were used, omitting the two other available classifications in MORPH, Asian and Unknown, for logistical reasons).

A confusion matrix for the gender classification task.
A confusion matrix for the gender classification task.

Initially the researchers tested to ensure that the novel identities being created were truly unique, and to this end performed facial recognition on the synthetic images. This ensured that the original data existed in the MORPH dataset and that the new faces were truly new.

Faces were extracted from the source images using a ResNet50 network trained on the WebFace4M face dataset, using the ArcFace loss function.  The scores distribution of original MORPH images were compared against the synthetic images. The graph below demonstrates the level of parity obtained:

The scores for generated images compared to scores for MORPH source images, with genuine pairs in green, zero-effort imposters in blue, and synthetic imposters in orange.
The scores for generated images compared to scores for MORPH source images, with genuine pairs in green, zero-effort imposters in blue, and synthetic imposters in orange.

Regarding these results, the authors state:

‘The genuine score distribution is represented by a single bin because only a single synthetic image is available per identity. The zero-effort distribution (blue) moves toward the genuine score distribution (green). This shift indicates the identity difference is smaller than in the original dataset. However, the distance between the distributions remains large enough to discriminate between identities.’

In broad qualitative tests, the reconstruction and racial fidelity of the synthetic images were sampled. In the image below, the left-most column shows original data from MORPH; the middle column shows a standard pSp transformation; and the right-most column shows the pSp result when passed through the new system’s ancillary autoencoder:

The authors note that in these tests, the target demographics are well-preserved.

To demonstrate the extent to which the novel autoencoder can disentangle information, the authors compared the T-distributed Stochastic Neighbor Embedding (TSNE) plots from the original MORPH dataset to plots on the reconstructed synthetic data.

TSNE plots for gender and race in the original MORPH data, compared to plots for the novel synthetic data.
TSNE plots for gender and race in the original MORPH data, compared to plots for the novel synthetic data.

Here the authors note:

‘We can observe that the AE’s latent space with the applied contrastive loss is better disentangled according to possible demographics.’

Conclusion

The current drive to re-balance racially or gender-skewed datasets is interesting from the standpoint of deployment and use cases. In many instances, datasets are intended to train models aimed at one particular demographic, which could be as wide as Asian, or Chinese. In such cases, in a potential future where properly-balanced hyperscale datasets could become the norm, it may be necessary either to apply active filters for specific use-cases, or (as is a common practice) to siphon off sub-sets from the broader set.

The aforementioned controversies around facial recognition have largely centered on use cases in western urban environments, where a wide variety of races need to be considered, and where the potential for bias (and ensuing abuse) is very much a ‘local’ matter, despite the plethora of international races represented within training datasets.

Yet there are many cases where the host country for a facial recognition (or face-based) application is essentially a ‘monoculture’, and where the overwhelming majority of data in a truly international and artificially-balanced dataset would essentially be ballast; and where systems training on that data will need procedures and methodologies in order to drill down to the target demographics required.

In the meantime, for use in racially-diverse cities, and in applications where gender imbalance is likely to genuinely cause incorrect results at inference time, it may indeed prove more effective to use generative systems such as this to create bespoke datasets; it’s enough to consider that the ideal demographic configuration for NYC is going to be radically different for a deployment in a minor Chinese city with minimal tourism or immigrant population.

It’s possible, therefore, that systems such as those presented in this new paper will be used not only to re-balance datasets, but to unbalance them, by intention, to accommodate local needs.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle