There is a kind of cultural entropy around historical face image datasets; many of them were scraped from the web ad hoc, years ago, from sources where white people – and in particular white males – tended, at least at the time, to predominate.

In the image research sector, however, such ancestral issues have a far-reaching effect that is difficult to address, primarily because the largest and often least-balanced datasets naturally became very influential, and embedded in year-on-year metrics. The more that they predominated in the literature as the sector developed, the more the metrics for new work came to rely on them.
In turn, many of the most fundamental algorithms in the current computer vision research sector, such as certain loss functions, still reflect the demographics of these unbalanced datasets. Time has allowed these older collections to impose a significant technical debt on new work.
Though the various costs associated with fixing the issues are not insignificant, the fact that racial and gender imbalance can affect the efficiency and fairness of facial recognition (FR) systems, and that this has become a news flashpoint in recent years, has given the industry a strong motivation to address the problem.
Several important datasets in FR systems have been pulled in recent years, partially due to controversies centering around demographic inequality, including MS-Celeb1M, MegaFace, and Oxford University’s VGGFACE2.

As others have noted, pulling a troublesome dataset does not represent a definitive withdrawal of the data, since it will often already have been used to train models, create subsets (which do not get pulled when the ‘mother’ set is cancelled), or even to originate algorithms that may end up contributing to and shaping a far wider range of models.
Creating new hyperscale datasets to replace these historical collections is not a trivial matter, however; even though web-scraping can ingest terabytes of face images in a short amount of time, the scale of the data makes it impossible to manually curate – and, ironically, many of the algorithms that could help in identifying race and gender classes within the collection are themselves affected by having been trained on imbalanced datasets.
This represents a formidable logistical problem. Any labels and metadata associated with a web-scraped image may itself be biased; more often, though, such labels are either absent or unhelpful in terms of racial and gender classification, not least because such annotation could be interpreted as marginalizing, in the context of the original website.
One possible solution that has been explored in recent years is the creation of hyper-realistic synthetic datasets that provide a fairer distribution of races and gender, using generative systems such as Generative Adversarial Networks (GANs) and Latent Diffusion Models (LDMs) such as Stable Diffusion.
In 2022 the Generative Visual Prompt project used the FairFace dataset in combination with the text/image semantic encoding system CLIP to minimize racial and gender divergence.

Further attempts have been made to re-balance older dataset content by changing race and/or gender (see image below) by manipulating the latent space of a GAN, and by utilizing an autoencoder to create visual distinctions.

Other approaches have included the use of traditional CGI, though it is not possible to obtain the same level of photorealism with this approach, as evidenced by initiatives such as DigiFace-1M and DreamFace.

Now, researchers from France and Switzerland are proposing to augment existing datasets with racially and gender-balanced augmentations based on the NVIDIA StyleGAN latent space, allowing the project to synthesize novel examples from the desired categories.

Since, as the authors observe, a notable domain gap can occur across diverse types of StyleGAN architecture, the StyleGANV3 architecture was adopted for the new work, as it has, the researchers assert, been proven to be less prone to this issue.
The authors state:
‘In contrast to previous works that are using much more complex modeling schemes we used simple modeling technique. Our method can be employed to model and later on generate synthetic images according to arbitrary demographic groups.
‘One can categorize our proposed method as pre-processing method for addressing bias in existing models.’
The new paper is titled Toward responsible face datasets: modeling the distribution of a disentangled latent space for sampling face images from demographic groups, and comes from three researchers across the Idiap Research Institute, the École Polytechnique Fédérale de Lausanne (EPFL), and the School of Criminal Justice, at the Université de Lausanne (UNIL).
Approach
The core approach of the new system is to identify racial and gender-classified content in a StyleGAN latent space and use its latent codes to reproduce more examples within the same category (see image above).

The authors point out that directly modeling the latent space of StyleGAN is not possible, due to entanglement of various facets of the trained material, and that their system instead uses an autoencoder network to form an effective ancillary space where the material can be disentangled.

The autoencoder is itself trained to recognize and classify semantically relevant material, based on the desired class. The ‘cleaned’ latent codes are then passed to a Gaussian Mixture Modeling (GMM) module, which formulates a rational amalgam of different noise patterns (where ‘noise’ is effectively a semantic aspect which must be mixed with another semantic aspect, such as race + gender).

The decoder module in the autoencoder is used to obtain a refreshed latent code that encapsulates the target demographic within the latent space of StyleGAN3. These final codes are then passed to the StyleGAN3 generator module.
The system’s primary encoder uses the pixel2style2pixel (pSp) technique developed for the Encoding in Style project from Pen-ta-AI and Tel Aviv University in 2021.

Data and Tests
The researchers tested the new system by devising an image classification task using a module from the Discrimination aware decision tree learning project.
The MORPH facial age estimation dataset was used to train the autoencoder modules. The base dataset for the StyleGAN3 latent space was FFHQ (which only last month was one of the targets of another, NYU-created system for re-balancing datasets). Therefore neither the training data for the autoencoder or the primary latent space had received prior exposure to the test images used in the trials.
PyTorch was used for the autoencoder implementation, while the inversion technique (i.e., the method for inserting the source pictures into the latent space) was borrowed from the 2022 project Third Time’s the Charm? Image and Video Editing with StyleGAN3, from Tel-Aviv University, the Hebrew University of Jerusalem, and Adobe Research.
The GMM module was informed by scikit-learn, and the autoencoder itself was trained on a NVIDIA RTX 3090ti with 24GB of VRAM. The batch size was set to a broadly representative 192 – a high figure for many use cases, but necessary when seeking to define broader qualities, such as the demographic characteristics, in this case.
The authors generated 1000 images each for the male and female class, to perform gender classification. For race, the labels white, black and Latino-Hispanic were used, omitting the two other available classifications in MORPH, Asian and Unknown, for logistical reasons).

Initially the researchers tested to ensure that the novel identities being created were truly unique, and to this end performed facial recognition on the synthetic images. This ensured that the original data existed in the MORPH dataset and that the new faces were truly new.
Faces were extracted from the source images using a ResNet50 network trained on the WebFace4M face dataset, using the ArcFace loss function. The scores distribution of original MORPH images were compared against the synthetic images. The graph below demonstrates the level of parity obtained:

Regarding these results, the authors state:
‘The genuine score distribution is represented by a single bin because only a single synthetic image is available per identity. The zero-effort distribution (blue) moves toward the genuine score distribution (green). This shift indicates the identity difference is smaller than in the original dataset. However, the distance between the distributions remains large enough to discriminate between identities.’
In broad qualitative tests, the reconstruction and racial fidelity of the synthetic images were sampled. In the image below, the left-most column shows original data from MORPH; the middle column shows a standard pSp transformation; and the right-most column shows the pSp result when passed through the new system’s ancillary autoencoder:

The authors note that in these tests, the target demographics are well-preserved.
To demonstrate the extent to which the novel autoencoder can disentangle information, the authors compared the T-distributed Stochastic Neighbor Embedding (TSNE) plots from the original MORPH dataset to plots on the reconstructed synthetic data.

Here the authors note:
‘We can observe that the AE’s latent space with the applied contrastive loss is better disentangled according to possible demographics.’
Conclusion
The current drive to re-balance racially or gender-skewed datasets is interesting from the standpoint of deployment and use cases. In many instances, datasets are intended to train models aimed at one particular demographic, which could be as wide as Asian, or Chinese. In such cases, in a potential future where properly-balanced hyperscale datasets could become the norm, it may be necessary either to apply active filters for specific use-cases, or (as is a common practice) to siphon off sub-sets from the broader set.
The aforementioned controversies around facial recognition have largely centered on use cases in western urban environments, where a wide variety of races need to be considered, and where the potential for bias (and ensuing abuse) is very much a ‘local’ matter, despite the plethora of international races represented within training datasets.
Yet there are many cases where the host country for a facial recognition (or face-based) application is essentially a ‘monoculture’, and where the overwhelming majority of data in a truly international and artificially-balanced dataset would essentially be ballast; and where systems training on that data will need procedures and methodologies in order to drill down to the target demographics required.
In the meantime, for use in racially-diverse cities, and in applications where gender imbalance is likely to genuinely cause incorrect results at inference time, it may indeed prove more effective to use generative systems such as this to create bespoke datasets; it’s enough to consider that the ideal demographic configuration for NYC is going to be radically different for a deployment in a minor Chinese city with minimal tourism or immigrant population.
It’s possible, therefore, that systems such as those presented in this new paper will be used not only to re-balance datasets, but to unbalance them, by intention, to accommodate local needs.