A Unified System for Facial Analysis


About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Most new projects dealing with facial synthesis are required to bind in an extraordinary number of ancillary libraries, to cope with all the necessary facets of dealing with aspects of the human face: pose estimation deals with the angle of the face/head relative to the camera; identity concerns the continuity of particular facial features; semantic segmentation is frequently needed both in order to determine areas that might lead to generative heads, and to isolate faces in images that have other content; and, among several additional considerations, race and age estimation are often necessary components in neural synthesis and analysis workflows.

Examples of just four of the many tasks necessary for facial workflows, all of which are often in their own right burdensome and unrelated libraries that must be individually hooked into new and emerging systems*.
Examples of just four of the many tasks necessary for facial workflows, all of which are often in their own right burdensome and unrelated libraries that must be individually hooked into new and emerging systems*.

Such is the culture of ad hoc assembly in the synthesis research scene, that interoperability is rarely the chief objective in new developments for any one of these tasks. Frequently, libraries and modules that handle these specialized facets may require specialized sub-environments, or even contrasting Python versions in order to operate. Further, they are likely to need to run sequentially, slowing up inference time or training times.

Worse yet, they may function with dedicated token systems or other methodologies that necessitate additional translation or exfiltration modules, in order to extract the necessary variables for a contrasting or subsequent phase. In extreme cases, the only way to access such processed variables may even be to have to apply them to a CGI head and then re-evaluate the head with the next system in line.

It would be most useful to the research scene if these multiple functionalities could be incorporated into an ‘all-in-one’ solution capable not only of performing all these tasks, but of providing a lingua franca between the variables that each task undertakes, so that the burden of development and inference could be greatly reduced.

Such is the scope of a new work from John Hopkins University, titled FaceXFormer, a potential ‘Swiss army knife’ for neural facial tasks.

Some of the transformational processes now possible in the single FaceXFormer framework. Source: https://arxiv.org/pdf/2403.12960.pdf
Some of the transformational processes now possible in the single FaceXFormer framework. Source: https://arxiv.org/pdf/2403.12960.pdf

Though some projects have come forth in recent years to at least integrate some of the broken-up sub-tasks in each category, very few have attempted to conglomerate a wider array of tasks into a unified framework.

The new system, instead, is capable of offering a generalized face representation capable of handling in the wild images (i.e., images that were not seen during the training of the model); of producing variables that can be passed among the diverse tasks; and of improving the general synergy of such workflows.

The use of FaceXFormer on unseen in-the-wild images, featuring segmentation, landmark estimation and pose estimation (with source image far left).
The use of FaceXFormer on unseen in-the-wild images, featuring segmentation, landmark estimation and pose estimation (with source image far left).

The new paper is titled FaceXFormer: A Unified Transformer for Facial Analysis, and comes from four researchers at JHU. The release also has a project page.


The new system makes extensive use of Transformers, where attention is orchestrated across the disparate variables coming in from each task, the results of which are ultimately fed to a novel FaceX module – a decoder that transforms each analysis task into a token, and concatenates these for co-processing within the unified system. The authors state that this leads to improved generalization.

Conceptual schema for the FaceXFormer workflow.
Conceptual schema for the FaceXFormer workflow.

The system uses a multi-scale encoding strategy to accommodate the various and differing requirements of the various utilities provided. The authors state:

‘For instance, age estimation requires a global representation, while face parsing necessitates a fine-grained representation. Given an input image I, it is processed through a set of encoder layers. For each encoder layer, the output captures information at varying levels of abstraction and detail, generating multi-scale [features] where i ranges from 1 to 4.

‘This results in a hierarchical structure of features, wherein each feature [map] transitions from a coarse to a fine-grained representation suitable for diverse facial tasks.’

Inspired by the 2021 SegFormer project, a lightweight Multilayer Perceptron fusion module is used to create a facial representation concatenated from the various modules, with each feature map passed through a separate and standardizing MLP layer.

These transformed features are then re-concatenated and passed to the fusion MLP layer to obtain a completely discrete representation that comprises the multiple qualities.

Inspired by the Detection transformer (DETR) initiative, which uses tokens to learn bounding box predictions for objects, FaceXFormer employs task tokens, each of which represents the extracted knowledge from each of the specific facial tasks. The new iteration is more parameter-efficient than DETR, according to the authors, and consists of three facets: Face-to-task cross attention; task self-attention; and task-to-face cross attention.

Regarding the unified head (middle right in the schema illustration above), the authors state:

‘[The] output face [tokens] and task [tokens] are processed through a Task-to-Face Cross-Attention mechanism to obtain final refined features.

‘Then, the output tokens are fed into their corresponding task heads. The task head for landmark detection and head pose estimation is a regression MLP, while the tasks of estimating age, gender, race, visibility, and attributes recognition utilize classification MLPs.

‘For face parsing, we leverage the [output] and process it through an upsampling layer, then perform a cross-product with the face parsing token to obtain a segmentation map.

‘Note, the number of tokens for segmentation corresponds to the total number of classes, and for regression and classification tasks, one token is used per task.’

Task tokens at work on in-the-wild images in the FaceXFormer system.
Task tokens at work on in-the-wild images in the FaceXFormer system.

For training, different loss functions are (of necessity) used for each task. Many of the sub-tasks being corralled together into the system have conflicting needs, requiring multiple approaches for rationalization and integration into a cohesive, multi-faceted processing workflow.

In FaceXFormer’s unified framework, face parsing uses Cross-Entropy loss; landmark estimation uses wing loss; head pose estimation uses geodesic loss; and various combinations of these are used for attributes recognition and age/race/gender attribution.

The system can perform tasks selectively, and does not need to engage all possible functionalities, generating outputs only in line with the user-requested tasks.

Data and Tests

A total of ten datasets were used in the training and testing of FaceXFormer: for training, for face parsing, the LaPa dataset was used, along with CelebAMaskHQ; for training for landmark and head-pose estimation, the authors used Landmark Detection 300W, Wider Facial Landmarks in the Wild (WFLW), 300W-LP, AFLW2000; for attribute recognition, the project employs the popular CelebA collection; for training of age, gender and race estimation, FaceXFormer utilized UTKFace and FairFace; and for training of landmarks visibility prediction, COFW.

For testing between datasets and testing across datasets, additional collections used were 300VW; BIWI; LFWA (Labeled Faces in the Wild); and FFHQ;

Evaluation metrics used were Area Under Receiver Operating Characteristic Curve (AUC) for landmark estimation, Mean Absolute Error (MAE) for head pose estimation, and accuracy for recognition of attributes, landmark visibility, age, race and gender attribution.

The models were trained on a distributed PyTorch framework across eight A5000 GPUs, each containing 24GB of VRAM all initialized with pretrained ImageNet weights. Images were processed at a resolution of 224x224px, and the AdamW optimizer was used.

All the models were trained for 12 epochs (i.e., twelve complete ‘looks’ at the available data) at a batch size of 48 for each GPU. Data augmentation was also used, including the random application of Gaussian blur, conversion to grayscale, gamma rotation, occlusion (blacking out sections of the image), flipping and affine transformations.

Rival methods tested included** EAGR; AGRNET; DML-CSR; FP-LIIF; Wing; SBR; DeCaFa; HRNet; STAR Loss; FDN; WHENet; TriNet; img2pose; TokenHPE; SSP+SSG; SSPL; Hetero-FAE; FairFace; MiVOLO; MTL-CNN; ProS; FaRL; HyperFace; and AllinOne.

A comparison between representative similar methods.
A comparison between representative similar methods.

Of the quantitative results, the researchers comment :

‘FaceXFormer achieves state-of-the-art performance in face [parsing] and attributes recognition with a mean F1 of 90.46 over LaPa and CelebAMaskHQ, and a mean accuracy of 91.79 on CelebA. We observe that the performance on age/gender/race estimation is on par with some specialized bias-mitigation methods, showing that it exhibits minimal bias on these factors, even when trained on a small number of data points.

‘[…] FaceXFormer demonstrates competitive performance at par with leading methods like STAR Loss and TokenHPE for landmarks detection and head pose estimation respectively.’

The authors note that their system’s laggard performance in landmark detection is due to the lack of time-consuming heatmap detection routines that are used in rival methods, along with other resource-draining approaches such as rotation matrix and 6D representation, which exceed the generic ambit for a generalized facial analysis system such as FaceXFormer.

They state:

‘We avoid using auxiliary information and advanced representations as our goal is not to specialize in a single task, but to learn a unified representation suitable for a wide spectrum of tasks.’

The tests included comparisons with two recent models that share an objective of unification of face-analysis tasks, i.e., the aforementioned ProS and FaRL. The researchers note that their work is also ‘tangential’ to the HyperFace and AllinOne initiatives, which also seek to develop a unified decoder across multiple tasks:

‘We observe that FaceXFormer outperforms other unified model baselines across multiple benchmarks. However, HyperFace performs better in head pose estimation. We observe that different tasks complement each other and boost each other’s performance, as can be seen in the case of HyperFace.

‘HyperFace is jointly trained for landmarks detection and head pose estimation, both of which are regression-based tasks that mutually enhance their performance. We observe a similar performance increase when FaceXFormer is trained exclusively on regression-based tasks.

‘This enhancement is primarily because the model has learned regression-specific feature representations, which are optimized for these tasks. However, when these tasks are combined with others that do not share the same feature requirements, we notice a decrement in performance.’

The authors additionally conducted a qualitative round of tests against HyperFace and AllinOne:

Qualitative tests against two similarly-scoped systems.
Qualitative tests against two similarly-scoped systems.

The authors observe that in the second row of the image above, a superior performance in landmark detection is obtained, and observe further that the range of semantic segmentation actions (topmost row) exceed the abilities of the two rival systems.


FaceXFormer is a promising Transformers-based approach to address the Babel of differing methods for related sub-tasks. The authors declare at the paper’s conclusion that with the future addition of interactivity, it could form the basis of a possible foundation model.

The system’s under-performance for landmark detection, relative to tested rivals, perhaps demonstrates the fragility of the resource budget for systems of this type. Though the addition of heatmap systems such as GradCAM and other more sophisticated approaches could bring the entirety of the framework up to par, it would also slow it down, and it remains to be seen if that trade-off would be worthwhile.

Currently the assortment of facial analysis tasks undertaken by the new project exist as a related but unassimilated set of objectives, with interoperability a fairly severe friction for the development of new projects that require more than one action to be run on source faces. If FaceXFormer can advance the case for a drop-in architecture that can address several of these needs on a level playing field, it’s a worthwhile goal.

* Contributing papers, from left to right:





** Abbreviations used for readability.

Inline citations omitted as redundant, all links have already been provided.

More To Explore

Images from the accompanying YouTube video for the paper ' MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos'. Source: https://www.youtube.com/watch?v=Kpbpujkh2iI

Extracting Controllable CGI From the ‘Black Box’ of Neural Human Avatars

A new collaboration between China and Denmark offers a way to extract traditional CGI meshes and textures from implicit neural human avatars – a task that is extraordinarily challenging, but which could pave the way for more controllable AI-generated imagery and video in the future.

Montaged images from the paper 'From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment' - https://arxiv.org/pdf/2406.13912

Generative AI’s Captioning Crisis May Not Be Fixable With Large Language Models

Text-to-image and text-to-video models such as Stable Diffusion and Sora rely on datasets of images that include captions which accurately describe the photos in the collection. Most often, these captions are either inadequate or inaccurate – frequently both. Sometimes they’re downright deceptive, damaging models trained on them. But the research sector’s hopes that multi-modal large language models can create better captions is challenged in a recent new paper from NVIDIA and Chinese researchers.

It is the mark of an educated mind to be able to entertain a thought without accepting it.