One Landmark Estimator to Rule Them All

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Historically, keypoints and landmarks have always been important in image synthesis, and particularly in facial synthesis. This is because the un-mapped face is basically just a ‘bunch of pixels’ to most computer vision systems; by adding a standard schema and number of dots to identify and delineate common key areas of the face, the system then understands which pixels are associated with ‘eyes’, ‘mouth’, ‘eyebrows’, and so forth.

Two popular implementations of facial landmarks – Mediapipe (often used in Stable Diffusion pipelines) and FAN Align, which is at the heart of traditional deepfake architectures. Sources: https://github.com/1adrianb/face-alignment/ and https://arxiv.org/pdf/1907.06724.pdf.
Two popular implementations of facial landmarks – Mediapipe (often used in Stable Diffusion pipelines) and FAN Align, which is at the heart of traditional deepfake architectures. Sources: https://github.com/1adrianb/face-alignment/ and https://arxiv.org/pdf/1907.06724.pdf.

Landmark estimators of this type are typically trained on relatively high-scale datasets, with exemplary manual annotations indicating to the system where the correct landmarks are likely to be found in other examples, so that a model can be generated which can accurately evaluate landmarks automatically.

Two of the most influential such systems at the present are Google’s MediaPipe, not least because it is used as a facial pose estimator (along with the more broadly applicable OpenPose system) in many implementations of Stable Diffusion; and Adrian Bulat’s FAN Align,  which provides facial pose estimation for historically popular autoencoder deepfake systems such as DeepFaceLab, DeepFaceLive, and FaceSwap.

At the bottom of the second column, in this example from the DeepFaceLive application, we can see FAN Align landmarks being evaluated in real time. The model being used here was trained on the actual subject’s face as the ‘Before’ state. Before training took place, landmarks were estimated for thousands of images of each subject, so that the generalized system could eventually perform such evaluation in real time, and transpose the subject’s landmarks to those of the target. Source: https://www.youtube.com/watch?v=tY_uitc7JAE

Though facial landmarks are popular methods of ID verification (among other security applications), not all landmark frameworks are oriented at the face. OpenPose has a dedicated hand landmark estimator, which is popular with Stable Diffusion users trying to combat that system’s historical inability to render accurate human hands – and MediaPipe also has a dedicated (but separate) hand landmarker:

The Google MediaPipe series of landmark annotators includes a dedicated entry for hand pose estimation. Source: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker
The Google MediaPipe series of landmark annotators includes a dedicated entry for hand pose estimation. Source: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker

Besides these, there are other hand landmark estimation approaches such as BodyHands and Nonparametric Structure Regularization Machine (NSRM Hand).

Moving away from faces and hands, full-body pose estimation is a popular research thread, not least from the standpoint of security applications such as gait recognition, which needs to impose common points into an identified mass of a body in order to understand known patterns when attempting to identify a person by the way that they walk.

There are more full-body pose estimation systems than we can even gloss over here, though some popular ones include 3D Human Pose Estimation, DECA, the Probabilistic Monocular 3D Human Pose Estimation with Normalizing Flows framework, EventHPE, Look into Person, and the Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation project, among many, many others.

Comparison of the 2021 DECA system against prior networks. Source: https://arxiv.org/pdf/2108.08557.pdf
Comparison of the 2021 DECA system against prior networks. Source: https://arxiv.org/pdf/2108.08557.pdf

So already we are at three separate systems, just for our own species. To date, no unified landmark system has been invented which offers the level of granularity and specificity that any one of these frameworks can; and ‘bolting’ a face or hand framework onto a full-body framework is a cumbersome procedure, due to the possibility of landmark overlap between competing systems, and a general lack of common standards in framework development.

Beyond Mankind

People are not the only life forms that may need to have landmarks applied. Animal recognition systems are often developed, often on a per-species basis, for diverse causes, such as conservation or general wildlife monitoring.

This animal identification system cannibalized standard human landmarking framework in order to potentially become applicable to a wide range of wildlife. Source: https://arxiv.org/pdf/2001.02801.pdf
This animal identification system cannibalized standard human landmarking framework in order to potentially become applicable to a wide range of wildlife. Source: https://arxiv.org/pdf/2001.02801.pdf

In general, not least because cats and dogs are unusually well-represented in open source datasets, many animal recognition systems that rely on landmarking are limited to these two species, such as Apple’s Vision Framework, which can detect both types of animal.

Apple's Vision Framework can detect cats and dogs, but not a wider range of arbitrary animal species. Source: https://developer.apple.com/videos/play/wwdc2023/10045/
Apple's Vision Framework can detect cats and dogs, but not a wider range of arbitrary animal species. Source: https://developer.apple.com/videos/play/wwdc2023/10045/

In theory, four-legged creatures could be amenable to a common landmarking framework, though this could only tell you if something is ‘an animal’ (and presumably would likewise identify a human on all fours as such) – which is of relatively limited use, without additional visual evaluation and verification.

To boot, from the standpoint of image synthesis, this compartmentalized state of affairs makes it difficult to develop comprehensive movement evaluation systems and generative systems that represent accurately the entirety even of a human body, never mind an unknown animal – or an unusual design, such as a squid, or a multi-legged configuration, such as any number of insect species.

What would be ideal would be an aggregated landmarking system that has some capability to roll in the latest SOTA frameworks and studies on known species, and maybe even develop some capacity to automatically assign landmarks that conform to the part of the creature (or even thing, such as a train) that is moving and demonstrating some level of articulation.

The UniPose Approach

Though it does not quite reach these dizzy heights of utility yet, a new proposal from China, titled UniPose, does indeed put forward a unified framework intended to derive keypoints from any arbitrary subjects that either pivot from at least one point, or have recognizable boundaries.

UniPose can infer landmarks almost at will on any object that features some kind of articulation. Source: https://raw.githubusercontent.com/IDEA-Research/UniPose/master/asset/in-the-wild.jpg
UniPose can infer landmarks almost at will on any object that features some kind of articulation. Source: https://raw.githubusercontent.com/IDEA-Research/UniPose/master/asset/in-the-wild.jpg

Interestingly, UniPose, which is evolved from several prior works, makes use of language prompts to improve detection, bridging the gap between traditional, pre-generative systems such as FAN Align and modern approaches such as Stable Diffusion, which leverage distilled information from text/image pairs during training, instead of corralling indiscriminate pixels with keypoints.

Indeed, the current trend towards modern generative systems such as Latent Diffusion Models (LDMs) has momentarily taken emphasis away from landmarking technologies, since the popular paradigm now is to use joint training of words and images (such as CLIP and OpenCLIP) to make semantic ‘sense’ of the training data, and to form (for instance) facial features into computer vision features – extracted nodes of text/image correlation which do not require the skeletal armatures which landmarking provides.

Until, that is, you try and animate anything. The use of motion priors is the current vogue, i.e., studying hundreds, or even thousands of clips of (for instance) ‘a person walking’, and finding common movement vectors which can arbitrarily be applied to novel characters or figures.

However, the difference between the ability to control a set of landmarks and trying to get motion priors into exactly the configuration that you want is akin to the difference between steering an oil tanker or a bicycle: much as CGI is increasingly encroaching on neural synthesis, as a layer of ‘known’ instrumentality, landmarking can still, potentially, likewise offer a greater level of control, not least because it can be hotlinked to real-world and real-time motion capture data, and even to highly controllable CGI output.

Therefore the authors of UniPose have gathered together the existing knowledge of multiple specific keypoint systems into a new concatenated dataset, titled UniKPT, despite the challenges involved in straddling the diverse distributions and standards of a multitude of prior datasets.

Examples of various contributing datasets to the new unified UniKPT collection. Source: https://arxiv.org/pdf/2310.08530.pdf
Examples of various contributing datasets to the new unified UniKPT collection. Source: https://arxiv.org/pdf/2310.08530.pdf

The authors state:

‘As keypoint detection tasks are unified in this framework, we can leverage 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances to train a generic keypoint detection model. UniPose can effectively align text-to-keypoint and image-to-keypoint due to the mutual enhancement of textual and visual prompts based on the cross-modality contrastive learning optimization objectives.

‘Our experimental results show that UniPose has strong fine-grained localization and generalization abilities across image styles, categories, and poses. Based on UniPose as a generalist keypoint detector, we hope it could serve fine-grained visual perception, understanding, and generation.’

Inference code and a demo are currently scheduled for the end of October, along with the release of checkpoints at the project’s GitHub repository, though it remains to be seen if a fully implementable architecture is released to the public.

The new paper is titled UniPose: Detecting Any Keypoints, and comes from four authors across the International Digital Economy Academy (IDEA, at Shenzhen), and the School of Data Science at the Shenzhen Research Institute of Big Data.

Method

UniPose is an end-to-end system that decodes instance-level representations from an input source image, by defining bounding boxes, and transforming the extracted data into pixel-based representations characterized by object keypoints. The incorporated prompt encoder makes use of semantic recognition to develop these keypoints.

The central mechanism for coarse-to-fine keypoint detection is a Cross-Modality Interactive Encoder (CMIE), which uses Transformers (cross attention) to coordinate the workflows of the three distinct strategies incorporated: the central processing backbone; a textual prompt encoder; and a visual prompt encoder.

Conceptual architecture for UniPose.
Conceptual architecture for UniPose.

The textual prompt encoder uses a hierarchical mapping structure to drill down to specifics in a recognized entity. In much the same way that CGI models contain child objects (body > head > eye, etc.), the semantic structure of the encoder breaks down sub-entities in a recognized entity at a text level (i.e., ‘A [IMAGE STYLE] photo of a [OBJECT]’s [PART]’s [KEYPOINT]’).

During training, random dropout is used as a ‘masking’ strategy to aid generalization by periodically hiding parts of the obtained text, so that the system generalizes better, is able to make superior inferences, and does not memorize phrases or text-segments by rote. This means that the system should be able to work well later on unseen data.

The UniPose Visual Prompt Encoder redresses some of the limitations of CLIP, which uses a Vision Transformer (ViT) encoder that can only derive image representations from learnable tokens and patch tokens. To these, UniPose adds keypoint positioning encodings (represented on the right of the left-most column of the image below).

Schema for the Visual Prompt Encoder in UniPose.
Schema for the Visual Prompt Encoder in UniPose.

Two additional token initialization routines are incorporated into this part of the system: NeRF-derived Fourier embedding and a shared learnable mask token.

From Facebook's FAIR initiative, where the Masked Autoencoder (MAE) architecture obscures parts of the training data, as a form of dropout, to improve generalization. Source: https://arxiv.org/pdf/2111.06377.pdf
From Facebook's FAIR initiative, where the Masked Autoencoder (MAE) architecture obscures parts of the training data, as a form of dropout, to improve generalization. Source: https://arxiv.org/pdf/2111.06377.pdf

UniPose needs to provide prompts across a range of modalities (text and image, for instance), and across the diverse approaches that the system draws together, so that text/image data pairings can be leveraged uniformly. To this end, for the cross-modality interactive encoder, novel self-attention layers are added on to prior layer designs taken from the Pose Estimation framework with Transformers (PETR) project, and the ED-Pose research initiative (also from IDEA, contributors to this new paper).

For the cross-modality interactive decoder, the authors have unhooked the instance-level decoder from the keypoint-level decoder, which allows iterative improvement of keypoints without affecting ancillary data. Prompt representations are also added to the queries at this point, through cross attention.

During training, in contrast to previous approaches (which center on close-set objects detected), UniPose encodes multimodality prompts (either text or image) into the apposite object prompt, using contrastive loss and prompt tokens for classification purposes.

Pivotal to the new system is the novel use of prompt-to-keypoint alignment that uses a unified set of keypoint definitions. This functionality is central to the development of the UniKPT keypoint detection dataset, which incorporates 13 previous, topic-specific (rather than topic-agnostic) keypoint datasets. These prior collections straddle animal and human subjects, as well as insects, and objects (such as cars).

The contributing datasets for UniKPT.
The contributing datasets for UniKPT.

The final unified database contains 226,547 images featuring 418,487 instances, with 338 keypoint types and 1,237 instance categories. The authors note:

‘In particular, for articulated objects like humans and animals, we further categorize them based on biological taxonomy, resulting in 1, 216 species, 66 families, 23 orders, and 7 classes.’

The researchers further observe that a great deal of work was necessary to rationalize the shortcomings or over-specced aspects across the datasets, and that it was necessary to augment certain datasets by adding location descriptions for keypoints, as well as standardizing orientation directions, among numerous other administrative chores.

Data and Tests

UniPose was evaluated against multiple criteria, and multiple former frameworks, in an unusually comprehensive, even exhaustive slate of tests. For a comprehensive review of all experiments, we refer the reader to the source paper and appendix, and here broadly cover the basic categories of experiments conducted.

To test keypoint detection in unseen (novel) objects, the rival frameworks used were ProtoNet, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML), Fine-tune, POMNet and Capeformer. The data used was the Multi-category Pose (MP-100) dataset. Inference was conducted on an A100, and training was done at a batch size of 1.

Results from the 'unseen objects and keypoint detection' round. The terms 'TD' and 'E2E' refer to 'top-down' and 'end-to-end' methodologies.
Results from the 'unseen objects and keypoint detection' round. The terms 'TD' and 'E2E' refer to 'top-down' and 'end-to-end' methodologies.

For tests with ground truth bounding boxes, and without being asked to generalize to unseen objects, UniPose achieves state of the art results. Regarding the second part of this test, the authors state:

‘[In] the absence of ground-truth bounding boxes, UniPose exhibits a significant improvement over CapeFormer in terms of average PCK, achieving a significant increase of 42.8%, thanks to UniPose’s generalization ability for both unseen instance and keypoint detection for multiple objects.’

Testing for generic keypoint detection, rival frameworks were ViTPose++, and the aforementioned ED-pose.

Results for generic keypoint detection. Multiple rival models were tested against several datasets (listed in the columns heading right). Expert models are indicated by dark indigo, and trained methods which cannot handle unseen datasets are indicated in lighter indigo. As usual, best results are bold; unusually, second-best results are underlined. 'T' and 'V' indicate textual or visual prompts used.
Results for generic keypoint detection. Multiple rival models were tested against several datasets (listed in the columns heading right). Expert models are indicated by dark indigo, and trained methods which cannot handle unseen datasets are indicated in lighter indigo. As usual, best results are bold; unusually, second-best results are underlined. 'T' and 'V' indicate textual or visual prompts used.

Of these results, the authors state:

‘The results demonstrate that UniPose consistently delivers superior performance across all datasets. Notably, compared to ViTPose++, which lacks the capability to handle unseen datasets with different keypoint structures, UniPose excels by detecting more objects and keypoints in an end-to-end manner.’

The next test was a baseline comparison with ED-Pose, a major contributor to UniPose. Both UniPose and ED-Pose were trained on the datasets COCO, Human-Art, AP-10K, and APT-36K.

Results for the baseline comparison, using a Swin-T backbone.
Results for the baseline comparison, using a Swin-T backbone.

Here the authors comment:

‘[Results] show that UniPose outperforms ED-Pose across all datasets in terms of both instance-level and keypoint-level detection. Moreover, for the AP-10K dataset, which involves the classification of 54 different species, UniPose surpasses ED-Pose with a 27.7 AP improvement, thanks to instance-level and keypoint-level alignments.’

Next, as illustrated also earlier in this article, a qualitative round was conducted.

Some examples from qualitative testing.
Some examples from qualitative testing.

The authors comment ‘Given an input image and textual prompts, UniPose can perform well for any articulated, rigid, and soft [objects].’

Further tests were conducted to compare UniPose to the generalist models Unified-IO, Painter, and InstructDiffusion, in regards to keypoint evaluation outcomes. UniPose was able to lead the board in this test as well:

Comparison to generalist models.
Comparison to generalist models.

Finally, two tests were conducted for open-vocabulary models, the first contender being CLIP. The models were tested on 54 animal categories and on Human-Art, which contains 15 image styles.

Results for a comparison with CLIP.
Results for a comparison with CLIP.

The authors state:

‘Results show that UniPose consistently provides higher-quality text-to-image similarity scores at the instance and keypoint levels.’

Last of all, UniPose was compared with the SOTA open-vocabulary object detector GroundingDINO, in regard to instance and keypoint detection.

Comparison across two concatenated results tables for tests against the unified GroundingDINO model, arguably the nearest competitor to UniPose. See source paper for better resolution.
Comparison across two concatenated results tables for tests against the unified GroundingDINO model, arguably the nearest competitor to UniPose. See source paper for better resolution.

In the image above, the results against the COCO dataset are seen at the top. Below that are the results for other datasets (listed in column second-from-right).

The paper states:

‘Grounding-DINO fails to localize fine-grained keypoints, UniPose successfully addresses these challenges, achieving dramatic improvements across all datasets. For the instance detection, UniPose has slight superiority. Although the fine-tuning on instance detection of Grounding-DINO can help, the keypoint detection will worsen.’

Conclusion

Whether or not efforts such as UniPose represent Quixotic gestures, in the face of newer technologies, depends on the extent to which approaches such as motion priors and feature-level manipulation ever become more accessible, controllable and instrumentalized.

So long as neural rendering and generative technologies continue to produce projects which – though dazzling – output random and non-repeatable content (such as wonderful faces and poses stumbled upon in the latent space of the latest and greatest network), it can only remain a random grab-bag for one-off Reddit posts, or Instagram cliplets.

Rather, professional workflows need professional reliability – something that is absent from any of the recent years’ crop of new methods, with the possible exception of NeRF (which is hard to edit and/or hard to render at production resolutions).

Therefore, one can see the prospect of a unified keypoint detector in the same category as the 3DMM and SMPL CGI models that are increasingly being used to give more form and consistency to neural output. In one sense, landmarks are no different to nodes on a CGI-based mesh. If the neural scene wishes to dispense with decades-old approaches such as these, it is going to have to come up with something at least as reliable and replicable as landmark evaluation or mesh-based instrumentality.

The research scene and the world alike were dazzled by the output of Generative Adversarial Networks (GANs) when they emerged, and it seemed that reining them in would be trivial. Years later, it is proving rather more difficult. It is possible that latent diffusion models, despite their apparent promise, may likewise prove intractable, in this respect.

The latent space is dazzling, but chaotic and willful; and – reversing the conventional wisdom – it may transpire that new tricks require an old dog.

More To Explore

AI ML DL

Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.

AI ML DL

Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle