Historically, keypoints and landmarks have always been important in image synthesis, and particularly in facial synthesis. This is because the un-mapped face is basically just a ‘bunch of pixels’ to most computer vision systems; by adding a standard schema and number of dots to identify and delineate common key areas of the face, the system then understands which pixels are associated with ‘eyes’, ‘mouth’, ‘eyebrows’, and so forth.
Landmark estimators of this type are typically trained on relatively high-scale datasets, with exemplary manual annotations indicating to the system where the correct landmarks are likely to be found in other examples, so that a model can be generated which can accurately evaluate landmarks automatically.
Two of the most influential such systems at the present are Google’s MediaPipe, not least because it is used as a facial pose estimator (along with the more broadly applicable OpenPose system) in many implementations of Stable Diffusion; and Adrian Bulat’s FAN Align, which provides facial pose estimation for historically popular autoencoder deepfake systems such as DeepFaceLab, DeepFaceLive, and FaceSwap.
At the bottom of the second column, in this example from the DeepFaceLive application, we can see FAN Align landmarks being evaluated in real time. The model being used here was trained on the actual subject’s face as the ‘Before’ state. Before training took place, landmarks were estimated for thousands of images of each subject, so that the generalized system could eventually perform such evaluation in real time, and transpose the subject’s landmarks to those of the target. Source: https://www.youtube.com/watch?v=tY_uitc7JAE
Though facial landmarks are popular methods of ID verification (among other security applications), not all landmark frameworks are oriented at the face. OpenPose has a dedicated hand landmark estimator, which is popular with Stable Diffusion users trying to combat that system’s historical inability to render accurate human hands – and MediaPipe also has a dedicated (but separate) hand landmarker:
Moving away from faces and hands, full-body pose estimation is a popular research thread, not least from the standpoint of security applications such as gait recognition, which needs to impose common points into an identified mass of a body in order to understand known patterns when attempting to identify a person by the way that they walk.
There are more full-body pose estimation systems than we can even gloss over here, though some popular ones include 3D Human Pose Estimation, DECA, the Probabilistic Monocular 3D Human Pose Estimation with Normalizing Flows framework, EventHPE, Look into Person, and the Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation project, among many, many others.
So already we are at three separate systems, just for our own species. To date, no unified landmark system has been invented which offers the level of granularity and specificity that any one of these frameworks can; and ‘bolting’ a face or hand framework onto a full-body framework is a cumbersome procedure, due to the possibility of landmark overlap between competing systems, and a general lack of common standards in framework development.
People are not the only life forms that may need to have landmarks applied. Animal recognition systems are often developed, often on a per-species basis, for diverse causes, such as conservation or general wildlife monitoring.
In general, not least because cats and dogs are unusually well-represented in open source datasets, many animal recognition systems that rely on landmarking are limited to these two species, such as Apple’s Vision Framework, which can detect both types of animal.
In theory, four-legged creatures could be amenable to a common landmarking framework, though this could only tell you if something is ‘an animal’ (and presumably would likewise identify a human on all fours as such) – which is of relatively limited use, without additional visual evaluation and verification.
To boot, from the standpoint of image synthesis, this compartmentalized state of affairs makes it difficult to develop comprehensive movement evaluation systems and generative systems that represent accurately the entirety even of a human body, never mind an unknown animal – or an unusual design, such as a squid, or a multi-legged configuration, such as any number of insect species.
What would be ideal would be an aggregated landmarking system that has some capability to roll in the latest SOTA frameworks and studies on known species, and maybe even develop some capacity to automatically assign landmarks that conform to the part of the creature (or even thing, such as a train) that is moving and demonstrating some level of articulation.
The UniPose Approach
Though it does not quite reach these dizzy heights of utility yet, a new proposal from China, titled UniPose, does indeed put forward a unified framework intended to derive keypoints from any arbitrary subjects that either pivot from at least one point, or have recognizable boundaries.
Interestingly, UniPose, which is evolved from several prior works, makes use of language prompts to improve detection, bridging the gap between traditional, pre-generative systems such as FAN Align and modern approaches such as Stable Diffusion, which leverage distilled information from text/image pairs during training, instead of corralling indiscriminate pixels with keypoints.
Indeed, the current trend towards modern generative systems such as Latent Diffusion Models (LDMs) has momentarily taken emphasis away from landmarking technologies, since the popular paradigm now is to use joint training of words and images (such as CLIP and OpenCLIP) to make semantic ‘sense’ of the training data, and to form (for instance) facial features into computer vision features – extracted nodes of text/image correlation which do not require the skeletal armatures which landmarking provides.
Until, that is, you try and animate anything. The use of motion priors is the current vogue, i.e., studying hundreds, or even thousands of clips of (for instance) ‘a person walking’, and finding common movement vectors which can arbitrarily be applied to novel characters or figures.
However, the difference between the ability to control a set of landmarks and trying to get motion priors into exactly the configuration that you want is akin to the difference between steering an oil tanker or a bicycle: much as CGI is increasingly encroaching on neural synthesis, as a layer of ‘known’ instrumentality, landmarking can still, potentially, likewise offer a greater level of control, not least because it can be hotlinked to real-world and real-time motion capture data, and even to highly controllable CGI output.
Therefore the authors of UniPose have gathered together the existing knowledge of multiple specific keypoint systems into a new concatenated dataset, titled UniKPT, despite the challenges involved in straddling the diverse distributions and standards of a multitude of prior datasets.
The authors state:
‘As keypoint detection tasks are unified in this framework, we can leverage 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances to train a generic keypoint detection model. UniPose can effectively align text-to-keypoint and image-to-keypoint due to the mutual enhancement of textual and visual prompts based on the cross-modality contrastive learning optimization objectives.
‘Our experimental results show that UniPose has strong fine-grained localization and generalization abilities across image styles, categories, and poses. Based on UniPose as a generalist keypoint detector, we hope it could serve fine-grained visual perception, understanding, and generation.’
Inference code and a demo are currently scheduled for the end of October, along with the release of checkpoints at the project’s GitHub repository, though it remains to be seen if a fully implementable architecture is released to the public.
The new paper is titled UniPose: Detecting Any Keypoints, and comes from four authors across the International Digital Economy Academy (IDEA, at Shenzhen), and the School of Data Science at the Shenzhen Research Institute of Big Data.
UniPose is an end-to-end system that decodes instance-level representations from an input source image, by defining bounding boxes, and transforming the extracted data into pixel-based representations characterized by object keypoints. The incorporated prompt encoder makes use of semantic recognition to develop these keypoints.
The central mechanism for coarse-to-fine keypoint detection is a Cross-Modality Interactive Encoder (CMIE), which uses Transformers (cross attention) to coordinate the workflows of the three distinct strategies incorporated: the central processing backbone; a textual prompt encoder; and a visual prompt encoder.
The textual prompt encoder uses a hierarchical mapping structure to drill down to specifics in a recognized entity. In much the same way that CGI models contain child objects (body > head > eye, etc.), the semantic structure of the encoder breaks down sub-entities in a recognized entity at a text level (i.e., ‘A [IMAGE STYLE] photo of a [OBJECT]’s [PART]’s [KEYPOINT]’).
During training, random dropout is used as a ‘masking’ strategy to aid generalization by periodically hiding parts of the obtained text, so that the system generalizes better, is able to make superior inferences, and does not memorize phrases or text-segments by rote. This means that the system should be able to work well later on unseen data.
The UniPose Visual Prompt Encoder redresses some of the limitations of CLIP, which uses a Vision Transformer (ViT) encoder that can only derive image representations from learnable tokens and patch tokens. To these, UniPose adds keypoint positioning encodings (represented on the right of the left-most column of the image below).
UniPose needs to provide prompts across a range of modalities (text and image, for instance), and across the diverse approaches that the system draws together, so that text/image data pairings can be leveraged uniformly. To this end, for the cross-modality interactive encoder, novel self-attention layers are added on to prior layer designs taken from the Pose Estimation framework with Transformers (PETR) project, and the ED-Pose research initiative (also from IDEA, contributors to this new paper).
For the cross-modality interactive decoder, the authors have unhooked the instance-level decoder from the keypoint-level decoder, which allows iterative improvement of keypoints without affecting ancillary data. Prompt representations are also added to the queries at this point, through cross attention.
During training, in contrast to previous approaches (which center on close-set objects detected), UniPose encodes multimodality prompts (either text or image) into the apposite object prompt, using contrastive loss and prompt tokens for classification purposes.
Pivotal to the new system is the novel use of prompt-to-keypoint alignment that uses a unified set of keypoint definitions. This functionality is central to the development of the UniKPT keypoint detection dataset, which incorporates 13 previous, topic-specific (rather than topic-agnostic) keypoint datasets. These prior collections straddle animal and human subjects, as well as insects, and objects (such as cars).
The final unified database contains 226,547 images featuring 418,487 instances, with 338 keypoint types and 1,237 instance categories. The authors note:
‘In particular, for articulated objects like humans and animals, we further categorize them based on biological taxonomy, resulting in 1, 216 species, 66 families, 23 orders, and 7 classes.’
The researchers further observe that a great deal of work was necessary to rationalize the shortcomings or over-specced aspects across the datasets, and that it was necessary to augment certain datasets by adding location descriptions for keypoints, as well as standardizing orientation directions, among numerous other administrative chores.
Data and Tests
UniPose was evaluated against multiple criteria, and multiple former frameworks, in an unusually comprehensive, even exhaustive slate of tests. For a comprehensive review of all experiments, we refer the reader to the source paper and appendix, and here broadly cover the basic categories of experiments conducted.
To test keypoint detection in unseen (novel) objects, the rival frameworks used were ProtoNet, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML), Fine-tune, POMNet and Capeformer. The data used was the Multi-category Pose (MP-100) dataset. Inference was conducted on an A100, and training was done at a batch size of 1.
For tests with ground truth bounding boxes, and without being asked to generalize to unseen objects, UniPose achieves state of the art results. Regarding the second part of this test, the authors state:
‘[In] the absence of ground-truth bounding boxes, UniPose exhibits a significant improvement over CapeFormer in terms of average PCK, achieving a significant increase of 42.8%, thanks to UniPose’s generalization ability for both unseen instance and keypoint detection for multiple objects.’
Of these results, the authors state:
‘The results demonstrate that UniPose consistently delivers superior performance across all datasets. Notably, compared to ViTPose++, which lacks the capability to handle unseen datasets with different keypoint structures, UniPose excels by detecting more objects and keypoints in an end-to-end manner.’
Here the authors comment:
‘[Results] show that UniPose outperforms ED-Pose across all datasets in terms of both instance-level and keypoint-level detection. Moreover, for the AP-10K dataset, which involves the classification of 54 different species, UniPose surpasses ED-Pose with a 27.7 AP improvement, thanks to instance-level and keypoint-level alignments.’
Next, as illustrated also earlier in this article, a qualitative round was conducted.
The authors comment ‘Given an input image and textual prompts, UniPose can perform well for any articulated, rigid, and soft [objects].’
Further tests were conducted to compare UniPose to the generalist models Unified-IO, Painter, and InstructDiffusion, in regards to keypoint evaluation outcomes. UniPose was able to lead the board in this test as well:
Finally, two tests were conducted for open-vocabulary models, the first contender being CLIP. The models were tested on 54 animal categories and on Human-Art, which contains 15 image styles.
The authors state:
‘Results show that UniPose consistently provides higher-quality text-to-image similarity scores at the instance and keypoint levels.’
In the image above, the results against the COCO dataset are seen at the top. Below that are the results for other datasets (listed in column second-from-right).
The paper states:
‘Grounding-DINO fails to localize fine-grained keypoints, UniPose successfully addresses these challenges, achieving dramatic improvements across all datasets. For the instance detection, UniPose has slight superiority. Although the fine-tuning on instance detection of Grounding-DINO can help, the keypoint detection will worsen.’
Whether or not efforts such as UniPose represent Quixotic gestures, in the face of newer technologies, depends on the extent to which approaches such as motion priors and feature-level manipulation ever become more accessible, controllable and instrumentalized.
So long as neural rendering and generative technologies continue to produce projects which – though dazzling – output random and non-repeatable content (such as wonderful faces and poses stumbled upon in the latent space of the latest and greatest network), it can only remain a random grab-bag for one-off Reddit posts, or Instagram cliplets.
Rather, professional workflows need professional reliability – something that is absent from any of the recent years’ crop of new methods, with the possible exception of NeRF (which is hard to edit and/or hard to render at production resolutions).
Therefore, one can see the prospect of a unified keypoint detector in the same category as the 3DMM and SMPL CGI models that are increasingly being used to give more form and consistency to neural output. In one sense, landmarks are no different to nodes on a CGI-based mesh. If the neural scene wishes to dispense with decades-old approaches such as these, it is going to have to come up with something at least as reliable and replicable as landmark evaluation or mesh-based instrumentality.
The research scene and the world alike were dazzled by the output of Generative Adversarial Networks (GANs) when they emerged, and it seemed that reining them in would be trivial. Years later, it is proving rather more difficult. It is possible that latent diffusion models, despite their apparent promise, may likewise prove intractable, in this respect.
The latent space is dazzling, but chaotic and willful; and – reversing the conventional wisdom – it may transpire that new tricks require an old dog.