Improving Pose Estimation for Generative AI

Sources: |
Sources: |

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

There is much discussion in this period around the possibility of AI-generated TV and production output in the relatively near future – not just the amending of details (such as faces) that were captured conventionally in a studio or on location; but the actual generation of complete photoreal human stories, where AI-generated people walk in and out of situations, interact with the world and each other, act in accordance with logic and physics, and in general appear as if they were conventional screen actors.

One of the challenges of creating the full-body synthetic people that will need to be considered for such a prospect is giving them convincing methods of moving, and authentically depicting them even from acute or difficult, non-conventional angles.

This requires machine learning systems that understand the general kinetic relationship that the individual parts of the human body have to each other. In the older field of CGI-based synthesis, entire paradigms such as inverse kinematics have developed over the past three or so decades to provide traditional VFX practitioners with simulated human subjects that accord to the basic laws; laws that limit our own movements in the real world:

A pair of representative human legs walking in accordance with strictures and limitations set up in an inverse kinematics system. The movement of these ‘matchstick’ people can be used to power far more convincing and fully-textured human depictions. Source:

In the example above we see a type of lower-body ‘matchstick figure’ obeying the limitations of inverse kinematics. Simple stick-like representations of this type are, in neural synthesis, a convenient method of distilling the human joint relationship, including inter-related motion in temporal space, into a base framework that can be understood, and later used to power actual photorealistic depictions.

Prior Restraint

In terms of generative AI, the creation of ‘fake people’ that look real, and can move around and interact with environments, requires machine systems that have trained on high volume video datasets and succeeded in generalizing multiple types of human motion. These entities are called motion priors.

From the Microsoft project ‘Learning Motion Priors for 4D Human Body Capture in 3D Scenes’, we see a full-bodied CGI capture obtained from a real-world video. Source:

In the video embedded above, we see priors being generated, with some effort, by capturing movement from a real-world video, which can then be neurally interpreted into complex animation. This kind of dense capture is typical of workflows that use 3D Morphable Models (3DMMs) and FLAME, among other kinds of CGI/neural interfaces.

This approach puts all the effort in the front end, at the capture and processing stages, and is suitable for complex simulacra such as recreating facial expressions, or approximations of cloth-based physics.

Sometimes, conversely, the brunt of the work is done at the end of the process, by a neural rendering engine that already has an understanding of volume, texture, and even physics. What such systems need is simple point-based data, similar to motion-capture (MoCap) data – a long-established method of  converting real-world indicators of joints (i.e., the actual joints of a real person who is moving in a motion capture studio) into 3D space, providing a lightweight temporal sequence that can be worked up into a full render.

Though modern demo videos rarely show the ‘joined-up’ dot-style stick figure any longer, we can see it in action here, as the simplest representation of skeletal joint representations in MoCap footage. Source:

So, if the resulting ‘stick figure’ is all you have (i.e., pure, raw MoCap data), then the burden of rendering out a realistic figure falls entirely on the back end system.

One such usage of this minimal data is in the Openpose module of the ControlNet framework for the Stable Diffusion text-to-image generative denoising system.

ControlNet's Openpose module interprets raw and approximated pose information, illustrated above in the form of the stick figure in upper-central right of the image, into a realistic render. Source:
ControlNet's Openpose module interprets raw and approximated pose information, illustrated above in the form of the stick figure in upper-central right of the image, into a realistic render. Source:

As we can see in the above illustration, ControlNet’s Openpose has extracted a figure pose approximation (upper center-right) from a user-provided photo (left), and used this pose information to generate text-prompted ‘chef’ characters.

However, as users of this module will know, Openpose can often experience some difficulty in interpreting complex poses. If the user is utilizing a LoRA or DreamBooth model, or some other customization/personalization method designed to render a particular character, this may be because the trained model did not train on enough information related to the pose that’s being asked, and the generalized knowledge in base Stable Diffusion is not able to bridge the gap.

Often, however, the shortfall is due to limitations in Controlnet, which can struggle with extreme poses, and also struggle to understand that the rendered character should, for instance, have their back to the camera (even if this information is provided by the user’s text prompt at inference time).

In the image below, with the base figure on the left as a guidance template, and with the accompanying text prompt ‘a fashion model in blue posing against a white backdrop’, we can see that both ControlNet’s Openpose (third from left) and the popular similar framework T2I-Adapter (second from left) have erred somewhat in their interpretation of this multimodal prompt:

Converting pose information into a model picture – but only the rightmost result gets it right. Source:
Converting pose information into a model picture – but only the rightmost result gets it right. Source:

The one model that gets closer to both the pose information and the prompt is a new offering from German researchers called Stable-Pose.

Stable-Pose is a development from principles governing prior works such as ControlNet and T2I-Adapter, with the addition of a bespoke Vision Transformer (ViT) and an augmented methodology that allows the system to achieve a 13% improvement on ControlNet. The researchers have released the code for this new system at a GitHub repository for the project.

Let’s take a look at this new offering, and consider that improved pose>render methods of this nature are likely to prove crucial to the ‘purely generative’ age of AI motion picture and television development, as well as a boon for medical research.

The new paper is titled Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation, and comes from five researchers across the Lab for AI in Medical Imaging, CompVis at LMU Munich, and the Munich Center for Machine Learning (MCML).


The trainable ViT module used by Stable-Pose operates on an input ‘skeleton’ image, and the base (V1.5)  Stable Diffusion model is frozen throughout training so that no loss in rendering quality need be expected, and so that Stable-Pose can potentially operate as an ancillary framework in a standard installation (though there is currently no clear support for Stable Diffusion GUIs such as ComfyUI or AUTOMATIC111, and the system operates as a standalone installation).

Conceptual schema for Stable-Pose.
Conceptual schema for Stable-Pose.

As we can see in the above schema, the diffusion process and the denoising U-Net (both to left of image) remain frozen. In the center-top, we see the extracted pose and the text component (the prompt) being passed through the novel ancillary ViT, where the computation of masked attention takes place, as the system iterates through the observed and inferred armature joints.

Stable-Pose then conditions the latent encoding on the input pose skeleton, while the text encoder maps the user-supplied prompt to an interstitial stage.

A masked self-attention module further iterates through the data in transit, with pose-masked self-attention (PMSA), in a feed-forward network, initially run through a Gaussian filter, in order to generalize the result.

The output from frozen Stable Diffusion passes through the pose-masked self-attention layers in the feed-forward network of the Stable-Pose system.
The output from frozen Stable Diffusion passes through the pose-masked self-attention layers in the feed-forward network of the Stable-Pose system.

The pose image and the latent encoding passed through from the frozen Stable Diffusion system are parsed by two blocks in Stable-Pose: a PMSA module and a pose encoder, the latter providing high-level features for the passed-through pose, through self-attention and binary-masked iteration of the same image.

The coarse-to-fine framework gradually improves the quality of the latent encoding, with additional guidance, emphasizing the development of the latent code towards the conditioned pose.

The pose encoder is comprised of six convolutional layers, with Sigmoid-Weighted Linear Unit (SiLU) activation layers. During this stage the passed-through pose image is downsampled by a factor of 8.

The PMSA module hunts out potential relationships within the patches of the latent encoding, orchestrating an overview of the interrelations between the diverse parts of the human body, via self-attention.

With the pose masks generated, these are divided into patches and non-pose-related areas are demoted with an integer value of zero. The authors state:

‘For all other regions not associated with pose, we assign an extremely small integer value. The attention mask helps to enhance the focus of PMSA on the pose-specific regions.’

The authors note that Stable-Pose emphasizes the weight of masked region contents, which comprise pose information, to a level far higher than the default Stable Diffusion process does.

Data and Tests


To test the system, the researchers pitted Stable-Pose and prior architectures against five high-volume, human-focused datasets: the Chinese 2023 collection Human-Art; the LAION subset LAION-Human (aka, Human-SD, which presumably has some advantage, since it contains data likely already to have been trained into the base Stable Diffusion model); the Canadian/Italian collaboration UBC Fashion; the 2022 Hong Kong-led collaboration Dance Track; and the 2016 ETH/Disney collaboration the DAVIS dataset.

The Stable-Pose model, in line with prior similar works, was trained on the enduringly popular V1.5 release, which, we have to note, continues to dominate the new literature, perhaps since it predates a number of issues around censorship that were to influence the performance and public acceptance of some of the later iterations.

Settings and Prior Frameworks

For training, the researchers used the Adam optimizer at a learning rate of 1×10-5. For the PMSA ViT module, they used a depth of 2 and a patch size of 2, with the two related sequential Gaussian filters having different kernel sizes, 23 and 13, respectively.

Images were generated by the DDIM sampler from a Stanford 2022 release.

For the Human Art dataset, all techniques were trained for 10 epochs; on the LAION-Human dataset, rival networks Human-SD, GLIGEN and Uni-ControlNet were trained for 10 epochs. Other rival networks trialed were T2I-Adapter and base Stable Diffusion (SD, V1.5).

Training took place on two NVIDIA A100 GPUs (the 40Gb or 80GB VRAM specification was not indicated in the paper), with the native method completing in 145 hours for Human-Art, and 70 hours for the LAION-Human subset. The authors note that this is a substantial decrease in necessary training time, compared to T2I-Adapter, which required around 300 GPU hours to train on a large-scale dataset.


Metrics used for pose accuracy were Mean Average Precision (mAP); Pose Cosine Similarity-based AP (CAP); and People Count Error (PCE), the latter using the pretrained pose estimator HigherHRNet.

To assess image quality, metrics used were Fréchet Inception Distance (FID) and Kernel Inception Distance (KID).

To measure text/image alignment, the authors have included the CLIP scores, which reveal the extent to which the CLIP model considers that the text prompt describes the generated image.


For the quantitative, metric-driven tests, Stable-Pose proved superior for pose alignment:

Results for the initial quantitative round.
Results for the initial quantitative round.

Regarding these results, the authors comment:

‘[These results] show that Stable-Pose achieved the highest AP (48.87 on Human-Art and 57.41 on LAION-Human) and CAP (71.04 on Human-Art and 68.06 on LAION-Human), surpassing the SOTA methods by more than 10%. This highlights Stable-Pose’s superiority in pose alignment. In terms of image quality and text-image alignment, Stable-Pose achieved comparable results against other methods, with only marginal discrepancy in FID/KID scores, yet the difference is negligible and the resulting quality remains high.

‘Overall, these results underscore Stable-Pose’s exceptional accuracy and robustness in both pose control and visual fidelity.’

In the qualitative round, some clear differences between the systems are evident:

Results for the qualitative round. Please refer to source paper at for better resolution.
Results for the qualitative round. Please refer to source paper at for better resolution.

Here the researchers state:

‘Consistent with the quantitative results, Stable-Pose demonstrates superior control compared to the other SOTA methods in both pose accuracy and text alignment, even in scenarios involving complex poses (the first row of Figure 4, which is a back view of the figure), and multiple individuals (the third row of [image above]), while the other methods fail to consistently maintain the integrity of the original pose instructions.

‘This is particularly evident in dynamic poses (e.g., yoga poses and athletic activities), where Stable-Pose manages to capture the pose dynamism more faithfully than others.’ *

The authors found, they state, that initial experiments revealed that the former rival methods did not always fare well when attempting more challenging human poses, in less common orientations, such as side or back poses.

Therefore they curated a small set of 2,650 images from the UBC Fashion dataset, featuring exclusively side and back-facing poses, and evaluated the checkpoint for each technique from the LAION Human dataset to assess quality of pose alignment:

Testing the prior methods for challenging side and back-facing poses.
Testing the prior methods for challenging side and back-facing poses.

Here the paper states:

‘Stable-Pose significantly outperforms other methods in recognizing and generating humans in all pose orientations, especially for rarer poses in side and back views, which surpasses the other methods by around 20% in AP. This further validates the robust controllability of Stable-Pose.’

As a further and final test besides ablative studies (not covered here), the researchers investigated outdoor and indoor T2I pose generation, using around 2,000 frames from the DAVIS dataset (outdoor activities)  and around 2,000 images from the Dance Track dataset (where most individuals were dancing outdoors in complex poses):

Definitive superiority for this test, for Stable-Pose.
Definitive superiority for this test, for Stable-Pose.

Here the results were conclusive:

‘[The] consistently highest AP and CAP scores achieved by Stable-Pose demonstrate its robustness in pose-controlled T2I generation across diverse environments, highlighting its potential as a backbone for pose-guided video generation.’


With just over a 10-13% improvement on ControlNet, Stable-Pose represents a worthwhile incremental advance on the state-of-the-art, and its authors should be commended for a timely code release, which is uncommon in the literature.

Weighed against that is the added complexity at inference time, though the authors concede this as an inevitable result of the weight of the self-attention mechanisms that bring the improvements.

* Despite the subsequent section reported above, and regarding the latter statement, we have to observe that the prompt in the sole yoga pose example does not stipulate that the figure should be posed with their back to the camera – though Stable-Pose is the only framework to render it so; and that only HumanSD deviates notably from the source skeletal pose

More To Explore

Main image derived from

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.