Repairing the Nightmarish Hands Produced by Stable Diffusion

RHanDS

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Since art itself came into existence, gaining domain knowledge of human hands has always been the primary challenge for an artist-in-training.

Middle, interlocked fingers may be the most difficult subject for even an accomplished artist to draw from memory or domain knowledge. Source: https://www.adobe.com/creativecloud/illustration/discover/how-to-draw-hands.html
Middle, interlocked fingers may be the most difficult subject for even an accomplished artist to draw from memory or domain knowledge. Source: https://www.adobe.com/creativecloud/illustration/discover/how-to-draw-hands.html

With 14 cylindrical finger segments, varying upper and lower base topology, thousands of possible pose and lighting permutations, interlocking and other types of hand>hand relationship, as well as enormous variations across gender and ages…instinctive understanding of the geometry of hands is easy for people to develop as a perceptual skill – but much harder to translate into an interpretive skill, such as drawing or in some way visually depicting hands – even in a default or canonical (i.e., neutral or resting) stance.

Michelangelo's 'The Creation of Adam' represents the work of an artist renowned for his skill at depicting this most challenging aspect of human anatomy. Source: https://en.wikipedia.org/wiki/File:Michelangelo_-_Creation_of_Adam_(cropped).jpg
Michelangelo's 'The Creation of Adam' represents the work of an artist renowned for his skill at depicting this most challenging aspect of human anatomy. Source: https://en.wikipedia.org/wiki/File:Michelangelo_-_Creation_of_Adam_(cropped).jpg

As it transpires, it’s a skill that AI also has extraordinary difficulty learning. The open source Latent Diffusion Model (LDM) Stable Diffusion is legendary for its capacity to produce malformed hands and limbs in general, frequently exceeding or under-cutting the requisite number of digits, and generating Bosch-style figurative nightmares almost as often as a basically acceptable hand representation.

Stable Diffusion frequently fails to produce even remotely acceptable hand renderings. Source: https://old.reddit.com/r/StableDiffusion/comments/z3a4ye/prompt_woman_showing_her_hands_on_stable/
Stable Diffusion frequently fails to produce even remotely acceptable hand renderings. Source: https://old.reddit.com/r/StableDiffusion/comments/z3a4ye/prompt_woman_showing_her_hands_on_stable/

As best can be understood, Stable Diffusion simply cannot get enough data for hands, because a standard training image dataset such as LAION (on which the most popular Stable Diffusion models are based) places no particular emphasis on hands, much in accordance with the general run of photography.

Further, excepting unusual situations where a photo might have hand emphasis (such as the classic Picard ‘Facepalm’; ‘talk to the hand’ 1990s memes; or other situations where the hand gravitates to the face – which is of far greater cultural attraction than the hand), the hand itself usually occupies a relatively small area of a photo, and frequently in repetitive or banal poses that do little to teach the system of all possible variations in disposition and lighting, as well as numerous other factors.

Therefore the problem is at least in part on of distribution and emphasis, and in theory can’t be solved in any other way than massively increasing the number and variety of hand photos in a general-purpose dataset; however, this would undermine other important sub-domains and skew the functionality of the model in various undesired ways.

Instead, the last 18 months have brought forth a growing number of post facto remedies from the research sector, as well as the efforts of many private practitioners in developing LoRAs and training checkpoints dedicated to producing or fixing the chronic hands/limbs issues that beset Stable Diffusion.

The 2023 paper HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting offered a post-processing solution to malformed hands in Stable Diffusion. Source: https://arxiv.org/pdf/2311.17957.pdf
The 2023 paper HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting offered a post-processing solution to malformed hands in Stable Diffusion. Source: https://arxiv.org/pdf/2311.17957.pdf

The latest research project to address this problem comes from China, in a paper published on the 22nd of April. Here, the researchers have approached the problem by devising and exploiting multiple datasets covering both good and bad hand renderings, in a variety of possible styles; together with a two-pronged training approach, which lavishly exploits the most powerful GPUs available, the resulting system, the authors claim, improves notably on the current state of the art across the widest possible number of domains, from photorealistic through to stylistic.

The new RHandDS system uses multiple datasets and CGI-based neural interfaces to help resolve hand-rendering issues in Stable Diffusion. Source: https://arxiv.org/pdf/2404.13984.pdf
The new RHandDS system uses multiple datasets and CGI-based neural interfaces to help resolve hand-rendering issues in Stable Diffusion. Source: https://arxiv.org/pdf/2404.13984.pdf

The system makes extensive use of the SMPL CGI-based neural interface, in order to generate example hands, as well as of the ancillary Stable Diffusion system ControlNet, in addition to a whole host of secondary libraries and contributing systems.

The RHanDS system uses ControlNet and extensive curated datasets to produce a system that can fix poorly-created hand renders.
The RHanDS system uses ControlNet and extensive curated datasets to produce a system that can fix poorly-created hand renders.

The paper states:

‘Our [system] ensures the correctness of the structure with a 3D hand model, maintains the correctness of the style with reference hands, and avoids the mutual influence that exists between structure and style guidance. As a result, RHanDS can produce plausible hand [images] with the correct structure while preserving the hand style.’

‘…[The system] is a post-processing method that leverages 3D hand mesh as the pixel-level condition to control hand structure. Compared to existing methods, RHanDS performs specifically on the hand region rather than the entire image, which facilitates more precise refinement.

‘Moreover, RHanDS enhances the perception of hand style, ensuring that the style of the refined hand seamlessly aligns with that of the original image for a coherent and authentic representation.’

The new paper is titled RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance, and comes from six researchers across Alibaba Group in Beijing, and Xiamen University. The related code and datasets are currently promised to be made available ‘soon’.

Method and Data

The RHanDS framework is diffusion-based, and consists of four modules: a style encoder, to accommodate the various domains in which it might be required that hands be created; a structure encoder, to preserve the topology and generality of the hand/s; a Variational Autoencoder (VAE); and a masking component.

Conceptual workflow schema for the RHanDS system.
Conceptual workflow schema for the RHanDS system.

Input for the framework, as visualized in the conceptual schema above, is an automatically recognized and cropped-out hand, removed from its context in the original image. The structural reference is a hand-depth image obtained from a hand model that can be created with the OpenPose editor, or else automatically inferred from the misshapen hand, and then encoded by the structure encoder.

The OpenPose editor allows the end-user to manipulate hands and create reference pose images that can be interpreted into photorealistic or stylistic renderings by Stable Diffusion. Source: https://github.com/ZhUyU1997/open-pose-editor
The OpenPose editor allows the end-user to manipulate hands and create reference pose images that can be interpreted into photorealistic or stylistic renderings by Stable Diffusion. Source: https://github.com/ZhUyU1997/open-pose-editor

The style reference is drawn from the style of the cropped hand image itself, while the mask is algorithmically determined by a hand recognition module.

To decouple the structure and style guidance, in the development of the system (i.e., not at inference time) the process is broken up into a first stage where a UNet and style encoder are trained on multi-style paired hand images. In the second stage, the structure encoder is trained on an analogous multi-style hand-mesh dataset (i.e., CGI-style imagery). A wide range of potential styles is catered for in these stages:

The various domains featured across the contributing curated datasets. In the run of contributing data for the training of RHanDS, some of the material was hand-generated or automatically generated by the authors, while other examples were taken from existing datasets.
The various domains featured across the contributing curated datasets. In the run of contributing data for the training of RHanDS, some of the material was hand-generated or automatically generated by the authors, while other examples were taken from existing datasets.

The mask-guided hand repair process is facilitated by a UNet inside the Stable Diffusion framework that contains five additional channels: four for the encoded masked image, and one for the actual mask itself. During the denoising process, the mask image remains unchanged while the decoder reconstitutes the image from the amended latent code.

The authors note that prior solutions to hand issues have limited ability to cope with stylized types of generation, and generalize poorly in styles such as anime, and certain painting styles:

From the paper, a comparison with the prior HandRefiner framework, compared to results obtained by the new method.
From the paper, a comparison with the prior HandRefiner framework, compared to results obtained by the new method.

This is solved in RHanDS by encoding style directly into the Unet, so that a style embedding can be exploited at inference time.

First stage training for the RHanDS system, where the style encoder and Unet are jointly trained on variegated styles of hand images.
First stage training for the RHanDS system, where the style encoder and Unet are jointly trained on variegated styles of hand images.

A diverse variety of image styles are used for this joint training (see below), so that the system does not fixate solely on photorealistic output.

The authors observe that their use of CLIP image encoder to extract a style from the input image operates in a similar way to the IP-Adaptor Stable Diffusion ancillary system. Essentially the CLIP encoder operates over a linear projection network (i.e., a network that concatenates the sum of features obtained during the process up to this point), and directly passes the style embedding on, having stripped out CLIP’s text facets.

The second of the two phases, after Style Guidance, is Structure Guidance. Here the MANO CGI-based framework, a collaboration between Body Labs Inc, and the Max Planck Institute, is used to represent an anatomically correct form and pose for hands.

Pose blendshapes in the MANO system. Source: https://dl.acm.org/doi/pdf/10.1145/3130800.3130883
Pose blendshapes in the MANO system. Source: https://dl.acm.org/doi/pdf/10.1145/3130800.3130883

The authors observe that the aforementioned OpenPose editor can also optionally be used for this purpose.

RHanDS follows the methodology of the prior framework HandRefiner (also the primary combatant in the paper’s experiments section, see below), in rendering a hand depth image from a hand mesh (i.e., an inferred CGI-style mesh) as a structure reference.

Examples from the original HandRefiner paper, with indications of grayscale depth images on the left. Source: https://arxiv.org/pdf/2311.17957.pdf
Examples from the original HandRefiner paper, with indications of grayscale depth images on the left. Source: https://arxiv.org/pdf/2311.17957.pdf

For the training of the second stage, the RHanDS structure encoder receives variegated styles and hand-depth images in tandem, with the Unet frozen so that the existing capability for style guidance is not affected by this work on structure.

Schema for the second phase of training, concentrating on structure, with the existing style domain knowledge frozen and preserved as the structure-based weights develop.
Schema for the second phase of training, concentrating on structure, with the existing style domain knowledge frozen and preserved as the structure-based weights develop.

Datasets

Underpinning these phases are the diverse datasets curated and partially created by the authors. The three resulting datasets are Multi-Style Hand, Multi-Style Hand Mesh, and Multi-Style Malformed Hand Mesh.

Example images from the three datasets created for RHanDS.
Example images from the three datasets created for RHanDS.

The Multi-Style Paired Hand Dataset consists of 517,096 hand pairs generated by the SMPL-H CGI mesh system, titled Embodied Hands, with body poses orchestrated by VPoser, the body pose prior for the earlier SMPL-X system.  

To extract the desired portions of the images, MMPose is used for the detection of general human pose, and YOLOV8 for the detection of human hands within the resulting synthesized images. All the hands in this dataset are from the same person, the authors state.

Hand gesture recognition under YOLOV8. Source: https://pyimagesearch.com/2023/05/15/hand-gesture-recognition-with-yolov8-on-oak-d-in-near-real-time/
Hand gesture recognition under YOLOV8. Source: https://pyimagesearch.com/2023/05/15/hand-gesture-recognition-with-yolov8-on-oak-d-in-near-real-time/

For the multi-style hand mesh dataset, seven different style categories were created, and 8,000 hand-mesh couplets generated for each chosen style.

The paper states*:

‘[We] crop out the hand region from the full human mesh and use a camera with a field-of-view of 45 degrees to render the hand depth images with and without an arm, respectively. Next, We extract canny images from the hand depth images with an arm and feed them into pre-trained canny and depth [ControlNet] to guide the structure generation.

‘To synthesize hand images of various styles, we manually customize the text prompt for each style to guide the generation process to match each target style. We finally synthesized 10K images for a totaling 7 styles (natural human, oil painting, watercolor, cartoon, sculpture, digital art, garage kit) and filtered out 20% images for each style using the [Mean Per Joint Position Error (MPJPE)] metric…’

Diverse styles from the initial multi-style hand mesh dataset.
Diverse styles from the initial multi-style hand mesh dataset.

For the multi-style malformed hand-mesh dataset, the authors curated images from the Human-Art dataset, and used SDEdit and Stable Diffusion to regenerate the cropped hand regions.

Examples from the Human-Art dataset, used to populate a sub-set for RHanDS. Source: https://idea-research.github.io/HumanArt/
Examples from the Human-Art dataset, used to populate a sub-set for RHanDS. Source: https://idea-research.github.io/HumanArt/

The style reference, in this case, is obtained from features derived from the generated images, while the structure reference is obtained with the use of the MobRecon hand-mesh reconstruction system.

The MobRecon hand-mesh reconstruction system in action. Source: https://github.com/SeanChenxy/HandMesh
The MobRecon hand-mesh reconstruction system in action. Source: https://github.com/SeanChenxy/HandMesh

After removing reconstruction failures (though it is not specified whether or not this was done manually), the dataset was left with 1440 pairs, including natural styles and thirteen categories of artificial styles.

Tests

Training for the two stages of the system was resource-intensive: for the multi-style paired dataset, the Unet was trained for 15,000 iterations, at a learning rate of 1e-5 (the lowest and finest-grained practicable rate). The batch size was a formidably high 256, spread across 8 NVIDIA A100 GPUs (though the paper does not specify whether each GPU was a 40GB or 80GB VRAM model).

Hand images were resized to 512x512px, and the style images to 224x224px, with standard data augmentation of horizontal and vertical flipping.

For the training of the second stage, the multi-style hand-mesh dataset was combined with material from the Static Gestures dataset for 15,000 iterations, this time at a learning rate of 2e-5, but once again with a 256 batch size across the same configuration of A100 GPUs.

Samples from the Static Gestures dataset, used in the second phase of training. Source: https://synthesis.ai/static-gestures-dataset/
Samples from the Static Gestures dataset, used in the second phase of training. Source: https://synthesis.ai/static-gestures-dataset/

Evaluation metrics, besides the previously-quoted MPJPE, consisted of Fréchet Inception Distance (FID) and style loss.

RHanDS was evaluated across two datasets proposed by the prior HandRefiner project: Text2Image, which contains 12,000 images generated with text descriptions from the HAnd Gesture Recognition Image Dataset (HAGRID); and Image2Image, which includes 2,000 images from HAGRID.

Examples from HAGRID. Source: https://arxiv.org/pdf/2206.08219
Examples from HAGRID. Source: https://arxiv.org/pdf/2206.08219

HandRefiner was used to refine hands on the RHanDS multi-style malformed hand dataset. Repeating the earlier image, we can see the difference in performance between HandRefiner and RHanDS:

From the paper, a comparison with the prior HandRefiner framework, compared to results obtained by the new method.

The paper states:

‘[The] hands refined by HandRefiner are almost fixed in style, which is different from the original hand style. The quantitative experiments in [the table below] show that our RHanDS outperforms HandRefiner in all metrics in terms of the style and structure of refined hands.’

Quantitative metric results comparing HandRefiner to RHanDS.
Quantitative metric results comparing HandRefiner to RHanDS.

The authors also conducted qualitative tests to discern the extent to which RHanDS can preserve style and structure in tandem:

Hands generated under differing style and structure references.
Hands generated under differing style and structure references.

Here the authors state:

‘During inference, we mask the entire image to achieve the more intuitive result in [the image above]. Since the style and structure guidance are decoupled during training, we can generate hands with specified structures under any style reference without worrying about structure leakage.’

A user study was additionally conducted for subjective comparison of results across the two systems, with 173 image pairs taken from the multi-style malformed hand dataset, across diverse categories visualized in the graph below. Users were asked to choose a preferred hand image, featuring better structure and style consistency, across results from the competing frameworks.

Overwhelming favor for RHanDS in the user study.
Overwhelming favor for RHanDS in the user study.

Here the researchers comment:

‘Only for one extreme style of shadow play, both [HandRefiner] and RHanDS failed to refine malformed hands with appropriate style and structure, and most users select “none” (1, 0, 9). Despite that, users prefer our RHanDS over most other styles.

‘These results validate the conclusion that our method can refine malformed hands with more style consistency and structure quality.’

Conclusion

Though the results from RHanDS stand out in the recent crop of hand-improvement systems, the architecture is complex and tortuous, and the training resources required are considerable. We can only hope that further investigation of the internal semantic challenge that hands present to LDMs will eventually unearth a more elegant and refined solution, perhaps even at training time of the base models involved.

* My substitution of hyperlinks for the authors’ inline citations, and additional hyperlinks as necessary (this paper retains a lot of weight in the closing supplementary section, and therefore is not the most linear read)

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle