Reshaping the Human Body With CGI and Stable Diffusion

Examples of DiffBody in action

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

As we recently indicated in our examination of the difference between CGI and AI, one of the current aims of human neural synthesis research is to use older CGI techniques in order to rein in and govern neural rendering approaches – modern, AI-based methods which are capable of far more realism than the decades-old techniques used by Hollywood, but which also are far more difficult to control.

One strong strand of research over the last few years has been to investigate the possibility of mapping real images or videos of people into special CGI polygon meshes, changing the appearance or pose of the ‘CGI-ed’ person, and then passing the result back to neural rendering processes, which can make a far more convincing representation of the altered person than could be achieved by old-school texture-mapping and CGI-based refinement.

Body estimation techniques seek to capture 3D data from single images in order to create traditional meshes which can be deformed and/or textured. Source: https://arxiv.org/pdf/2304.07389.pdf
Body estimation techniques seek to capture 3D data from single images in order to create traditional meshes which can be deformed and/or textured. Source: https://arxiv.org/pdf/2304.07389.pdf

In effect, the goal here is to apply systematic transformations to human images, so that one could change their pose or appearance, using venerable CGI techniques as an interstitial stage between the flat photo and the neurally-altered image.

This particular line of inquiry is of notable interest to fashion-based AI, which is concerned in this period with developing ‘virtual try-on’ techniques that can accommodate a variety of body sizes and types, without the need to seek out particular types of real-world people to actually model the clothing. In this sense, the fashion synthesis research sector has been one of the earliest proponents of full-body deepfakes.

Click to play. The DiffSynth framework can impose fashion choices on target models. Source: https://anonymous456852.github.io/

Though changing the appearance of body shape in images and video is of pivotal interest to fashion synthesis systems that wish to tailor try-ons to the diverse shapes and sizes of potential customers, it is also an enthusiastic line of inquiry for social media applications that can potentially allow uploaders to look more ‘socially conforming’ in their photos and videos than they actually are in real life:

One body-changing app allows uploaders to 'refine' their appearance. Source: https://www.youtube.com/watch?v=-DUY4Jk5aAo
One body-changing app allows uploaders to 'refine' their appearance. Source: https://www.youtube.com/watch?v=-DUY4Jk5aAo

Generating 3D meshes from single images has proved to be an ill-posed problem in image synthesis, since the reconstruction usually has only one single image from which to attempt to create a manipulable model. Occlusions and ‘unknown areas’ are therefore inevitable. One clear and quite famous example of this is the difficulty of guessing what a person looks like from the side, based only on front-facing images.

The advent of Latent Diffusion Models (LDMs), and particularly of Stable Diffusion, has breathed new life into this difficult task, since this strand of generative AI, trained on multiple millions of diverse images, is more-than-averagely capable of at least attempting to ‘fill in the gaps’ of unseen viewpoints when estimating a 3D synthesis, even from as little as a single image.

One recent new project from Japan uses a variety of techniques, including the SMPL-X CGI body mesh, DreamBooth and Stable Diffusion to offer a novel method of achieving such transformations, claiming a general improvement on the current state of the art.

The new DiffBody technique, from researchers in Japan, can not only create novel poses from a single source photo, but change body type, among other potential transformations. Source: https://arxiv.org/pdf/2401.02804.pdf
The new DiffBody technique, from researchers in Japan, can not only create novel poses from a single source photo, but change body type, among other potential transformations. Source: https://arxiv.org/pdf/2401.02804.pdf

Titled DiffBody, the new approach solves several of the bottlenecks and bugbears of recent attempts at a similar scope, using various types of iterative refinement to improve the quality of the estimated geometry and texture.

Tested across a variety of datasets, DiffBody, the researchers claim to have demonstrated, outperforms a range of SOTA equivalent projects.

Examples of transformations of body type created by the DiffBody method, featuring the CGI mesh elements in grayscale.
Examples of transformations of body type created by the DiffBody method, featuring the CGI mesh elements in grayscale.

The new paper is titled DiffBody: Diffusion-based Pose and Shape Editing of Human Images, and comes from three researchers at the University of Tsukuba.

Method

The new work relies on the SMPL-X parametric capture model, which is capable of estimating a traditional CGI-based polygon mesh from a single image.

A showcase example of the SMPL-X body estimation system. Source: https://smpl-x.is.tue.mpg.de/
A showcase example of the SMPL-X body estimation system. Source: https://smpl-x.is.tue.mpg.de/

A reference texture for the source image is then created by projection mapping, and body shape parameters such as weight and height are imposed, together with keypoints for the armature (see sample image above).

Naturally, as noted earlier, there are going to be huge gaps in a projected texture. If you project an image of your face onto a showroom dummy, for instance, the texture will streak at the sides and be completely absent at the back of the head.

Thus DiffBody uses Stable Diffusion to infer what the texture ought to be in these occluded areas, by creating a DreamBooth model for which the only data consists of the sole reference image.

Typically, DreamBooth models contain multiple images depicting varying viewpoints; however, even with one sole image, as in this case, the supporting information in the underlying Stable Diffusion base model contains so much data about human anatomy that it is possible to ‘look round the corners’ by using a DreamBooth model trained on just one image.

The LDM (i.e., Stable Diffusion) model is also conditioned on the T2I-Adaptor module of the ControlNet system – a framework designed to help Stable Diffusion more accurately perform image-to-image operations (i.e., painting crude glasses on an image of a person and passing the image through Stable Diffusion with the text-prompt ‘wearing sunglasses’).

An example of the extraordinary transformative powers of the ControlNet module T2I Adapter. Source: https://arxiv.org/pdf/2302.08453.pdf
An example of the extraordinary transformative powers of the ControlNet module T2I Adapter. Source: https://arxiv.org/pdf/2302.08453.pdf

At this point, images can be generated that go beyond the source image in terms of angles and poses, and, inspired by the methodology of the Stochastic Differential Editing (SDEdit) framework, image-to-image translation is used to modify fine details of the initial coarse images which are produced earlier in the workflow.

Full Body Refinement

Next, two separate refinement modules are applied to the flow: one to address shortcomings in the rendering of the entire body and another to improve depiction of the face, and to help retain the source identity.

Workflow for the refinement module.
Workflow for the refinement module.

The authors state*:

‘The refinement module takes as input an image to be modified, a refinement mask for invisible areas, a conditioning vector extracted from a prompt and keypoints. The input image is converted into a latent feature map using the VAE encoder of the LDM. Inspired by Blended Diffusion, we perform denoising in the refinement mask to modify only invisible areas.’

Unusually weak noise is used in the reverse noise procedure (the heart of LDM generation, and depicted in the image above), because excessive noise would tend to destroy essential detail.

Therefore the process continues into iterative refinement through multiple reverse processes using, again, weak noise. The loss function used here is Adaptive Wing (AW). The OpenPose ControlNet functionality is implemented by the aforementioned T2I adapter while CLIP similarity is employed between the output and reference images, to gain an understanding of when adequate accuracy has been achieved.

Iterative refinement schema for the full-body workflow.
Iterative refinement schema for the full-body workflow.

Additionally, the text embeddings (i.e., the semantic relationship between visual material and words) are optimized with each iteration, with the input image reinitialized at each stage to avoid the solution converging to an unnatural state(i.e., the iteration is forced to always reconsider the source image at each stage, rather than some ‘evolved’ version of that image).

Face Refinement

To refine the face, a facial region is automatically cropped from the image obtained in the first step of the workflow. Thereafter the face goes through a similar procedure as the body (see ‘Full Body Refinement’, above), except that MSE keypoint loss is also used, where facial landmarks are considered in the training process. Though the ‘entire body’ refinement face also features a face, this level of granular attention is saved for this face-only refinement phase.

Identity loss is also evaluated during this section of the pipeline, by the MagFace framework, to ensure that the original identity of the source image is not degraded in the refinement process.

Schema for face refinement. The reference to 'SKS' simply denotes the trigger word for the subject that was trained into the DreamBooth (i.e., the word you need to use to trigger the subject you trained, so that they appear in the generated images). 'SKS' is used because it is a unique keyword, and has no corresponding entries in the Stable Diffusion V1.4/5 models; therefore the system will not become confused in regard to what it is meant to depict when it is fed this trigger.
Schema for face refinement. The reference to 'SKS' simply denotes the trigger word for the subject that was trained into the DreamBooth (i.e., the word you need to use to trigger the subject you trained, so that they appear in the generated images). 'SKS' is used because it is a unique keyword, and has no corresponding entries in the Stable Diffusion V1.4/5 models; therefore the system will not become confused in regard to what it is meant to depict when it is fed this trigger.

RetinaFace and CLIP are also used to keep tabs on the accuracy during iterations. Finally, the end result is a merge between the source image and the output of steps 1 and 2, concatenated together using Poisson blending.

Data and Tests

For testing of DiffBody, the researchers used a wide variety of datasets, including DeepFashion, MonoPerfCap, Everybody Dance Now (EDN), SMPL-X’s EHF, and YouTube 18 Dancers.

A demonstration of MonoPerfCap, the dataset for which was used in tests for DiffBody. Source: https://www.youtube.com/watch?v=Zg0Zaiarlpk

A total of 51 reference images and 963 target images were obtained from these datasets.

Evaluation metrics used were Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), Fréchet Inception Distance (FID, which has been challenged lately), and the aforementioned AW and ID loss methods.

For ID loss, the cosine similarity between facial features was estimated with MagFace.

Rival frameworks tested were Liquid Warping GAN with Attention (LWG), Pose-Guided Human Animation (PGHA), Neural Texture Extraction and Distribution (NTED), Person Image Synthesis via Denoising Diffusion Model (PIDM), T2I Adapter, and Diffusion Inpainting of Neural Textures (DINAR).

DiffBody was implemented over Python and PyTorch on a NVIDIA RTX A6000 with 48GB of VRAM, with training images resized to 512x512px.

A templated (V1.4) Stable Diffusion model was fine-tuned with the sole reference image in DreamBooth, using the AdamW optimizer. Text embedding optimization, instead, was performed under the base Adam optimizer, with PyMaF used for the estimation of SMPL-X parameters.

For a quantitative comparison for pose editing, the various frameworks were either fine-tuned on the Stable Diffusion under the T2I adapter, or the official models were used where available, or else official models were trained on the DeepFashion dataset.

Quantitative results for the Everybody Dance Now, EHF, MonoPerfCap, YouTube 18 Dancers (Y18D), and DeepFashion datasets, including average scores across all datasets. Please refer to the cited source paper for better resolution.
Quantitative results for the Everybody Dance Now, EHF, MonoPerfCap, YouTube 18 Dancers (Y18D), and DeepFashion datasets, including average scores across all datasets. Please refer to the cited source paper for better resolution.

Of these results, the authors state:

‘In the results on DeepFashion, although NTED performs the best, our method also outperforms the methods not trained on the DeepFashion dataset.

‘Furthermore, our method shows the best average scores across all datasets. These findings suggest that our method works across multiple datasets.’

Below are results for the qualitative tests (please see source paper for better resolution).

Qualitative comparisons of DiffBody ('ours') against rival frameworks. Please refer to the cited source paper for better detail and resolution.
Qualitative comparisons of DiffBody ('ours') against rival frameworks. Please refer to the cited source paper for better detail and resolution.

Regarding these results, the paper notes that warping-based methods LWG, PGHA and DINAR evoke distorted and stretched textures in the ‘unknown’ or invisible regions from the reference images. They also note that these approaches tend to weld the hand texture into the torso area.

Though the image-to-image results for the NTED and PIDM methods are good, they do not adequately preserve clothing textures and facial identity on alternate datasets, the authors observe:

‘The text-to-image approach, T2IA [24], also suffers from this problem and even generates different backgrounds from the reference images. Our method, on the other hand, consistently produces satisfactory results on all of the datasets.

‘Our method successfully achieves a wide range of pose editing for a variety of person images, which is difficult to achieve with the existing methods.’

In the case of testing criteria for body-shaping, the pursuit is so new that there are no reference datasets currently available. Therefore the researchers adopted the qualitative testing method devised for the 2022 paper Structure-Aware Flow Generation for Human Body Reshaping.

From the 2022 paper 'Structure-Aware Flow Generation for Human Body Reshaping', testing criteria for body reshaping are applied, judging the authenticity of the system's ability to (in this case) increase the body weight of the individual depicted in the source image (leftmost image). Source: https://arxiv.org/pdf/2203.04670.pdf
From the 2022 paper 'Structure-Aware Flow Generation for Human Body Reshaping', testing criteria for body reshaping are applied, judging the authenticity of the system's ability to (in this case) increase the body weight of the individual depicted in the source image (leftmost image). Source: https://arxiv.org/pdf/2203.04670.pdf

The prior method uses warping directly, i.e., algorithmic manipulation of the source image, without a 3D interstitial method, and therefore, for fairness, the researchers of the new paper estimated a like-for-like warping strength for their SMPL-X-based approach.

Qualitative results against this prior framework are depicted below:

Qualitative results for Structure Aware Flow Generation vs. DiffBody. Please refer to the cited source paper for better detail and resolution.
Qualitative results for Structure Aware Flow Generation vs. DiffBody. Please refer to the cited source paper for better detail and resolution.

The authors comment:

‘In the results of the existing method, increasing body size often causes significant distortion in the torso regions. In contrast, our method can create plausible images.

‘In addition, our method can handle changes in the facial appearance that occur with changes in body weight.’

Conclusion

It’s unusual that a potentially broadly-applicable human synthesis method should be driven by such a specific industry as fashion; but the extraordinary level of funding that the clothing sector can provide for further innovations in body-editing may eventually benefit the VFX sector as well.

Straightforward warping has been available for decades, at gradually improving quality levels, in prosumer packages such as After Effects, and in professional visual effects and post-processing applications and frameworks. In such a case, a region of a moving image is mapped and ‘pinned’, so that as the actor changes pose, the warp is applied continuously. This method can also be used to ‘stick’ non-existent textures (such as tattoos and wounds) to faces and bodies, among many other applications, often without the use of CGI meshes.

Though these older techniques are increasingly 3D-aware, none are able to resolve ‘unseen’ areas of the original capture in the way that generative systems such as Stable Diffusion potentially can.

For the fashion industry, the possible ultimate objective is that a user be able to upload one image or more, and then be able to see visualizations of themselves (at their correct body weight and height) moving around and demonstrating potential clothes purchases. Ideally, fashion houses would only need to upload the new season’s fashions to update the system.

The current state-of-the-art is some fair way off this objective, though more limited systems are beginning to emerge. But here, as in the general trend in neural human synthesis, it seems that interstitial CGI systems such as SMPL-X and FLAME are going to prove indispensable in the very near future.

* My conversion of the authors’ inline citations to hyperlinks.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle