As we recently indicated in our examination of the difference between CGI and AI, one of the current aims of human neural synthesis research is to use older CGI techniques in order to rein in and govern neural rendering approaches – modern, AI-based methods which are capable of far more realism than the decades-old techniques used by Hollywood, but which also are far more difficult to control.
One strong strand of research over the last few years has been to investigate the possibility of mapping real images or videos of people into special CGI polygon meshes, changing the appearance or pose of the ‘CGI-ed’ person, and then passing the result back to neural rendering processes, which can make a far more convincing representation of the altered person than could be achieved by old-school texture-mapping and CGI-based refinement.
In effect, the goal here is to apply systematic transformations to human images, so that one could change their pose or appearance, using venerable CGI techniques as an interstitial stage between the flat photo and the neurally-altered image.
This particular line of inquiry is of notable interest to fashion-based AI, which is concerned in this period with developing ‘virtual try-on’ techniques that can accommodate a variety of body sizes and types, without the need to seek out particular types of real-world people to actually model the clothing. In this sense, the fashion synthesis research sector has been one of the earliest proponents of full-body deepfakes.
Click to play. The DiffSynth framework can impose fashion choices on target models. Source: https://anonymous456852.github.io/
Though changing the appearance of body shape in images and video is of pivotal interest to fashion synthesis systems that wish to tailor try-ons to the diverse shapes and sizes of potential customers, it is also an enthusiastic line of inquiry for social media applications that can potentially allow uploaders to look more ‘socially conforming’ in their photos and videos than they actually are in real life:
Generating 3D meshes from single images has proved to be an ill-posed problem in image synthesis, since the reconstruction usually has only one single image from which to attempt to create a manipulable model. Occlusions and ‘unknown areas’ are therefore inevitable. One clear and quite famous example of this is the difficulty of guessing what a person looks like from the side, based only on front-facing images.
The advent of Latent Diffusion Models (LDMs), and particularly of Stable Diffusion, has breathed new life into this difficult task, since this strand of generative AI, trained on multiple millions of diverse images, is more-than-averagely capable of at least attempting to ‘fill in the gaps’ of unseen viewpoints when estimating a 3D synthesis, even from as little as a single image.
One recent new project from Japan uses a variety of techniques, including the SMPL-X CGI body mesh, DreamBooth and Stable Diffusion to offer a novel method of achieving such transformations, claiming a general improvement on the current state of the art.
Titled DiffBody, the new approach solves several of the bottlenecks and bugbears of recent attempts at a similar scope, using various types of iterative refinement to improve the quality of the estimated geometry and texture.
Tested across a variety of datasets, DiffBody, the researchers claim to have demonstrated, outperforms a range of SOTA equivalent projects.
A reference texture for the source image is then created by projection mapping, and body shape parameters such as weight and height are imposed, together with keypoints for the armature (see sample image above).
Naturally, as noted earlier, there are going to be huge gaps in a projected texture. If you project an image of your face onto a showroom dummy, for instance, the texture will streak at the sides and be completely absent at the back of the head.
Thus DiffBody uses Stable Diffusion to infer what the texture ought to be in these occluded areas, by creating a DreamBooth model for which the only data consists of the sole reference image.
Typically, DreamBooth models contain multiple images depicting varying viewpoints; however, even with one sole image, as in this case, the supporting information in the underlying Stable Diffusion base model contains so much data about human anatomy that it is possible to ‘look round the corners’ by using a DreamBooth model trained on just one image.
The LDM (i.e., Stable Diffusion) model is also conditioned on the T2I-Adaptor module of the ControlNet system – a framework designed to help Stable Diffusion more accurately perform image-to-image operations (i.e., painting crude glasses on an image of a person and passing the image through Stable Diffusion with the text-prompt ‘wearing sunglasses’).
At this point, images can be generated that go beyond the source image in terms of angles and poses, and, inspired by the methodology of the Stochastic Differential Editing (SDEdit) framework, image-to-image translation is used to modify fine details of the initial coarse images which are produced earlier in the workflow.
Full Body Refinement
Next, two separate refinement modules are applied to the flow: one to address shortcomings in the rendering of the entire body and another to improve depiction of the face, and to help retain the source identity.
The authors state*:
‘The refinement module takes as input an image to be modified, a refinement mask for invisible areas, a conditioning vector extracted from a prompt and keypoints. The input image is converted into a latent feature map using the VAE encoder of the LDM. Inspired by Blended Diffusion, we perform denoising in the refinement mask to modify only invisible areas.’
Unusually weak noise is used in the reverse noise procedure (the heart of LDM generation, and depicted in the image above), because excessive noise would tend to destroy essential detail.
Therefore the process continues into iterative refinement through multiple reverse processes using, again, weak noise. The loss function used here is Adaptive Wing (AW). The OpenPose ControlNet functionality is implemented by the aforementioned T2I adapter while CLIP similarity is employed between the output and reference images, to gain an understanding of when adequate accuracy has been achieved.
Additionally, the text embeddings (i.e., the semantic relationship between visual material and words) are optimized with each iteration, with the input image reinitialized at each stage to avoid the solution converging to an unnatural state(i.e., the iteration is forced to always reconsider the source image at each stage, rather than some ‘evolved’ version of that image).
To refine the face, a facial region is automatically cropped from the image obtained in the first step of the workflow. Thereafter the face goes through a similar procedure as the body (see ‘Full Body Refinement’, above), except that MSE keypoint loss is also used, where facial landmarks are considered in the training process. Though the ‘entire body’ refinement face also features a face, this level of granular attention is saved for this face-only refinement phase.
Identity loss is also evaluated during this section of the pipeline, by the MagFace framework, to ensure that the original identity of the source image is not degraded in the refinement process.
Data and Tests
A demonstration of MonoPerfCap, the dataset for which was used in tests for DiffBody. Source: https://www.youtube.com/watch?v=Zg0Zaiarlpk
A total of 51 reference images and 963 target images were obtained from these datasets.
Evaluation metrics used were Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), Fréchet Inception Distance (FID, which has been challenged lately), and the aforementioned AW and ID loss methods.
For ID loss, the cosine similarity between facial features was estimated with MagFace.
Rival frameworks tested were Liquid Warping GAN with Attention (LWG), Pose-Guided Human Animation (PGHA), Neural Texture Extraction and Distribution (NTED), Person Image Synthesis via Denoising Diffusion Model (PIDM), T2I Adapter, and Diffusion Inpainting of Neural Textures (DINAR).
DiffBody was implemented over Python and PyTorch on a NVIDIA RTX A6000 with 48GB of VRAM, with training images resized to 512x512px.
A templated (V1.4) Stable Diffusion model was fine-tuned with the sole reference image in DreamBooth, using the AdamW optimizer. Text embedding optimization, instead, was performed under the base Adam optimizer, with PyMaF used for the estimation of SMPL-X parameters.
For a quantitative comparison for pose editing, the various frameworks were either fine-tuned on the Stable Diffusion under the T2I adapter, or the official models were used where available, or else official models were trained on the DeepFashion dataset.
Of these results, the authors state:
‘In the results on DeepFashion, although NTED performs the best, our method also outperforms the methods not trained on the DeepFashion dataset.
‘Furthermore, our method shows the best average scores across all datasets. These findings suggest that our method works across multiple datasets.’
Below are results for the qualitative tests (please see source paper for better resolution).
Regarding these results, the paper notes that warping-based methods LWG, PGHA and DINAR evoke distorted and stretched textures in the ‘unknown’ or invisible regions from the reference images. They also note that these approaches tend to weld the hand texture into the torso area.
Though the image-to-image results for the NTED and PIDM methods are good, they do not adequately preserve clothing textures and facial identity on alternate datasets, the authors observe:
‘The text-to-image approach, T2IA , also suffers from this problem and even generates different backgrounds from the reference images. Our method, on the other hand, consistently produces satisfactory results on all of the datasets.
‘Our method successfully achieves a wide range of pose editing for a variety of person images, which is difficult to achieve with the existing methods.’
In the case of testing criteria for body-shaping, the pursuit is so new that there are no reference datasets currently available. Therefore the researchers adopted the qualitative testing method devised for the 2022 paper Structure-Aware Flow Generation for Human Body Reshaping.
The prior method uses warping directly, i.e., algorithmic manipulation of the source image, without a 3D interstitial method, and therefore, for fairness, the researchers of the new paper estimated a like-for-like warping strength for their SMPL-X-based approach.
Qualitative results against this prior framework are depicted below:
The authors comment:
‘In the results of the existing method, increasing body size often causes significant distortion in the torso regions. In contrast, our method can create plausible images.
‘In addition, our method can handle changes in the facial appearance that occur with changes in body weight.’
It’s unusual that a potentially broadly-applicable human synthesis method should be driven by such a specific industry as fashion; but the extraordinary level of funding that the clothing sector can provide for further innovations in body-editing may eventually benefit the VFX sector as well.
Straightforward warping has been available for decades, at gradually improving quality levels, in prosumer packages such as After Effects, and in professional visual effects and post-processing applications and frameworks. In such a case, a region of a moving image is mapped and ‘pinned’, so that as the actor changes pose, the warp is applied continuously. This method can also be used to ‘stick’ non-existent textures (such as tattoos and wounds) to faces and bodies, among many other applications, often without the use of CGI meshes.
Though these older techniques are increasingly 3D-aware, none are able to resolve ‘unseen’ areas of the original capture in the way that generative systems such as Stable Diffusion potentially can.
For the fashion industry, the possible ultimate objective is that a user be able to upload one image or more, and then be able to see visualizations of themselves (at their correct body weight and height) moving around and demonstrating potential clothes purchases. Ideally, fashion houses would only need to upload the new season’s fashions to update the system.
The current state-of-the-art is some fair way off this objective, though more limited systems are beginning to emerge. But here, as in the general trend in neural human synthesis, it seems that interstitial CGI systems such as SMPL-X and FLAME are going to prove indispensable in the very near future.
* My conversion of the authors’ inline citations to hyperlinks.