Editable Clothing Layers for Gaussian Splat Human Representations

LayGa - Source: https://arxiv.org/pdf/2405.07319
LayGa - Source: https://arxiv.org/pdf/2405.07319

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

We have mentioned before that one of the advantages that old-school CGI has over the emerging research strands in neural human synthesis is that everything is a distinct object in the CGI world: a shirt is a separate mesh, painted with bitmapped textures that can be swapped out, and secondary ID characteristics such as hair and eye-color can be ‘dropped in’ relatively easily.

In the bottom row, we see the unfolded mesh on the left and the corresponding facial texture on the right.
Lower left, an expanded CGI mesh, including high detail on ears; lower-right, the corresponding bitmap texture that can be used to paint the mesh.

By contrast, human rendering methods such as Neural Radiance Fields (NeRF), Generative Adversarial Networks (GANs) and Gaussian Splatting tend to treat captured source material as a single entity. If the person in a source video is wearing a brown shirt, that shirt will usually become ineffably baked into any neural representation that is obtained from the video, as if it were part of the person’s body. Re-texturing is problematic, and substituting the item of clothing for another one is usually far more difficult in a neural than in a CGI workflow.

Click to play. Animatable Gaussian avatars are impressive representations, using the old raster technology, originally used in medical imaging and now re-taking the synthesis research scene by storm, can produce convincing clothes – but they’re baked into the representation, NeRF-style, and can’t be changed. Source: https://animatable-gaussians.github.io/

Above we can see an example of a fairly recent outing in the growing research strand for Gaussian Splatting. Titled Animatable Gaussians, this method obtains characteristics from a source video and, typical of the literature, uses a Skinned Multi-Person Linear Model (SMPL), a parametric CGI model, to translate between the obtained data and the neural synthesis.

However, as the authors of the work conceded when it was published in March of this year, the impressive avatars resultant from the process have baked-in clothing, making the process more suitable for the very sought-after task of video-based fashion try-on systems, but not really versatile enough to potentially be useful in a VFX workflow that would require more control and editability.

Pursuant to the very limited number of systems that have cropped up recently attempting to separate bodies and clothing into distinct and editable layers, a new work from China offers an evolution of the Animatable Gaussian project into a system that allows for the interposition of novel clothing as a distinct and separate layer that can potentially offer full editing, at least of relatively non-billowy items of clothing:

Click to play. Left, the original source video; left-to-right, the interpreted neural representation, together with the discretized clothing and body layer; right-most, a free-point rendering of the resulting representation. Source: https://jsnln.github.io/layga/index.html

Titled Layered Gaussian Avatars (LayGA), the approach utilizes the training and inference of two models in parallel, in order to achieve the second layer of clothing.

The same shirt interposed onto different identities with LayGA. Source: https://arxiv.org/pdf/2405.07319
The same shirt interposed onto different identities with LayGA. Source: https://arxiv.org/pdf/2405.07319

The methodology used for LayGA involves multiple prior libraries or approaches, and significant effort, as we’ll see; but any new work that offers disentanglement of obtained source material, a problem that has especially plagued NeRF and GANs, is a welcome step forward, hopefully towards smoother and easier workflows in the future.

The new paper is titled LayGA: Layered Gaussian Avatars for Animatable Clothing Transfer, and comes from four researchers across Tsinghua University, NNKosmos Technology, and Beijing Normal University.

Method

A body pose obtained from a source video is first converted into a position map, outputting a pose condition. In order to predict a pose-dependent 3D Gaussian representation, a StyleUNet (StyleAvatar) model is used, deformed from the canonical position of an SMPL-X CGI neural interface.

The SMPL-X model framework interprets real images into CGI representations. The X/Y/Z coordinates from the CGI model can then be used as anchor points for neural overlays. Source: https://arxiv.org/pdf/1904.05866
The SMPL-X model framework interprets real images into CGI representations. The X/Y/Z coordinates from the CGI model can then be used as anchor points for neural overlays. Source: https://arxiv.org/pdf/1904.05866

The resulting variables are then posed with Linear Blend Skinning (LBS, in accordance with the 2023 paper that set off the new enthusiasm for Gaussian rendering) and rendered to a 3D Gaussian Splat (3DGS) representation.

Since, the authors attest, single-layer reconstruction has proved inadequate for tracking garment motion, training is divided into two stages: base human depiction, and segmented training for the garments:

Training schema for the new system.
Training schema for the new system.

For the single-layer reconstruction stage, segmented reconstruction is achieved with the use of a novel series of geometric constraints (to ensure conformity of clothing to the base figure) and with the aid of garment mask supervision.

Garment masking is provided by methods developed for the 2019 paper ' Self-Correction for Human Parsing'. Source: https://arxiv.org/pdf/1910.09777
Garment masking is provided by methods developed for the 2019 paper ' Self-Correction for Human Parsing'. Source: https://arxiv.org/pdf/1910.09777

The paper states*:

‘In the multi-layer fitting stage, we train two separate models to represent the body and clothing, thus enabling clothing transfer across different identities. Overall, our method outperforms the state-of-the-art baseline and realizes photorealistic virtual try-on.

‘Moreover, our geometrically constrained Gaussian rendering scheme, if considered as a stand-alone method, can also be used for multi-view geometry reconstruction of humans.’

The core schema of the new method is built upon Animatable Gaussians, but adds a second layer of training for clothing, converting the SMPL-X joint angles into position maps by rendering the canonical SMPL-X model at front and back views (canonical representations of this kind being best-known generally for Da Vinci’s Vitruvian Man). 

Canonical SMPL-X figure representations are the basis of delineating the boundaries of clothing in the new system.
Canonical SMPL-X figure representations are the basis of delineating the boundaries of clothing in the new system.

3D Gaussians are extracted via the template masks derived from the canonical references.

The authors observe that basic 3DGS approaches do not provide adequate geometrical constraints, impeding the possibility to suspend a second layer of parallel clothing above the base figure model. Therefore a variety of loss functions are applied during training. The paper states:

‘[Geometric] constraints force the 3D Gaussians to converge to a smooth surface. However, we empirically found that these geometric constraints negatively impacted the rendering quality of 3DGS, possibly because these geometric constraints lower the flexibility of Gaussians to model high-fidelity appearance.

‘To prevent the adverse impact brought by geometric constraints, while still preserving a smooth geometry for collision handling in clothing transfer, we propose to separate the geometry layer and the rendering layer.’

Image-based normal loss is the first of the loss functions used. In traditional CGI, surface normals are indicators of which direction textures and geometry are facing (which is not necessarily implicit in a carelessly-assembled model). In the case of the new system, these are estimated from images as an ancillary supervision signal during training, but only where they fall inside the above-illustrated template masks.

Normal estimation from canonical points to 3D space.
Normal estimation from canonical points to 3D space.

Stitching Loss is also considered. This does not refer to the seams in clothes, but rather the points at which the frontal and rear canonical views meet up. For this, simple L1 and L2 losses are used between the boundary maps.

Further, regularization is used to penalize large distortions during training, with the authors making use of multiple regularization techniques, including a technique developed for the 2017 ClothCap project.

The ClothCap project, from 2017, developed a method to regularize the edges of canonically-represented (i.e., flattened pattern) garments, used in the new system†.
The ClothCap project, from 2017, developed a method to regularize the edges of canonically-represented (i.e., flattened pattern) garments, used in the new system†.

Additional losses are applied for clothing segmentation, with the SCHP model trained on the ATR dataset (apparently an abbreviation for the AvatarRex dataset – see below), in order to generate segmentation masks for items of clothing.

Data and Training

(Please note that the division of this paper is unconventional, with training and data details interspersed throughout, and with very little material for experiments, making it uncertain at which point one should consider that the ‘Data and training’ coverage should begin)

The labels for the first frame of each sequence obtained for the base human figure model are used to develop the second-layer avatar, resulting in a subset of Gaussians that are labeled as clothing.

The training of the two models shares a common architecture, with additional modifications for reconstruction and rendering quality. The authors note that the aforementioned geometric constraints that force a defined surface tend to lower the flexibility of Gaussians, in terms of high-quality representations.

There is some provision made for collision detection, a common routine in CGI-based workflows, which seeks to prevent two separate entities literally crossing into each other. However, collision handling, the authors assert is only handled at inference time, and is not baked into the training architecture.

In any case, the two contributing layers, base human and clothing, are siloed, by various formulae, into discrete geometric and rendering layers, with the clothing layer supervised by a Chamfer distance loss between the base geometric model and the segmented clothing reconstruction.

Data for training and tests consisted of two datasets: a subset of the AvatarRex collection, and three sequences from the ActorsHQ dataset.

Samples from the ActorsHQ dataset, which features multi-view recordings from 160 synchronized cameras. Source: https://www.actors-hq.com/
Samples from the ActorsHQ dataset, which features multi-view recordings from 160 synchronized cameras. Source: https://www.actors-hq.com/

Additionally, the authors provided their own original data, in the form of captures of a dancing man wearing a tight white t-shirt, which contains 15 views across 800 frames. For the AvatarRex sequence, 13 views were used for training, and for the ActorsHQ subset, 39 full-body views were used.

The single-layer model was trained for 200,000 iterations, and the multi-layer model for 550,000. The paper does not provide details of the hardware used. For equity with the prior method, parametric templates and view-dependent experiences were not imposed on the older technique.

The authors note that in one case, they trained not 2 but 3 models, to avoid collision between a ‘pants’ and a ‘shirt’ model.

An instance in which three models were trained, to account for two garments: pants and shirt.
An instance in which three models were trained, to account for two garments: pants and shirt.

The very limited tests conducted for the study test the new method against the Animatable Gaussians baseline, effectively meaning that the researchers were testing against a method that they had improved. Since there is, to say the least, a paucity of comparable projects that attempt to discretize clothing, under any architecture, the options can be considered to have been limited.

Metrics used were Peak signal-to-noise ratio (PSNR); Structural Similarity Index (SSIM); Learned Perceptual Image Patch Similarity (LPIPS); and Fréchet Inception Distance (FID).

The authors state*:

‘To evaluate the effectiveness of our geometric constraints, we compare the geometry of the Gaussians reconstructed by the baseline method (single-layer without geometric constraints) and ours. Note that the baseline method does not provide normals. Thus, we compute normals using the method illustrated in [the above-included image of ‘normal computation’].’

Geometric reconstruction using the original Animatable Gaussians method (left) and the newly-adapted method (right).
Geometric reconstruction using the original Animatable Gaussians method (left) and the newly-adapted method (right).

Of these results, the authors comment:

‘[The image above] shows geometric results of both methods, where Gaussians are rendered as ordinary point clouds with normals. For the baseline method, the messy shading near the leg area indicates incorrect normal orientation. This suggests their reconstructed Gaussians do not lie on the actual geometric surface. On the other hand, our method produces clean point cloud reconstructions using Gaussians.’

The authors further state:

‘Given trained LayGA models of different subjects, we can animate one subject or use the garment model of 𝐴 and the body model of 𝐵 to generate a mixed avatar. [The image below] shows novel pose animation results. Note that our layered representation can model tangential motions of clothing.’

Qualitative examples of novel pose animation, with the LayGA representations exhibiting tangential motions between models and clothing, such as when a belt is revealed, or a t-shirt lifted.
Qualitative examples of novel pose animation, with the LayGA representations exhibiting tangential motions between models and clothing, such as when a belt is revealed, or a t-shirt lifted.

Though the paper restricts itself to the static samples, we can see clearer evidence of the difference between the two methods in the examples at the project page:

Click to play. A comparison of the new method (left) over the old (right).

The authors concede that their method suffers from the most common affliction of ‘virtual try-on’ systems – that it cannot account for loose and billowy clothing, such as skirts – a shortcoming that presumably will eventually be addressed with the use of physics modeling, as has long been the case with CGI implementations of similar challenges.

Conclusion

This is not the first time we’ve come across research that has to jump through many hoops in order to replicate the ease of use of traditional CGI modeling. The formulae and architectural strategies used in LayGA have been considerably pared down for this review, to maintain readability – but the locus of work involved in separating neural representations is clearly quite considerable, at the current state of the art.

As we have noted before, fashion industry funding into neural virtual try-on systems is likely to maintain and accelerate the impetus towards discretized clothing in neural human synthesis, and to benefit other sectors, such as VFX practitioners.

As it stands, we’ve noted an increase in recent months of new papers addressing this challenge, such as the MOSS system, which, though it does not achieve garment separation, offers some improvement in clothing rendering quality.

Examples from the new MOSS system. Source: https://wanghongsheng01.github.io/MOSS/

Other recent offerings in this area, which pay attention to garment rendering, include VIVID, Edit-Your-Motion, Tunnel Try-On, and TELA.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle