Trying out New Clothes in Stable Diffusion-Based Videos

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Though it may not be obvious to the casual observer of the image and video synthesis scene, the fashion industry is funding considerable research into removing the need for people to actually come into stores and try on clothing that they might want to purchase.

Typical output from the vast range of try-on models released into the literature over the last 5-6 years, typified by static image results, and not very different to Photoshop-style output. Source: https://diffuse2choose.github.io/
Typical output from the vast range of try-on models released into the literature over the last 5-6 years, typified by static image results, and not very different to Photoshop-style output. Source: https://diffuse2choose.github.io/

The ‘try-on’ strand of computer vision research is well-funded and well-pursued; but, until lately, has been limited either to static images that show the consumer what they might look like if wearing the potential purchase, or else has provided only rudimentary and quite crude video synthesis systems that cannot handle billowy or non-tight clothing, fails to produce consistent renders, or is unable to reproduce more elaborate new designs in a realistic context.

Thus, our attention has been drawn this week to a new offering from China that uses Stable Diffusion as a backbone to ‘project’ single images from new product lines into real footage, effectively substituting the clothing from the source video with an adroitness that’s rather impressive:

The new system can adapt to a wide variety of clothing styles, and does not require that the clothing in question is tight-fitting to the body. This is a montage of videos available at higher resolution at the project page. Source: https://mengtingchen.github.io/tunnel-try-on-page/

The new approach uses a method dubbed ‘tunneling’ by the authors – really, a simple recognition of the area to be affected (i.e., the superimposition of a new sweatshirt), which is then diverted into a dedicated process for conforming through Stable Diffusion and various ancillary methods, until the clothing is credibly superimposed into the new footage.

Examples of some of the novel features available in the new system, which present a challenge for the very small number of prior attempts to create this kind of functionality. This is a montage of videos available at higher resolution at the project page.

Contextual information about the environment is also handled via a separate module, so that the projected clothing maintains a natural appearance.

There are very few existing projects that have attempted this, and the authors’ tests report that their approach, dubbed ‘Tunnel Try-On’, notably defeats the scant current state-of-the-art for this task.

They consider Tunnel Try-On to be the first attempt towards commercially viable clothing imposition of this nature, and further state:

‘By integrating the focus tunnel strategy and focus tunnel enhancement, our method demonstrates the ability to effectively adapt to different types of human movements and camera variations, resulting in high-detail preservation and temporal consistency in the generated try-on videos.

‘Moreover, unlike previous video try-on methods limited to fitting tight-fitting tops, our model can perform try-on tasks for different types of tops and bottoms based on the user’s choices.’

The new paper is titled Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos, and comes from nine researchers across Huazhong University of Science and Technology and Alibaba. The initiative comes with an accompanying project site, from which the videos in this article have been extrapolated.

Method

The system uses a latent encoder to project an image of the new fashion item into the latent space of the model. The Stable Diffusion inpainting model is used for the task.

At this point, Stable Diffusion itself transforms this passed-on latent representation into Gaussian noise through a Markov process, and eventually will pass the processed data through to CLIP, so that the image/text bindings trained into the model will become used as an interpretive layer for apposite output.

Schema for the architectural workflow of Tunnel Try-On.
Schema for the architectural workflow of Tunnel Try-On.

As mentioned, the obscure term ‘tunneling’ in this project really refers to the extraction of the target area by means of image recognition, similar to zooming-in and cropping off a relevant area of a video.

Tunnel extraction is essentially extraction of the area of interest, which is usually a face or person in image synthesis pipelines, but in this case is a 'garment area'.
Tunnel extraction is essentially extraction of the area of interest, which is usually a face or person in image synthesis pipelines, but in this case is a 'garment area'.

We can see in the image above, on the right, that the tunneling pipeline concludes in a masked-off area representing the part of the video that will be cloistered away for processing, until a later environmental module will help to integrate it better into the context of the clip.

The environmental feature encoding consists of a frozen CLIP image encoder and a learnable (non-frozen) mapping layer. The output from this part of the process is subsequently fine-tuned through a learnable projection layer (i.e., a layer that concatenates the results obtained so far).

The baseline module (in grey in the schema earlier above) consists of two U-Nets – a ‘main’ U-Net module initialized with an inpainting module, and a ‘ref’ U-Net that utilizes the popular Reference Only component of the ControlNet system, an ancillary control system for Stable Diffusion, which allows users to generate images based on existing images, instead of the images being pulled from the imaginative faculties of the trained model, based only on a text prompt.

ControlNet's 'Reference Only' module can force generations to stick to the content of a source image, while still producing variations on that image. Source: https://github.com/Mikubill/sd-webui-controlnet/discussions/1236
ControlNet's 'Reference Only' module can force generations to stick to the content of a source image, while still producing variations on that image. Source: https://github.com/Mikubill/sd-webui-controlnet/discussions/1236

With the Reference Only module preserving the image detail from the input garment, CLIP is also used to capture high-level semantic details of the target garment (i.e., a descriptive text component, which is a typical usage in Stable Diffusion generation, is considered additionally).

The main U-Net consists of nine channels, four of which are extracted from the target area in the source video, four of which consist of related latent noise, and one of which is reserved for the generated mask in which the synthesis will occur. Pose maps (estimations of figure disposition) are also incorporated at this stage to increase accuracy, and concatenated into the resulting embeddings.

Temporal attention is then applied after each stage of the primary U-Net, to adapt the try-on model (the garment) to the processing of video. The feature maps generated at this stage will ultimately result, after processing, in output features consisting entirely of denoising feature maps.

The so-called ‘focus tunnel’ obtained by this point is then subject to Kalman filtering, a linear quadratic estimation (LQE) technique that captures a diverse series of measurements over time, accounting for statistical noise, in order to arrive at cleaner and more meaningful variables.

At the same time, an environment encoder captures the outlying and contextual information into which the synthesized material will ultimately be projected, so that the final result does not look like a crude ‘overlay’, but is instead integrated well into the clip.

The criteria for cropping the segments for the tunnel come with a number of pitfalls. The authors explain:

‘In typical image virtual try-on datasets, the target person is typically centered and occupies a large portion of the image. However, in video virtual try-on, due to the movement of the person and camera panning, the person in video frames may appear at the edges or occupy a smaller portion.

‘This can lead to a decrease in the quality of video generation results and reduce the model’s ability to maintain clothing identity. To enhance the model’s ability to preserve details and better utilize the training weights learned from image try-on data, we propose the ”focus tunnel” [strategy.]’

The aforementioned pose maps are therefore used to identify the smallest possible bounding box for the upper or lower body (i.e., shirts or jeans, etc.). Using this as a base, the system then enlarges the bounding area until the entirety of necessary clothing is covered in the crop, and ultimately the video frames will be cropped, padded and resized to the target training input resolution in the main U-Net.

Regarding the tunnel embedding stage, the authors state*:

‘The input form of the focus tunnel has increased the magnitude of the camera movement. To mitigate the challenge faced by the temporal-attention module in smoothing out such significant camera  movements, we introduce the Tunnel Embedding.

‘Tunnel Embedding accepts a three-tuple input, comprising the original image size, tunnel center coordinates, and tunnel size. Inspired by the design of resolution embedding in SDXL, Tunnel Embedding first encodes the three-tuple into 1D absolute position encoding, and then obtains the corresponding embedding through linear mapping and activation functions.

‘Subsequently, the focus tunnel embedding is added to the temporal attention as position encoding.’

The final generated output is, at this stage, blended into the source video using Gaussian blur as a means of avoiding excessive sharpness that would make the new content stand out excessively.

Data and Tests

The training of the model is divided into two phases: in the first, the researchers freeze the SD-native Variational Autoencoder (VAE), its decoder, temporal attention, the environment encoder and the tunnel embedding functionality, and updates solely the parameters of the primary U-Net, the reference (i.e., ControlNet-related) U-Net, the pose estimator guidance unit.

Here the model is trained on paired data relating to image try-ons. The authors state:

‘The objective of this stage is to learn the extraction and preservation of clothing features using larger, higher-quality, and more diverse paired image data compared to the video data, aiming to achieve high-fidelity image-level try-on generation results as a solid foundation.’

In the second stage the model is trained on datasets featuring video-based try-on material, with only temporal attention and the environment encoder updated, and other facets frozen. The objective of this is to translate the static features learned in the first phase into the temporal realm, with apposite video data.

To test the system, the authors used two video try-on datasets – their own curated collection, and the FW-GAN VVT dataset.

Examples from the prior dataset, VVT, used in experiments for the new paper. Source: †
Examples from the prior dataset, VVT, used in experiments for the new paper. Source: †

The VVT dataset contains 791 paired videos of people and clothing, at a resolution of 192x256px. The models in the video have an unchallenging pure white background (see example images above). The researchers of the new paper note that the content is equally unchallenging, since the people in the videos perform ‘simple’ poses in fitted (i.e., tight) tops, which falls below the scope of the new system proposed in Tunnel Try-On.

Because of these limitations, the authors; own curated datasets is drawn from real-life ecommerce applications, featuring 5,350 video-image pairings, split by the researchers into 4,280 training videos and 1,070 testing videos. This bespoke dataset features more complex backgrounds and more problematic (i.e., non-tight) types of clothing.

The main U-Net utilizes the aforementioned Stable Diffusion inpainting model; the reference U-Net is initialized with the standard text-to-image functionality of the V1.5 release of Stable Diffusion; and the temporal attention module is initialized from the motion module in the Stable Diffusion adjunct system AnimateDiff.

Two training stages were undertaken, during each of which the disparately-sized source data was resized to a standard 512x512px.

The models were trained on eight A100 GPUs (the paper does not specify whether these GPUs were the 40GB or 80GB VRAM models – the A100 comes in both configurations).

Initially the researchers used image try-on paired images extracted from their video sources, merging them with prior equivalent data from the VITON-HD dataset.

Examples from the VITON-HD dataset, used in the tests for Tunnel Try-On. Source: https://arxiv.org/pdf/2103.16874
Examples from the VITON-HD dataset, used in the tests for Tunnel Try-On. Source: https://arxiv.org/pdf/2103.16874

After this, a 24-frame clip was sampled as input for the second stage. During testing, longer video clips were facilitated by the temporal aggregation technique developed for the 2022 EDGE project.

EDGE was designed to produce choreography directly from music, and its video concatenation technique was used in the training of Tunnel Try-On. Source: https://edge-dance.github.io/
EDGE was designed to produce choreography directly from music, and its video concatenation technique was used in the training of Tunnel Try-On. Source: https://edge-dance.github.io/

The researchers compared Tunnel Try-On with alternative and prior methods in the VVT dataset, across qualitative, quantitative and user studies.

Methods examined included the GAN-based approaches FW-GAN, PBAFN, and the influential ClothFormer, as well as the diffusion-based frameworks Anydoor and StableVITON.

For fair comparison, the authors used the VITON-HD dataset for first-stage training, and VVT for the second stage, avoiding the use of their own dataset (which might have excessively and unfairly challenged the prior frameworks tested).

Qualitative comparisons over the VVT dataset. Please refer to the source paper for better resolution.
Qualitative comparisons over the VVT dataset. Please refer to the source paper for better resolution.

Above are featured the initial qualitative tests. Of these, the authors comment:

‘[It] is evident that GAN-based methods like FW-GAN and PBAFN, which utilize warping modules, struggle to adapt effectively to variations in the sizes of individuals in the video. Satisfactory results are achieved only in close-up shots, with the warping of clothing producing acceptable outcomes.

‘However, when the model moves farther away and becomes smaller, the warping module produces inaccurately wrapped clothing, resulting in unsatisfactory single-frame try-on results.

‘ClothFormer can handle situations where the person’s proportion is relatively small, but its generated results are blurry, with significant color deviation.’

The authors further observe that the diffusion-based methods, including StableVITON, are unable to maintain continuity of lettering on the text of a t-shirt (though the paper does not mention that later versions of Stable Diffusion than V1.5 have largely solved the typography issue).

The complex pattern, including lettering, is not maintained consistently across generations under rival diffusion-based methods.
The complex pattern, including lettering, is not maintained consistently across generations under rival diffusion-based methods.

As can be imagined, when applied to videos, these and similar aberrations cause a notable ‘jitter’ or ‘sizzle’, as observed by the authors of the new work.

The authors also conducted quantitative tests. Metrics used for static images were Structural Similarity Index (SSIM), and Learned Perceptual Similarity Metrics (LPIPS).

For video, the metrics used were from Video Fréchet Inception Distance (VFID, which appears to be an alternative take on Fréchet Video Distance, or FVD), here with feature extraction provided by I3D and 3D-ResNeXt101.

Results from the quantitative test round.
Results from the quantitative test round.

Of the quantitative tests, the authors state:

‘[On] the VVT dataset, our Tunnel Try-on outperforms others in terms of SSIM, LPIPS, and VFID metrics, further confirming the superiority of our model in image visual quality (similarity and diversity) and temporal continuity compared to other methods.

‘It’s worth noting that we have a substantial advantage in LPIPS compared to other methods. Considering that LPIPS is more in line with human visual perception compared to SSIM, this highlights the superior visual quality of our approach.’

A user study was also conducted, wherein 10 annotators were asked to compare 130 samples from the VVT test set, with different video generation methods used from the same input. Evaluation criteria were ‘smoothness’, ‘fidelity’, and ‘quality’.

Results from the user study.
Results from the user study.

The authors emphasize, as can be seen from the results table above, that Tunnel Try-On comfortably leads the results in the user study.

Referring to the initial qualitative tests at the head of the new paper (see image below), the authors conclude:

‘By integrating the focus tunnel strategy and focus tunnel enhancement, our method demonstrates the ability to effectively adapt to different types of human movements and camera variations, resulting in high-detail preservation and temporal consistency in the generated try- on videos.

‘Moreover, unlike previous video try-on methods limited to fitting tight-fitting tops, our model can perform try-on tasks for different types of tops and bottoms based on the user’s choices.’

Initial showcase results from the new paper. Please refer to the source paper for better resolution.
Initial showcase results from the new paper. Please refer to the source paper for better resolution.

Conclusion

After a long and relatively arid winter of papers that either cannot fulfill their promise, or else only produce incremental improvement over prior works (often at great expense of resources), it has been refreshing to see a breakthrough as impressive as Tunnel Try-On, as the season begins to warm up in anticipation of the coming array of computer vision and AI conferences.

Of particular note for the new system is its ability to handle clothing that is not ‘sprayed on’ to the person in the video, since fashion is a volatile medium, and systems of this nature will need to be able to handle volume better than prior works have done. Here, a very encouraging start is made to meet that challenge, even if there is the occasional glitch with hand rendering (a problem which has always plagued Stable Diffusion).

However, the new system has not taken on the most voluminous possible types of clothing, such as wide-borne dresses or wedding dresses, which would likely require addressing via physic models of some kind.

Yet, disregarding the tendencies of fashion models to elaborately flourish around loose clothing that they might be modeling, such as long scarves, Tunnel Try-On appears to have made a notable leap towards a genuinely versatile and photorealistic fashion video try-on system.

* My substitution of hyperlinks for the authors’ inline citation/s.

https://openaccess.thecvf.com/content_ICCV_2019/papers/Dong_FW-GAN_Flow-Navigated_Warping_GAN_for_Video_Virtual_Try-On_ICCV_2019_paper.pdf

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle