Better Stable Diffusion Deepfakes With an Object-Oriented Programming Approach

SuDe

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

For visual effects practitioners, one of the most fruitful capabilities in the text-to-image system Stable Diffusion system is its ability to create neurally-synthesized and really convincing versions of real actors – their faces and bodies too.

Various celebrity likenesses captured through LoRA training, using free LoRAs from places such as civit.ai.
Various celebrity likenesses captured through LoRA training, using free LoRAs from places such as civit.ai.

By contrast, older technologies such as CGI have been plagued by the uncanny valley effect, notwithstanding huge effort from some of the biggest VFX houses in the world.

ILM's painstaking CGI reproduction of actor Peter Cushing was quite impressive in 2016, when 'Rogue One' was released – but only a year later the advent of deepfakes would date this approach terribly. Sources: https://www.indiewire.com/awards/industry/rogue-one-visual-effects-ilm-digital-grand-moff-tarkin-cgi-princess-leia-1201766597/ and https://www.youtube.com/watch?v=xMB2sLwz0Do
ILM's painstaking CGI reproduction of actor Peter Cushing was quite impressive in 2016, when 'Rogue One' was released – but only a year later the advent of deepfakes would date this approach terribly. Sources: https://www.indiewire.com/awards/industry/rogue-one-visual-effects-ilm-digital-grand-moff-tarkin-cgi-princess-leia-1201766597/ and https://www.youtube.com/watch?v=xMB2sLwz0Do

Stable Diffusion, and similar latent diffusion models (LDMs) such as Kandinsky, however, can achieve photorealistic faces and bodies of actors, when (with all rights and permissions obtained), ancillary systems such as LoRA and DreamBooth are trained on a relatively small number of source images of the actor (even 1,000 images, a very high amount for such systems, is a fraction of the data needed for older autoencoder systems such as DeepFaceLab, DeepFaceLive, and FaceSwap).

For DreamBooth and LoRA training, a plausible and reasonably flexible neural representation can be obtained by training as few as four images for a fairly short time, even on moderate consumer-level hardware. Source: https://huggingface.co/docs/diffusers/training/dreambooth
For DreamBooth and LoRA training, a plausible and reasonably flexible neural representation can be obtained by training as few as four images for a fairly short time, even on moderate consumer-level hardware. Source: https://huggingface.co/docs/diffusers/training/dreambooth

While the VFX research scene has not yet produced a deepfake-style LDM video-production system that’s entirely as temporally stable as DeepFaceLab, it seems this innovation is imminent; and in the meantime, the use of LoRA and DreamBooth output in Stable Diffusion can potentially produce the kind of high-quality synthetic training data and extreme face angles that cannot be obtained, or at least obtained easily, any other way.

Class Prejudice

There is, however, one major drawback in using LoRA (which has, for the most part, succeeded DreamBooth) in generating such images: it is often quite difficult to get a LoRA representation of an actor to do anything that was not represented in the training data, including unusual expressions, or poses that did not feature in the source images on which the LoRA was trained.

Above, basic Stable Diffusion V1.5 with the prompt 'A color photo of [man | woman | person ] with open mouth', but the negative prompt 'smile' forces Stable Diffusion to not default to the easy option of using a smile to represent an open mouth. Below, a LoRA of the actor Ryan Gosling with the same adapted prompt – the system is unable to accommodate the request. Source: https://civitai.com/models/22431/ryan-gosling-lora
Above, basic Stable Diffusion V1.5 with the prompt 'A color photo of [man | woman | person ] with open mouth', but the negative prompt 'smile' forces Stable Diffusion to not default to the easy option of using a smile to represent an open mouth. Below, a LoRA of the actor Ryan Gosling with the same adapted prompt – the system is unable to accommodate the request. Source: https://civitai.com/models/22431/ryan-gosling-lora

In the top row of the image above, we can see that Stable Diffusion V1.5 is quite capable of producing a photo of a person (or, if desired, more specifically a male or female person) with an open mouth, without requiring that it be accompanied by a smile.

But in the LoRA-driven representations of Ryan Gosling, the actor keeps his mouth firmly closed. Since there were clearly no open mouths in the Gosling training data, they cannot be summoned up at inference time.

Yet this makes no sense, considering the way that systems such as LoRA and Stable Diffusion itself use classes to train subjects, and text/image pairs to bind data to these classes.

For instance, in the dominant Kohya ss distribution, by far the most popular, powerful and flexible method of training LoRAs, the source images are required to be named in the form [token] [class], i.e., RyanGosling person, or RyanGosling Man, so that the data will consist of a sequence of images something like RyanGosling man 0001.png, RyanGosling man 0002, etc..

Since Gosling is in this case being bound to a ‘human’ class of some type or other, and since Stable Diffusion is clearly capable of producing humans with open mouths, that are not smiling, why is it that the actor will only have this capability if the source data contains images of him with an open mouth? Why is it that Stable Diffusion cannot reach back into latent codes from the higher-level human/person/man classes, and append this functionality to the LoRA’s capabilities?

SuDe

A new paper from China offers some insight into the problem, and a possible solution, based on the rigid taxonomies of Object-Oriented Programming.

The new work suggests that the bespoke personalization data is being introduced too late into the class taxonomy during training, and cannot therefore benefit from everything that Stable Diffusion knows about that class.

The current method of text/image pair training in systems such as LoRA (and certain iterations of DreamBooth, though neither explicitly require captioning) places the customized information at the very end of a class chain, so that it can exploit only the broadest attributes of the parent class (such as ‘man’).

Class: Ryan Gosling [MAN]

The paper suggests that the injected information should instead become a sub-class of a public class:

Class: Person > Man | Ryan Gosling

In experiments with injecting bespoke data into a public class, the researchers were able to create customized characters and objects that exist as sub-classes in the same way that ‘open mouth’ is a sub-class of ‘person’, instead of being rudely appended as end-of-line information in a broad class.

In this way, they have succeeded in creating generative customized systems that can exploit a far broader range of attributes of the parent class, without sacrificing the identity information in the source data:

Experiments with the FaceChain-SuDe system produce personalized results that can exploit a broader range of characteristics of the parent class. Source: https://arxiv.org/pdf/2403.06775.pdf
Experiments with the FaceChain-SuDe system produce personalized results that can exploit a broader range of characteristics of the parent class. Source: https://arxiv.org/pdf/2403.06775.pdf

The paper states:

‘Typical works focus on learning the new subject’s private attributes. However, an important fact has not been taken seriously that a subject is not an isolated new concept but should be a specialization of a certain category in the pre-trained model.

‘This results in the subject failing to comprehensively inherit the attributes in its category, causing poor attribute-related generations. [Motivated] by object-oriented programming, we model the subject as a derived class whose base class is its semantic category.

‘This modeling enables the subject to inherit public attributes from its category while learning its private attributes from the user-provided example.’

Examples of customization that fails (or does not completely succeed) when the customized data is cloistered in a private class (middle row), and succeeds when the customized data is integrated into the parent class.
Examples of customization that fails (or does not completely succeed) when the customized data is cloistered in a private class (middle row), and succeeds when the customized data is integrated into the parent class.

The technique is called Subject-Derived regularization (SuDe), with the code available at GitHub. SuDe operates as a plug-and-play method that can in theory be attached to a number of training and customization methodologies.

Testing against three rival baselines across two backbones, the authors found that SuDe is capable of achieving disentanglement in a novel and effective way. Additionally, the system offers potential for one-shot customization, using only a single source image.

The paper is titled FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation, and comes from six researchers across Peking University, Alibaba Group, Tsinghua University, and Pengcheng Laboratory.

Method

The paper states:

‘[We] propose to model a subject as a derived class of its semantic category, the base class. This helps the subject inherit the public attributes of its category while learning its private attributes and thus improves attribute-related generation while keeping subject fidelity.’

The SuDe pipeline learns private attributes by reconstructing the subject example in novel ways, forcing the customized data into the public attributes of what would normally be the parent class.
The SuDe pipeline learns private attributes by reconstructing the subject example in novel ways, forcing the customized data into the public attributes of what would normally be the parent class.

The new approach regularizes the subject-driven generated images so that it inherits the attributes of its class category, effectively becoming a sub-category of that class, instead of a discrete token or sequence of latent codes that merely have a relation to the class.

This is facilitated by the pre-trained text encoder BERT Large Language Model (LLM). An additional classifier is not useful in this case, since its semantic logic may not align with the trained target model.

For optimization, the new approach simply introduces one new loss function, titled Lsude. In this way, the characteristics of the custom-trained data are moved from their usual cloister in the adjunct system directly into the higher class structure, and have access to the broader traits of the class.

Data and Tests

For initial tests, the SuDe system was evaluated across three frameworks: DreamBooth; Custom Diffusion; and ViCo, each under two separate backbones: Stable Diffusion V1.4, and V1.5.

Only Lsude is added as a training loss. Fine-tuning, under these minimal requirements, took as little as seven minutes on a single NVIDIA 3090 GPU with 24GB of VRAM.

For quantitative tests, the researchers used the DreamBench dataset from the original DreamBooth project, which contains 30 subjects across 15 categories, with five images dedicated to each subject, and a focus on one-shot customization, using only a single source image.

Each of the baselines explored had their own prompt templates, but the researchers collected five attribute-related prompts for each subject tackled, to test the extent of disentanglement that could be achieved by SuDe.

For metrics, DINO-I was used in conjunction with CLIP-I. These represent the averaged cosine similarity between DINO and CLIP-generated images and real images. The authors note that DINO-I has an advantage in that it can better evaluate differences between subjects within the same category/class.

Further, the researchers used two novel metrics related to attribute alignment: CLIP-T, which only has a general understanding of classification, and BLIP-T, the average cosine similarity between BLIP embeddings of prompts and generated images (BLIP is most commonly used by Stable Diffusion enthusiasts and practitioners as an annotation or image analysis system).

The researchers initially produced a qualitative round:

Qualitative results for the SuDe tests. Please refer to the source paper for better detail and resolution.
Qualitative results for the SuDe tests. Please refer to the source paper for better detail and resolution.

Of these results, the authors comment:

‘Qualitatively, we see that generations with our SuDe align the attribute-related texts better. For example, in the 1st row, Custom Diffusion cannot make the dog playing ball, in the 2nd row, DreamBooth cannot let the cartoon character running, and in the 3rd row, ViCo cannot give the teapot a golden material.

‘In contrast, after combining with our SuDe, their generations can reflect these attributes well. This is because our SuDe helps each subject inherit the public attributes in its semantic category.’

The authors further note, citing the dog in the first row of the image above, that SuDe maintains identity well, and is not ‘flooded’ with collateral attributes that threaten continuous identity, and that these particular indicators of identity remain as private attributes in the class/category.

Quantitative results for SuDe, pitted against some notable prior methods.
Quantitative results for SuDe, pitted against some notable prior methods.

Of the quantitative round, the paper asserts*:

‘[SuDe] achieves stable improvement on attribute alignment, i.e., BLIP-T under SD-v1.4 and SD-v1.5 of 4.2% and 2.6% on ViCo, 0.9% and 2.0% on Custom Diffusion, and 1.2% and 1.5% on Dreambooth. Besides, we show the performances (marked by ) of a flexible ws (best results from the [0.5, 1.0, 2.0] · ws).

‘We see that this low-cost adjustment could further expand the improvements, i.e., BLIP-T under SD-v1.4 and SD-v1.5 of 5.3% and 3.9% on ViCo, 1.1% and 2.3% on Custom Diffusion, and 3.2% and 2.0% on Dreambooth.’

The researchers note once again that with only a slight fluctuation to the baseline DINO-I score, the approach will not damage the generations’ fidelity to the target subject (i.e., the customized person/object/animal, etc.).

Further tests were conducted, increasing the influence of Lsude (ws), where it becomes clear that the optimal amount applicable is in the middle range. In the image below, for instance, we can see the aesthetic quality of the generations compromised as the ws value is taken up to 4x:

Visual comparisons using different intensities of the new method, with some failure occurring at the higher strengths (far right).
Visual comparisons using different intensities of the new method, with some failure occurring at the higher strengths (far right).

Of particular interest in the results visualized above is that the baseline fails to disentangle the fruit in the source image (far left) from the bowl which is the sole content of the prompt (it does not mention fruit), whereas all the various SuDe versions pull out only the bowl content for visualization.

Further tests were conducted to see if diverse concepts could be cumulatively added to generations and still produce discrete characteristics:

Experiments with attribute-unrelated prompts for SuDe.
Experiments with attribute-unrelated prompts for SuDe.

Despite what appear to be some CFG artifacts (particularly in the case of the dog examples in the first row, indicating that CFG values needed to be pushed in certain cases in order to obtain fidelity to prompt), most of the results are of high quality, and obey the specifics of the attribute-agnostic additional prompting material, suggesting a level of disentanglement which is hard to achieve with much more intensive training, in methods such as DreamBooth and LoRA.

Extensive examples in the supplementary section of the paper demonstrate that the core subject can be extracted discretely from its context in the source image, as well as allow on-subject changes (such as clothing) and stylization, all whilst retaining the core concept extracted from a single image:

Diverse challenges for SuDe.
Diverse challenges for SuDe.

Though we refer the reader to the source paper for further examples of experiments undertaken to demonstrate the disentangled quality of generations created for the work, one notable final example is the extent to which SuDe can undertake object and action editing very effectively, including the retention of identity under what, in the general run of image synthesis papers, is extreme duress:

Editing examples. Note that the fundamental shape of the teapot is preserved even when an extreme deformation, such as 'cube-shaped' is requested.
Editing examples. Note that the fundamental shape of the teapot is preserved even when an extreme deformation, such as 'cube-shaped' is requested.

Conclusion

SuDe’s initial results, as presented in the paper, represent a quite extraordinary level of disentanglement, at remarkably little cost. The general run of such methods tends to involve days or even weeks of intensive training on the highest level of hardware available, and many more layers of schematic and architectural chicanery.

By contrast, the new work provides evidence that something quite obvious has been overlooked in the training of image/text pairs for customization, and that moving up custom content from a cloister into the broad run of the governing class is capable of achieving unparalleled dexterity and improvement in results, with an impressive economy of execution.

It would have been most interesting to see the authors tread more boldly and frequently into the realm of human image synthesis and editing in this paper; however, the current climate of fear around growing legislation against deepfakes, and the possibility of negative association with what is coming to be perceived as a destructive technology, means that most of the results presented avoid human representation in favor of the more limited range of topics promoted by API-only systems such as the DALL-E series, and Adobe Firefly.

Nonetheless, SuDe may have great implications for the visual effects professional, if eventually implemented into mainstream, human-centered solutions such as Kohya. Currently on the ‘to do’ list at the project’s GitHub site is ‘Support more style lora (such as those on [Civit.ai])’. If the SuDe methodology does indeed evolve that far into the mainstream, it may represent a quantum leap in fidelity to prompt, and in the long-term struggle against entanglement.

From the project's GitHub page, a rare example of human subjects under SuDe.
From the project's GitHub page, a rare example of human subjects under SuDe.

* ws is here a formulaic quality of Lsude.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle