Editing Porn, Bias, Objects and Artists Out of Stable Diffusion Models

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

New research from the US and Israel offers an all-in-one solution for restricting controversial output from the generative text-to-image AI model Stable Diffusion.

With the new framework, trained models are no longer capable of creating NSFW material; of recreating art in the style of any artists that have been intentionally removed from the trained model; of generalizing mistakenly about gender or other demographic misrepresentations in response to a prompt (such as ‘successful person’ generally producing male examples); and even of removing objects and concepts from the model, so that the output from a ‘denied’ prompt produces no content for that input.

Unified Concept Editing (UCE) can perform a variety of amnesiac interventions on a latent diffusion model. Source: https://arxiv.org/pdf/2308.14761.pdf
Unified Concept Editing (UCE) can perform a variety of amnesiac interventions on a latent diffusion model. Source: https://arxiv.org/pdf/2308.14761.pdf

The system’s amendments to gender representation tend to concentrate on roles most associated with males, such as ‘sheriff’, as we can see in the image below:

UCE allows for changing the core concept (i.e., 'man', 'woman') without changing the environmental context. Note that the bold red background for the male sheriff is little altered in the gender-swapped transformed version.
UCE allows for changing the core concept (i.e., 'man', 'woman') without changing the environmental context. Note that the bold red background for the male sheriff is little altered in the gender-swapped transformed version.

The authors of the new project are able to erase the artistic styles of specific artists en masse, so that a request for work in the style of any artist affected by these changes will lead to a generic version of the prompt.

Comparisons of the new approach with prior methods of 'editing' machine learning models so that they do not produce undesired content, such as reproducing the style of certain artists.
Comparisons of the new approach with prior methods of 'editing' machine learning models so that they do not produce undesired content, such as reproducing the style of certain artists.

The system is able to cancel out common objects from existence, such as a gas pump:

The common gas pump no longer exists in the UCE-edited model.
The common gas pump no longer exists in the UCE-edited model.

The method devised is called Unified Concept Editing (UCE), so titled because it gathers together individual prior attempts at model-censoring into a single framework, overcoming in the process some of the disadvantages of the contributing prior works.

The authors state:

‘Our approach enables targeted debiasing, erasure of potentially copyrighted content, and moderation of offensive concepts, using only text descriptions. Our measurements suggest that our method offers three key benefits over prior methods. First, it can mitigate multifaceted gender, racial, and other biases simultaneously while preserving model capabilities.

‘Second, it is scalable, modifying hundreds of concepts in one pass without expensive retraining. Third, extensive experiments demonstrate superior performance on real-world use cases. Together, our findings suggest that UCE is significant step towards democratizing access to ethical and socially-responsible generative models.

‘The ability to seamlessly unify debiasing, erasure, and moderation will be an important tool for building AI that benefits our diverse global society.’

The new paper (which also has a project site, though this currently contains no obvious additional material) is titled Unified Concept Editing in Diffusion Models, and comes from five researchers across Northeastern University, the Massachusetts Institute of Technology (MIT), and Technion at the Israel Institute of Technology.

Challenges and Prior Approaches

Due to the interdependent nature of the way that data is arrayed in the latent space of a trained model, it’s not possible to just go in there, search for information, select it and delete it, since there are no clear demarcations between concepts and domains in hyperscale models that have been trained on a wide range of subjects.

For instance, removing the concept of ‘dog’ from a model is likely to knock on into the concept of ‘wolf’, and even into objects and concepts that have appropriated the term ‘dog’, such as ‘hot dog’. Likewise, numerous ancillary concepts may lead back to the ‘dog’  concept embedding, such as leash and bark.

Similarly, it would become impossible to refer to Quentin Tarantino’s first cinematic outing, Reservoir Dogs, since that title also contains the erased concept.

By analogy, it is equally difficult to edit the human genome so that it stops producing cancerous cells, because the processes that allow this are essential to many other operations in cell formation and regeneration.

So the information stays in the database – but it doesn’t have to stay in the (generated) picture. By surgically altering or erasing the specific connections that cohere a concept into the wider trained model, it’s possible to ‘reroute’ any ‘banned’ inquiries into other content than the one intended.

By analogy, it’s equivalent to removing access to a property by destroying or amending the map that shows you how to get there.

Schema for 'closed form editing' in the new UCE system.
Schema for 'closed form editing' in the new UCE system.

This is not the only way to block content access to users of generative image systems; DALL-E 2 and the emerging beta of the Firefly generative system in Photoshop perform a number of checks when users input a text prompt. In terms of resource usage, the cheapest of these is to blacklist words and/or phrases:

Denial messages in, left to right, the Firefly cloud-based generative system in the current beta of Photoshop, and in DALL-E 2.
Denial messages in, left to right, the Firefly cloud-based generative system in the current beta of Photoshop, and in DALL-E 2.

This means that the generative process is halted before any computing resources are engaged, and the user is notified about what prevented the request from completing.

The second line of defense is to use CLIP, Vision Transformers, AWS screening services or other AI-based evaluation methods and content scanning architectures to examine the image before it is returned to the viewer. If banned content is found, the image will have briefly existed, and have taken some resources to generate, but will not be seen by the user or made available to them.

Photoshop beta decides not to supply the generated image to the user.
Photoshop beta decides not to supply the generated image to the user.

DALL-E 2 doesn’t let things get that far, and appears to filter entirely on the prompt, whereas Photoshop Firefly passes the images through an evaluation process, and frequently decides to block them based on the returned ratings.

A lesser-used approach is to actually change the content of the user’s text prompt before it reaches the generative system, so that the user receives a bowdlerized or amended version of their prompt.

At the end of 2022, research from Korea and the US proposed a 'clean-up' system for undesirable material in user prompts. Source: https://arxiv.org/pdf/2212.03507.pdf
At the end of 2022, research from Korea and the US proposed a 'clean-up' system for undesirable material in user prompts. Source: https://arxiv.org/pdf/2212.03507.pdf

One alternative broader approach is to ensure that the training data contains no material that one would want to end up in a generated image.

The problem here, besides the formidable task of carefully curating (potentially) billions of images, is that of demarcation and the synergistic nature of the latent space.

A study in 2022 revealed that removing undesired layers of data from a trained model can reveal further undesired layers of data;  another, from 2018, that expurgating toxic material can create additional new biases; additionally, complete and discrete excision of offending content frequently leaves some of that content behind.

Finally, in general, fine-tuning a model (i.e., resuming training on a finished model, but with additional data intended to block undesired content) is always to some extent destructive of the core weights that were originally created for the model, which must ‘move aside’ in favor of the new data being introduced, reducing the efficacy and accuracy of the system.

Approach

UCE gathers together a number of prior alternative approaches, each designed to address only a single issue. The first of these is the Text-to-Image Model Editing (TIME) system, a very recent innovation from two of the authors of the new paper (together with other researchers), which updates the text-based cross-attention layers of a Stable Diffusion model.

The TIME system interferes with routing access to specific material within the latent matrices. Source: https://arxiv.org/pdf/2303.08084.pdf
The TIME system interferes with routing access to specific material within the latent matrices. Source: https://arxiv.org/pdf/2303.08084.pdf

The second primary method used is Mass-Editing Memory in a Transformer (MEMIT), another recent outing, again featuring two of the authors of the new work. Following on from a related prior work called Rank-One Model Editing (ROME), MEMIT targets the weights of Transformers modules that govern certain parameters of factual recall, allowing for thousands of entries to be simultaneously ‘re-assigned’ or rewritten.

MEMIT reassigns relationships with bulk edits of thousands of parameters inside a model. Source: https://arxiv.org/pdf/2210.07229.pdf
MEMIT reassigns relationships with bulk edits of thousands of parameters inside a model. Source: https://arxiv.org/pdf/2210.07229.pdf

However, UCE exceeds the capabilities of MEMIT in that it can edit text-to-image generative models, whereas MEMIT is limited to Large Language Models; it can better specify the concepts to be edited; and it offers a novel debiasing approach (see above), with the authors reporting that UCE outperforms prior methodologies by ‘a wide margin’.

The modeling methodology adopted by UCE can be applied to any linear projection layer (LPL – a deep learning layer that compresses data by pruning the amount of parameters) in the model.

The edits performed on the cross-attention projections enables erasure, moderation, and debiasing in a single architecture.

Regarding the procedure to effect an erasure, the authors state:

‘To erase a concept ci , we want to prevent the model from generating it. If the concept is abstract like an artistic style (eg. “Kelly Mckernan”), this can be accomplished by modifying the weights so the target output vi aligns with a different concept c (e.g. “art”) – vi ← W oldc

‘This updates the weights such that the output no longer reflects concept ci, effectively erasing that concept from the model’s generations and eliminating generations of the undesired characteristics.’

In order to achieve debiasing, the model’s relevant parameters are adjusted likewise in a desired direction. The replacement concepts being inserted are chosen in such a way as to affect the desired probability across each pertinent attribute, which, the authors state, distinguishes the initiative from TIME, which can only debias across a smaller number of attributes.

Regarding moderation of NSFW material such as ‘nudity’, the operation really is much simpler, and effectively constitutes a ‘rick roll’ to an unconditional (and unrelated) prompt, such as the blank ” “.

In this sense, it could be argued that the NSFW-removal aspect of the architecture, which can take advantage of Stable Diffusion’s own internal censor (a mechanism that is not difficult to overcome, and is widely turned off or removed in distributions such as the AUTOMATIC1111 webui), is merely a second and more entrenched padlock on access to NSFW output – compared to SD’s own built-in censor.

Data and Tests

For testing purposes, the authors compared UCE to analogous methods ESD-x (which once again features authors from the current paper), Concept Ablation, and Safe self-Distillation Diffusion (SDD).

Concepts and artistic styles erased from a latent diffusion models with the Concept Ablation framework. Source: https://arxiv.org/pdf/2303.13516.pdf
Concepts and artistic styles erased from a latent diffusion models with the Concept Ablation framework. Source: https://arxiv.org/pdf/2303.13516.pdf

In a second round of experiments testing object erasure, the ESD-u variant was used as a rival framework, since this freezes all parameters in the model besides cross-attention, thus enabling it to attempt multiple erasures.

For the initial round, the authors attempted to erase the style of various artists at scale:

These artists' styles were NOT erased, yet are affected more in the rival frameworks after the edit than with UCE.
These artists' styles were NOT erased, yet are affected more in the rival frameworks after the edit than with UCE.

The authors observe that UCE preserves the general state of the model better, after the edit, than the rival frameworks. In the image above, we see generations from non-erased artists, and it is clear that in the case of the rival frameworks, there has been some collateral damage, as the edits have affected them more than with UCE. Scoring metrics are provided by the LPIPS loss function.

Rival frameworks' ability to effectively and discretely erase styles does not scale as well as UCE's, the authors of the new paper assert.
Rival frameworks' ability to effectively and discretely erase styles does not scale as well as UCE's, the authors of the new paper assert.

The authors comment:

‘[We] are able to consistently erase multiple artistic styles, while other methods maintain a lot of characteristics of the artistic styles and impair the model’s capabilities as the number of erased concepts increases…

‘… Our method also demonstrates reduced interference with neighboring, non-erased concepts compared to other techniques.’

The paper notes that diffusion models have been shown to be able to imitate over 1,800 artistic styles. The researchers decided to test UCE’s bulk-editing capabilities by erasing up to 1000 artists from Stable Diffusion. In effect, they found that they could convincingly erase about 100 artists at a time before damaging the ability of the model to reproduce non-affected artists’ styles.

In these results (see image above), as the number of erased artists increases, we can see the CLIP, LPIPS and Fréchet Inception Distance (FID) scores score getting worse. 

To test UCE’s erasure capabilities, the researchers conducted experiments using Imagenette classes, a subset of ten relatively easy-to-classify classes from the influential Imagenet dataset, comprising the classes tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, and parachute.

The authors generated 500 images for each of these classes, and tested for top-1 classification accuracy under ResNet-50. Unlike the artists’ classes, the objects were erased individually, in order to better detect whether non-targeted classes might be affected.

Results from the object removal task.
Results from the object removal task.

The authors state:

‘Without explicit preservation, our approach exhibited superior erasure capability while minimizing interference on non-targeted classes…Erasing all 10 Imagenette classes together reduced image generation accuracy to just 4.0% and COCO-CLIP score to 31.02 (original SD is 31.32), quantitatively showing effective single and multi-object removal while limiting interference.’

Finally, the researchers tested the debiasing of ‘profession’ concepts (i.e., jobs that may be more associated with a particular race or gender, etc.), using a surprisingly prosaic algorithm, and a traditional for/if loop:

The UCE debiasing algorithm.
The UCE debiasing algorithm.

Since multiple edits on this species of class are more likely to have negative collateral effects, it was necessary to address the instances individually, and to use edit and freeze concept lists (i.e., when a concept has been debiased, it is added to a ‘do not affect further’ list. Admittedly, this is quite a manual approach, though presumably could succumb to automation as necessary.

The approach was, nonetheless, found to improve gender and race bias:

Results of editing Stable Diffusion's bias in gender and race.
Results of editing Stable Diffusion's bias in gender and race.

The results for this section are extensive in comparison to the other modules tested for UCE, perhaps because the target action is less potentially controversial than the other objectives of the system; we refer the reader to the original paper for extended details in this regard.

Finally, in a very small results-set, compared to the debiasing section, the authors test the system’s ability to intervene in NSFW throughput in a generative system.

To conduct this test, the authors generated 4,703 images using prompts taken from the Safe Latent Diffusion project. Using the NudeNet classifier, they found that the new method was more or less in parity with the rival frameworks. They further observe that ESD-x has a ‘more aggressive’ erasure, and note that this is because it actually fine-tunes the model, rather than intervening specifically into its parameters.

Results for the intervention against NSFW material.
Results for the intervention against NSFW material.

However, the researchers point out that UCE induces ‘substantially lower’ distortion to subsequent model generations, compared to the state that the affected model is in after being operated on by the prior frameworks.

Metric scores for the NSFW intervention round.
Metric scores for the NSFW intervention round.

The authors observe:

‘This indicates our method better preserves image quality while moderating sensitive concepts. Additionally, the CLIP score indicates that our technique maintains better text-image alignment post editing.’

The researchers are making the code for UCE available at GitHub.

Conclusion

It could be argued that UCE is an attempt to avoid the far more difficult problem of data curation, which remains an ad hoc and legally oblivious pursuit – albeit that the ‘wild west’ era of data scraping is likely coming to an end in this period.

From a practical point of view, the use case for UCE is to maintain the liberal distribution of trained text-to-image models, but with a form of ‘censorship’ (or ‘protection’ – it’s a semantically volatile point) that can’t be as easily overridden as is currently the case – at least with the older Stable Diffusion V1.5 base model, around which an enormous hobbyist ecosystem has developed over the last year.

Generative systems accessed via API, such as Firefly, DALL-E 2, and the ROOP faceswapping system (which only works at full resolution via a Discord bot), don’t need to harden their models against what can only be described as ‘domestic abuse’, since they can sanitize inputs and outputs – or both.

Finally, in regard to which concepts are ‘allowed’ to remain in models, it’s worth weighing the negative effects of training arbitrarily on hyperscale and under-curated data against the implications of giving rise to a gate-keeping culture with no electoral basis.

In this respect, the kind of ‘editing decisions’ facilitated by UCE could easily creep across, from the fight against the generation of universally abhorred imagery, into greyer and more divisive areas – where this type of intervention could in itself become a new form of tacit repression.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle