Personalized Protection Against Stable Diffusion Deepfaking

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

There is currently notable interest in preventing images found on the internet, or in social feeds, from being used as training material for image synthesis systems such as Stable Diffusion. Perhaps the best-publicized reason for this is the way that such systems can ‘appropriate’ and replicate the style of traditional artists, based on analysis of posted images of their work, potentially diminishing the artist’s livelihood.

Additionally, third-party systems such as DreamBooth make it possible for amateur users to develop sophisticated character models of celebrities, or of unknown people, potentially allowing such identities to be included in scenarios – including porn – to which the owners’ of these identities would not have consented.

Such personalized ‘character systems’ can also be used to generate fake news, in the form of misleading videos and images, and this growing trend is also driving the research into making the original source images ‘immune’ to training.

An AI-altered image of Elon Musk and Mary Barra briefly fooled a number of social media users before being debunked. Source: https://twitter.com/blovereviews/status/1639988583863042050?lang=en
An AI-altered image of Elon Musk and Mary Barra briefly fooled a number of social media users before being debunked. Source: https://twitter.com/blovereviews/status/1639988583863042050?lang=en

Many of the solutions currently being presented could be considered as draconian measures: mandating ‘backdoors’ into generative system such as Stable Diffusion; preventing AI companies from using freely-available, web-facing data; and the (now-venerable) idea of including ‘privacy bits’ in images that will impede them from being imported into transformational systems.

From 2015, the inclusion of a 'privacy bit' is mooted, to alert transformational systems that the incoming image should not be imported. Naturally this involves the cooperation, mandated if necessary, of the systems in question, which must implement such safety filters. Source: https://tinyurl.com/av35u63r
From 2015, the inclusion of a 'privacy bit' is mooted, to alert transformational systems that the incoming image should not be imported. Naturally this involves the cooperation, mandated if necessary, of the systems in question, which must implement such safety filters. Source: https://tinyurl.com/av35u63r

However, by far the largest locus of academic interest is in ‘poisoning’ the source data by adding adversarial noise to training images.

The low-level adversarial perturbations of the TAFIM framework result in training images that produce degraded output in generative systems. Source: https://arxiv.org/pdf/2112.09151.pdf
The low-level adversarial perturbations of the TAFIM framework result in training images that produce degraded output in generative systems. Source: https://arxiv.org/pdf/2112.09151.pdf

As we’ve discussed before, many such ‘systemic’ adversarial solutions (i.e., the ones that still allow open source generative technologies to be distributed) require notable changes in core infrastructure, either for the way that the systems are deployed or made accessible, or the way that the training images themselves are accessible on the internet – with the latter a more far-reaching and expensive issue to address.

One of the most severe implications of using adversarial data to hobble the training of generative systems is that such an approach generally prevents the entirety of image content being ingested into generative systems – a ‘scorched Earth’ policy that may not eventually be suitable for all parties, once the ‘wild west’ flavor of generative AI has devolved into lucrative licensing discussions, where individuals may wish to selectively allow their images to be used in generative systems.

Therefore what might be more useful would be a way for individuals to have more granular control over their own image-based identities.

GenWatermark – A Personal Watermarking Framework

One such approach has just been proposed, in an academic collaboration between Germany’s CISPA Helmholtz Center for Information Security (in Saarbrücken), and China’s Xi’an Jiaotong University. Titled GenWatermark, the new approach is intended to create identity-specific encoders and recognition systems that permit an individual to inject almost imperceptible ‘watermarks’ into their own images.

Here we see subject syntheses of the late singer and rapper Aaron Carter. On the left, the syntheses are informed by 'clean' (non-perturbed) images, and on the right are informed by amended images. The differences in quality are due to the two methods used (Textual Inversion and DreamBooth), rather than signifying any quality change from the process itself, which is intended to help disclose the source images used in training, rather than interfere with the synthesis process.
Here we see subject syntheses of the late singer and rapper Aaron Carter. On the left, the syntheses are informed by 'clean' (non-perturbed) images, and on the right are informed by amended images. The differences in quality are due to the two methods used (Textual Inversion and DreamBooth), rather than signifying any quality change from the process itself, which is intended to help disclose the source images used in training, rather than interfere with the synthesis process.

These perturbations survive the training and generalization process of transformational AI systems well, allowing the users to run any unknown (generated or amended) images through the trained detector, and see if any of their watermarked images contributed to the generative model’s ability to reproduce that individual*.

What this means, in effect, would be that you’d be able to upload an arbitrary number of images into the system, which would shortly thereafter spit back watermarked versions, which you could distribute as you please (perhaps with notices about restrictions on use in AI systems).

If you wanted to have your own image run through generative AI systems, GenWatermark would not prevent this, nor affect the quality of output in any meaningful way. But the detector module would be able to confirm that the ‘amended’ versions contributed to AI-generated pictures featuring your identity. You may be fine with that, or not; but you’d know, either way, and have some form of redress in cases of infraction and unauthorized use.

Though it has been confirmed that GenWatermark is intended only as an informational tool, to let the originator of the training photos know that their output has been used in a generative system, the survival of the watermark naturally opens up other possibilities – such as the inclusion of filter systems in generative AI frameworks, which could conceivably decide to reject the use of training images that have such signifiers buried in them. Additionally, widely adopted, the correct identification of these watermarks could also be used as a sign that the image is not an original photo, but an AI creation, and thereby make some contribution to the sphere of deepfake detection.

The authors state:

‘Subject-driven synthesis serves as an easy-to-use and creative tool for users to synthesize images based on their own needs and potentially benefit from the [synthesis].

‘For example, artists may authorize third-party AI-powered design services to help accelerate the artwork production process, especially in the animation industry. Individuals can also authorize third-party AI-powered photo editing services to efficiently synthesize images depicting them in varied novel scenes, e.g., tourist attractions.’

Approach

To accomplish this, a corpus of hundreds of thousands of photos are trained into a watermark generator, which uses a GAN methodology to inject the identifying material into synthetic and altered real images.

At the same time, a recognition model is trained, which is informed both by ‘clean’ and ‘watermarked’ versions of the user identity. Mixing the sources in this way is necessary for good generalization, and to prevent model training from overfitting on the added watermark characteristics.

The joint training of the watermark generator and detector.
The joint training of the watermark generator and detector.

The intention of the system is that the user in question would train this system personally (presumably via easy use of APIs, and so on), and curate the input material themselves.

The two primary challenges for systems like this are robustness and invisibility.

Robustness means getting enough altered data into the image that the perturbing material survives the intense manipulations of the training process, and essentially gets treated as a feature; and invisibility means that both the amended images and any generated syntheses should not be marred by clear evidence of the added watermark content .

Very often, watermark systems are specific to the architecture that they’re being used in, such as GANs or latent diffusion models. This has the advantage that the ’embedding’ of watermark content is concordant with the underlying code; but it has several disadvantages.

One disadvantage is that the resultant watermarks are architecture/build-specific, limiting the general utility of the system. Another is that fundamental changes to the code-base can cause changes in the ability of the host system to recognize watermarks created with prior versions. Therefore, with GenWatermark, the authors are seeking to imbue and recognize platform-agnostic adversarial characteristics.

Regarding this, the authors state:

‘[The] watermark system should generalize well to different target models and text prompts because the subject owner has no control of the subject-driven model and text prompts that will be used by the malicious subject synthesizer for their unauthorized image synthesis.

‘In addition, the watermark system should generalize well to different data distributions. For example, using only a few pieces of artwork from a specific artist may not be enough to train a powerful watermark system for that artist.’

GenWatermark follows the methodologies of prior systems (such as LowKey, PAT, and Augmented Lagrangian Adversarial Attacks), to limit the ‘perturbation budget’ spent during the process (i.e., the extent to which the need for robustness and invisibility are challenged by the core objective of achieving a persistent identifier).

From a prior work, LowKey, we see the extent to which greater magnitudes of the algorithm can increase robustness at the expense of visibility. The first row are 'untainted' source images, the second row 'medium' strength, the lower-most row, high strength. Source: https://arxiv.org/pdf/2101.07922.pdf
From a prior work, LowKey, we see the extent to which greater magnitudes of the algorithm can increase robustness at the expense of visibility. The first row are 'untainted' source images, the second row 'medium' strength, the lower-most row, high strength. Source: https://arxiv.org/pdf/2101.07922.pdf

The images used to jointly train both components include synthesized images of the subject, obtained from diverse text-prompts. Since these are similar in nature to the output of the generative systems which are being targeted here, these help to condition the adversarial material in a ‘native’ context to the actual end use-case.

Unusually, the Learned Perceptual Image Patch Similarity (LPIPS) metric is used at this stage not to assess the general image quality, as is usually the case, but rather to assess the continuing effectiveness and presence of the adversarial perturbations, and to aid in creating an invisible layer of identifying watermark data that is likely to become deeply associated with the identity being trained.

Describing the development of the detector module in the joint training process, the paper’s authors observe:

‘First, we use a clean image set X and its corresponding watermarked set Xw to train two subject-driven models, i.e., a clean model M and a watermarked model Mw. Then, we use these two models to synthesize images with multiple prompts, resulting in two corresponding synthesized image sets, denoted as S and Sw. Finally, we use S and Sw to fine-tune the detector. Note that the generator remains unchanged.

‘In this way, protecting each individual subject becomes a downstream task that can be solved efficiently.’

Data and Tests

The test scenarios undertaken by the authors are extensive, and cover multiple methodologies, as well as two end use-cases: a ‘deepfake’ usage, where the identities are used to create or superimpose faces; and a ‘style’-based use case, which aims to protect the visual lexicon of a particular artist. As the results are extensive and complex, we will concentrate primarily on the deepfake use case here, and we refer the reader to the paper for the complete array of results.

All experiments were run on a single NVIDIA A100 GPU, with 40GB of VRAM. The pre-training phase, where thousands of abstractly-selected pictures of celebrities are used to build up a general understanding of human faces, takes six hours; however, presumably, the fruits of this training can to an extent be automated and repeated for other identities later on.

The two personalized syntheses frameworks used are DreamBooth and Textual Inversion. Textual Inversion (TI) is more suited to the extraction of abstract styles from images. Though DreamBooth can also accomplish this, it tends to embed specifics from the training data more than textual inversion does; for instance, training a ‘Dali’ style on data that includes ‘melting watches’ is more likely to lead to instances of melting watches in the generated output, than is the case with textual inversion, which attempts to distill the artist’s style at a higher or more conceptual level.

For the task of re-generating trained identities, the authors used the Celeb-A dataset, which contains 202,599 images across 10,177 celebrities.

The synthesis process for human faces includes a text component, since text prompts are an essential feature of the targeted systems. The authors describe how text is incorporated into the training process:

‘In the human face task, we randomly select 30 prompts from Lexica, a popular search engine that contains millions of AI-synthesized images and their corresponding prompts. We refine each prompt by removing irrelevant information but only keeping a target context to form our final prompt following “A photo of [V] target context”. For example, for an original prompt “smiling softly, 8k, iraklinadar, hyperrealism, hyperdetailed, ultra realistic”, our refined prompt is “A photo of [V] smiling softly”.

‘Since DreamBooth requires an additional term denoting the category of the subject, the prompt becomes “A photo of [V] face smiling softly”’

For the human face (deepfake) task, the training set was constructed by randomly choosing four celebrities from Celeb-A. The pre-training process obtains eight models for the human face – four of which are ‘clean’, and four watermarked. Thirty prompts were selected, and images synthesized at 256x256px.

For the watermark generator, a vanilla Generative Adversarial Network (GAN) is used, derived directly from the original 2014 paper. A ResNet34 network was used for the detector. The pre-training for the face model used 200,000 images from Celeb-A. For each subject in the fine-tuning phase for the detector, 1000 synthesized images were used (covering 40 different images across 25 text prompts) for both the clean and the watermarked model.

The researchers considered four possible attack scenarios – an important consideration, since a great deal of research into adversarial steganography presumes an inordinate and often improbable amount of access to the core technologies (i.e., white box attacks).

Scenario 1 assumes that both the model and prompts are known to the attacker; scenario 2 assumes that the model is known, but not the prompts; scenario 3 assumes that the prompts are known, but not the model; and scenario 4, the most ‘black box’ approach, and the most challenging, presumes that neither the prompts nor the model are known to the attacker.

In tests, the images produced were evaluated for watermark detection accuracy, and also using Fréchet Inception Distance (FID).

Results for watermark accuracy. Human face results are the right-most two columns, covering Textual Inversion and DreamBooth. The four aforementioned scenarios are color-coded (see index above).
Results for watermark accuracy. Human face results are the right-most two columns, covering Textual Inversion and DreamBooth. The four aforementioned scenarios are color-coded (see index above).

Predictably, the most challenging scenarios in the human face tests achieve the lower results, though it is interesting to note that the absolute ‘black box’ scenario (column 4, where neither the model nor the prompts are known) fares little worse than scenario 3.

Of this, the researchers comment:

‘Even in Scenario 4, the most challenging one with unknown models and prompts, GenWatermark still maintains an accuracy of about 74%, much higher than the random chance for binary classification, i.e., 50%. As expected, in Scenario 2 or Scenario 3, where either the model or prompts are known, the performance is in between the two extreme scenarios.

‘When comparing Scenario 2 and Scenario 3, we can see that knowing the model is much more helpful than knowing the prompts. In other words, transferring between models is more difficult than between prompts.’

In terms of FID scores, the human faces performed worse than the artistic style tests, though the difference between results from clean and watermarked images was quite low.

Results for the FID scores.
Results for the FID scores.

To test the resilience of the watermarks, based on classification accuracy, DreamBooth fared slightly better for the human face task than Textual Inversion. However, the authors note that uniqueness is not necessarily the most meaningful metric for this particular task, since it’s expected that the person behind the identity in question only has control over their own face, nullifying the usefulness of testing the procedure with images that feature other people.

Results for watermark uniqueness.
Results for watermark uniqueness.

The researchers also tested for general image quality (included in the FID scores illustrated above), since the procedure is intended to have minimal impact on the generative process:

‘[The] FID score changes only by less than 1% on average, indicating that injecting watermarks indeed has little impact on the original synthesis quality.’

They further note that DreamBooth obtains better results in this respect than Textual Inversion, a standpoint borne out by the anecdotal comments of many casual practitioners in the Stable Diffusion community, where DreamBooth (and more advanced non-invasive methods such as LORA and Lycoris) became a more popular method of identity synthesis.

For a qualitative analysis, the researchers chose pictures of Aaron Carter (illustrated earlier in this article), where the authors assert that the synthesized and watermarked images are similar in quality and style to the clean input images.

The paper offers further studies into the resilience of the new technique against common attacks, such as the use of Gaussian and JPEG artifacts to ‘confuse’ such systems, and eradicate the watermarks. In this respect, the authors note that JPEG compression is a stronger countermeasure than Gaussian noise, which is commonly found in the literature for steganographic adversarial methodologies.

The authors further conclude that the effectiveness of the new technique may be aided by the relatively sparse choices currently available for realistic synthesis, such as Stable Diffusion and various rival latent diffusion models. They concede that the rapidly-developing synthesis scene may bring forth a greater multiplicity of new architectures, which could reduce the effectiveness of such a generic adversarial technique as the one that they propose.

They suggest also that fine-tuning the generator as well as the detector could improve the resilience of GenWatermark, but note that ‘this would inevitably cost more computational resources’.

Additionally, GenWatermark, like prior schemes, is more effective and survivable according to the intensity of the watermark, which necessitates a careful trade-off between affecting the image in a way that the average person would notice, and risking that the watermark data is so faint that it may be discarded as ‘noise’ during the training process.

The effect on image quality of varying levels of watermarking intensity.
The effect on image quality of varying levels of watermarking intensity.

Conclusion

It’s encouraging to see a research direction that aims to provide flexibility and agency to end users who may wish to selectively allow AI-generated synthesis of their own identities, without resorting to the ‘nuclear’ options that have begun to be proposed in recent months, as the power of generative image systems has increasingly engaged the public’s attention.

It’s possible that systems such as GenWatermark are looking at the system from the wrong end of the telescope, and that the culture is eventually headed towards more centralized identification procedures. However, such overarching systems may take years, or even decades to overcome the political and individual concerns of participating nations, and face a number of technical and infrastructural challenges as well.

Nonetheless, it may eventually prove easier to authenticate consenting identities through central databases than to slipstream consent (or even accurate, per-ID watermarks) in the way envisaged by GenWatermark.

The new system suffers from many of the shortcomings common to prior adversarial watermarking schemes: resilience comes with at least some cost in image quality; the procedure entails some notable and time-consuming processing and infrastructure; and the system may be ineffective against new or upgraded architectures, notwithstanding the authors’ efforts to create a ‘platform-agnostic’ and identity-specific watermarking regimen – which is an admirable aim.

 

* In an email to us, lead author Yihan Ma confirmed: ‘[Our] generative model won’t be used to prevent such images being used, but let the author to check if the images are based on their artworks or photos when they saw potential unauthorised images.’

More To Explore

AI ML DL

Solving the ‘Profile View Famine’ With Generative Adversarial Networks

It’s hard to guess what people look like from the side if you only have frontal views of their face; and the chronic lack of profile views in popular datasets makes this a stubborn data problem that’s standing in the way of 360-degree facial synthesis. Now, researchers from Korea are offering a method that might alleviate this traditional roadblock.

AI ML DL

Repairing Demographic Imbalance in Face Datasets With StyleGAN3

New research from France and Switzerland uses Generative Adversarial Networks (GANs) to create extra examples of races and genders that are under-represented in historical face datasets, in an effort to offset controversies such as the tendency for facial recognition systems to fail to recognize (or to over-recognize) particular types of people.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle