Better Deepfakes by Ripping Out Skip Connections

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Researchers from China have discovered that entirely removing one of the apparently essential core technologies in image synthesis and generative systems can notably improve the quality of facial deepfakes.

Improved likenesses are obtained in the new system, which entirely removes skip connections and the shared encoder that are common to many face-swapping systems. Source: https://arxiv.org/ftp/arxiv/papers/2307/2307.07928.pdf
Improved likenesses are obtained in the new system, which entirely removes skip connections and the shared encoder that are common to many face-swapping systems. Source: https://arxiv.org/ftp/arxiv/papers/2307/2307.07928.pdf

By removing skip connections (which we’ll get to in a moment) from a re-imagined deepfaking framework, the researchers were able to significantly outperform rival state-of-the-art face-swapping systems.

The new system beats some formidable rivals, by removing skip connections and dispensing with a shared encoder (practically the central principle of traditional deepfake methods). We'll take a deeper look at these results later.
The new system beats some formidable rivals, by removing skip connections and dispensing with a shared encoder (practically the central principle of traditional deepfake methods). We'll take a deeper look at these results later.

Referring to the thorny problem of entanglement in image synthesis, and the 2020 FaceShifter framework (a collaboration between Peking University and Microsoft that is central to the new paper), the authors assert*:

‘We noticed that in [FaceShifter] and most follow-up works [InfoSwap, Smooth-Swap, and Hififace], skip connections play an important role in preserving non-identity attributes.

‘Hence, we delve deep into this design. After diagnostic evaluations, we realized that because shallow convolutional features contain both ID and non-ID information, the skip connection that introduces shallow features to image decoding could be the “reason” behind such entanglement.’

Let’s now take a look at the researchers’ new insights and (quite revolutionary) approach to one of the most persistent obstacles in the development of neural human synthesis systems.

The new paper is titled Reinforced Disentanglement for Face Swapping without Skip Connection, and comes from five researchers at Xiaobing.AI.

Skipping Skip Connections

In a Deep Neural Network (DNN), a skip connection is essentially a shortcut from a shallow layer to a deep layer.

The data in transit, by analogy, is taking a bus that is blazing past several scheduled stops.

Skip connections can bypass several 'scheduled' layers in the network. Source: https://www.analyticsvidhya.com/blog/2021/08/all-you-need-to-know-about-skip-connections/
Skip connections can bypass several 'scheduled' layers in the network. Source: https://www.analyticsvidhya.com/blog/2021/08/all-you-need-to-know-about-skip-connections/

Skip connections build a bridge between the input of a convolutional block (aka a ‘residual module’) and its output, entirely avoiding the ‘scenic route’.

One supposed advantage of this is that information that has been gathered at the start of this journey arrives at the end-point uncompromised.

If the same information had taken the ‘complete’ and unexpurgated route through the network, it might have become notably altered by the time it arrived at the destination, because the network is by nature transformative and interpretive.

Thus, on the plus side, skip connections allow important low-level information to remain intact at the later processing or generative stage. This can mean that training data can have a more tacit and useful influence on the output.

But it can also mean that information one would actually like to get rid of becomes persistent and entangled in other, non-related processes.

So skip connections are architecturally ‘amoral’: they’re either helping, or getting in the way, depending on what it is you want to achieve.

The new paper argues that skip connections are indeed obstructive to neural identity transfer, and represent one of the central reasons for identity bleed – the inability of a deepfake architecture to completely ‘overwrite’ the target content, and the tendency of the original identity in that content to ‘bleed through’ and spoil the deepfaking effect; and that removing skip connections leads to improved results and better face-swaps.

This kind of manipulation of skip connections is common practice in the Stable Diffusion community, where the CLIP Skip feature can help to improve image quality outcomes, by over-leaping the extreme granularity of information that can occur in some terms interpreted by the CLIP encoder.

As per the principles outlined above, again, this preserves the ‘purity’ of an idea present in a text-prompt, without dragging it through the complexity of irrelevant layers or welding it irretrievably into other associated semantics of the text-prompt, and potentially mutilating or ‘over-cooking’ the original intention of what a semantic image/text pair might have brought to the final generated output.

Method

As we recently discussed at some length, the common run of deepfake applications, since the technology burst on the scene in late 2017, have used a shared encoder to store features that are derived from source data for two identities, during the training of the model.

As the two separate folders, each containing a different identity, are processed, the resulting features are dumped into a common central repository – the shared encoder. Source: https://blog.metaphysic.ai/combating-identity-bleed-in-deepfakes/
As the two separate folders, each containing a different identity, are processed, the resulting features are dumped into a common central repository – the shared encoder. Source: https://blog.metaphysic.ai/combating-identity-bleed-in-deepfakes/

Under this schema, skip connections are used to negotiate the various pose/identity facets that are drawn from the training data. These facets of information are all mixed in together in a central ‘soup’, with skip connections used to address the necessary aspects of the data.

The traditional schema for a deepfakes architecture.
The traditional schema for a deepfakes architecture.

The researchers for the new paper have devised an alternate scheme which abandons the idea of a shared encoder and gives dedicated and separate encoder space to each identity:

The new schema for WSC-Swap.
The new schema for WSC-Swap.

The new approach is called WSC-Swap (with ‘WSC’ presumably standing for Without Skip Connections, though this is not made explicit in the paper).

The authors state:

‘[Instead] of using skip connections, we propose a novel framework consisting of a Facial Non-ID (FNID) network and a Non-Facial attributes (NFA) network to perform face swapping. The skip-connection-free framework exhaustively preserves target non-identity information, while at the same time preventing the target facial identity from leaking into the swap image decoder.’

The novel modules created for WSC-Swap consist of a Facial Non-ID (FNID) network and a Non-Facial attributes (NFA) network, which actual performs the face-swapping.

Conceptual architecture for WSC-Swap.
Conceptual architecture for WSC-Swap.

The authors observe that without skip connections, the system is able to exhaustively preserve the identity that is to be superimposed, and to prevent the target identity (i.e., the one that will be overwritten) from leaking through into the final output.

The paper notes that prior works (such as SimSwap, the aforementioned InfoSwap, and Hififace, among many others) which have investigated diverse methodologies for the disentanglement of identity and non-identity attributes, have all failed to achieve three central goals outlined in the new work: 1) the preservation of the source identity; 2) the disentangled preservation of non-identity material (such as locations and clothing, etc.); and 3) output which is free from artifacts and measurably and perceptibly photorealistic.

Though the paper is short on examples, and the supplementary material mentioned not available at the time of writing (we have reached out to the authors), here is one of the few examples given in the paper, where we see the difference in quality of identity preservation in the FaceShifter system, with and without skip connections.
Though the paper is short on examples, and the supplementary material mentioned not available at the time of writing (we have reached out to the authors), here is one of the few examples given in the paper, where we see the difference in quality of identity preservation in the FaceShifter system, with and without skip connections.

The authors researched these prior structures, observing that all contained a bottleneck encoder/decoder structure (as pictured earlier), and recognized that this central architectural glitch is almost certain to lead to ‘misclassified’ information (i.e., either identity or non-identity information) being delivered to the wrong place, due to lack of discrete pipelines for these separate facets at training and synthesis time.

In this under-tooled scenario, the authors note that the bottleneck encoder/decoder architecture is burdened not only with fully removing the facial identity from its context (everything that is not directly ID-related), but also must preserve everything else about the identity, with permanent and transient facets essentially ‘fused’ (i.e., ‘eye direction’ is not a defining characteristic of a face, but instead describes a pose, whereas ‘eye color’ is indeed linked with identity – yet these are handled by the same internal processes).

The researchers state:

‘[We] argue that it is very difficult, if not impossible, to simultaneously achieve the above two goals using only one single compressed bottleneck [encoder]. In addition, the skip connections used in the decoder would inevitably bring the target identity information into the results together with other non-identity attributes, therefore, further hurting the disentanglement learning.’

Data and Tests

To test WSC-Swap, the authors used three facial image datasets to train the system: CelebA-HQ; FFHQ; and VGGFace. Since each of these datasets are differently configured, subsets and derivations were used selectively, to ensure parity in testing. The researchers followed the testing methodology of FaceShifter, with the training images cropped to 256x256px.

For metrics and evaluation data, the authors used the FaceForensics++ dataset, which provides three ID metrics: ID retrieval, pose error and expression error.

The researchers note the frequent leveraging in recent literature of the 2018 CosFace model as the ID vector extractor:

Overview of the popular CosFace identity extractor, which introduces the Large Margin Cosine Loss (LMCL) function to improve intra-class variance. Source: https://arxiv.org/pdf/1801.09414.pdf
Overview of the popular CosFace identity extractor, which introduces the Large Margin Cosine Loss (LMCL) function to improve intra-class variance. Source: https://arxiv.org/pdf/1801.09414.pdf

However, since popular recent methods vary in the way that they estimate expression and pose errors, the authors instead use the 2022 DAD-3DHeads dataset for pose and expression prediction, using Euclidian distance to quantify expression and pose errors.

The rich facial data in the DAD-3Dhead dataset. Source: https://arxiv.org/pdf/2204.03688.pdf
The rich facial data in the DAD-3Dhead dataset. Source: https://arxiv.org/pdf/2204.03688.pdf

Inspired by 2022’s DFA-NeRF, the authors used the 2018 initiative Fine-Grained Head Pose Estimation Without Keypoints to evaluate additional pose metrics, and Google’s 2019 research Compact Embedding for Facial Expression Similarity for further expression metrics.

Additionally, a variation on the CelebA-HQ test split was used for further analysis of identity preservation from the source data, and ID removal from the target image (i.e., how ‘pure’ and uncorrupted the face swap is).

The initial comparison to prior approaches pitted WSC-Swap against standard implementations of FSGAN, SimSwap, InfoSwap, MegaFS, Arithmetic Face Swapping (AFS), and Uniface.

Initial quantitative results in comparison to comparable recent frameworks.
Initial quantitative results in comparison to comparable recent frameworks.

Of these results, the authors state:

‘[Our] method achieves superior performances in terms of ID retrieval, pose error, and expression error. When compared with [InfoSwap] which has strong ID swap ability, our results are still significantly better on all three metrics.’

The authors note that SimSwap achieves a lower expression error, but that WSC-Swap offers better ID retrieval and lower pose errors, and provide a visualization of the general performance of the top eight methods:

The authors compare face-swapping performance on the FF++ dataset, with each axis depicting the performance of a metric. Larger polygons indicate better performance, and the mean illustration places the new method as the most effective overall.
The authors compare face-swapping performance on the FF++ dataset, with each axis depicting the performance of a metric. Larger polygons indicate better performance, and the mean illustration places the new method as the most effective overall.

For qualitative evaluation, the paper offers a comprehensive comparison of final output across the methods. Though this is featured below, please see the original paper for better resolution (the other results from this stage of testing are shown as the first image in this article).

Qualitative comparisons. Please refer to the original paper for better resolution. The top-left corner represents the source image, and the other faces target images.
Qualitative comparisons. Please refer to the original paper for better resolution. The top-left corner represents the source image, and the other faces target images.

The authors observe ‘[Our] results are noticeably better in terms of ID consistency across various target images’.

A user-study was also conducted, though extensive details of this are presumably to be found in the forthcoming supplementary material.

Results from the user study.
Results from the user study.

The study evaluated target non-ID preservation (the extent to which the target identity was completely overwritten), source-ID similarity (the extent to which the superimposed identity entirely survived the face-swapping procedure), and full image fidelity (the extent to which the image is free of artifacts, and other indications of being a generative rather than a ‘real’ image).

Of this, the authors observe: ‘The results indicate that our method significantly surpasses prior works on overall face swapping quality. Refer suppl. material for more details’.

The paper concludes:

‘[We] unveil that the skip connection that was widely used in prior works is one root cause for poor disentanglement between ID and non-ID representation. We proposed a new framework to address this issue from both network structure and regularization loss perspectives.

‘The experimental results confirm both our hypothesis and the effectiveness of our method.’

Conclusion

Though the authors of the new work are not the first to question the necessity of skip connections in a U-Net, WSC-Swap, if it proves to be a valid approach in the long-term, may hold implications not only for other types of image synthesis network, but also other kinds of convolutional network, where alternatives to skip connections could potentially yield equally interesting and  innovative results.

The general run of the disentanglement literature in the image synthesis research sector rarely proposes an architectural or conceptual modification as radical as the authors put forward in the new work. WSC-Swap stands out from the crowd in terms of boldly tackling this most challenging and crucial roadblock to effective neural identity-swapping.

It’s a shame that the supplementary material is laggard; but the work offers an approach that may well be worth pursuing further, and in broader contexts.

* My conversion of the authors’ inline citations to hyperlinks.
Such as ensuring that facial landmarks do not smuggle in facial topography information.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle