Researchers from China have discovered that entirely removing one of the apparently essential core technologies in image synthesis and generative systems can notably improve the quality of facial deepfakes.

By removing skip connections (which we’ll get to in a moment) from a re-imagined deepfaking framework, the researchers were able to significantly outperform rival state-of-the-art face-swapping systems.

Referring to the thorny problem of entanglement in image synthesis, and the 2020 FaceShifter framework (a collaboration between Peking University and Microsoft that is central to the new paper), the authors assert*:
‘We noticed that in [FaceShifter] and most follow-up works [InfoSwap, Smooth-Swap, and Hififace], skip connections play an important role in preserving non-identity attributes.
‘Hence, we delve deep into this design. After diagnostic evaluations, we realized that because shallow convolutional features contain both ID and non-ID information, the skip connection that introduces shallow features to image decoding could be the “reason” behind such entanglement.’
Let’s now take a look at the researchers’ new insights and (quite revolutionary) approach to one of the most persistent obstacles in the development of neural human synthesis systems.
The new paper is titled Reinforced Disentanglement for Face Swapping without Skip Connection, and comes from five researchers at Xiaobing.AI.
Skipping Skip Connections
In a Deep Neural Network (DNN), a skip connection is essentially a shortcut from a shallow layer to a deep layer.
The data in transit, by analogy, is taking a bus that is blazing past several scheduled stops.

Skip connections build a bridge between the input of a convolutional block (aka a ‘residual module’) and its output, entirely avoiding the ‘scenic route’.
One supposed advantage of this is that information that has been gathered at the start of this journey arrives at the end-point uncompromised.
If the same information had taken the ‘complete’ and unexpurgated route through the network, it might have become notably altered by the time it arrived at the destination, because the network is by nature transformative and interpretive.
Thus, on the plus side, skip connections allow important low-level information to remain intact at the later processing or generative stage. This can mean that training data can have a more tacit and useful influence on the output.
But it can also mean that information one would actually like to get rid of becomes persistent and entangled in other, non-related processes.
So skip connections are architecturally ‘amoral’: they’re either helping, or getting in the way, depending on what it is you want to achieve.
The new paper argues that skip connections are indeed obstructive to neural identity transfer, and represent one of the central reasons for identity bleed – the inability of a deepfake architecture to completely ‘overwrite’ the target content, and the tendency of the original identity in that content to ‘bleed through’ and spoil the deepfaking effect; and that removing skip connections leads to improved results and better face-swaps.
This kind of manipulation of skip connections is common practice in the Stable Diffusion community, where the CLIP Skip feature can help to improve image quality outcomes, by over-leaping the extreme granularity of information that can occur in some terms interpreted by the CLIP encoder.
As per the principles outlined above, again, this preserves the ‘purity’ of an idea present in a text-prompt, without dragging it through the complexity of irrelevant layers or welding it irretrievably into other associated semantics of the text-prompt, and potentially mutilating or ‘over-cooking’ the original intention of what a semantic image/text pair might have brought to the final generated output.
Method
As we recently discussed at some length, the common run of deepfake applications, since the technology burst on the scene in late 2017, have used a shared encoder to store features that are derived from source data for two identities, during the training of the model.

Under this schema, skip connections are used to negotiate the various pose/identity facets that are drawn from the training data. These facets of information are all mixed in together in a central ‘soup’, with skip connections used to address the necessary aspects of the data.

The researchers for the new paper have devised an alternate scheme which abandons the idea of a shared encoder and gives dedicated and separate encoder space to each identity:

The new approach is called WSC-Swap (with ‘WSC’ presumably standing for Without Skip Connections, though this is not made explicit in the paper).
The authors state:
‘[Instead] of using skip connections, we propose a novel framework consisting of a Facial Non-ID (FNID) network and a Non-Facial attributes (NFA) network to perform face swapping. The skip-connection-free framework exhaustively preserves target non-identity information, while at the same time preventing the target facial identity from leaking into the swap image decoder.’
The novel modules created for WSC-Swap consist of a Facial Non-ID (FNID) network and a Non-Facial attributes (NFA) network, which actual performs the face-swapping.

The authors observe that without skip connections, the system is able to exhaustively preserve the identity that is to be superimposed, and to prevent the target identity (i.e., the one that will be overwritten) from leaking through into the final output.
The paper notes that prior works (such as SimSwap, the aforementioned InfoSwap, and Hififace, among many others) which have investigated diverse methodologies for the disentanglement of identity and non-identity attributes†, have all failed to achieve three central goals outlined in the new work: 1) the preservation of the source identity; 2) the disentangled preservation of non-identity material (such as locations and clothing, etc.); and 3) output which is free from artifacts and measurably and perceptibly photorealistic.

The authors researched these prior structures, observing that all contained a bottleneck encoder/decoder structure (as pictured earlier), and recognized that this central architectural glitch is almost certain to lead to ‘misclassified’ information (i.e., either identity or non-identity information) being delivered to the wrong place, due to lack of discrete pipelines for these separate facets at training and synthesis time.
In this under-tooled scenario, the authors note that the bottleneck encoder/decoder architecture is burdened not only with fully removing the facial identity from its context (everything that is not directly ID-related), but also must preserve everything else about the identity, with permanent and transient facets essentially ‘fused’ (i.e., ‘eye direction’ is not a defining characteristic of a face, but instead describes a pose, whereas ‘eye color’ is indeed linked with identity – yet these are handled by the same internal processes).
The researchers state:
‘[We] argue that it is very difficult, if not impossible, to simultaneously achieve the above two goals using only one single compressed bottleneck [encoder]. In addition, the skip connections used in the decoder would inevitably bring the target identity information into the results together with other non-identity attributes, therefore, further hurting the disentanglement learning.’
Data and Tests
To test WSC-Swap, the authors used three facial image datasets to train the system: CelebA-HQ; FFHQ; and VGGFace. Since each of these datasets are differently configured, subsets and derivations were used selectively, to ensure parity in testing. The researchers followed the testing methodology of FaceShifter, with the training images cropped to 256x256px.
For metrics and evaluation data, the authors used the FaceForensics++ dataset, which provides three ID metrics: ID retrieval, pose error and expression error.
The researchers note the frequent leveraging in recent literature of the 2018 CosFace model as the ID vector extractor:

However, since popular recent methods vary in the way that they estimate expression and pose errors, the authors instead use the 2022 DAD-3DHeads dataset for pose and expression prediction, using Euclidian distance to quantify expression and pose errors.

Inspired by 2022’s DFA-NeRF, the authors used the 2018 initiative Fine-Grained Head Pose Estimation Without Keypoints to evaluate additional pose metrics, and Google’s 2019 research Compact Embedding for Facial Expression Similarity for further expression metrics.
Additionally, a variation on the CelebA-HQ test split was used for further analysis of identity preservation from the source data, and ID removal from the target image (i.e., how ‘pure’ and uncorrupted the face swap is).
The initial comparison to prior approaches pitted WSC-Swap against standard implementations of FSGAN, SimSwap, InfoSwap, MegaFS, Arithmetic Face Swapping (AFS), and Uniface.

Of these results, the authors state:
‘[Our] method achieves superior performances in terms of ID retrieval, pose error, and expression error. When compared with [InfoSwap] which has strong ID swap ability, our results are still significantly better on all three metrics.’
The authors note that SimSwap achieves a lower expression error, but that WSC-Swap offers better ID retrieval and lower pose errors, and provide a visualization of the general performance of the top eight methods:

For qualitative evaluation, the paper offers a comprehensive comparison of final output across the methods. Though this is featured below, please see the original paper for better resolution (the other results from this stage of testing are shown as the first image in this article).

The authors observe ‘[Our] results are noticeably better in terms of ID consistency across various target images’.
A user-study was also conducted, though extensive details of this are presumably to be found in the forthcoming supplementary material.

The study evaluated target non-ID preservation (the extent to which the target identity was completely overwritten), source-ID similarity (the extent to which the superimposed identity entirely survived the face-swapping procedure), and full image fidelity (the extent to which the image is free of artifacts, and other indications of being a generative rather than a ‘real’ image).
Of this, the authors observe: ‘The results indicate that our method significantly surpasses prior works on overall face swapping quality. Refer suppl. material for more details’.
The paper concludes:
‘[We] unveil that the skip connection that was widely used in prior works is one root cause for poor disentanglement between ID and non-ID representation. We proposed a new framework to address this issue from both network structure and regularization loss perspectives.
‘The experimental results confirm both our hypothesis and the effectiveness of our method.’
Conclusion
Though the authors of the new work are not the first to question the necessity of skip connections in a U-Net, WSC-Swap, if it proves to be a valid approach in the long-term, may hold implications not only for other types of image synthesis network, but also other kinds of convolutional network, where alternatives to skip connections could potentially yield equally interesting and innovative results.
The general run of the disentanglement literature in the image synthesis research sector rarely proposes an architectural or conceptual modification as radical as the authors put forward in the new work. WSC-Swap stands out from the crowd in terms of boldly tackling this most challenging and crucial roadblock to effective neural identity-swapping.
It’s a shame that the supplementary material is laggard; but the work offers an approach that may well be worth pursuing further, and in broader contexts.
* My conversion of the authors’ inline citations to hyperlinks.
† Such as ensuring that facial landmarks do not smuggle in facial topography information.