In the field of image synthesis, entanglement is the ‘enmeshing’ of data properties with adjacent or intrinsically-related properties. This can make it challenging to isolate a particular aspect of an image: if you change that data, you end up changing other facets in the generation:

In AI-driven human synthesis, architectures such as Generative Adversarial Networks (GANs), Neural Radiance Fields (NeRF), latent diffusion and autoencoders are all affected to some extent by entanglement, and all sub-sectors of research related to these technologies are actively investigating ways to ‘split up’ constituent parts or traits of neural representations.
Here, ‘traditional’ CGI techniques offer a massive advantage, since every single component in CGI imagery is essentially ‘hand-crafted’ and discrete.

If anything, the problem is reversed with CGI, and more related to the challenge of orchestrating the disparate contributing facets (such as hair and cloth physics, texturing, lighting and body motion) into a cohesive and natural representation.
Conversely, a system such as NeRF gathers every single piece of contributing data in one momentary and unordered ‘blast’, from a limited series of photos:

To accomplish this, the NeRF acquisition pipeline shoots ‘virtual rays’ down each pixel in the image, simultaneously estimating the geometry needed to recreate the subject in 3D – an advanced form of photogrammetry, where the ‘missing’ source views are estimated from the available views, creating a complete 3D interpretation:

Loaded, But Locked
However, this is an indiscriminate and non-granular method of acquisition; the result is what CGI and video-game artists would call a ‘baked’ texture/poly map, where all editability has been removed.
If you need a ‘shinier’ surface, an altered geometry or even a change of lighting, there are no trivial or innate solutions that are native to neural technologies.
Likewise the geometry in a NeRF is ‘static’ by default, and lacks the joints and rigging that CGI animators have used with relative ease, and a high level of control, for thirty years – and which must, with great difficulty, somehow be recreated by other means.
This is not to say that NeRF offers only a static explorable representation; a whole slew of projects over the last 2-3 years has enabled the recording and reproduction of continuous and even ad hoc motion in NeRF representations:

However, in general, these mobile NeRF ‘replays’ are not directly editable either; whatever motion was captured at the time is what you have to work with; and those projects which are the most flexible tend to run at the lowest resolution, or with problematic latency (i.e., the results do not render fast enough).
It must be admitted that within these parameters, one can perform eye-boggling transformations, such as nesting NeRFs inside other NeRF representations, and even mixing and matching different playback speeds:

But if you want to change the movement itself, it’s already too late; the motion was frozen into the data at the time of recording, in a system which has no intrinsic understanding of body movement, facial expressiveness, or hair and cloth dynamics – all facets that are routinely controllable in a CGI workflow.
Projects aimed at increasing access to NeRF content includes Editing Conditional Radiance Fields, Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering, EditNeRF, and the face-focused FENeRF:
In terms of at least disentangling lighting from the capture material for a NeRF, projects that have made some headway in this regard include Neural Radiance Transfer Fields for Relightable Novel-view Synthesis with Global Illumination, and Neural Radiance Fields for Outdoor Scene Relighting, among others.
Ungainly GANs
In terms of content editing, Generative Adversarial Networks are almost as difficult to navigate and control as NeRF. While the process of estimating the 3D geometry of a captured object is entirely native to NeRF, almost from the moment of data acquisition, GANs have no such innate mechanism, and are usually trained on entirely ad hoc, non-sequential data (such as hundreds of thousands of faces).
Though a GAN is well able to generalize the central traits of such diverse training data, and offers a power of invention that’s entirely missing in NeRF, this lack of native 3D understanding is an additional obstacle to creating even very limited movement.
In theory, GANs are potentially well-disposed towards disentanglement; the 2020 paper GANSpace: Discovering Interpretable GAN Controls, a collaboration between Aalto University, NVIDIA and Adobe, found that latent directions discovered through Principal Component Analysis (PCA), applied either in latent space or feature space, can offer a range of potential instrumentality for disentangled aspects of trained data.
Yet even the official video for this work (embedded directly below) demonstrates the extent to which non-targeted material is ‘dragged into’ targeted transformations.
The GAN research scene has been saturated with disentanglement projects over the last 3-4 years, none of which have made any redefining breakthroughs, and most of which leverage third-party technologies such as 3DMM (discussed below).
Approaches to GAN disentanglement include ByteDance’s use of (superimposed) semantic segmentation in its recent paper SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing.
Grad-CAM
One technology that features frequently in GAN-centered disentanglement is Gradient-weighted Class Activation Mapping Demonstration (Grad-CAM), a 2016 initiative from the Georgia Institute of Technology.
Grad-CAM uses the gradient of any target concept present in the latent space of a network (such as ‘dog’) to generate a rough localization that’s capable of highlighting related regions in the image (i.e., related to the ‘search term’ dog, for example).
The ‘heat maps’ produced by the system are essentially hijacked from processes developed by the researchers to improve the GAN’s training process, and which the GAN’s discriminator can use to tell the generator component how well it did on its previous attempt at reconstruction.
In Grad-CAM, instead, these pathways are used to create activation visualizations rather than optimization processes.

Grad-CAM is a relatively coarse tool (see the 2016 presentation video here); nonetheless, it’s one of the most popular of a very limited range of available latent space mapping libraries and frameworks, and features in a number of GAN disentanglement research projects.
For instance, in 2021 a research group led by the Chinese University of Hong Kong leveraged Grad-CAM in their paper Improving GAN Equilibrium by Raising Spatial Awareness (see video below), which enabled a user interface that allows an end-user to ‘scrub through’ the latent space of a GAN.
Though morphing through churches and adjusting angles of cats is amusing, we can see that the ‘editing’ enabled by the research essentially only allows minor adjustments, or to register transitional states between existing ‘frozen’ latent codes, rather than really getting access to the central disentangled assets of the network.
Grad-CAM has also been used to repair GAN-generated faces, for generalizing adversarial explanations, visualizing deep networks, and as an explanatory tool for the decisions made by the YOLO object detection series. It is also a popular tool for generating saliency maps in medical research, among other applications.
Grad-CAM is one of the very few ‘purely neural’ solutions to disentanglement – most current GAN-based image synthesis approaches are leaning towards CGI-based interface solutions (see Faux Disentanglement Through CGI below).
Grad-CAM’s ‘paper trail’ approach (i.e., ‘marking’ the path of the data as it enters the network) has also been used in BlobGAN, which allows the user to move ‘pre-marked’ sections of trained data around by manipulating objects in a grid:

Disentanglement in Latent Diffusion Generative Systems
The extent to which entanglement affects latent diffusion systems such as Stable Diffusion is currently a major research obsession, with 4-8 papers emerging weekly, at the time of writing, offering new solutions that attempt to isolate facets of an image generation for discrete editing.
One recent approach, from Zhejiang University, proposes a two-stage framework, called text-guided mask-free local image retouching, which converts a text token (i.e. ‘bear’) recognized in a generated image into an addressable object that can be isolated for editing purposes:

A collaboration between the UK’s University of Rochester and Adobe Research also recently proposed Structure-Guided Image Completion with Image-level and Object-level Semantic Discriminators – a system that imposes an object-level discriminator framework that, again, can isolate those annoyingly entangled facets of a generated image.

A German collaboration, including involvement from LAION, in the same week offered a Semantic Guidance (SEGA) system, dubbed The Stable Artist. The system operates directly in the latent space of a diffusion-based generative system, by organizing concepts and orchestrating appropriate paths through the latent directions related to those concepts, without dragging non-targeted facets in as collateral damage:

These are just three examples from a single week in December of 2022. Other recent proposals include SmartBrush, the Google-backed project Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis, Sony’s initiative Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models, and the Meta AI-backed Shape Guided Diffusion with Inside-Outside Attention.
And, if any further evidence of the academic fervor around latent diffusion disentanglement were necessary, we still haven’t cited any projects that appeared outside of December 2022.
Don't Label Me
The quality of Stable Diffusion’s generated images is closely related to the standards and characteristics of the captions processed through the CLIP-based (now OpenCLIP, since SD V2.0) text-encoding mechanism. Therefore, if a picture of a human face has been trained into Stable Diffusion with minimal captioning (‘woman’, ‘man’, etc.), the entanglement is clear, since the caption does not even specify ‘face’.
In effect, however, the trained SD network still understands that ‘eyes’ (for instance) are normal components of a face, and can transfer its more deeply-ingrained knowledge about faces to the under-captioned (or even miscaptioned) face image.
The problem is that so many of the other contributing images in the training data are also not optimally captioned. What for us would be an image containing a collection of facial features (‘a young female face with full lips, blue eyes and a short nose’) can become essentially a ‘bag of pixels’ for Stable Diffusion – one that’s hard for the system to pick apart or make editable at a semantic level, if the text data is lacking.

Going to Seed
To date, much of the research into maintaining compositional stability for Stable Diffusion editing has relied on relatively coarse and abstract mechanisms, such as the random seed functionality.
The random seed in an SD generation represents a unique and ad hoc path through the many possible ways that the latent diffusion architecture might interpret a user’s text prompt. If you take a note of the seed that was used in a prior generation, and deliberately use that seed for a subsequent generation, without changing any other parameters, it’s usually possible to exactly reproduce the original obtained image.
However, this is a fragile method of composition preservation, and practically any change is likely to break it:

The brittleness of the seed is proving a major impediment to temporal coherence in video generated with Stable Diffusion, and the difficulty of changing any minor aspect of a seed-driven generation (i.e., one where you specify the prior seed from a generation that you would like to modify) illustrates the extent to which entanglement restricts editability in Stable Diffusion, compared to traditional digital workflows.
At the time of writing, a new paper – from UC, Santa Barbara, Adobe Research, and the MIT-IBM Watson AI Lab – offers a seed-driven approach to disentanglement that hinges on revising the input text embeddings from a neutral description (such as ‘photo of a person’) to a stylized description (such as ‘a photo of a person with smile’), while fixing the Gaussian noise generated during the denoising process, can preserve the semantic content of the image while allowing for modification of specific content inside that image.

The new approach only optimizes around 50 parameters, and does not require fine-tuning (i.e., resuming model training with additional data, which can damage the overall generative capabilities of the model). Though the open source code for the model generalizes well to unseen data (i.e., you can use it on any image, and don’t have to ‘teach’ the system to adapt to images you want to edit), the results presented are not entirely free of entanglement artifacts.
Depth Maps as Boundaries for Editing in Stable Diffusion
The recently-released Stable Diffusion V2.0 introduced a new and promising method of disentanglement based on depth maps.

The system, titled Depth2img, uses Intel’s MiDaS library to generate depth maps as faux 3D bounding areas outlining which parts of the image content should be addressed (i.e., edited, or in some way altered), allowing for the modification of discrete areas in otherwise reproducible images.
As it stands, we can see from the bottom row of official example images in the illustration above that the depth map itself is entangled, in that the image of the man sitting on the stairs includes the environment; therefore the stairs themselves are transformed together with the conceptual transformations of the man. Presumably, this could be further addressed by the use of semantic segmentation, or other techniques designed to isolate specific elements inside depth maps.
Though depth-based mapping of this type is a notable step forward in disentangling edited content from overall compositionality, it should be noted that it does not enable the isolation of individual, object-level (or character-level) attributes.
An alternative depth-based approach has, at the time of writing, just been suggested by Korea University. Titled DAG: Depth-Aware Guidance with Denoising Diffusion Probabilistic Models, the approach of the new paper differs from the SD V2.0+ Depth2Img method in that it incorporates depth-aware guidance directly into the sampling process as an unconditional generation*, rather than using a tertiary library to extract depth map information from the final result.

Additionally, unlike the native depth-map functionality in Stable Diffusion V2.0+, the Korean system can automatically generate normal maps – a traditional CGI modeling method where the colors in an additional texture image can be used to render geometry, without explicitly including any 3D information.
Faux Disentanglement Through CGI-Based Approaches
The ideal situation for neural modeling and representation would be to gain a better mastery of the latent space generated for the model during training – to know where all the relevant information is (i.e., traits such as ‘blond’, ‘old’, ‘male’, ‘female’, etc.) and to learn how to adroitly combine or negate these qualities according to the desired result.
In the video embedded below, we see one such ‘pure’ approach, in a 2021 collaboration led by Adobe, where the user can perform real-time attribute editing – however, we can note that the application is restricted to individual images, and that the temporal element is missing; though we can explore changed static scenes, we are not seeing actual movement in the face representations:
In practice, as we have seen, trait elements tend to enter the model ‘pre-fused’, so that additional technologies and approaches are needed to disentangle them – if, indeed, that’s an achievable goal at all.
Therefore a notable new trend in research over the past four years or so has concentrated on using traditional CGI techniques to help ‘split’ and control the diverse facets in a trained model.
Most of these have used 3D Morphable Models (3DMMs) – a technique that uses a CGI model as a parametric mapping interface, allowing creators to work with traditional and familiar, controllable tools while leveraging the superior representative qualities of neural representations.

Leading research organizations such as the Max Planck Institute (a preeminent developer for new advances in CGI>Neural interfaces) are developing CGI/NeRF-based systems which can restore some level of control over the qualities of a representation. These innovations include Sparse Trained Articulated Human Body Regressor (STAR), a successor to its less-capable but still-popular Skinned Multi-Person Linear Model (SMPL) framework.
Though 3DMM approaches were initially dominant in GAN-based research, eventually the NeRF research community began to realize that applying CGI instrumentality to NeRF could be a valid method of developing more wieldy and versatile generation systems:

The most prominent example of the ‘CGI concession’ in recent years has been Disney Research’s intense interest in the use of Morphable Face Models to control neural representations – for example, the MoRF project, which extends the 3DMM-style approach to create a framework that can generate diverse identities which can be puppeted through parametric controls.
MoRF also offers the user improved ability to control and generate fine-grained aspects of the rendering, such as diffuse and specular separation, as well as native (rather than inferred, or ‘guessed’) depth maps:

Beyond Defeat
Such approaches are comforting for a VFX industry that’s curious but circumspect in regard to bleeding-edge AI technologies, and long-since accustomed to high levels of control over the rendering pipeline; but, arguably, they also smack of defeat, and signal a concession to the opacity of the latent space; a concession that’s hopefully only temporary, and which may eventually yield to increased and ongoing efforts towards more ‘native’ and less hybrid solutions.
In the meantime, the extraordinary ability of GANs and other neural approaches to provide hyper-real facial output – and to overcome at a stroke the uncanny valley syndrome that has dogged CGI for over 30 years – has made the incorporation of neural workflows potentially worth the effort of creating intermediary systems; effectively, however, this reduces the role of machine learning to that of a very advanced texture-renderer amidst what is otherwise a relatively old set of technologies (and which come with their own pain-points and bottlenecks).
* Confirmed in an email to us of 20th December 2022, by the paper’s corresponding author, Gyeongnyeon Kim.