Researchers from Peking University have developed a neural face-editing system that’s capable of discretely editing facets of a human face, including identifying characteristics such as eye color, hair color and age, as well as makeup transfer and the editing of facial expressions.
The results demonstrated in the new publication are among the best we’ve seen for expression editing, in a research sub-sector that’s hamstrung by limited data sources and inadequate methodologies.
The new system, titled ChatFace, leverages the power of Large Language Models (LLMs) to achieve a more fine-grained and accurate re-representation of faces, by manipulating the association between words and images in a more powerful way than default methodologies such as CLIP, and by acting as an intermediary between a user’s intent and the parameters for transformation.
ChatFace allows the user to conduct a natural language dialogue with the system in order to request edits, and to refine the results of processed edits – a user interface approach that will be familiar to users of ChatGPT and the recent crop of advanced Natural Language Processing (NLP) systems that depend on LLMs.
The highly-disentangled system proposed by the researchers permits the user to add arbitrary and concatenated changes to face images, including the adding or removal of glasses:
The edits are applied in the latent space of the model, where the fundamental and resolution-independent priors have been indexed, so that any changes made are applied at the most intrinsic layer of the embeddings, and not in pixel-space.
ChatFace leverages Diffusion Autoencoders (DAEs) to address some of the issues with entanglement that can emerge when attempting to change discrete parts of an image without unwittingly altering other parts of the photo, and offers a novel Stable Manipulation Strategy (SMS) that remediates some of the shortcomings of prior approaches.
While many of the current breed of expression-editing systems rely explicitly on the Facial Action Coding System (FACS) – whose shortcomings we have commented on before – ChatFace bundles in expression data as a facet of more general information about possible face-edits (i.e., age, eye color, hair color, etc.).
However, this does not directly create a novel or necessarily improved Facial Expression Recognition (FER)/Synthesis methodology, notwithstanding the apparent superiority of the published results, as we’ll see. Further, the system’s FER capabilities are only one part of a larger ambit for the project, whose efforts in regard to makeup transfer and ageing, while impressive, are more in line with the current general state-of-the-art.
The new paper is titled ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation, and comes from six researchers at the university. Though the project has an associated dedicated website, it is a placeholder at the time of writing.
ChatFace comprises a multimodal interpretive system for face editing, consisting of an LLM, which is used as an interpreter for the user’s requests, and a DAE-powered pipeline for generative manipulation.
The input face image is initially encoded into the latent space via a residual mapping network, which obtains the base features from the face. Once these features have been decomposed into essential and discrete facets, the zedit state is achieved (see image right, above), and the ‘toolkit’ of features becomes available to later stages in the process.
After this, a noise-based latent code is obtained, which also distills the fundamental characteristics of the image. Here the researchers have leveraged prior work on DAEs from the Vidyasirimedhi Institute of Science and Technology in Thailand (which in itself leverages the 2022 Stanford research paper Denoising Diffusion Implicit Models, or DDIMs).
Thereafter a compact multilayer perceptron (MLP) is used to infer directions for manipulation within the latent space, such as changes to makeup, ageing, glasses, expressions, etc.
The next problem to overcome in the process is to stop the diffusion process from ‘over-committing’ early on in the diffusion process.
Essentially, when an image is being generated in multiple passes through a latent diffusion model, the early stages tend to form a ‘rigid’ foundation which can’t be changed much by the time the details begin to form. The process is notably ‘imaginative’ or ‘hallucinatory’ by default in these early phases, which can lead to text-based instructions (such as ‘make her hair curly’, etc.) ‘bleeding out’ into other facets of the generated image than were intended.
One can see this bleeding in action by imposing a color-based text prompt into a latent diffusion model: the palette.fm colorization platform, based on Stable Diffusion, imposes color into monochrome images by noise-based diffusion, and by parsing the recognized components in an image into a colorized equivalent.
However, when you begin to order the system to colorize particular facets of the image, that directive tends to end up enacted at more than just the restricted region that you specified:
Therefore the researchers for ChatFace have devised a Stable Manipulation Strategy (SMS) which conditions the temporal features of the diffusion model to align with the (text-based) semantic condition at each time step, instead of predominantly at the later stages of the process.
To accommodate this process further, the researchers also developed three novel loss functions, including a face identity loss that draws inspiration from the ArcFace network and a CLIP-direction loss based around StyleGAN-NADA.
The central premise of ChatFace is to use LLMs for parsing the user’s intent, as derived from the user-supplied prompt, into a semantic formulation that’s most likely to achieve the intended result.
Effectively this constitutes a language-based semantic interpretive layer, with the LLM finding the best path between the features drawn from the user’s prompt, the features in the pre-trained model, and the matrix of these two represented by the features drawn from the specific input data (i.e., a face to be edited). The authors define this process as an Attribute Mapping Network.
Regarding the core mechanics of the process, the authors state:
‘The large language model takes a request from user and decomposes it into a sequence of structured facial attributes. We design a unified template for this task. Specifically, ChatFace designs three shots for editing intent parsing: desired editing attribute A, editing strength S, and diffusion sample step T .
‘To this end, we inject demonstrations to “teach” LLM to understand the editing intention, and each demonstration consist of a user’s request and the target facial attribute sequence.’
Data and Tests
The code for Asyrp was also used as the foundational Diffusion Autoencoder for the tests, though in a vanilla install, it was also one of the systems ChatFace was tested against.
The experiments comprised 54 text prompts formulated for facial images, including the alteration of hair styles, style, glasses, gender, expressions, makeup, and various other impositions. Using the Ranger deep learning optimizer, the training learning rate was set at (fairly high) 0.2, with each attribute trained for 10,000 iterations at a batch size of 8, on 8 NVIDIA 3090 GPUs, each with 24GB of VRAM. For the training of the GPT3.5 LLM used, the turbo model was employed, accessible through the API supplied by OpenAI.
The aforementioned MLP consists of only four layers, with each text prompt trained just once, in order to perform semantic editing on the related attributes of a real input image.
Regarding the initial results in the qualitative tests (partial results seen in the image above, for expression editing and the adding of glasses), the authors state:
‘We observe that StyleCLIP struggles to faithfully reconstruct real images, and local attribute modifications result in unintended [change]. For [example], manipulating the “blue eyes” attribute also changes the girl’s clothing color to blue.
‘Furthermore, while DiffusionCLIP improves image reconstruction results of StyleCLIP, editing fine-grained facial attributes often affects the global visual features of the [image]. In contrast, our ChatFace perform efficient real image editing based on the input queries while preserving visual fidelity.’
The second, non-combative qualitative round involved editing diverse identity and facial attributes, such as makeup, hair and further expressions, and in changing the age and expression strength across a gradient of strengths, as seen in the image below.
Here the authors comment:
‘[ChatFace] successfully preserves the identity of the face and generates high-quality edited images. The diverse manipulation results showcase the robustness of our approach.’
It should be noted that the authors have ensconced some of the less convincing and bolder examples of ChatFace’s expression-editing in the depths of the supplementary section:
While it has to be conceded that these expression manipulation examples are quite crude, they are generally better than the current run of similar frameworks, whether FACS-based or not. The complexity of facial affect is, inevitably, boiled down to cartoon-like and painfully rudimentary facial signifiers for emotions, with ‘surprise’ and ‘sad’ particularly exaggerated and non-specific to the subject.
However, right now, in terms of semantic expression editing (i.e., where text labels are used to perform the transformations), this is probably as good as it gets for the state-of-the-art.
Other systems have succeeded in allowing users to perform non-semantic facial manipulations in GAN space and other systems, such as ‘eyes more open’, ‘mouth pursed’, etc. However, such systems are essentially cut-down neural versions of CGI-based ‘clay modeling’ frameworks such as ZBrush, where the user basically keeps moving elements of the face around until they perceive that the desired expression is obtained – a methodology that does not enable reproducible or deployable semantic expression synthesis, but rather extends ‘traditional’ VFX tool-kits into a machine learning space. It could be argued that such systems are ultimately potential artists’ tools, rather than independent, AI-based interpretive frameworks.
Notwithstanding these reservations, the authors’ claim that their expression edits retain the identity of the source image, certainly in comparison to rival systems, seems a valid one, even if the imposed expressions themselves may not be accurate to the source identity (after all, the system has only seen one single image of the identity, and cannot know for certain how that person’s face may be disposed in a variety of expressions).
The researchers also tested multi-attribute editing – truly a challenge for such systems, which are inclined to lose accurate identity even with just a single edit.
Of these results, the authors state:
‘It’s clear that ChatFace can generate progressive multi-attribute edits based on the user’s queries, thereby demonstrating the continuous editing capability of our proposed method.’
For the quantitative rounds, the authors followed the metric scheme adopted by DiffusionCLIP: directional CLIP similarity (Sdir), which measures the distance between the semantic value of the text prompt and the manipulated image; Segmentation Consistency (SC), which uses semantic segmentation to evaluate whether structural features have remained intact; and Face Identity Similarity (ID), which seeks to evaluate whether the defining characteristics of the person depicted have survived the transformations.
The authors note that ChatFace succeeds in outperforming the rival frameworks while maintaining consistency with the original images, in comparison to the prior approaches, across all tested facets.
The researchers also re-presented the quantitative data (30 CelebA-HQ images that were manipulated for smile, curly hair, makeup and glasses) in a human evaluation study, with results depicted in the right-most columns of the table above, with similar leading results obtained.
It should be noted that CelebA-HQ largely represents popular ‘catwalk-style and junket press images, with innumerable smiling and well-lit celebrities dominating the dataset, and that, considering this, imposing a smile using this data is probably the lowest-hanging fruit available, in terms of expression editing.
Further diverse studies, not included in the main section of the paper, can be found in the appendix materials, and we refer the reader to these.
The current narrative in neural face-editing centers around discrete and non-destructive editing of facial aspects – a task that is massively impeded by the entangled nature of data in the latent space of any current system. In this respect, ChatFace seems to have equaled or exceeded the known state of the art, apparently by use of the SMS module which regulates the denoising process, so that the early stages do not crudely set the generation down a wayward path.
It could be argued that the scope and potential of the various manipulations attempted by ChatFace and rival systems should not be lumped in together. For instance, it is relatively simple to change the color of facial facets such as eyes, makeup, and hair, since these changes do not usually affect the actual topology of the face.
On the other hand, the enormously diverse changes that occur in a face when it smiles or frowns cannot be gleaned from a single input image, but will surely require a reasonable selection of subject-specific samples that represent that particular emotional affect. In other words, there’s probably no avoiding something akin to ‘training’ if you’re looking to explore the latent directions of one particular person’s facial expressiveness.
In the case of ageing, such data is, of course, not available at all, making age-up manipulations a matter of pure speculation, much as is the case with expression-guessing (i.e., when the system has only been fed one single image of the source subject).
What’s characteristic of the recent crop of ‘all-in-one’ neural facial editing systems is that they approach these very different tasks as if a single reductive approach could ever be effective.
At this stage of the game, however, the sector is still concentrating on basic disentanglement, which, despite the impressive gains of several aspects of ChatFace, remains an unsolved problem in pure neural approaches that do not resort to 3DMM techniques or other CGI-based ‘crutches’.