The end user can not only ‘invent’ ad hoc avatars and representations by describing the person to be depicted (text-to-avatar), but can also use text prompts to revise the appearance of the final product.
Crucially, the new system, titled Roll-Out Diffusion Network (RODIN), is trained not on web-scraped faces, but on synthetic data created by the open source Blender project (via the Microsoft Fake It Til You Make It initiative), obviating the various eventual legal implications that could emerge in developing synthesis systems from random, publicly posted images, or from datasets whose legality and applicability for computer vision uses is beginning to be questioned.
For the project, 100,000 synthetic, Blender-originated human avatars were used as training data.
Tackling the 3D Avatar
In the last few years there has been a proliferation of projects using Generative Adversarial Networks (GANs) to achieve these kinds of syntheses, using portrait inversion (a broadly applicable technique that can ‘project’ a novel image of a person into a trained network so that it can be viewed and, in some cases, edited within the latent space).
However, GAN-based projects along similar lines have struggled either to develop adequate instrumentality (i.e. the means to control and manipulate the output), or to adequately disentangle the various attributes of an image (i.e., changing the hair color of a person may also have changed non-hair elements of the image, etc.).
Conversely, most of the NeRF-based approaches, while offering greater editability (because NeRF contains more explicit and addressable 3D-centric information than GANs), have tended to either be computationally resource-intensive and/or time-consuming to train; have failed to reproduce detail adequately, or to output a suitably high-resolution product; or else have had similar problems to GAN in terms of entanglement.
The researchers compared their new approach to three prior works that blended GAN methodologies with NeRF output: Stanford University’s Pi-GAN; GIRAFFE, from the Max Planck Institute for Intelligent Systems and the University of Tubingen; and EG3D, an academic collaboration also led by HKUST – as well as to a generic autoencoder approach, obtaining notably lower Frechet Inception Distance (FID) results than the older works.
Addressing the Quality Gap
RODIN, among the earliest systems to incorporate latent diffusion into an avatar-compatible generative pipeline, is able to output 1024×1024 resolution, thanks to a complex hierarchical generation architecture, which includes upscaling modules, as well as the leveraging of a large array of prior works.
In the last few month’s Google’s Imagen Video project illustrated the current trend for multilayer upscaling architectures, to bridge the resolution gap between almost universally-native 512px or lower training pipelines, and the need to generate HD content. Imagen Video upscales from a paltry 24x48px native resolution to 1280x768px across three layers of upscaling modules.
Likewise, RODIN uses a hierarchy of upscaling algorithms to arrive at 1024px resolution, preceded directly by upscaling layers that increase resolution from 64px to 256px, and then 512px.
During the ‘fitting’ stage, where the input material (such as a photo of a user) is being adapted to the network, the upscaling routines randomly resample the image data to 64px and 256px, in order to make sure that the encoder (which will generate the ‘useful imagery’ that comprises the avatar) is robust to this ‘triplane’ workflow.
Among several other ‘traditional’ bottlenecks addressed by RODIN, the researchers have made economies by adopting patch-wise training, where the convolutional neural network (CNN) being used operates on representative sections of an image, instead of on entire images.
The authors believe that RODIN’s approach may eventually be implementable for more than just avatars, and conclude:
‘While this paper only focuses on avatars, the main ideas behind the Rodin model are applicable to the diffusion model for general 3D scenes. Indeed, the prohibitive computational cost has been a challenge for 3D content creation. An efficient 2D architecture for performing coherent and 3D-aware diffusion in 3D is an important step toward tackling the challenge.’
The system operates by taking a neural volume representation of a face and unpacking it into a series of 2D feature planes:
RODIN relies on three core architectural features: 3D-aware convolution; latent conditioning; and, as we have already seem, hierarchical synthesis.
Regarding 3D convolution, this is the process where a CNN rationalizes the 2D inputs and enables cross-plane communication to help to assemble the source material into 3D-aware data. This helps to synchronize the details that are common across all the images and to form a coherent 3D representation (see images above for examples of ‘planes’, from a prior and unrelated project).
However, this is not enough for coherency in avatar output. Therefore RODIN uses an additional image encoder trained on the Blender-generated avatars. The extensive latent information drawn from 100,000 images provides a consistent rationale by which the user-submitted image (whether a real image of a text-to-image avatar) can be conformed to a consistent visual standard. This process is called, by the researchers, latent conditioning
Regarding latent conditioning, the authors state:
‘The latent conditioning not only leads to higher generation quality but also permits a disentangled latent space, thus allowing semantic editing of generated results.’
Keeping the source data (i.e., for a submitted portrait inversion) in the same domain requires some additional effort, so that the fitting stage produces coherent output. This is accomplished in RODIN with a shared multi-layer perceptron (MLP) decoder (see our article on autoencoder synthesis for more details on shared encoders) that ‘pushes’ the tri-plane features into the shared latent space.
Since the data is likely to exhibit at least some inconsistencies, the MLP decoder has to be tolerant of these. The aforementioned random upscaling and downscaling helps this component to become more robust to abstract differences between the planes, pulling all the data into a coherent representation.
One useful feature of having control over the training data (i.e., not having to rely on unbalanced, web-scraped datasets of real people, such as LAION or ImageNet), is that the resulting trained systems exhibit genuine diversity, and are capable of creating truly diverse representations.
The opposite of this is memorization, where the data is either too scant or too similar to provide the system with enough choices to generate diverse output, and where, instead, it will tend to repeat the data that it knows about.
The RODIN researchers have tested this capacity by generating the ‘nearest neighbors’ for a series of avatars (seen in the image below). If the system had become subject to memorization (a form of overfitting), the adjacent images to the avatars would be ‘variations’ on them; but as we can see, the nearest neighbors are very diverse indeed:
CLIP, CFG and Editability
To allow the end-user to make text-based adjustments to generations, the researchers have used OpenAI’s CLIP (Contrastive Language-Image Pre-training) system. CLIP equates images and derived features with related text, so that it’s possible to use natural language to create or alter images.
The same mechanism allows RODIN to create avatars out of thin air by simple description:
The CLIP encoder used is ‘frozen’, which means that it will render decisions that are unaffected by its position in workflows that may be gaining information from new data, but rather provides a fixed outcome based on its own prior training. Used centrally in OpenAI’s DALL-E 2, CLIP is a core feature in latent diffusion models.
CFG allows the user to boost the fidelity of the generative system by forcing it to adhere to the user’s text prompts, and restricting its ability to freely interpret the submitted text. The flipside of this useful feature is that as the amount of CFG is increased, the quality of the output is likely to become more taut, stylized, or even to begin to tear and notably degrade (see our article on full-body deepfakes for examples). Used with restraint, however, CFG allows the user to strike a balance between accuracy of interpretation (i.e., fidelity to the prompt) and authenticity of the result.
In a qualitative ablation study test, the RODIN researchers found that removing CFG had a deleterious effect on output:
The diffusion model used by RODIN makes use of the U-Net model employed in OpenAI’s Guided Diffusion research. The diffusion model was trained with the AdamW optimizer at a batch size of 48 and a learning rate of 5-e5, while the upsampling diffusion model was trained similarly, but on a batch size of 16.
The base diffusion model used 1,000 steps (i.e., the number of times that it iterates over the training data), while the upsampling model used 100 steps, with a linear noise schedule.
For inference, both models used 100 diffusion steps (somewhere between 75-150 is a common range for inference). All the tests were performed on NVIDIA Tesla V100 GPUs with 32G of VRAM.