A new academic/industry collaboration from China and Singapore proposes a novel method of injecting ‘custom’ people into latent diffusion-based text-to-image systems such as Stable Diffusion without the need for tedious and resource-intensive fine-tuning. Furthermore, it can accomplish (the authors assert) a SOTA-beating facial synthesis in Stable Diffusion using just one image as the input source, eliminating the need for laborious dataset curation:
Titled PhotoVerse, the new approach draws on a number of existing adjunct image-to-image SD technologies, including Low Rank Adaption (LoRA), in order to bypass the need for actually loading up a full model and either directly modifying its weights (as with DreamBooth), or by painstakingly training differential weights, as occurs with slightly lighter bolt-on personalization solutions such as Textual Inversion and LoRA.
The authors state:
‘[Our] proposed PhotoVerse eliminates the need for test-time tuning and relies solely on a single facial photo of the target identity. This significantly reduces the resource costs associated with image generation. Following a single training phase, our approach enables the generation of high-quality images within just a few seconds.
‘Moreover, our method excels in producing diverse images encompassing various scenes and styles.’
Though there is currently no evidence of a code release, the authors also note that PhotoVerse is amenable to being incorporated into currently popular image-control solutions for Stable Diffusion, such as ControlNet.
In tests, the authors claim ‘superior performance’ over analogous state-of-the-art approaches.
The new paper is titled PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models, and comes from 11 researchers across ByteDance, Beihang University, and the National University of Singapore. There is also an associated project page.
PhotoVerse uses what the authors define as a dual-branch conditioning mechanism, which governs both the textual and image domains. These two domains are orchestrated in Stable Diffusion via CLIP, which creates semantic correlative relationships between trained images and associated concepts, enabling the user to prompt ‘Tom Cruise’ and get a likeness out of even the basic (V1.5) Stable Diffusion release (because many photos of the actor were present in the subset of the LAION database on which SD was trained).
The workflow includes adapters that can ‘project’ the reference image (i.e., the sole image of a face) into a combined pseudo-word/image feature. A pseudo-word is a word that is either non-existent in the Stable Diffusion trained model, or so underrepresented as to be effectively ‘available’ for coopting. By using a unique or under-used word of this kind, the concept being trained won’t bring with it any additional baggage from the training process. The best-known example of this in the Stable Diffusion community is the sks identifier, used in early personalization processes (though this particular token has since fallen under criticism).
PhotoVerse adds ‘concept conditions’ to the standard Stable Diffusion workflow, and trains the system (once, and to provide a durable solution) to embed this concept scope, which is bolstered by an additional face identity loss function.
However, this training is not full-fledged fine-tuning. The authors explain:
‘Rather than fine-tuning the entire UNet, which can be computationally expensive and potentially reduce model editability due to overfitting, we only add conditions and fine-tune the weights in the cross-attention module.
‘Previous [studies] have also highlighted that the most expressive model parameters reside in attention layers.’
Prior to this injection stage, the material needs a great deal of preprocessing. The authors state that a face detection algorithm is used to extract human faces from the input material, and that, unusually, the bounding box for the face (which normally hugs the outer lineaments very closely) is artificially enlarged, by a scaling factor of 1.3.
Further (though this part of the process is not visually illustrated in the new paper) the face is automatically masked to withhold background material and extraneous accessories and non-face material from being included in the process.
(NOTE: The citations for prior modules and contributing frameworks used in the new work are not correctly ascribed in this paper, and often only list authors and year, without the name of the work, or any way of connecting the citation to the paper’s under-written end references. To avoid ambiguity, since groups of authors may publish multiple papers in any year, we omit reference to any contributing modules that we are not able to identify with certainty, and we apologize if this makes our evaluation less specific than usual)
In line with prior work, PhotoVerse embeds the reference (source) image into the textual word embedding space, initially using the CLIP autoencoder process native to Stable Diffusion.
As far as the paper makes clear, part of the technique used in Google’s 2023 HyperDreamBooth initiative (which also attempts to use a sole image, though it requires additional fine-tuning) is then adopted to sift and select features extracted by CLIP that best capture the spatial and semantic information needed to effect the transformation.
Then, a multi-adaptor architecture translates these filtered features from each layer into multi-word embeddings. These adapters consist of a very sparse two Multi-Layer Perceptron (MLP) layers, which lightweight mechanism facilitates a rapid throughput of the workflow.
There are some shortcomings in the standard Stable Diffusion pipeline after this point, according to the researchers:
‘Despite the advantages of condition in the textual embedding space, there are certain limitations to consider. For instance, the performance can be influenced by the encoder ability of the following text encoder, and the semantic information in the text space tends to be abstract which leads to higher requirements for token representation capabilities. ‘
To counteract this, PhotoVerse adapts the features that CLIP extracts directly into the image space with the use of an adapter that’s structurally similar to the text adapter in CLIP. The authors assert that this ‘doubling up’ facilitates a more accurate projection of the source image into the pipeline.
The next stage is to give Stable Diffusion some functionality that it lacks, natively – the ability to inject new concepts directly into the workflow. This is normally handled by directly intervening into the source model, as with DreamBooth, or by interfering with the way that SD interprets the weights of a model, as with Textual Inversion and LoRA.
Indeed, PhotoVerse directly incorporates part of the LoRA methodology into the creation of this hybrid version of Stable Diffusion, which will thereafter have some genuinely creative capacity, rather than merely exploring the latent space of existing trained models.
Only the cross-attention module of Stable Diffusion is trained here, and the authors repeat their assertion that previous studies have demonstrated this part of the framework to be the most ‘expressive’, or likely to directly influence the outcome at inference time.
For this, PhotoVerse uses the Parameter-Efficient Fine-Tuning (PEFT) technique initially developed for LoRA. This process freezes the weights of the pre-trained model and introduces new trainable rank decomposition matrices into every layer in the model’s architecture, effectively creating a type of on-demand LoRA generator. The authors note:
‘Overall, the lightweight adapters and UNet are jointly trained on public datasets within a concept scope, such as [FFHQ]*. In this way, PhotoVerse can learn how to recognize novel concepts. At inference time, fine-tuning the model is obviated, thereby enabling fast personalization based on user-provided textual prompts.’
Data and Tests
To test the system, the researchers fine-tuned the permanent model on three publicly available datasets: FairFace; CelebA-HQ; and FFHQ. These are baseline datasets, but the actual evaluation material used was gathered by the authors into a private dataset containing 326 images with a fair balance of race and gender.
The system employs base Stable Diffusion and LoRA, as well as visual attention weights. The model was trained at a learning rate of 1e-4 and at a batch size of 64 on a V100 GPU (VRAM amount unspecified – 16GB and 32GB are available), for 60,000 steps. Training parameters were further customized to optimize for the task.
For evaluation metrics in quantitative tests, the researchers leveraged functionality from the 2020 ArcFace facial recognition model, to test whether the output results contain the same recognizable features as the single input source images.
For quantitative testing, rival frameworks trialed were base Stable Diffusion; Textual Inversion; DreamBooth; Textual Inversion combined with DreamBooth; Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models (E4T); and ProFusion.
All the rival methods require test-time tuning, which is to say that for each use-case, such as converting a particular identity, some individual and case-specific fine-tuning is necessary.
The researchers observe that methods such as Textual Inversion and DreamBooth require 3-5 photos for each subject, and note the further burden of balancing, curating and storing these photos, at least at scale. DreamBooth requires, the authors say, approximately five minutes of training time – though actual practitioners may radically disagree with this, as typical training sessions even in powerful Colabs tend to rarely fall far short of an hour – and that Textual Inversion requires 5000 steps, which can lead to varying training times, depending on the power of the GPU in use.
Though E4T and ProFusion also allow just one image for each conversion, they also require per-case fine-tuning, which the authors state to be around thirty seconds.
‘In contrast, our proposed approach is test time tuning-free, enabling the synthesis of 5 images within a mere 25 seconds. This remarkable efficiency makes our method exceedingly user-friendly, significantly enhancing the user experience.’
Regarding the quality of the results, the paper states:
‘[PhotoVerse] exhibits exceptional proficiency in capturing facial identity information. Our approach successfully retains crucial subject features, including facial features, expressions, hair color, and hairstyle. For instance, when compared to alternative techniques, our proposed method outperforms in restoring intricate hair details while effectively preserving facial features, as evident in the first row.’
The researchers contend that PhotoVerse best captures the characteristic traits of the input images, including specific expression, such as frowning.
Though we do not usually cover ablation studies, here they touch on a controversial topic in the Stable Diffusion community – whether or not regularization images are really needed in order to obtain good results in diffusion-based training and synthesis.
Regularization images, when included in a training round, are there to ‘offset’ the specificity of the source training material, and to help the obtained weights to blend in with the source model, by effectively showing the system the context into which the novel material must integrate.
Thus, in DreamBooth, LoRA, and other systems, it has long been a practice to provide class-based regularization images for the process, usually in far greater numbers, to ensure that the source images are always being compared to novel (regularization) images during the estimation process, and that they don’t become ‘paired’ with a reg image that recurs too often.
Therefore, if training a DreamBooth or LoRA of Tom Cruise, the class being ‘man’ or ‘person’, a few thousand typical example images of this class would be utilized as regularization images. This practice has lately been seen as possible ballast, and unnecessary, even obstructive, to obtain well-generalized but character-specific models.
In their ablation studies for PhotoVerse, however, the paper’s authors found otherwise:
‘The experimental results illustrate the importance of regularizations for visual values and textual facial embeddings during concept injection. It can promote the sparsity of representations, thereby retaining key values and mitigating overfitting issues, as well as enhancing the generalization capability of our model.’
The researchers additionally conducted a qualitative user study, though not noting in the paper the number of participants or other circumstances.
The subjects were provided with examples from rival frameworks and asked to compare them to the output of PhotoVerse, and to rate them according to subject fidelity (does it look like the original person?) and text fidelity (does it do what the prompt asked it to do?).
As demonstrated in the graphs above, PhotoVerse outperformed the equivalent (single source) methods E4T and ProFusion by a significant margin.
The paper concludes:
‘The results highlight the potential and effectiveness of PhotoVerse as a promising solution for personalized text-to-image generation, addressing the limitations of existing methods and paving the way for enhanced user experiences in this domain.’
What’s missing from the results shown in the paper are the more extreme angles that are needed to create a diverse and convincing character – though it has to be admitted that in this respect, the rival networks tested suffer the same limitation. It is very difficult, arguably impossible, to infer what a person’s profile will look like from a ‘passport’-style or straight-on image alone.
Perhaps the best exponent to date of this kind of one>many pose extrapolation is the ROOP framework, which can quite effectively produce profile views in video, based solely on single images. However, as is increasingly the case this year, the ROOP project is doubling-down for monetization, and allows high-resolution inference only through strict (and censored) API-based methods, usually via Discord.
Nonetheless, it’s possible that truly effective profile interpretation will eventually hit the FOSS space. Though it doesn’t appear to be a facility that PhotoVerse offers, the project’s core aim of ‘train once, infer forever’, is a worthwhile goal for this strand of the image synthesis research sector. It remains to be seen, of course, whether the project’s potential to be utilized in open source software such as ControlNet ever actually manifests with a full code release.