Without adequate quality control, consumer goods and services would be impossible: food would be randomly poisonous or unpleasant; any number of cars might hit the road in a dangerous state; spaceships could explode on launch; and toys could become deadly for our children.
Where the product or service operates in a less critical context, entropy and/or apathy often become the normal state: it’s generally acknowledged that the product could be better, but nobody necessarily knows exactly how it could be improved, and enough people still buy it that nothing much happens to improve the situation. The first solution more-or-less worked, and therefore standardization tends to embed it into subsequent systems.
Further, outdated technologies may outstay their welcome, even by decades, because it’s too hard or expensive to replace them.
In image synthesis, where AI systems train data into generative models such as Stable Diffusion, or into Generative Adversarial Networks (GANs), that quality control comes in the form of a handful of the most popular loss functions – algorithms that, during training, assess the system’s developing ability to create useful transformations or replications of the source data, and that force the system to keep training until it gets better at these processes.
Latent Diffusion Models iterate noise into coherent and highly realistic pictures, based on text-prompts, which find text/image embeddings trained into the source model and generate novel variations. Source: https://ai.googleblog.com/2021/07/high-fidelity-image-generation-using.html
Therefore, loss functions almost entirely dictate how good such resulting trained models will be. Even having good data is, arguably, less important, since a deficient loss function won’t deal with any kind of data correctly.
If a metric fails to adapt to new conditions, it is likely not only to become less effective (if not entirely ineffective), but also risks to actively undermine the evolution of new frameworks. Just as candlepower evolved into the Candela and then the Lumen, the scale, intensity or quality of a property may require new methods of quantification as the related technologies change.
FID: The Incumbent
The 2019 proposal Fréchet Inception Distance (FID) has become, by now, the gold standard metric for evaluating either the progress of images as a model improves during training – or as a standard for evaluating how well the trained model is performing at inference time. Not only does its use permeate the current literature, but it has been extended into a video-equivalent metric titled Fréchet Video Distance (FVD).
How well a new framework performs under FID is therefore a well-established litmus test for the efficacy and viability of a new system.
Recently, Google Research published a paper that claims to demonstrate that the widespread use of FID is likely to be holding back the development of new computer vision and also generative systems, while at the same time proposing a new and more advanced metric that’s not only computationally less expensive, but far more effective at evaluating the new generation of AI-based image-creation systems.
The paper states*:
‘We encourage image generation researchers to rethink the use of FID as a primary evaluation metric for image quality. Our findings that FID correlates poorly with human raters, that it does not reflect gradual improvement of iterative text-to-image models and that it does not capture obvious distortions add to a growing body of criticism.
‘We are concerned that reliance on FID could lead to flawed rankings among the image generation methods, and that good ideas could be rejected prematurely.’
Continuing use of an ineffective metric (if FID is indeed ineffective) has ramifications beyond computer research and the development of new generative systems, since the older method is likely also to be used for ancillary tasks, such as ranking the quality of images on the internet, for search algorithms, and also being used as a filtering method for new datasets, where (if the new paper holds true) it is likely to incorrectly prioritize images, affecting the viability of the dataset in trained systems.
The new work is titled Rethinking FID: Towards a Better Evaluation Metric for Image Generation, and comes from six researchers at Google Research, New York.
Theory, Method and Results
(Note that this paper does not present its findings or frame its challenges in the usual linear fashion, and therefore the customary distinct sections of Method+Data and Tests must be commingled here)
Fréchet Inception Distance, as the authors point out, is used to evaluate discrepancies between two image sets. Typically, the ‘sample’ dataset is a real one, such as COCO, ImageNet, FaceForensics, or any other set where the general subject matter is an appropriate match for the target system.
The second group of data comes from the system being tested. This can be in the form of dynamic comparison during training (where the judgements aided by FID can determine the forward route of the training and directly affect the outcome by examining the system’s latest on-the-fly attempts), or evaluation of generated images from the finished (new) framework.
Typically, the two datasets are matched for domain parity, which is to say that if the target (new) framework is intended to generate mainly human faces, an apposite dataset such as FaceForensics is likely to be chosen as the benchmark.
The problem, according to the new work, is the way that FID goes about rationalizing a schema for this comparison: it extracts InceptionV3 embeddings from both datasets.
Even where ImageNet is not being used as the benchmark dataset, InceptionV3 embeddings are informed by having been trained on 1 million ImageNet images, across 1000 classes (i.e., ‘woman’, ‘boat’, ‘building’, etc.).
We have indicated in earlier articles that the now-aged ImageNet dataset has garnered enormous influence in the research community, and in the development of loss metrics, which ’embeds’ the collection’s arguable shortcomings at the algorithmic level, even with new projects which do not directly use ImageNet at all.
The fact that FID is dependent on data from a collection of only 1 million images (and a more limited number of classes than have since been developed), in an era of hyperscale datasets that run into the multiple billions, may give one clue as to why FID’s judgements could be a little myopic for modern purposes.
The new paper observes that the obtained distributions from modern datasets do not have the ‘normal’ dispersal that FID is anticipating, and that the high dimensionality (2048×2048) covariance matrices derived during FID analysis can lead to large errors when a dataset is small.
Basically, FID, the paper broadly suggests, is a production line that’s expecting a certain scale and a variety of conditions that are no longer certain (or even likely) to manifest in modern computer vision research projects.
However, when the authors of the new work tested both FID and FID∞ against their new CMMD method, they found that neither of the older (FID) techniques operated correctly when normality assumptions in the distribution of the extracted embeddings were violated:
This means that every judgement that FID makes after this point is possibly proceeding from a false assumption.
On a practical level, this suggests the possibility that any generative system (of types such as Stable Diffusion) that relies on FID could be more effective if it was trained either using CMMD itself, or some other innovative new metric that’s not operating under FID’s possibly outdated presumptions.
The Human Factor
The current trend in development of new loss functions tends to involve human perception, rather than purely algorithmic determinations of the difference between compared images. One example of this is Learned Perceptual Similarity Metrics (LPIPS), which utilizes 484,000 human evaluations on the perceived distortion of images; thus LPIPS is essentially the sum of those multiple individual human diagnoses†.
To illustrate this for their own use case, the new paper’s authors conducted a human study, where participants were presented with two different examples from a Muse model trained on the PaLI WebLI dataset: Model-A and Model-B. The second, Model-B, was configured to produce lower-quality images which one would expect any human rater to recognize as such.
Duplicated random seeds were used to generate all the images, each of which was evaluated by three different participants:
The authors report that ‘FID contradicts human evaluation while CMMD agrees’:
‘We observed that Model-A was preferred in 92.5% of the comparisons, while Model-B was preferred only 6.9% of the time. The raters were indifferent 0.6% of the time. It is therefore clear that human raters overwhelmingly prefer Model-A to Model-B.
‘However, COCO 30K FID and its unbiased variant FID∞, unfortunately say otherwise. On the other hand, the proposed CMMD metric correctly aligns with the human preference.’
CMMD, The new metric apparently outperforming FID, thus is abbreviated from CLIP embeddings and Maximum Mean Discrepancy, or CLIP-MMD.
The central difference here is the use of CLIP, which is trained on 400 million image/text pairs. The paper states*:
‘CLIP embeddings are better suited for representing the diverse and complex content we see in images generated by modern image generation algorithms and the virtually infinite variety of prompts given to text-to-image models.
‘To compute the distance between two distributions we use the MMD distance. MMD was originally developed as a part of a two-sample statistical test to determine whether two samples come from the same distribution. The MMD statistic calculated in this test can also be used to measure the discrepancy between two distributions.’
One advantage of MMD (i.e., the base method, without CLIP), is that it does not depend on distributions, unlike FID, which is expecting a certain ‘spread’ of data among the extracted embeddings. ‘In contrast,’ the authors say ‘ Inception-v3 is trained on ImageNet, which has on the order of 1 million images which are limited to 1000-classes and only one prominent object per image.’
Another advantage is that when working with high-dimensional vectors such as embeddings, CMMD is considerably more efficient:
‘[Calculating] FID requires estimating a 2048 × 2048 covariance matrix with 4 million entries. This requires a large number of images causing FID to have poor sample efficiency…The proposed CMMD metric does not suffer from this problem thanks to its usage of MMD distance instead of the Fréchet distance.’
The authors tested this by evaluating a Stable Diffusion model at varying sample sizes across the two methods, with images taken randomly from the commonly-used COCO dataset.
More than 20,000 images were needed to estimate accurate FID, in contrast to the MMD method, which can perform as well even with ‘small datasets’, according to the authors.
The paper further notes the considerable resource-savings involved in using MMD over FID, since MMD utilizes matrix multiplications which are ‘trivially parallelizable’, and which can be further optimized in modern and popular deep learning libraries such as JAX, PyTorch, and TensorFlow.
A runtime comparison (image below), using Jax and PyTorch/Numpy approaches from two prior projects, also shows the runtime for Inception and CLIP features for a batch of 32 images
It’s noteworthy that the authors observed how deeply FID can apparently err when images are getting progressively better. This ‘gradual improvement’ is a recent phenomenon due to the way that LDMs such as Stable Diffusion begin an image generation with pure Gaussian noise (see video at start of article), and iterate through multiple versions of an image until clarity (and fidelity to the text-prompt) gradually emerge.
The authors found similar shortcomings for FID when gradually applying distortion to images, with FID suggesting that the images are getting better rather than (as they actually are) getting worse:
Though there are many loss functions far older than FID, it’s shocking to reflect, if the paper’s assertions are true, how such a recent algorithm can fail to adapt to a major innovation like LDMs, and to recent changes in scale and approach for training computer vision models.
The new work portrays FID in the light of having been the best solution for 2019, rather than an enduring loss function for the years ahead. Since CMMD (and base MMD) are not wide-spread solutions in generative systems, it would take a great deal of further testing to know the extent to which this more modern approach could improve current frameworks.
* My conversion of the authors’ inline citations to hyperlinks.
† While many papers have asserted that LPIPS is a more accurate method of determining loss, humans are themselves biased, even across diverse geographical groups. So there is no suggestion that LPIPS is an ‘unbiased’ algorithm – just that it accords more with the way humans would rate images, if there was time and scale for them to do it (there isn’t – training a complex model this way would literally take centuries).