Colorization Is an Obstacle to Recreating Actors of the Past

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Recreating long-dead personalities, or generating AI-created younger versions of current actors, both require training data – images of the people as they were at the desired target age, which can be fed into a neural network (such as autoencoders, Generative Adversarial Networks, Neural Radiance Fields or Gaussian Splats architectures), which can then learn to recreate the trained faces.

Usually, the target resolution is modern, with the expectation of producing, at a minimum, 4K footage. The data is practically never available either at these dimensions, or at equivalent resolution density, quality and clarity.

Additionally, the further back you go, the more certain it is that the available material will be available either exclusively or most abundantly in black and white, which was, in the golden age of photo-reportage and the paparazzi, until the early 2000s, a cheaper, generally more light-sensitive and more resilient medium for press photographers.

Even for stars that are still alive and working today, such as Clint Eastwood, Meryl Streep and Robert Redford, the majority of available press and stock photos of their younger incarnations are in monochrome, or else in color that may be too grainy or too unresolved to be usable as source data. Source: gettyimages.com
Even for stars that are still alive and working today, such as Clint Eastwood, Meryl Streep and Robert Redford, the majority of available press and stock photos of their younger incarnations are in monochrome, or else in color that may be too grainy or too unresolved to be usable as source data. Source: gettyimages.com

It may be, in some cases, that AI dataset curation teams get lucky, and that the actor in question happened to feature significantly in a long-running TV series, preferably filmed on fine-grained, low ISO 35mm color stock, the exploitation rights of which can perhaps be purchased.

However, this isn’t helpful if you can’t get (or can’t afford) the usage rights; if the show featured extensive exterior photography back when low-light film emulsions were very grainy (forcing the producers to use the same consistent and grainy stock even in well-lit studio situations); or if the actor is more than 3-4 years younger or older in the show than the desired target age (since people used to age more quickly than they now do) – among other possible roadblocks.

In many cases (and particularly for models that do not require thousands of source shots, such as LoRAs), the best resolution available may be in the form of publicity stills. Actors promoting a TV show or movie that they featured in may look ‘off camera’ for such pictures; but most head-shots feature them looking straight-to-camera, which does not provide adequately diverse facial topology information and enough variation of head/expression poses on which to train a versatile AI.

Stock archive photos of the actors Robert Redford and Juliet Mills – the latter of which featured in a long-running 1970s TV show that produced years of both abundant 'standard portraiture' and 'narrative action' stills, all extremely well-lit and with low-ISO, low-grain photos; but this is rarely the case when AI archive-hunters are seeking out historical source material for a project. Source: gettyimages.com
Stock archive photos of the actors Robert Redford and Juliet Mills – the latter of which featured in a long-running 1970s TV show that produced years of both abundant 'standard portraiture' and 'narrative action' stills, all extremely well-lit and with low-ISO, low-grain photos; but this is rarely the case when AI archive-hunters are seeking out historical source material for a project. Source: gettyimages.com

In the upper rows for each actor in the image above, we see the same kind of portraiture that may form part of our own archives, i.e., in terms of official high school photos, or in company portraits. In the rows underneath, we see the kind of feigned pose that fell out of fashion for consumer-level portraiture in the mid-20th century, where the actor is adopting a ‘narrative stance’, even though the images are not extracted from real episodes or movies, but were purposely staged for a photographer.

The Limited Utility of Old Color Photos for AI Recreations

While a fair few of the available pictures exemplified above may be in color, they’re not in compatible color; and in that sense, they might as well all be in black and white.

We can see, just in the limited selection of photos featured above, how film resolution (and later, digital resolution) has evolved over time, how narrow the range of each color is, how skin tones tend to resolve into a single and monotonous solution (depending on the era of the emulsion), and how easy it is, based only on our perception of the color gamut, and excluding any other clues, to tell that these are ‘old’ photos.

Therefore, if we train models on such images, the limited color matrices are likely to entangle themselves, at least to an extent, into the neural recreations – potentially risking that the AI simulation looks ‘out of time’ in the context we wish to put it in, because the color quality is ‘too accurate’ to the source material.

While old color photos of this kind can clear up any major ambiguities in regard to accuracy of eye color and hair color, the general quality of color they display has to be updated to modern standards (unless the target context is ‘archive’ or ‘contemporary’ footage, such as much of the simulated archive material in Forrest Gump).

What we can obtain from old photos, if their relative quality is high enough, is surface detail. Depending on the exposure chosen for a photo, skin may reveal much detail, or less detail at higher exposures. These days, it is relatively trivial to use High Dynamic Range (HDR) photography, which simultaneously incorporates multiple exposure levels into a single RAW camera image, allowing the photographer to select an exposure after the fact.

High Dynamic Range (HDR) photography incorporates simultaneous and cohesive exposure bracketing into a single image, so that the photographer can later explore and even mix various depths of tonality. Source: https://www.cl.cam.ac.uk/~rkm38/pdfs/mantiuk15hdri.pdf
High Dynamic Range (HDR) photography incorporates simultaneous and cohesive exposure bracketing into a single image, so that the photographer can later explore and even mix various depths of tonality. Source: https://www.cl.cam.ac.uk/~rkm38/pdfs/mantiuk15hdri.pdf

We can extrapolate similar functionality from older photos, to a limited extent, in a range of applications, from Photoshop through to professional VFX processing tools.

Though HDR was not available for archive photography, it is possible to extrapolate a similar gamut of exposure details synthetically. Source: https://www.gettyimages.com/detail/news-photo/gallery-juliet-mills-news-photo/109941926
Though HDR was not available for archive photography, it is possible to extrapolate a similar gamut of exposure details synthetically. Source: https://www.gettyimages.com/detail/news-photo/gallery-juliet-mills-news-photo/109941926

This leaves us, in any case, with only fundamental indications as to how the person would have looked if they had been photographed in color by modern methods. If the actor is still alive, even though considerably older, cornea color can usually be established (and this can vary considerably across archival color photos of the same person, and also across variations in lighting conditions) – but practically everything else will need to be estimated and simulated.

AI Colorization

Most current interest in AI-based colorization centers around the popular Stable Diffusion latent diffusion model, which many adherents combine with various modules of the ControlNet ancillary system, allowing Stable Diffusion to perform, at varying levels of authenticity and effectiveness, photo restoration that includes colorization.

Various user-contributed attempts at using Stable Diffusion to restore quality and color to vintage or archival photos. Sources: https://old.reddit.com/r/StableDiffusion/comments/11scd1v/im_amazed_at_how_great_stable_diffusion_is_for/ https://old.reddit.com/r/StableDiffusion/comments/159snk5/photo_restoration_using_controlnet_of_a_german/ https://old.reddit.com/r/StableDiffusion/comments/11uo7ex/restoration_of_an_old_photo_of_my/ https://old.reddit.com/r/StableDiffusion/comments/11hzsrs/controlnet_did_a_good_job_rejuvenating_a_stained/
Various user-contributed attempts at using Stable Diffusion to restore quality and color to vintage or archival photos. Sources: https://old.reddit.com/r/StableDiffusion/comments/11scd1v/im_amazed_at_how_great_stable_diffusion_is_for/ https://old.reddit.com/r/StableDiffusion/comments/159snk5/photo_restoration_using_controlnet_of_a_german/ https://old.reddit.com/r/StableDiffusion/comments/11uo7ex/restoration_of_an_old_photo_of_my/ https://old.reddit.com/r/StableDiffusion/comments/11hzsrs/controlnet_did_a_good_job_rejuvenating_a_stained/

The problem here is that Stable Diffusion will borrow adjunct data freely from the hundreds of millions of web-scraped image on which it was trained, and will make contextual associations that could be unhelpful.

For instance, in the top-left image pair above we see a plausible simulation of color stock gamut from around 30-40 years ago, particularly for the uniformity of skin color, which may have been chosen as contextually correct, based on era-specific objects or traits recognized by the CLIP implementation in Stable Diffusion (i.e., the component that associates text content with image content).

Since older photos tend to have less dynamic range, it’s likely that Stable Diffusion will associate the limited exposure detail with thousands (or hundreds of thousands) of similar trained image latents which bear the same characteristics. It can therefore be an uphill struggle to separate these associations and obtain a ‘modern’ look during colorization.

Likewise, in the upper-right image, we see evidence that Stable Diffusion is basing this colorization scheme on the multitude of hand-colored photos in the LAION database (manual colorization with airbrushing and other techniques was common from the 19th to the middle of the 20th century).

Arguably, in this case, with its flat wash of army green and simplified colors, the result is true to the dominant data that Stable Diffusion has seen (i.e., it knows what colorized WWI photos look like), but not to the objective (it did not discard irrelevant context and create a better, ‘modern’ colorized version, even though adequate surface texture was available).

Ironically, this represents a kind of ‘domain perpetuation’: the colorists of the era were using artificial methods to impose onto B/W material the best standards of color photography that were available at the time, including monotonous color gamuts.

Improving ControlNet's Colorization Capabilities

Latent Diffusion Models such as Stable Diffusion have additional difficulties in colorizing images, in that, as we have mentioned before, when specifying target colors in a prompt (such as ‘a woman with blonde hair and blue eyes wearing a green coat’), the first color mentioned tends to dominate the others, and even sleight-of-hand tricks such as prompt-weighting (an attempt to give tokens that occur later in the prompt equivalent weight to earlier tokens) are not able to improve the situation much:

The Stable Diffusion-based commercial colorization resource palette.fm exemplifies the difficulties that Stable Diffusion has with containing color to target areas. Here we see that specifying the color of the taxi-cab in the original B/W photo has also 'stained' the woman's clothing yellow, even though the wool dress is evidently white in the original picture. Source: Source: https://palette.fm/color/edit and https://www.gettyimages.com/detail/news-photo/portrait-of-american-actress-raquel-welch-as-she-waves-news-photo/1216409693
The Stable Diffusion-based commercial colorization resource palette.fm exemplifies the difficulties that Stable Diffusion has with containing color to target areas. Here we see that specifying the color of the taxi-cab in the original B/W photo has also 'stained' the woman's clothing yellow, even though the wool dress is evidently white in the original picture. Source: Source: https://palette.fm/color/edit and https://www.gettyimages.com/detail/news-photo/portrait-of-american-actress-raquel-welch-as-she-waves-news-photo/1216409693

This problem of color localization has been addressed in a number of ‘layout’ modules in ControlNet, all of which seek to confine prompted content to one particular area of an image, and which can be used to target colors discretely into portions of the picture – though such segmented approaches tend to be used more for deepfake-style transformations and to control the disposition of objects in generated images than for colorization purposes, at present.

The LayoutDiffuse system, which we covered in February, can 'pre-compose' Stable Diffusion transformations, leaving less to chance. Source: https://arxiv.org/pdf/2302.08908.pdf
The LayoutDiffuse system, which we covered in February, can 'pre-compose' Stable Diffusion transformations, leaving less to chance. Source: https://arxiv.org/pdf/2302.08908.pdf

One interesting recent paper offers an improvement on ControlNet’s general colorization capabilities, by processing relevant text-prompts with the Large Language Model (LLM) GPT-4, and by finetuning the InstructPix2Pix module (a popular system that began as a standalone project and was eventually made available in ControlNet) on a diverse dataset, obtaining quantifiable improvement in colorization results:

Figure 3 from the paper. Above, the black and white original photos; in the middle row, the standard colorization results of InstructPix2Pix; and in the bottom row, the results from the new method. Source: https://arxiv.org/pdf/2312.04780.pdf
Figure 3 from the paper. Above, the black and white original photos; in the middle row, the standard colorization results of InstructPix2Pix; and in the bottom row, the results from the new method. Source: https://arxiv.org/pdf/2312.04780.pdf

The new paper is led by Zifeng An, currently a machine learning research scientist for Apple (though the paper predates this role), together with Zijing Xu, Eric Fan and Qi Cao (whose roles are also not specified in the preprint), and is titled Enhancing Visual Realism: Fine-Tuning InstructPix2Pix for Advanced Image Colorization.

Method

The researchers used a subset of 766 images from the IMDB-WIKI dataset, which, the paper states, is the largest publicly available dataset of face images (containing 460,723 images from the Internet Movie Database, and 62,328 images from Wikipedia).

The images were converted to grayscale, with the original color versions used as ground truth, to measure how accurately the final system could guess ‘authentic’ color.

The LLM component of the project is an adaptation of the Finetuned language models are zero-shot learners (FLAN-V2) initiative, wherein 30 synonymous variants of the text-prompt ‘colorize the image’ were generated by GPT-3 and associated with the images at training time. The researchers comment:

This approach was designed not only to fortify the robustness of testing procedures but also to optimize overall performance. For the validation dataset, we deliberately adhered to employing solely the prompt “colorize the image,” ensuring a consistent basis for meaningful comparisons.’

Examples of the diverse captions generated in association with GPT-4, to bolster the semantic power of the model to force rational and effective colorization. Source: https://huggingface.co/datasets/annyorange/colorized_people-dataset
Examples of the diverse captions generated in association with GPT-4, to bolster the semantic power of the model to force rational and effective colorization. Source: https://huggingface.co/datasets/annyorange/colorized_people-dataset

We can see in the examples in the image above, just a few from those made available in an explorable dataset, the extent to which abstract or colloquial terms and commands can have different influences on colorization implementations. The central idea of the new work is that the reductive sum of many such diverse prompts per image can potentially produce a consistently-improved text-prompted colorization.

For the finetuning of InstructPix2Pix, the Variational Autoencoder (VAE) and the CLIP text encoder were frozen (meaning that their weights would be unaffected by the training). This is because these two components are involved in encoding the text/image data into the latent space of the model, which has a limited effect on its ability to colorize images.

Instead, the active finetuning took place on the architecture’s U-Net, which governs the denoising of the Gaussian latent noise that begins each Stable Diffusion image.

Conceptual architecture for the project.
Conceptual architecture for the project.

The two loss functions deployed in training were training loss (the difference between predicted and actual noise in the latent diffusion process) and validation loss (the difference between the output colorized and the ground truth images – the latter containing ‘real’ rather than simulated color). Mean Squared Error (MSE) was used in the latter, within the LAB color domain. Of this, the authors state:

‘This approach ensures that the colorization process faithfully reproduces accurate and realistic colors by closely mirroring the true color values of the original images.’

During the trials, the authors adjusted diverse hyperparameters, as well as the batch size, the learning rate and the size of the text prompt (i.e., the number of text tokens passed through the transformative process).

Data and Tests

Fine-tuning took place over a total of 50 variations. The results can be seen in the above ‘Figure 3’ image which is a comparison between the original InstructPix2Pix model and the most performant of the fine-tuned models produced by the researchers.

The authors note that the training loss, as depicted in the graphs below, shows oscillation towards convergence, and they observe that this is normal.

Statistics for the training loss of the researchers' system.
Statistics for the training loss of the researchers' system.

However, the validation loss, the authors contend, is more interesting:

The validation loss statistics for the training process.
The validation loss statistics for the training process.

Regarding this, the paper states:

[The] validation loss, measuring the MSE loss between the predicted colorized image and the original image, shows an increasing trend with the number of training steps. This phenomenon is attributed to the limitation of MSE in perfectly reflecting the absolute quality of colorization.

‘For instance, in colorization tasks, different colors can be valid for the same black-and-white photo under similar lighting conditions.’

The latter comment indicates one of the frequent phenomena seen in ‘blind’ colorization, where a system is asked to guess personal characteristics such as hair color, without the aid of text prompts. Certain degrees of grey and other ‘tell tale’ signals may indicate red rather than even deep blonde hair; but this kind of luminosity value can vary enormously with the exposure used, and with other factors, making the ad hoc colorization of redheaded people a rather random operation.

There are several other associated syndromes of this kind, such as the difficulty in determining eye color when the eyes in a B/W source photo present ‘clear’ corneas. In cases like these, bias will emerge naturally from ground truth correlation, but the statistical probabilities of anyone being red-headed or blue/green-eyed make such guesses more of an interpretive art than a science.

In the five above examples†, the first column is the original image, the second a colorization attempt by Photoshop's neural filters (based on the Adobe Firefly engine), and the third a colorization by palette.fm, which is based on Stable Diffusion. It should be considered that in many cases the training data for the underlying engine may have even included the target image. In these cases, the color versions were specifically converted to grayscale before colorization - however, this is an automatic function of all known current colorization methodologies anyway.

The only way round such statistical anomalies is to account for them with manual labeling. This is usually impractical to undertake at scale, and less effective in smaller datasets that must produce models which can make effective predictions on novel data.

The authors also present images taken at the early, middle and late stages of the fine-tuning, and they contend that these are indicators of the improving quality of the model’s colorization capabilities:

Comparison examples across three phases of training, with the desaturated source image above left, and the ground truth image above right.
Comparison examples across three phases of training, with the desaturated source image above left, and the ground truth image above right.

For a quantitative comparison, the authors used three metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Mean Absolute Error (MAE).

Quantitative comparison on the IMDB-WIKI datasets, for the project.
Quantitative comparison on the IMDB-WIKI datasets, for the project.

The authors observe that their approach outperforms baseline InstructPix2Pix on all metrics.

The researchers conducted three further test types: hyperparameter tuning, learning rate variation, and batch size variation. The first of these essentially represents the ‘tournament’ phase leading to the optimal model selected for the main tests. The default learning rate used was 5×10-6 at a batch size of 4, over thirty GPT-4 generated prompts, with each hyperparameter altered separately for each section.

The effect of varying hyperparameters on the input image. No better resolution currently available.
The effect of varying hyperparameters on the input image. No better resolution currently available.

Regarding changes to the learning rate (second row from the top in the image above), the authors comment:

‘[The] lower learning rate produces an image that is dull and is very close to InstructPix2Pix without much changes, while the higher learning rate produces an image that has very high contrast with vibrant colors. We found the balanced learning rate at 5 × 10−6 with the best results.’

In regard to changes in batch size (third row from top in image above), the paper states:

‘The facial colorization looks better in smaller batch sizes, but larger batch sizes provide some more coloring to the background. However, larger batch sizes update the model more given the same number of training steps, so they might overfit the model and cause the original model to start forgetting.’

Finally, regarding varying the number of prompts, the researchers found the effect to be lacking in potential*:

‘[There] is not much difference between using 1 and 30 prompts. We believe the CLIP text encoder will project similar prompts to similar embeddings, resulting in low differences between the training and validation prompts.

‘One potential approach to improve the prompting mechanism is to introduce soft prompting, thus eliminating the need for hard-coded prompts.’

In overview, the researchers conclude:

‘[The results] demonstrate a significant enhancement in the model’s ability to realistically colorize images. From a qualitative perspective, the colorization results of our model stand out in their visual perception. The colors in the images generated by our model strike a harmonious balance – they are neither too dull and close to grayscale nor too sharp and overly vibrant.’

Conclusion

At the very least this all illustrates how incredibly subjective colorization is, as a pursuit, thanks to the intrinsic relationship between language and image that’s present in latent diffusion models, and in other AI synthesis techniques that profit from semantically trained text-image pairings.

While there is no doubt that plausible color can be devised for AI-based recreations, automating and optimizing the process may represent a larger challenge.

It’s possible that future research needs to take a step back from the paradigms used in latent diffusion models – and, perhaps, in any architectures that rely on text/image pairings, in order to disengage from the hope that some truly ‘authentic’ solution lies within the current methodologies for hyperscale training in such models (which certainly would save a lot of effort). Photo colorization was an artisanal pursuit, for about 120 years from the mid-19th century onward, and threatens to remain so unless such a pivot occurs.

A true ‘updating’ of archival monochrome material into modern photo reproduction standards seems unlikely to emerge from text/image systems which are algorithmically rewarded more for authenticity (making old monochrome photos look like old color photos) than currency (making old monochrome photos look like modern photos).

Why would the models do otherwise? There is nothing in the data that reflects this particular transformation, unless we take the trouble to make it ourselves – and at daunting scale.

* My conversion of the authors’ inline citations to hyperlinks.

Sources:
Ann Margret – https://www.gettyimages.com/detail/news-photo/portrait-of-actor-ann-margret-posing-next-to-lavender-news-photo/3200716
Christina Hendricks – https://www.gettyimages.com/detail/news-photo/christina-hendricks-news-photo/533520800
Jill St.John – https://www.gettyimages.com/detail/news-photo/american-actress-jill-st-john-circa-1965-news-photo/153462495
Julianne Moore – https://www.gettyimages.com/detail/news-photo/julianne-moore-is-on-hand-at-news-conference-at-the-four-news-photo/97321025
Emma Stone – https://www.pinterest.com/pin/emma-stone-closeup-in-2023–914230793087575137/

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle