Restoring Archival Video with CLIP

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

For over fifty years, storage of video on analog magnetic tape represented one of the most fragile repositories of a changing culture in the history of archiving. Besides the fatal temptation to re-use video for subsequent projects as a cost-saving measure in terms of storage and logistics (this did not affect celluloid, which has no re-use value once exposed), video was prone to all kinds of unique types of quality degradation over time, depending on storage, usage, and other factors.

Restoring historical magnetic videotape to something nearer its original quality is undertaken for various motives: state or private archival projects designed to preserve history; the wish to commercialize archived content for release to disc sales or streaming services; and, perhaps of most interest to us, to facilitate scant reference material of people, such as celebrities and actors, that we may now wish to reproduce from earlier times in their career, but for whom little or no media material is available for that stage in their career, and in their physical development.

It is a shame to suffer definition loss and quality loss in a media format (defined in scanlines rather than pixels, and further compromised by interlacing) that was already operating at pixel-equivalent output sizes now considered to be ‘low resolution’. If we need to rely on such formerly-unloved sources for precious facial data of people who now either look older or are no longer alive, machine learning approaches can be useful.

TAPE

One new such initiative, from Italy, offers a novel method of restoring the frequently (and quite randomly) damaged frames of archival video by the use of the CLIP system which was originally responsible for the effectiveness of text-to-image frameworks such as the OpenAI DALL-E series, and Stable Diffusion. However, there is no generative AI, in the sense that the term is currently understood, involved in the process:

Above, real frames from a dataset. Below left, the current state of the art in video frame restoration with machine learning; below right, the results obtained by the new system, TAPE. Source: https://arxiv.org/pdf/2310.14926.pdf
Above, real frames from a dataset. Below left, the current state of the art in video frame restoration with machine learning; below right, the results obtained by the new system, TAPE. Source: https://arxiv.org/pdf/2310.14926.pdf

Many previous and even recent approaches have used optical flow to address inconsistencies and errors in videotape restoration. Optical flow essentially ‘unwraps’ a short video sequence until it can, at least from the point of view of the system, be viewed at a single glance, and address visible errors, in much the same way that Adobe Audition and similar software can turn an entire sound-clip into a pixel-based image that can be arbitrarily edited (for instance, to identify and clone out background noise, visually):

Left, Adobe Audition unwraps a temporal sound clip into a non-temporal and editable image; right, optical flow likewise makes changes in movement within a video clip viewable at a glance. Sources: https://www.pcmag.com/reviews/adobe-auditionand https://www.researchgate.net/figure/Optical-flow-field-vectors-shown-as-green-vectors-with-red-end-points-before-and-after_fig6_290181771
Left, Adobe Audition unwraps a temporal sound clip into a non-temporal and editable image; right, optical flow likewise makes changes in movement within a video clip viewable at a glance. Sources: https://www.pcmag.com/reviews/adobe-auditionand https://www.researchgate.net/figure/Optical-flow-field-vectors-shown-as-green-vectors-with-red-end-points-before-and-after_fig6_290181771

However, the nature of analog magnetic tape glitches and anomalies is too random and staccato for optical flow to be an applicable approach. Without consistency, anomalies must be addressed where they occur, not where it would be convenient for them to occur.

Therefore the new method, titled (perhaps a little tortuously) resToration of digitized Analog videotaPEs (TAPE), seeks to identify the cleanest and least-degraded frames in a sequence extracted from a video excerpt, and to this end uses text-based prompts, combined with CLIP, to isolate these frames for renovation, using prompt ensembling (the effective concatenation of the impact of multiple prompts) to target affected clips in an aggregated approach.

After this, a SWIN-UNet network is used to restore the affected frames, which are inserted into the extracted frame sequence prior to re-rendering.

Since ground truth for such anomalies is not easy to obtain, the researchers developed their own synthetic dataset, using the Adobe After Effects video processing and visual effects software to add artificial perturbations to video, to provide the system with an effective ‘before’ state, from which point it can repair issues, based on the content of the CLIP-identified unaffected frames.

Excerpts from the results section (see below), demonstrating TAPE's ability to restore affected frames.
Excerpts from the results section (see below), demonstrating TAPE's ability to restore affected frames.

In tests, the researchers of the new work found that TAPE obtained state-of-the-art results against prior analogous methods of restoration.

The new paper is titled Reference-based Restoration of Digitized Analog Videotapes, and comes from four researchers at the Media Integration and Communication Center (MICC) at the University of Florence.

Method

The TAPE system was to be tested in a typically challenging scenario, with no available ground truth, and with the material ‘as is’. Therefore source material was obtained from the archivio storico luce, the biggest Italian historical video archive, containing magnetic tape media spanning the life of the technology throughout the 20th century.

Using high-quality examples from ASL’s Harmonic dataset, the researchers followed the methodology of a prior paper (featuring several of the new paper’s authors) in using After Effects to simulate degradations such as Chroma loss along scanlines; tape mistracking, and VHS edge waving; tape noise, which is not dissimilar to Gaussian noise; and under-saturation, where color suddenly drops the image down to near-grayscale.

A real-world example. Severe tape mistracking, which occurs when the spinning playback head fails to align correctly with the helical scan system during copying or recording. Source: https://www.thepixelfarm.co.uk/identifying-common-tape-defects/
A real-world example. Severe tape mistracking, which occurs when the spinning playback head fails to align correctly with the helical scan system during copying or recording. Source: https://www.thepixelfarm.co.uk/identifying-common-tape-defects/

Finally, the researchers obtained 26,392 frames across 40 clips, which were divided 75%-25%, into training and test sets.

Synthetic degradations applied in After Effects for the project.
Synthetic degradations applied in After Effects for the project.

The researchers state:

‘At this point, our synthetic artifacts look similar to real ones. However, we also need to replicate the time-varying nature of real degradation. Indeed, artifacts in real-world videos change abruptly between consecutive frames and occur at the same time, leading to a disruption of temporal consistency.

‘For this reason, we randomize all the effects we apply to the synthetic videos to make the degradations appear with different intensities, positions, and combinations for each frame.

‘This way, we obtain a dataset of mainly temporally inconsistent videos composed of both almost clean and severely degraded frames, thus resembling real-world videos.’

The next task is frame classification, for which the project uses the aforementioned prompt ensembling via CLIP. All prompt ensembles used employed one or more of the following prompts:

1) “an image with color artifacts along rows”
2) “an image with interlacing artifacts”
3) “an image of a noisy photo”
4) “an image of a degraded photo”
5) “a photo with distortions”
6) “an image of a bad photo”
7) “a jpeg corrupted image of a photo”
8) “a pixelated image of a photo”
9) “a blurry image of a photo”
10) “a jpeg corrupted photo”
11) “a pixelated photo”
12) “a blurry photo”

We can see here that the digital-focused terminology of CLIP is being hijacked a little to accommodate image degradation in magnetic tape, even though these prompts refer to technologies (such as JPEG) that are not applicable to analog tape. In effect, the authors found that the problem was (literally) purely semantic, and that the existing associations could be easily adapted to the task at hand.

The similarity between each frame and a prompt-aided evaluation is measured, with lower similarities signifying a less degraded frame. The ultimate objective is to find the least-degraded frame that is as near as possible to a target degraded frame, so that this selected frame can serve as a template for subsequent reconstruction.

Overview of the classification process, and the subsequent passing of pertinent frames to the SWIN-UNet.
Overview of the classification process, and the subsequent passing of pertinent frames to the SWIN-UNet.

Evaluation is aided by the reference-free quality assessment metrics blind/referenceless image spatial quality evaluator (BRISQUE); Natural Image Quality Evaluator (NIQE); and CONTRastive Image QUality Evaluator (CONTRIQUE).

Some threshold had to be set to define degraded and good frames, and for this, the researchers used the comparative method outlined in the paper A Threshold Selection Method for Gray-Level Histograms, which, the authors state, is commonly used in such thresholding tasks.

Next, the selected frames are passed to the SWIN-UNet Transformer architecture, which operates simultaneously on multiple degraded frames. The Transformers architecture compensates for the temporal flow issues that are likely to emerge in such restoration projects, and can account for the randomness of artifacts and perturbations in a better way than systems such as optical flow can accommodate.

Initially, shallow features are extracted via a convolutional layer, after which the patch size is reduced and the number of channels increased through blocks in the SWIN architecture

Overview of the SWIN-UNet architecture used in TAPE.
Overview of the SWIN-UNet architecture used in TAPE.

The architecture uses skip connections to add the output from the encoder to the processed images, forcing the network to learn a prior for each frame.

The paper states:

‘[By] having the processed features attend themselves, we intuitively make each region of the input frames look at similar parts of the other frames. This is particularly useful due to the time-varying nature of the artifacts, as a highly degraded portion of one frame may be less damaged in one of the neighboring ones.

‘However, if a given region is severely degraded in all the input frames, some details will be permanently lost. For this reason, we propose employing clean reference frames that do not belong to the window of the input frames.’

At this point, the SWIN-UNet Transformer cannot help any further, since the reference frames are not contiguous temporally, and cannot be re-inserted into the video clip in their current state and configuration. Therefore the authors have devised a novel Multi-Reference Spatial Feature Fusion Block (MRSFF) module, which uses multi-head shifted window cross-attention and attention pooling to align the output back into a workable context – an approach inspired by SWIN 2D Transformer blocks.

Two successive MRSFF blocks, where the term 'LN' means 'layer normalization'.
Two successive MRSFF blocks, where the term 'LN' means 'layer normalization'.

Data and Tests

The TAPE model was trained on the aforementioned synthetic dataset for 100 epochs (complete passes over the data) under the ADAMW optimizer, with a learning rate of 2e-5. During training, 128px2 patches were randomly cropped out of the source images to aide generalization, and the model was trained with a weighted sum of the Charbonnier loss, and with perceptual loss functions from the DeePSiM class, among others.

During testing, additionally, the output was cropped to 512px2, a common size for training.  

The model took a formidable two days to train, even on a NVIDIA A100 with 40GB of VRAM, and ultimately runs at 15FPS at inference time.

The six baselines used were Memory-Augmented Non-Local Attention for Video Super-Resolution (MANA); Multi-Scale Memory-Based Video Deblurring (MemDeblur); BasicVSR++; Video Restoration Transformer (VRT, aka RVRT); Recurrent Transformer Network (RTN); and the prior method co-authored by some of the authors of the current paper (signaled as ‘Agnolucci et al.’).

All the baselines were trained on the aforementioned synthetic dataset, using the official repositories of each project.

Evaluation metrics used were Peak Signal-to-Noise Ratio (PSNR); Structural Similarity Index (SSIM); Learned Perceptual Image Patch Similarity (LPIPS); and the Netflix loss function Video Multimethod Assessment Fusion (VMAF).

Left to right, results for the synthetic and real world datasets in quantitative tests.
Left to right, results for the synthetic and real world datasets in quantitative tests.

Regarding quantitative results on the synthetic data (left, in image above), TAPE considerably outperforms all rival baselines, on all metrics. Regarding this, the authors state:

‘[In] particular, LPIPS shows that our method produces results that are more perceptually accurate than the other techniques, while VMAF proves that our restored videos are more temporally consistent and with less motion jitter. Furthermore, we observe that [BasicVSR++] and [RVRT] perform poorly, even though they represent the state of the art in standard video restoration.

‘We attribute this result to the use of optical flow for frame alignment, which is detrimental for analog videos, as the degradation is so severe that it completely disrupts the temporal consistency. This outcome further shows the difference between standard and analog video restoration.’

For the real world dataset, TAPE outperforms all rivals on CONTRIQUE, but is bested by MANA on BRISQUE and NIQE. However, the authors attribute this to the way, they observe, that MANA introduces artifacts which are not pleasing or accurate, but which are sharp, and therefore may be interpreted as having re-introduced salient detail.

Examples of the way that MANA may be boosting its scores in inauthentic ways, by adding detail that achieves a good score, but does not qualitatively improve the image.
Examples of the way that MANA may be boosting its scores in inauthentic ways, by adding detail that achieves a good score, but does not qualitatively improve the image.

The authors opine:

‘We argue that BRISQUE and NIQE are misled by these artifacts and mistake them for high-frequency details that are instead typical of high-quality images.’

Regarding the qualitative tests carried out, we refer the reader to the source PDF for the best method of comparing results, but include versions of these tests below for quick reference:

Qualitative results on the synthetic dataset.
Qualitative results on the synthetic dataset.
Qualitative results for the real-world dataset.
Qualitative results for the real-world dataset.

Regarding the qualitative round, the authors state that MANA and RTN generate ‘many unpleasant artifacts’, while (R)VRT is unable to tackle the degradation effectively, which the researchers state is due to its inability to generalize adequately.

The authors continue:

‘MemDeblur, BasicVSR++, and Agnolucci et al. yield acceptable results, but with visible artifacts. Regarding the synthetic dataset, TAPE generates the most detailed and photorealistic images. In our results, the eyes of the subjects in the first and second rows show fewer artifacts, and the overall images are considerably [cleaner.]’

The authors tell us that they intend to release video samples that will offer comparative views. They also state in the new paper that they will release the synthetic dataset used in the study.

They conclude:

‘Extensive experiments show the effectiveness of TAPE compared to several state-of-the-art techniques on both synthetic and real-world datasets. Our results demonstrate the differences between standard and analog video restoration, highlighting the need for approaches specifically designed for this task.’

Conclusion

The novel use of Transformers attention and the SWIN-UNet in TAPE, while interesting for the stated purposes of the project, may offer a novel approach also for digital restoration for data extraction purposes, in the very frequent cases where optical flow is not a suitable answer.

As it stands, the professional VFX arm of the generative AI movement is sorely in need of better methods for improving archival data that has either deteriorated over time, or was not that great in the first place (frequently both these factors apply).

The initial promise of methods that inspired and impressed the public and industry alike over the past five years, such as DeOldify and the current GAN-based upscaling and detailing algorithms, have some quite severe limitations, while the initial promise of Latent Diffusion Models such as Stable Diffusion has devolved into almost interminable attempts to make this a governable rather than random technology.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle