If you have ever watched a movie in relatively low resolution, you may have noticed that non-moving parts of the video seem ‘static’ and painted in comparison to sections of the image that feature movement. This is the effect of re-saving the source file with an optimized video codec – a lossy compression algorithm, which throws away as much information as possible in order to reduce the size of a movie or TV episode video down from multiple gigabytes to a more manageable size that’s suitable for streaming or for storing on local media.
For this reason, during a sustained and fixed shot of (for instance) a person in front of a wall, the ‘wall’-data is likely to be repeated multiple times in order to keep the final file size low, along with numerous other tricks that the codec performs to obtain an optimal balance between quality and file-size:
In modern and popular codecs such as the ubiquitous H.264, this process is effected by the evaluation of Motion Vectors (MVs). In the image above, we can see a static representation of the (black) areas that are economically ‘repeated’ from other frames.
In the image below, we can see a temporal representation of how the need to spend data on movement can play out over a video clip, with the directional arrows indicating areas of the screen where new data must be written, and where old data cannot be re-used:
Motion vectors have a number of applications in computer vision besides compressing movies and TV shows. In static video feeds, where the system is expecting zero change in the content of the video, any such movement becomes recognized as a potential and dynamic object, which can facilitate the detection of novel objects, people, and animals.
It’s possible to ‘unwrap’ motion vectors so that they can be viewed at a glance. Much as the Adobe Audition audio software can convert a sound clip into a single viewpoint (image left, below), allowing the user to copy, paste and erase particular facets of audio in a recording, optical flow (OF) approaches can concatenate motion vector changes so that they are explicit, and so that it is not necessary to consider the video one frame at a time:
Being a lossy process, encoding of this type is irreversible, and carries unique traits of compression artefacts. These artefacts cannot easily be imitated by deepfake video software such as DeepFaceLab, DeepFaceLive, and FaceSwap, which are concentrating instead on reproducing a target likeness, and which frequently do not have access to raw and uncompressed source video (which does not have these problems, and where file sizes can approach or exceed the terabytes range).
Consequently, mismatches between the motion vector artefacts in an original video and those in an AI-manipulated video derived from it, can be exploited via optical flow as a signifier of ‘untrue’ content.
However, optical flow needs a lot of resources to run, since it has to be calculated for each frame of the video. Though NVIDIA has provided a dedicated hardware solution, in the form of the Optical Flow SDK, designed to optimize OF throughput, and though a small number of projects have sought to address the bottleneck, it’s an exhaustive, if not exhausting technique.
Now, a new paper from Switzerland is offering a computationally much cheaper method of accomplishing this – by comparing the motion vectors of the H.264 codec to deepfaked versions, and obtaining a difference in MVs that would signify interference with the original video.
One does not need a ‘before’ and ‘after’ video for this – rather, these comparisons are evaluated when the model is trained, so that it can recognize (on novel, deepfaked data) the tell-tale signs of deepfakery, based on unusual motion vector activity.
Since H.264, also known as the Advanced Video Coding (AVC) codec, is the most popular in the world at this time, it’s a suitable target for such a method – however, the authors of the new work point out that their approach is so generalized and broadly applicable as to be amenable also to H.264’s as-yet less popular successor, H.265, which uses the Coding Tree Unit (CTU) structure instead of the more traditional macroblocks of H.264, but still uses motion vectors.
Essentially, the paper contends, H.264 has already done the necessary work to provide a basis for comparison and for deepfake detection, at the time it was used to optimize the video, and study of the baseline and deepfaked motion vectors can reveal the disparity without the need to unfold the clip at some significant cost of local resources, via optical flow.
In experiments on the FaceForensics++ dataset, the new approach achieved a 14% increase in accuracy, even against methods that are more resource-intensive, and the authors believe that their work could eventually evolve into a real-time consumer-level hardware solution capable of discerning deepfake content in video calls and live streaming environments.
The new paper is titled Efficient Temporally-Aware DeepFake Detection using H.264 Motion Vectors, and comes from five researchers at the Image and Visual Representation Lab of the École Polytechnique Fédérale de Lausanne.
The H.264 codec, not uniquely, features three types of frame: I-Frames (‘Intracoded’ frames), which are complete, full-data frames, whose interval and frequency can be set in codec encoding preferences, and which are of no use for deepfake detection, since they are full-fledged, ‘luxury’ frames, and do not demonstrate the artefacts that are useful for this end; Predicted (P) frames; and Bi-directionally predicted (B) frames.
The latter two of both recycle data from adjacent/redundant frames (though at no more than 16 frames’ distance), but the B-Frames can sample from future and past frames, while the P-Frames can only sample from past frames.
Some of the P and B frames contain no motion vector information, and are encoded entirely with Intra-frame coding (thus called I-Macroblocks, since they are hybrids between full data frames and partially derivative frames). Since these, like pure I-Frames, are irrelevant to the project’s methodology, they are shielded from scrutiny as Information Masks (IMs).
As shown in the conceptual architecture graphic above, a subset of RGB frames are then subject to face cropping through a face detection network. The I-Macroblocks and motion vectors are then taken directly from the decoder, before being cropped to a bounding box and passed to a neural network, which processes the results and produces a probability of deepfaked content.
Let’s take a closer look at the modules and procedures used.
In accordance with two prior works, The face detector used is a Multi-Task Cascaded CNN (MTCNN), which if multiple faces are found, will select the largest of them. Once identified, the face is constrained to a 224px2 bounding box, with all frames normalized at the end.
After this, the motion vectors and information masks are stacked together into either a four-dimensional or six-dimensional input, depending on what type of frames emerge, and whether or not they are referencing future and/or past I-frames or I-Macroblocks – and these entities too are then normalized.
Data augmentation is used to avoid overfitting in training the applicable model, through the use of the Albumentation library, which applies Gaussian noise and blur, RGB and hue saturation shifts, random changes in brightness contrast, and grayscale transformations.
Additionally more traditional ML augmentation methods are used, including horizontal and vertical flipping of the images, and patch-based removal of random sections of images, in order to prevent the training process from memorizing the data, instead of being adaptable to later, novel data.
The backbone classifier used in the project is MobileNetV3, which, as the name suggests, is optimized for low latency and accuracy under fairly constrained resources. However, the architecture needed some modification first: the classifier was replaced by a fully connected convolutional layer capable of outputting a scalar to indicate whether the input has been found to be fake; additionally, the number of input channels was augmented, since the type of output (RGB, motion vectors, etc.) was so varied and needed extra channels.
Data and Tests
The FaceForensics++ dataset features 1,000 YouTube videos manipulated with various types of deepfake technologies, and the HQ version of the dataset was used in tests for the new project. The five different deepfake approaches used are: FaceShifter (FS in results); FaceSwap (FSwap in results); DeepFakes (DF in results, aka the aforementioned DeepFaceLab basis); Face2Face (F2F in results); and NeuralTexture (NT in results).
The results were compared not only with base RGB models, but also against the state-of-the-art OF estimator RAFT.
Since prior works did not report all pertinent quantitative factors, the authors created their own baseline.
The model was implemented in PyTorch and PyTorch Lightning, and was trained on two NVIDIA GTX Titan X GPUs, each with 12GB of memory, over an Intel Xeon E5-2860 V3 processor running at 2.50GHz.
The Adam optimizer was used, and the model trained for eight epochs (an epoch being a complete review of all the data during training), until convergence was reached (i.e., until the model was optimally trained, and unlikely to improve further with additional training).
Initially all the models were evaluated on FaceForensics++ against the RGB baseline, with the aim of making the most accurate prediction possible.
Of these results, the authors state:
‘Once we introduce RGB based models into the mix, we see an immediate saturation in accuracy for the models at around 96%. Additionally, the RGB models once again quickly overfit on the dataset.
‘As we see no changes in general accuracy whether the model uses RGB or a combination of RGB and motion information, we surmise that this is due to the poor quality of current DeepFake datasets available for research. Meaning the model gets as high of an accuracy it can get, already by just using RGB input.’
The central focus of the new work was to maintain this high level of applicability when using optical flow models, and therefore the researchers trained models individually on each type of deepfake approach, and ran tests for cross-forgery generalization:
Here the paper comments:
‘[We] can see MVs achieving higher accuracies on the specific datasets, indicating that they might add additional information that is not included in optical flows. Secondly, for cross-forgery and on average we see equally strong generalization results for all temporal augmentations, with OF performing slightly better on some subsets, namely Face2Face and DeepFakes.
‘Confirming previous research, we see purely RGB trained networks perform poorly on all generalization tasks. Thereby, we confirm the cross-forgery detection ability of MV in DeepFake [detection.]’
The authors additionally tested the systems against purely temporal data, evaluating model performance against RAFT’s optical flow methodology, and also on motion vector stemming.
The authors assert:
‘These results show that using the MVs and IMs only, strongly outperforms the RAFT based model, even only using MV Ps. Meaning that, while maybe not all motion is accounted for by MVs as it is by the [OF], there is additional information in MVs and IMs that allows for a better classification of whether a video is a forgery or not.’
As the computer vision research sector faces unfeasible dataset sizes and resource cost for the training and inference of novel models, and as much of the western world faces up to a growing energy crisis, a growing number of papers addressing optimization have been released as 2023 has matured.
In the case of the current work, the central motivation is to offer a more adroit and nimble solution to prior optical flow approaches to deepfake video detection; but the collateral benefits indicate that new and valuable solutions may continue to emerge from current and otherwise undesirable constraints upon resources.