Low-Cost Deepfake Video Detection With H.264 Motion Vectors

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

If you have ever watched a movie in relatively low resolution, you may have noticed that non-moving parts of the video seem ‘static’ and painted in comparison to sections of the image that feature movement. This is the effect of re-saving the source file with an optimized video codec – a lossy compression algorithm, which throws away as much information as possible in order to reduce the size of a movie or TV episode video down from multiple gigabytes to a more manageable size that’s suitable for streaming or for storing on local media.

For this reason, during a sustained and fixed shot of (for instance) a person in front of a wall, the ‘wall’-data is likely to be repeated multiple times in order to keep the final file size low, along with numerous other tricks that the codec performs to obtain an optimal balance between quality and file-size:

4 x 4 to 16 x 16 DCT transforms at work in codec compression. In the image on the right, we can see highlighted the areas which require extra information in each frame to convey continuous movement. The black areas change very little in this frame sequence, and thus may 'borrow' data from previous or future frames, instead of bulking up the video file with redundant information. Source: https://www.researchgate.net/figure/First-image-of-the-TABLE-sequence-using-4-x-4-to-16-x-16-DCT-transforms-intra-coded_fig1_224725106
4 x 4 to 16 x 16 DCT transforms at work in codec compression. In the image on the right, we can see highlighted the areas which require extra information in each frame to convey continuous movement. The black areas change very little in this frame sequence, and thus may 'borrow' data from previous or future frames, instead of bulking up the video file with redundant information. Source: https://www.researchgate.net/figure/First-image-of-the-TABLE-sequence-using-4-x-4-to-16-x-16-DCT-transforms-intra-coded_fig1_224725106

In modern and popular codecs such as the ubiquitous H.264, this process is effected by the evaluation of Motion Vectors (MVs). In the image above, we can see a static representation of the (black) areas that are economically ‘repeated’ from other frames.

In the image below, we can see a temporal representation of how the need to spend data on movement can play out over a video clip, with the directional arrows indicating areas of the screen where new data must be written, and where old data cannot be re-used:

Arrows indicate movement over time – for these areas, the codec is going to have to spend some data. Where there is no movement, it can re-use data from adjacent frames. Source: https://www.researchgate.net/figure/A-typical-Motion-Vector-Map-in-H264-encoding-procedure_fig2_221449884
Arrows indicate movement over time – for these areas, the codec is going to have to spend some data. Where there is no movement, it can re-use data from adjacent frames. Source: https://www.researchgate.net/figure/A-typical-Motion-Vector-Map-in-H264-encoding-procedure_fig2_221449884

Motion vectors have a number of applications in computer vision besides compressing movies and TV shows. In static video feeds, where the system is expecting zero change in the content of the video, any such movement becomes recognized as a potential and dynamic object, which can facilitate the detection of novel objects, people, and animals.

In an overhead surveillance of a room, the motion vectors of a person are recognized and translated into a motion heat map. Source: https://courses.engr.illinois.edu/ece417/fa2017/ece417fa2017lecture23.pdf
In an overhead surveillance of a room, the motion vectors of a person are recognized and translated into a motion heat map. Source: https://courses.engr.illinois.edu/ece417/fa2017/ece417fa2017lecture23.pdf

It’s possible to ‘unwrap’ motion vectors so that they can be viewed at a glance. Much as the Adobe Audition audio software can convert a sound clip into a single viewpoint (image left, below), allowing the user to copy, paste and erase particular facets of audio in a recording, optical flow (OF) approaches can concatenate motion vector changes so that they are explicit, and so that it is not necessary to consider the video one frame at a time:

On the left, a sound clip has been 'unwrapped' in the waveform and spectrum editor of Adobe Audition. On the right, optical flow reveals the changes of motion vectors over time, in a single image. Sources: https://www.pcmag.com/reviews/adobe-auditionand https://www.researchgate.net/figure/Optical-flow-field-vectors-shown-as-green-vectors-with-red-end-points-before-and-after_fig6_290181771.
On the left, a sound clip has been 'unwrapped' in the waveform and spectrum editor of Adobe Audition. On the right, optical flow reveals the changes of motion vectors over time, in a single image. Sources: https://www.pcmag.com/reviews/adobe-auditionand https://www.researchgate.net/figure/Optical-flow-field-vectors-shown-as-green-vectors-with-red-end-points-before-and-after_fig6_290181771.
A more dynamic example of optical flow's estimation of motion vectors. Source: https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html
A more dynamic example of optical flow's estimation of motion vectors. Source: https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html

Being a lossy process, encoding of this type is irreversible, and carries unique traits of compression artefacts. These artefacts cannot easily be imitated by deepfake video software such as DeepFaceLab, DeepFaceLive, and FaceSwap, which are concentrating instead on reproducing a target likeness, and which frequently do not have access to raw and uncompressed source video (which does not have these problems, and where file sizes can approach or exceed the terabytes range).

Consequently, mismatches between the motion vector artefacts in an original video and those in an AI-manipulated video derived from it, can be exploited via optical flow as a signifier of ‘untrue’ content.

From the 2020 paper 'Deepfake Video Detection through Optical Flow based CNN', incongruence between original and deepfaked motion vectors can be used as a tell-tale indicator of deepfaked content (here the target subject is the original subject from the source video, rather than an alternate identity). Source: https://openaccess.thecvf.com/content_ICCVW_2019/papers/HBU/Amerini_Deepfake_Video_Detection_through_Optical_Flow_Based_CNN_ICCVW_2019_paper.pdf
From the 2020 paper 'Deepfake Video Detection through Optical Flow based CNN', incongruence between original and deepfaked motion vectors can be used as a tell-tale indicator of deepfaked content (here the target subject is the original subject from the source video, rather than an alternate identity). Source: https://openaccess.thecvf.com/content_ICCVW_2019/papers/HBU/Amerini_Deepfake_Video_Detection_through_Optical_Flow_Based_CNN_ICCVW_2019_paper.pdf

However, optical flow needs a lot of resources to run, since it has to be calculated for each frame of the video. Though NVIDIA has provided a dedicated hardware solution, in the form of the Optical Flow SDK, designed to optimize OF throughput, and though a small number of projects have sought to address the bottleneck, it’s an exhaustive, if not exhausting technique.

Now, a new paper from Switzerland is offering a computationally much cheaper method of accomplishing this – by comparing the motion vectors of the H.264 codec to deepfaked versions, and obtaining a difference in MVs that would signify interference with the original video.

In an exaggerated example of deepfaking, intended to illustrate the procedure, the new method can discern areas of deepfaked content which contain dissonant motion vectors without needing to subject the video to optical flow methodologies. Source: https://arxiv.org/pdf/2311.10788.pdf
In an exaggerated example of deepfaking, intended to illustrate the procedure, the new method can discern areas of deepfaked content which contain dissonant motion vectors without needing to subject the video to optical flow methodologies. Source: https://arxiv.org/pdf/2311.10788.pdf

One does not need a ‘before’ and ‘after’ video for this – rather, these comparisons are evaluated when the model is trained, so that it can recognize (on novel, deepfaked data) the tell-tale signs of deepfakery, based on unusual motion vector activity.

Since H.264, also known as the Advanced Video Coding (AVC) codec, is the most popular in the world at this time, it’s a suitable target for such a method – however, the authors of the new work point out that their approach is so generalized and broadly applicable as to be amenable also to H.264’s as-yet less popular successor, H.265, which uses the Coding Tree Unit (CTU) structure instead of the more traditional macroblocks of H.264, but still uses motion vectors.

Essentially, the paper contends, H.264 has already done the necessary work to provide a basis for comparison and for deepfake detection, at the time it was used to optimize the video, and study of the baseline and deepfaked motion vectors can reveal the disparity without the need to unfold the clip at some significant cost of local resources, via optical flow.

In experiments on the FaceForensics++ dataset, the new approach achieved a 14% increase in accuracy, even against methods that are more resource-intensive, and the authors believe that their work could eventually evolve into a real-time consumer-level hardware solution capable of discerning deepfake content in video calls and live streaming environments.

The new paper is titled Efficient Temporally-Aware DeepFake Detection using H.264 Motion Vectors, and comes from five researchers at the Image and Visual Representation Lab of the École Polytechnique Fédérale de Lausanne.

Method

The H.264 codec, not uniquely, features three types of frame: I-Frames (‘Intracoded’ frames), which are complete, full-data frames, whose interval and frequency can be set in codec encoding preferences, and which are of no use for deepfake detection, since they are full-fledged, ‘luxury’ frames, and do not demonstrate the artefacts that are useful for this end; Predicted (P) frames; and Bi-directionally predicted (B) frames.

In orange, the I-Frames are full-data frames, with no reuse of information from adjacent frames. Source: https://www.researchgate.net/figure/Diagram-of-the-relationship-among-I-frame-P-frame-and-B-frame_fig1_335633523
In orange, the I-Frames are full-data frames, with no reuse of information from adjacent frames. Source: https://www.researchgate.net/figure/Diagram-of-the-relationship-among-I-frame-P-frame-and-B-frame_fig1_335633523

The latter two of both recycle data from adjacent/redundant frames (though at no more than 16 frames’ distance), but the B-Frames can sample from future and past frames, while the P-Frames can only sample from past frames. 

Some of the P and B frames contain no motion vector information, and are encoded entirely with Intra-frame coding (thus called I-Macroblocks, since they are hybrids between full data frames and partially derivative frames). Since these, like pure I-Frames, are irrelevant to the project’s methodology, they are shielded from scrutiny as Information Masks (IMs).

Conceptual architecture for the new project.
Conceptual architecture for the new project.

As shown in the conceptual architecture graphic above, a subset of RGB frames are then subject to face cropping through a face detection network. The I-Macroblocks and motion vectors are then taken directly from the decoder, before being cropped to a bounding box and passed to a neural network, which processes the results and produces a probability of deepfaked content.

Let’s take a closer look at the modules and procedures used.

In accordance with two prior works, The face detector used is a Multi-Task Cascaded CNN (MTCNN), which if multiple faces are found, will select the largest of them. Once identified, the face is constrained to a 224px2 bounding box, with all frames normalized at the end.

After this, the motion vectors and information masks are stacked together into either a four-dimensional or six-dimensional input, depending on what type of frames emerge, and whether or not they are referencing future and/or past I-frames or I-Macroblocks – and these entities too are then normalized.

Data augmentation is used to avoid overfitting in training the applicable model, through the use of the Albumentation library, which applies Gaussian noise and blur, RGB and hue saturation shifts, random changes in brightness contrast, and grayscale transformations.

Examples of some of the data augmentations performed during training of the model.
Examples of some of the data augmentations performed during training of the model.

Additionally more traditional ML augmentation methods are used, including horizontal and vertical flipping of the images, and patch-based removal of random sections of images, in order to prevent the training process from memorizing the data, instead of being adaptable to later, novel data.

The backbone classifier used in the project is MobileNetV3, which, as the name suggests, is optimized for low latency and accuracy under fairly constrained resources. However, the architecture needed some modification first: the classifier was replaced by a fully connected convolutional layer capable of outputting a scalar to indicate whether the input has been found to be fake; additionally, the number of input channels was augmented, since the type of output (RGB, motion vectors, etc.) was so varied and needed extra channels.

Since deepfake detection is a binary proposition (it’s either real or it’s not), binary cross entropy loss was used as the loss function.

Data and Tests

The FaceForensics++ dataset features 1,000 YouTube videos manipulated with various types of deepfake technologies, and the HQ version of the dataset was used in tests for the new project. The five different deepfake approaches used are: FaceShifter (FS in results); FaceSwap (FSwap in results); DeepFakes (DF in results, aka the aforementioned DeepFaceLab basis); Face2Face (F2F in results); and NeuralTexture (NT in results).

The results were compared not only with base RGB models, but also against the state-of-the-art OF estimator RAFT.

Since prior works did not report all pertinent quantitative factors, the authors created their own baseline.

The model was implemented in PyTorch and PyTorch Lightning, and was trained on two NVIDIA GTX Titan X GPUs, each with 12GB of memory, over an Intel Xeon E5-2860 V3 processor running at 2.50GHz.

The Adam optimizer was used, and the model trained for eight epochs (an epoch being a complete review of all the data during training), until convergence was reached (i.e., until the model was optimally trained, and unlikely to improve further with additional training).

Initially all the models were evaluated on FaceForensics++ against the RGB baseline, with the aim of making the most accurate prediction possible.

Initial results from an RGB run against FaceForensics, for all tested models.
Initial results from an RGB run against FaceForensics, for all tested models.

Of these results, the authors state:

‘Once we introduce RGB based models into the mix, we see an immediate saturation in accuracy for the models at around 96%. Additionally, the RGB models once again quickly overfit on the dataset.

‘As we see no changes in general accuracy whether the model uses RGB or a combination of RGB and motion information, we surmise that this is due to the poor quality of current DeepFake datasets available for research. Meaning the model gets as high of an accuracy it can get, already by just using RGB input.’

The central focus of the new work was to maintain this high level of applicability when using optical flow models, and therefore the researchers trained models individually on each type of deepfake approach, and ran tests for cross-forgery generalization:

Cross-forgery evaluation results covering all models used in the trials.
Cross-forgery evaluation results covering all models used in the trials.

Here the paper comments:

‘[We] can see MVs achieving higher accuracies on the specific datasets, indicating that they might add additional information that is not included in optical flows. Secondly, for cross-forgery and on average we see equally strong generalization results for all temporal augmentations, with OF performing slightly better on some subsets, namely Face2Face and DeepFakes.

‘Confirming previous research, we see purely RGB trained networks perform poorly on all generalization tasks. Thereby, we confirm the cross-forgery detection ability of MV in DeepFake [detection.]’

The (uncaptioned in the paper) two tables relating to the cross-forgery evaluation tests.
The (uncaptioned in the paper) two tables relating to the cross-forgery evaluation tests.

The authors additionally tested the systems against purely temporal data, evaluating model performance against RAFT’s optical flow methodology, and also on motion vector stemming.

Above (table 1, though not annotated thus in paper), the evaluation for motion vectors, information masks and optical flow classification accuracies on the respective deepfake approaches tested. Below (table 2, though not annotated in paper), the computational costs of RAFT's optical flow approach, compared to the new H.264 analysis method.
Above (table 1, though not annotated thus in paper), the evaluation for motion vectors, information masks and optical flow classification accuracies on the respective deepfake approaches tested. Below (table 2, though not annotated in paper), the computational costs of RAFT's optical flow approach, compared to the new H.264 analysis method.

The authors assert:

‘These results show that using the MVs and IMs only, strongly outperforms the RAFT based model, even only using MV Ps. Meaning that, while maybe not all motion is accounted for by MVs as it is by the [OF], there is additional information in MVs and IMs that allows for a better classification of whether a video is a forgery or not.’

Conclusion

As the computer vision research sector faces unfeasible dataset sizes and resource cost for the training and inference of novel models, and as much of the western world faces up to a growing energy crisis, a growing number of papers addressing optimization have been released as 2023 has matured.

In the case of the current work, the central motivation is to offer a more adroit and nimble solution to prior optical flow approaches to deepfake video detection; but the collateral benefits indicate that new and valuable solutions may continue to emerge from current and otherwise undesirable constraints upon resources.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle