How Easy Is It to Make and Detect a Deepfake?

The Evolution of Deepfake Technology

How to Make a Deepfake and How Hard It Is

Deepfakes can be harmful, but creating a deepfake that is hard to detect is not easy. Creating a deepfake today requires the use of a graphics processing unit (GPU). To create a persuasive deepfake, a gaming-type GPU, costing a few thousand dollars, can be sufficient. Software for creating deepfakes is free, open source, and easily downloaded. However, the significant graphics-editing and audio-dubbing skills needed to create a believable deepfake are not common. Moreover, the work needed to create such a deepfake requires a time investment of several weeks to months to train the model and fix imperfections.

The two most widely used open-source software frameworks for creating deepfakes today are DeepFaceLab and FaceSwap. They are public and open source and are supported by large and committed online communities with thousands of users, many of whom actively participate in the evolution and improvement of the software and models. This ongoing development will enable deepfakes to become progressively easier to make for less sophisticated users, with greater fidelity and greater potential to create believable fake media.

As shown in Figure 1, creating a deepfake is a five-step process. The computer hardware required for each step is noted.

Deep_Fake_1

Figure 1: Steps in Creating a Deepfake

  1. Gathering of source and destination video (CPU)—A minimum of several minutes of 4K source and destination footage are required. The videos should demonstrate similar ranges of facial expressions, eye movements, and head turns. One final important point is that the identities of source and destination should already look similar. They should have similar head and face shape and size, similar head and facial hair, skin tone, and the same gender. If not, the swapping process will show these differences as visual artifacts, and even significant post-processing may not be able to remove these artifacts.
  2. Extraction (CPU/GPU)—In this step, each video is broken down into frames. Within each frame, the face is identified (usually using a DNN model), and approximately 30 facial landmarks are identified to serve as anchor points for the model to learn the location of facial features. An example image from the FaceSwap framework is shown in Figure 2 below.

Deepfake fig 2

Figure 2: Face after extraction step showing bounding box (green) and facial landmarks (yellow dots).

Reprinted with permission from Faceswap.

3. Training (GPU)—Each set of aligned faces is then input to the training network. A general schematic of an encoder-decoder network for training and conversion is shown in Figure 1 above. Notice that batches of aligned and masked input faces A and B (after the extraction step) are both fed into the same encoder network. The output of the encoder network is a representation of all the input faces in a lower dimensional vector space, called the latent space. These latent-space objects are then each passed separately through decoder networks for the A and B faces that attempt to generate, or recreate, each set of faces separately. The generated faces are compared to the original faces, the loss function is calculated, backpropagation occurs, and the weights for the decoder and encoder networks are updated. This occurs for another batch of faces until the desired number of epochs is achieved. The user decides when to terminate the training by visually inspecting the faces for quality or when the loss value does not decrease any further. There are times when the resolution or quality of the input faces, for various reasons, prevents the loss value from reaching a desired value. Most likely in this case, no amount of training or post-processing will result in a deepfake that is convincing.

4. Conversion (CPU/GPU)—The deepfake is created in the conversion step. If one wishes to create a faceswap, where face A is to be swapped with B, then the flow in the lower portion of Figure 1 above is used. Here, the aligned, masked input faces A are fed into the encoder. Recall that this encoder has learned a representation for both faces A and B. When the output of the encoder is passed to the decoder for B, it will attempt to generate face B swapped with the identity of A. Here, there is no learning or training that is done. The conversion step is a one-way pass of a set of input faces through the encoder-decoder network. The output of the conversion process is a set of frames that must then be put together by other software to become a video.

5. Post-processing (CPU)—This step requires extensive time and skill. Minor artifacts may be removable, but large differences will likely not be able to be edited out. While post-processing may be performed leveraging the deepfake software frameworks’ built-in compositing and masking, results are less than desirable. While DeepFaceLabs provides the ability to incrementally adjust color correction, mask position, mask size, and mask feather per each frame of video, the granularity of adjustment is limited. To achieve photorealistic post-processing, traditional media FX compositing is required. The deepfake software framework is used only to export an unmasked deepfake composite and all adjustments to the composite made with a variety of video post-production applications. DaVinci Resolve can be used to color correct and chroma key the composite to the target video. Mocha can then be used to planar motion track the target video as well as the composite video creating a custom keyframe mask. The Mocha can then be imported into Adobe After Effects for the final compositing masking of the deepfake with the target. Finally, shadows and highlights from the target would be filtered from the target video and overlayed on the deepfake. Should the masking accidentally remove pixels of the target’s background, Photoshop can be used to recreate the lost pixels. The finished result creates a motion-tracked, color-corrected photorealistic deepfake limiting traditional blending artifacts.

Each open-source tool has a large number of settings and neural-network hyperparameters with some general commonalities between tools, and some differences mainly with respect to neural-network architecture. With a range of GPUs available, including a machine-learning GPU server, as well as individual gaming-type GPUs, a higher quality deepfake can be made on a single gaming-type GPU, in less time, than on a dedicated machine-learning GPU server.

Hardware requirements vary based on the deepfake media complexity; standard-definition media require less robust hardware than ultra-high-definition (UHD) 4K. The most critical hardware component to deepfake creation is the GPU. The GPU must be NVIDIA CUDA and TensorFlow compliant, which requires NVIDIA GPUs. Deepfake media complexity is affected by

  • video resolution for source and destination media
  • deepfake resolution
  • auto-encoding dimension
  • encoding dimensions
  • decoding dimensions
  • tuning parameters such as these, from DeepFaceLab: Random Warp, Learning Rate Drop Out, Eye Priority Mode, Background Style Power, Face Style Power, True Face Power, GAN Power, Clip Grade, Uniform Yaw, etc.

The greater each parameter, the more GPU resources are needed to perform a single deepfake iteration (one iteration is one batch of faces fed through the network with one optimization cycle performed). To compensate for complex media, deepfake software is sometimes multithreaded, distributing batches over multiple GPUs.

Once the hardware is properly configured with all needed dependencies, there are limited processing differences between operating systems. While a GUI-based operating system does use more system resources, the effect on batch size is not severely altered. Different GPUs, however, even by the same manufacturer, can have widely different performances.

Time per iteration is also a factor for creating deepfakes. The larger the batch size, the longer each iteration takes. Larger batch sizes produce lower pixel-loss values per iteration, reducing the number of iterations needed to complete training. Distributing batch sizes over multiple GPUs also increases time per iteration. It is best to run large batch sizes over a single GPU with a high amount of VRAM as well as a large core clock. Although a reasonable expectation is that using a GPU server with 16 GPUs would be superior to a couple of GPUs running in a workstation, in fact, someone with access to a couple of GPUs worth a few thousand dollars can potentially make a higher quality deepfake video than that produced by a GPU server.

The current state of the art of deepfake video creation entails a long process of recording or identifying existing source footage, training neural networks, trial and error to find the best parameters, and video post-processing. Each of these steps is required to make a convincing deepfake. The following are important factors for creating the most photorealistic deepfake:

  • adequate GPU hardware
  • source footage with enough even lighting and high resolution
  • adequate lighting matched between source and destination footage
  • source subjects with similar appearance (head shape and size, facial-hair style and quantity, gender, and skin tone) and patterns of facial hair
  • video capturing of all head angles and mouth phoneme expression
  • using the right model for training
  • performing post-production editing of the deepfake

This process involves much trial and error with many disparate sources to get information (forums, articles, publications, etc.). Therefore, creating a deepfake is as much an art as a science. Because of the non-academic nature of deepfake creation, it may persist this way for some time.

State of Detection Technology: A Game of Cat and Mouse

A rush of new research has introduced several deepfake video-detection (DVD) methods. Some of these methods claim detection accuracy in excess of 99 percent in special cases, but such accuracy reports should be interpreted cautiously. The difficulty of detecting video manipulation varies widely based on several factors, including the level of compression, image resolution, and the composition of the test set.

A recent comparative analysis of the performance of seven state-of-the-art detectors on five public datasets that are often used in the field showed a wide range of accuracies, from 30 percent to 97 percent, with no single detector being significantly better than another. The detectors typically had wide-ranging accuracies across the five test datasets. Typically, the detectors will be tuned to look for a certain type of manipulation, and often when these detectors are turned to novel data, they do not perform well. So, while it is true that there are many efforts underway in this area, it is not the case that there are certain detectors that are vastly better than others.

Regardless of the accuracy of current detectors, DVD is a game of cat and mouse. Advances in detection methods alternate with advances in deepfake-generation methods. Successful defense will require repeatedly improving on DVD methods by anticipating the next generation of deepfaked content.

Adversaries will probably soon extend deepfake methods to produce videos that are increasingly dynamic. Most existing deepfake methods produce videos that are static in the sense that they depict stationary subjects with constant lighting and unmoving background. But deepfakes of the future will incorporate dynamism in lighting, pose, and background. The dynamic attributes of these videos have the potential to degrade the performance of existing deepfake-detection models. Equally concerning, the use of dynamism could make deepfakes more credible to human eyes. For example, a video of a foreign leader talking as she rides past on a golf cart would be more engaging and lifelike than if the same leader were to speak directly to the camera in a static studio-like scene.

To confront this threat, the academic and the corporate worlds are engaged in creating detector models, based on DNNs, that can detect various types of deepfaked media. Facebook has been a major contributor by holding the Deepfake Detection Challenge (DFDC) in 2019, which awarded a total of $US 1 million to the top five winners.

Participants were charged with creating a detector model trained and validated on a curated data set of 100,000 deepfake videos. The videos were created by Facebook with help from Microsoft and several academic institutions. While originally the dataset was available only to members of the competition, it has since been released publicly. Out of the more than 35,000 models that were submitted, the winning one achieved an accuracy of 65 percent on a test dataset of 10,000 videos that were reserved for testing, and 82 percent on the validation set used during the model-training process. The test set was not available to the participants during training. The discrepancy in accuracy between the validation and test sets indicates that there was some amount of over-fitting, and therefore a lack of generalizability, an issue that tends to plague DNN-classification models.

Knowing the many elements required to make a photorealistic deepfake—high-quality source footage of the proper length, similar appearance between source and destination, using the proper model for training, and skilled post-production—suggests how to identify a deepfake. One major element would be training a model with enough different types of deepfakes, of various qualities, covering the range of possible flaws, on a model that was complex enough to extract this information. A possible next step would be to supplement the dataset of deepfakes with a public source, such as the dataset from the Facebook DFDC, to build a model detector.

Looking Ahead

Network defenders need to understand the state of the art and the state of the practice of deepfake technology from the side of the perpetrators. Our SEI team has begun taking a look at the detection side of deepfake technology. We are planning to take our knowledge of deepfake generation and use it to improve on existing deepfake-detection models and software frameworks.