With the advent of deepfakes, it’s becoming even harder to separate truth from fiction. Deepfakes have the potential to damage almost anything they target—the 2020 political campaigns, local or international news and current events, even someone’s divorce proceedings. Virtually every news organization, social media platform, and technology company working in video is exploring how to detect and counteract deepfakes, but when we began researching this article, we found that few wanted to talk about it yet.
“If technology can create something, technologies should be able to detect whether it’s an authentic video or whether it has been edited,” says Petr Peterka, chief technology officer at RadioMobile and former CTO at Verimatrix, where he focused on anti-piracy and forensic video security.
“Deepfakes take your real image, knowledge about how your lips should move when you are pronouncing different sounds, and then it artificially, like CGI graphics, creates the movement of your lips to make it sound like you’ve said something,” says Peterka. “The same thing happens with audio. If somebody analyzes your voice, then they can synthesize something that you said that has your tone of voice, your accent, and everything else.”
This is a nascent field, but in principal the content security business Peterka previously was in had very similar problems to solve. “Knowing some of the video capturing and editing and compression technologies, I think there must be a way to detect it.”
Deepfakes’ Troubling Spread
Deepfakes have primarily been used to take famous female entertainers and place their face onto pornography according to “The State of Deepfakes.” a research study conducted by Deeptrace, a technology company that does detection and monitoring of synthetic videos.
“Our research revealed that the deepfake phenomenon is growing rapidly online, with the number of deepfake videos almost doubling over the last seven months to 14,678,” wrote Giorgio Patrini, Founder, CEO, and Chief Scientist, Deeptrace in September 2019.
“This increase is supported by the growing commodification of tools and services that lower the barrier for non-experts to create deepfakes.” The company’s research identified multiple forums, open source code, apps, and platforms for deepfake content creation. According to the report, one service needs 250 images and two days of processing to create a realistic deepfake.
A whopping 96% of the deepfake videos Deeptrace found online were porn, with the top four websites garnering more than 134 million views. The remaining 4% of deepfake videos were found on YouTube and ran the gamut from politicians to corporate figures. The quality of many of the current deepfakes, including content manipulated to appear as if it features Nancy Pelosi, Donald Trump, Barack Obama (notably in the video created by BuzzFeed shown as the opening image to this article), Mark Zuckerberg and others, is poor enough that some refer to them as “cheapfakes.” The concern, though, is that the technology is improving so fast that it will soon be much more difficult to distinguish a deepfake from the real thing.
That’s Not a Real Cat—Or a Real Person
The term “deepfake” comes from “deep learning” and “fake.” The AI field, associated with creating deepfakes, is rooted in a class of machine learning called generative adversarial networks (GAN), where neural networks compete with each other by taking data from a training set and generating new content.
“It’s all about the artificial neural network’s capability to capture either actual entities and learn the representations or even learn the stylistic representations of the data and encode this information,” says Dr. Mika Rautiainen, CEO/CTO of Finnish company Valossa Labs. “That information [is used to create] generative models, that basically creates new content.”
Valossa has developed image recognition software to identify people, objects, and sentiment within video content and to extract this metadata to describes what’s happening or be used for highlights extraction. “Deepfakes have emerged as a side interest for the company, and we are looking into the technology and ways of developing it,” says Rautiainen.
To develop new content, whether it’s images, video, or audio, machine learning requires training data. To learn what a cat looks like, a system must be shown enough cats for the system to learn all the permutations of cats—cats lying down, sleeping, fighting, sitting, standing, and things which are not cats. To manufacture either an existing person doing something they have not done or said or to create a synthetic person, the systems needs data to learn from. Early efforts weren’t very realistic (as you can see from the image at left).
Thispersondoesnotexist.com is an archive of simulated people, built by Phillip Wang. The site shows high-resolution images of fake people and has been viewed more than 50 million times. “We’ve found a way for computers to disentangle sets of data, understand all the patterns beneath it, and then be able to reconstruct and generate new data that’s indistinguishable from the training data. In this case, it’s faces, but it could be anything. You can have architecture diagrams or dental implants,” says Wang.
“High resolution is extremely important. The better the data quality, the better the output of this specific architecture [because] they have more to go on.” Thispersondoesnotexist.com was trained for a week on eight GPUs using 70,000 high-res images from Flickr’s open source data set. “[Video is] not at the point where it can be as shocking as images,” says Wang. “Due to computational limits, and also moving through time, [means] there needs to be consistency in each frame.”
“I think solving the fake images problem is probably a proxy for solving the big video problems. A video is just a collection of image frames, so it’s in the same line of research,” says Wang. “I think if you do forensics on each of the frames, there’s definitely different tells you can find to figure out if it’s been modified.”
One of hundreds of AI-generated deepfake images at the aptly named website This Person Does Not Exist.
Obtaining Training Data
The first step in countering deepfakes is identifying what actually is or is not a Deepfake. “The industry doesn’t have a great data set or benchmark for detecting [deepfakes],” said Mike Schroepfer, CTO of Facebook, in a blog post. “We want to catalyze more research and development in this area and ensure that there are better open source tools to detect deepfakes. That’s why Facebook, the Partnership on AI, Microsoft, and academics from Cornell Tech, MIT, University of Oxford, UC Berkeley, University of Maryland, College Park, and University at Albany-SUNY are coming together to build the Deepfake Detection Challenge (DFDC).”
Participants will compete to create new ways of detecting and preventing manipulated media. Facebook is funding this to the tune of more than $10 million and is usingpaid actors, not Facebook user data. Amazon also joined the DFDC and is donating AWS credits and technical support. “The new dataset generated by Facebook which comprises tens of thousands of example videos, both real and fake. Competitors will use this dataset to design novel algorithms which can detect a real or fake video,” wrote Michelle Lee, vice president, Machine Learning Solutions Lab at AWS.
“They are using actors to create deepfake videos, so the research community can try to find ways to dissect the media in the pixel-level granularity, using different ways and methods to be able to decode or predict whether the media was actually real or fake,” says Rautiainen.
The question is whether or not using actors will create a good enough data set? A Canadian company called Dessaused a Google’s reference of synthetic videos created from using paid actors. The New York Times reported it failed 40% of the time using the reference data. When Dessa culled other content for training from other real-life scenarios, it was able to improve the success rate.
What Do We Know for Sure?
In order to detect something, knowing the common traits are essential. Today’s deepfakes are marked by “slight indications where there’s discontinuities in the motion or de-synchrony of the audio and visuals. There may be disjoint regions between the face and the head movement, coloring or lighting differences (average lighting color between the background and the foreground),” says Rautiainen. “By careful analysis of spoken tracks, you can map information and try to find correlations or de-correlations within the video.”
One issue is both the detection and the manipulators look for the same thing. If detection tools are looking for how crow’s feet appear at people’s eyes when they smile, we know that the deepfake creators are considering the same thing. “No one’s come up with a universal, scanner, to check for all the different techniques … in the way that we think of virus detection today,” says Charles Chu, chief product officer at Brightcove.
One potential method of detection is similar to what’s used for content protection today. “If I apply a watermark or if I apply a digital signature to a video, anybody trying to alter that video would either damage the signature or it would damage the watermark, and it would give me some indication that there’s something wrong with this video,” says Peterka.