In 2021, in the midst of the coronavirus pandemic, fraudsters began using videoconferencing in their business email compromise (BEC) campaigns. In this version of the imposter scam, the attacker donned the persona of a business executive, such as the CEO, and used video to trick employees into sending money.
Attackers used a still picture of the business executive and audio generated by deep neural networks to mimic the executive’s voice to create deepfakes that “instruct employees to initiate wire transfers or use the executives’ compromised email to provide wiring instructions,” the FBI stated in its 2022 Congressional report on wire fraud. The technique helped fraudsters commit more than $2.7 billion in business email compromises in 2022, according to the FBI’s Internet Crime Report 2022, published by the Internet Criminal Complaint Center (IC3).
The data underscores that attackers are increasingly using generative AI technologies to create better scams, fueling the need for more widespread use of defense that can detect AI-generated images and audio, says Vijay Balasubramaniyan, co-founder and CEO of Pindrop, a voice-identity security firm.
“Right now, we’re able to detect a lot of these deepfakes — in fact, not just detecting the deepfakes, but detect the engine that’s creating the deepfakes … with 99% accuracy,” he says. “We are very, very good at detecting them right now, but this space, it changes so drastically and so quickly — it’s something where you have to constantly work hard to keep up.”
So far current defenses outperform humans. Telltale artifacts of machine generation — the weird hands of images generated by DALL-e and the strange intonations of audio created by AI synthesis — are rapidly going away and, with them, a person’s ability to quickly detect machine-generated content. At present, humans can correctly detect deepfake videos only 57% of the time, while a leading machine-learning detection model could accurately identify a deepfake 84% of the time, according to research published in January 2022 by the Massachusetts Institute of Technology and Johns Hopkins University.
With potential million-dollar paydays for fraudsters, humans will have to rely on their own AI assistants to combat AI-enabled fraudsters.
First Defense: Detect Liveness
The state of the art is not just facial or voice matching, but detecting whether a live human is on the other side of the microphone or camera. The technology does not just include analyzing an image or audio file for artifacts that could indicate that the data is the product of AI generation, but also analyzing the metadata and environment of the file.
Detecting whether an image was injected into the camera’s pipeline, whether the background noise is consistent throughout the clip, and whether there are other signs of spoofing tell you a lot, says Stuart Wells, chief technology officer of Jumio, an identity verification firm.
“It’s layered defense,” he says. “The architecture that that we have and are pursuing is basically using multimodal biometrics, multimodal liveness detection, and then a battery of anti-spoofing techniques, because it doesn’t matter how good your anti-spoofing techniques, the fraudsters find a hole in them.”
Liveness is not just a defense against deepfakes, but a defense against replay attacks, where a voice is recorded and played at the right time, a technique famously employed in the movie Sneakers — “My voice is my passport. Verify me.” — but also sometimes used in fraud today. A variety of examples have used AI-generated voices to bypass the voice check at different financial institutions, from a Wall Street Journal reporter bypassing financial firm Chase’s voice identification to Do Not Pay’s founder using a deepfake and generative AI to get a refund from Wells Fargo.
Yet companies often do not have technology in place to detect liveness, says Pindrop’s Balasubramaniyan.
“If you look at all of these banks, where journalists have gotten through their own technology, the technologies they’ve used don’t have liveness detection — [they] don’t check to see if it is a deepfake,” he says. “In fact, you know, we’ve tested against these organizations behind the scenes, and we’ve seen that many of them don’t even prevent a replay attack, so you could just create a replay and beat the system comfortably.”
Tactics Will Continue to Evolve
Prerecorded clips, however, will give way to voice conversion attacks, where a generative-AI system is trained to turn a fraudster’s voice into the target’s voice in real time, as well as more efficient attacks, such as Microsoft’s VALL-E, which can create a credible — though not undetectable — audio deepfake from a three-second clip.
For defenders, that means there will be no permanent solution, just the cat-and-mouse game that has become a matter of course for cybersecurity defenders, says Jumio’s Wells.
“This technology is just going to get better and better,” he says. “The number of people working in this particular area — both in terms of using deepfakes and also in terms of detecting deepfakes — is going to continue to be an area of big investment, both for good and for evil.”
Balasubramaniyan maintains that undetectable deepfakes will be possible, but only for attackers with a lot of resources and a great deal of material from which to create a deepfake. Instead of trying to detect those types of attacks, defenders are better served by focusing on preventing scalable deepfakes, he says.
“There is a path where you raise the bar to such a high extent that only someone with incredible compute and incredible storage and algorithms that are not public could create a super-secret deepfake,” he says. “If you raise the bar that hard, it isn’t an attack at scale. … But at scale, I’m still fairly confident we will be able to detect these attacks.”