The AI of SenseTime creates realistic deepfake videos

The AI of SenseTime creates realistic deepfake videos

22. Januar 2020 0 Von Horst Buchwald

The AI of SenseTime creates realistic deepfake videos

Hong Kong, 22.1.2020

SenseTIme and academic researchers presented earlier this week a method for editing target portraits that uses an audio sequence to synthesize photorealistic video. In an online report, the researchers described how the new method for creating deep faces „is unique because it is highly dynamic“.

Thus, deepfakes – media that include a person in an existing image, audio or video recording and replace it with the image of another person – are becoming increasingly convincing. In this lucrative market, competition is particularly fierce in Asia.

In late 2019, researchers at Seoul-based Hyperconnect developed a tool (MarioNETte) that can manipulate the facial features of a historical figure, a politician or a CEO with nothing more than a webcam and still images.

Recently, a team from Hong Kong-based technology giant SenseTIme, Nanyang Technological University and the Institute of Automation of the Chinese Academy of Sciences recently proposed a method for manipulating target portrait photos that uses sound sequences to synthesize photorealistic video. Unlike MarioNETte, SenseTime’s technology is dynamic, meaning it is better able to handle media it has never encountered before. And the results are impressive, albeit disturbing in light of recent developments regarding Deepfakes.

The study’s co-authors, who describe the work, note that the task of „many-to-many translation“ from audio-to-video – i.e. translation that does not require a single identity between source and target video – is a challenge. Typically, only a small number of videos are available to train an AI system, and each method has to cope with large audio-video variations between subjects and the lack of knowledge about scene geometry, materials, lighting and dynamics.

To overcome these challenges, the team uses the expression parameter space or the values for facial features set before the start of training as the target space for the audio-video mapping. They say that this helps the system learn the mapping more effectively than full pixels because expressions are more semantically relevant to the audio source and can be manipulated by generating parameters through machine learning algorithms.

The expression parameters generated by the researchers – combined with geometry and pose parameters of the target person – produce the reconstruction of a three-dimensional facial network with the same identity and head posture as the target, but with lip movements that match the source audio phonemes (perceptible different sound units). A specialized component ensures that the translation from audio to expression remains agnostic to the identity of the source audio, so that the translation is robust to variations in the voices of different people and the source audio. And the system extracts features – landmarks – from the person’s mouth region to ensure that each movement is accurately mapped by first displaying them as heat maps and then combining them with frames in the Ressource video, using the heat maps and frames as input to complete a mouth region.

The researchers say that in a study where 100 volunteers were asked to evaluate the realism of 168 video clips, half of which were synthesized by the system, synthesized video was described as „real“ in 55% of cases, while 70.1% of the time was used for basic truth. They attribute this to the superior ability of their system to capture teeth and details of facial texture, as well as features such as the corners of the mouth and nasolabial folds (the lines of indentation on both sides of the mouth that extend from the edge of the nose to the outer corners of the mouth).

The researchers admit that their system could be abused or misused for „various malicious purposes“, such as media manipulation or the „dissemination of malicious propaganda“. As remedies, they suggest „protective measures“ and the adoption and enforcement of laws to mark edited videos as such. „Because we are at the forefront of developing creative and innovative technologies, we are working to develop methods to identify edited videos as a countermeasure,“ they wrote, „We also encourage the public to act as watchdogs and report suspicious-looking videos to the authorities. By working together, we can promote cutting-edge and innovative technologies without compromising the public’s personal interest.

Unfortunately, these proposals hardly seem suitable to stem the flood of AI-generated deepfakes like the ones described above. The Amsterdam-based cyber security start-up Deeptrace found 14,698 Deepfake videos on the Internet at its last count in June and July, up from 7,964 last December – an increase of 84% in just seven months. This is worrying not only because Deepfakes could be used to influence public opinion during an election, for example, or to implicate someone in a crime they didn’t commit, but also because the technology has already generated pornographic material and defrauded companies out of hundreds of millions of dollars.