Different voice spoofing attacks - summary by Bhusan Chettri

Voice authentication systems: are they secure? can AI be used to fool them? ophile Bhusan Chettri explains how voice authentication systems can be fooled using AI and how they can be protected Although today’s speaker verification systems driven by deep learning and big data shows superior performance in verifying a speaker, they are not secure. They are prone to spoofing attacks. In this article Dr. Bhusan Chettri gives an overview of the technology used for spoofing a voice aunthetication system that uses automatic speaker verification (ASV) technology. Spoofing attacks in ASV: an overview by Dr Bhusan Chettri A spoofing attack (or presentation attack ) involves illegitimate access to the personal data of a targeted user. These attacks are performed on a biometric system to provoke an increase in its false acceptance rate. The security threats imposed by such attacks are now well acknowledged within the speech community. As identified in the ISO/IEC 30107-1 standard, a biometric system could be potentially attacked from nine different points. Fig. 1 provides a summary of this. The first two attack points are of specific interest as they are particularly vulnerable in terms of enabling an adversary to inject spoofed biometric data. These two points are commonly referred as physical access (PA) and logical access (LA) attacks. As illustrated in the figure, PA attacks involve presentation attack at the sensor (microphone in case of ASV) level and LA attacks involve modifying biometric samples to bypass the sensor. Text-to- speech and voice conversion techniques are used to produce artificial speech to bypass an ASV system. These two methods are examples of LA attacks. On the other hand, mimicry and playing back speech recordings (replay) are examples of PA attacks.

Figure 1: Possible locations [ISO/IEC, 2016] to attack an ASV system. 1: microphone point, 2: transmission point, 3: override feature extractor, 4: modify features, 5: override classifier, 6: modify speaker database, 7: modify biometric reference, 8: modify score and 9: override decision. Below, Bhusan Chettri provides a brief summary of the four different spoofing methods used to fool an ASV system 1. Mimicry (or Impersonation) This form of attack involves an attacker attempting to modify their voice characteristics to sound like a target speaker. In other words, an attacker aims to transform their lexical and prosodic properties to be able to sound as close as possible to the target speaker. Therefore, this form of attack can be highly effective when the attacker’s voice is similar to the target speaker, as less effort would be required to adjust the voice of an attacker in contrast to situations where the voice of the attacker is less similar to the target speaker. In other words, the success of mimicry attacks often depends on the degree or quality of the impersonated voice, suggesting that professional impersonators may be better at mimicking a target speaker’s voice than inexperienced impersonators. Research has shown that successful attackers were found to be able to transform their F0 (fundamental frequency) and sometimes the formants close to the target speaker. 2. Speech synthesis

Speech synthesis or text-to-speech (TTS), is a method to generate speech from a given text input that sounds as natural and intelligible as possible. It has a wide range of applications including spoken dialogue systems, speech-to-speech translation, assisting people with vocal disorders, and automatic e- book reading, to name a few. Text analysis and speech waveform generation are the two main components of a typical TTS system. The text analysis component analyses the input text and produces sequence of phonemes defining the linguistic specification of the text. Using these phonemes, the speech waveform generation module produces the speech waveform. However, in end-to-end deep learning frameworks, speech waveforms are directly generated from the input text. 3. Voice conversion Voice conversion aims at converting the voice of a speaker to that of another. In the context of ASV spoofing, the source voice corresponds to an attacker which is converted to that of a target speaker to fool an ASV system. Typical VC systems operate directly on speech signals of the source and target speaker using a parallel corpus of the two speakers (speaking the same utterances) on which a transformation function is learned to convert the attacker acoustic parameters to that of a target speaker. Applications of VC technologies include producing natural sounding voices for people with speech disabilities and voice dubbing in entertainment industries to name a few. 4. Replay attacks A replay spoofing attack involves playing back recorded speech samples of a target speaker (enrolled speaker) to bypass an ASV system. This type of attack requires physical transmission of spoofed speech through the system microphone. This is shown as point 1 in Fig. 1. Replay is the simplest form of a spoofing attack that can be implemented using smartphones, and does not require specific expertise either in speech processing or machine learning techniques. A bonafide or genuine speech corresponds to speech spoken by a target speaker during enrollment (or the verification phase) and is acquired by an ASV system’s microphone. On the other hand, a replayed speech denotes the speech signal that is obtained by playing back a pre-recorded bonafide speech which is then acquired by the system’s microphone. The acoustic environment for the acquisition of bonafide speech, and the replayed speech can be the same — situations where an attacker manages to launch the attack from the same physical space. But, in practice the acoustic space is usually different (eg. a different closed room/office with no background noise) as an attacker would not want to risk getting caught while launching such attacks. Therefore, factors of interest in detecting replay attacks are changes/noise induced in bonafide speech from the loudspeaker of playback device, recording device and the acoustic environment where the replay attack is simulated.

Therefore, it is very important to secure these systems from being manipulated. For this, spoofing countermeasure solutions are often integrated within the verfication pipeline. And, voice spoofing countermeasures is currently an active research topic within the speech research community. In the next article, Dr Bhusan Chettri will be talking more about how AI and big-data can be used to design anti- spoofing solutions in order to protect voice authentication systems from spoofing attacks. References [1] Bhusan Chettri scholar and personal website [2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019. [3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay attacks. PhD thesis, Queen Mary University of London, August 2020. [4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website. Tags: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri | Bhusan Chettri social | Bhusan Chettri Research

Different voice spoofing attacks - summary by Bhusan Chettri