Beyond the Illusion: Ensemble Learning for Effective Voice Deepfake Detection

Ghulam Ali, Javed Rashid, Muhammad Rameez Ul Hussnain, Muhammad Usman Tariq, Anwar Ghani, Daehan Kwak

Research output: Contribution to journalArticlepeer-review

Abstract

Deepfake synthetic media manipulated using artificial intelligence to mimic authenticity has become more dangerous in the modern digital era. Despite the significant progress in video deepfake detection, audio deepfake detection relies on specific datasets and machine learning algorithms. This study addresses this limitation by developing a deep ensemble learning approach. The proposed deep ensemble approach for audio deepfake detection is called Ensemble Convolutional Neural Network-Mel Frequency Cepstral Coefficient (ECN-MF). The ECN-MF model comprises a recurrent neural network, 1D convolutional neural network, long short-term memory networks, and convolutional long short-term memory network to extract a wider range of audio features, including mel-frequency cepstral coefficient, chroma features, and zero crossing rate. The step of data processing was carried out through the creation of a preprocessing pipeline inclusive of feature extraction, dimensionality standardization, and data normalization. The summarized features were incorporated into the feature vector, which was next normalized before standardizing in order to enhance consistency and stability across the audios. The investigation of the suggested ECN-MF model was carried out using Fake-or-Real dataset. The dataset comprises four subdatasets ('for-original', 'for-norm', 'for-2sec', and 'for-rerec'), categorized by audio duration and bit rate. To evaluate the performance of the proposed ensemble model, we utilized all sub-datasets with fake and real audio. The proposed ensemble approach achieved state-of-the-art accuracies of 99.5% on the 'for-original' sub-dataset, closely matching the CNN model at 99.6%. It also achieved accuracies of 98% on 'for-norm,' 96.9% on 'for- 2sec ,' and 92.8% on 'for-rerec' sub-datasets. By applying the proposed ensemble model to the 'for-merged' dataset, which comprises all sub-datasets, we obtained a state-of-the-art accuracy of 98%. These results demonstrate the effectiveness of the proposed approach, which outperforms the results of the individual models.

Original languageEnglish
JournalIEEE Access
DOIs
StateAccepted/In press - 2024

Keywords

  • 1D CNN
  • ASVSpoof
  • ASVSpoof-2019
  • ConvLSTM
  • ensemble model
  • LSTM
  • machine learning approach
  • RNN
  • Voice deepfake

Fingerprint

Dive into the research topics of 'Beyond the Illusion: Ensemble Learning for Effective Voice Deepfake Detection'. Together they form a unique fingerprint.

Cite this