TY - JOUR
T1 - Beyond the Illusion
T2 - Ensemble Learning for Effective Voice Deepfake Detection
AU - Ali, Ghulam
AU - Rashid, Javed
AU - Hussnain, Muhammad Rameez Ul
AU - Tariq, Muhammad Usman
AU - Ghani, Anwar
AU - Kwak, Daehan
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - Deepfake synthetic media manipulated using artificial intelligence to mimic authenticity has become more dangerous in the modern digital era. Despite the significant progress in video deepfake detection, audio deepfake detection relies on specific datasets and machine learning algorithms. This study addresses this limitation by developing a deep ensemble learning approach. The proposed deep ensemble approach for audio deepfake detection is called Ensemble Convolutional Neural Network-Mel Frequency Cepstral Coefficient (ECN-MF). The ECN-MF model comprises a recurrent neural network, 1D convolutional neural network, long short-term memory networks, and convolutional long short-term memory network to extract a wider range of audio features, including mel-frequency cepstral coefficient, chroma features, and zero crossing rate. The step of data processing was carried out through the creation of a preprocessing pipeline inclusive of feature extraction, dimensionality standardization, and data normalization. The summarized features were incorporated into the feature vector, which was next normalized before standardizing in order to enhance consistency and stability across the audios. The investigation of the suggested ECN-MF model was carried out using Fake-or-Real dataset. The dataset comprises four subdatasets ('for-original', 'for-norm', 'for-2sec', and 'for-rerec'), categorized by audio duration and bit rate. To evaluate the performance of the proposed ensemble model, we utilized all sub-datasets with fake and real audio. The proposed ensemble approach achieved state-of-the-art accuracies of 99.5% on the 'for-original' sub-dataset, closely matching the CNN model at 99.6%. It also achieved accuracies of 98% on 'for-norm,' 96.9% on 'for- 2sec ,' and 92.8% on 'for-rerec' sub-datasets. By applying the proposed ensemble model to the 'for-merged' dataset, which comprises all sub-datasets, we obtained a state-of-the-art accuracy of 98%. These results demonstrate the effectiveness of the proposed approach, which outperforms the results of the individual models.
AB - Deepfake synthetic media manipulated using artificial intelligence to mimic authenticity has become more dangerous in the modern digital era. Despite the significant progress in video deepfake detection, audio deepfake detection relies on specific datasets and machine learning algorithms. This study addresses this limitation by developing a deep ensemble learning approach. The proposed deep ensemble approach for audio deepfake detection is called Ensemble Convolutional Neural Network-Mel Frequency Cepstral Coefficient (ECN-MF). The ECN-MF model comprises a recurrent neural network, 1D convolutional neural network, long short-term memory networks, and convolutional long short-term memory network to extract a wider range of audio features, including mel-frequency cepstral coefficient, chroma features, and zero crossing rate. The step of data processing was carried out through the creation of a preprocessing pipeline inclusive of feature extraction, dimensionality standardization, and data normalization. The summarized features were incorporated into the feature vector, which was next normalized before standardizing in order to enhance consistency and stability across the audios. The investigation of the suggested ECN-MF model was carried out using Fake-or-Real dataset. The dataset comprises four subdatasets ('for-original', 'for-norm', 'for-2sec', and 'for-rerec'), categorized by audio duration and bit rate. To evaluate the performance of the proposed ensemble model, we utilized all sub-datasets with fake and real audio. The proposed ensemble approach achieved state-of-the-art accuracies of 99.5% on the 'for-original' sub-dataset, closely matching the CNN model at 99.6%. It also achieved accuracies of 98% on 'for-norm,' 96.9% on 'for- 2sec ,' and 92.8% on 'for-rerec' sub-datasets. By applying the proposed ensemble model to the 'for-merged' dataset, which comprises all sub-datasets, we obtained a state-of-the-art accuracy of 98%. These results demonstrate the effectiveness of the proposed approach, which outperforms the results of the individual models.
KW - 1D CNN
KW - ASVSpoof
KW - ASVSpoof-2019
KW - ConvLSTM
KW - ensemble model
KW - LSTM
KW - machine learning approach
KW - RNN
KW - Voice deepfake
UR - http://www.scopus.com/inward/record.url?scp=85204108124&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3457866
DO - 10.1109/ACCESS.2024.3457866
M3 - Article
AN - SCOPUS:85204108124
SN - 2169-3536
JO - IEEE Access
JF - IEEE Access
ER -