TY - JOUR
T1 - Beyond the Illusion
T2 - Ensemble Learning for Effective Voice Deepfake Detection
AU - Ali, Ghulam
AU - Rashid, Javed
AU - Ul Hussnain, Muhammad Rameez
AU - Tariq, Muhammad Usman
AU - Ghani, Anwar
AU - Kwak, Daehan
N1 - Publisher Copyright:
© 2024 The Authors.
PY - 2024
Y1 - 2024
N2 - Deepfake synthetic media, manipulated using artificial intelligence to mimic authenticity, has become more dangerous in the modern digital era. Despite significant progress in video deepfake detection, audio deepfake detection relies on specific datasets and machine learning algorithms. This study addresses this limitation by developing a deep ensemble learning approach. The proposed deep ensemble approach for audio deepfake detection is called the Ensemble Convolutional Neural Network-Mel Frequency Cepstral Coefficient (ECN-MF). The ECN-MF model comprises a recurrent neural network, a 1D convolutional neural network, long short-term memory networks, and a convolutional long short-term memory network to extract a wider range of audio features, including mel-frequency cepstral coefficient, chroma features, and zero crossing rate. The data processing step was carried out through the creation of a preprocessing pipeline, inclusive of feature extraction, dimensionality standardization, and data normalization. The summarized features were incorporated into the feature vector, which was then normalized before being standardized in order to enhance consistency and stability across the audios. The investigation of the suggested ECN-MF model was carried out using the Fake-or-Real dataset. The dataset comprises four sub-datasets ('for-original', 'for-norm', 'for-2sec', and 'for-rerec'), categorized by audio duration and bit rate. To evaluate the performance of the proposed ensemble model, we utilized all sub-datasets with fake and real audio. The proposed ensemble approach achieved state-of-the-art accuracies of 99.5% on the 'for-original' sub-dataset, closely matching the CNN model at 99.6%. It also achieved accuracies of 98% on 'for-norm,' 96.9% on 'for-2sec,' and 92.8% on 'for-rerec' sub-datasets. By applying the proposed ensemble model to the 'for-merged' dataset, which comprises all sub-datasets, we obtained a state-of-the-art accuracy of 98%. These results demonstrate the effectiveness of the proposed approach, which outperforms the results of the individual models.
AB - Deepfake synthetic media, manipulated using artificial intelligence to mimic authenticity, has become more dangerous in the modern digital era. Despite significant progress in video deepfake detection, audio deepfake detection relies on specific datasets and machine learning algorithms. This study addresses this limitation by developing a deep ensemble learning approach. The proposed deep ensemble approach for audio deepfake detection is called the Ensemble Convolutional Neural Network-Mel Frequency Cepstral Coefficient (ECN-MF). The ECN-MF model comprises a recurrent neural network, a 1D convolutional neural network, long short-term memory networks, and a convolutional long short-term memory network to extract a wider range of audio features, including mel-frequency cepstral coefficient, chroma features, and zero crossing rate. The data processing step was carried out through the creation of a preprocessing pipeline, inclusive of feature extraction, dimensionality standardization, and data normalization. The summarized features were incorporated into the feature vector, which was then normalized before being standardized in order to enhance consistency and stability across the audios. The investigation of the suggested ECN-MF model was carried out using the Fake-or-Real dataset. The dataset comprises four sub-datasets ('for-original', 'for-norm', 'for-2sec', and 'for-rerec'), categorized by audio duration and bit rate. To evaluate the performance of the proposed ensemble model, we utilized all sub-datasets with fake and real audio. The proposed ensemble approach achieved state-of-the-art accuracies of 99.5% on the 'for-original' sub-dataset, closely matching the CNN model at 99.6%. It also achieved accuracies of 98% on 'for-norm,' 96.9% on 'for-2sec,' and 92.8% on 'for-rerec' sub-datasets. By applying the proposed ensemble model to the 'for-merged' dataset, which comprises all sub-datasets, we obtained a state-of-the-art accuracy of 98%. These results demonstrate the effectiveness of the proposed approach, which outperforms the results of the individual models.
KW - 1D CNN
KW - ConvLSTM
KW - LSTM
KW - RNN
KW - Voice deepfake
KW - ensemble model
KW - fake-or-real dataset
KW - machine learning approach
UR - https://www.scopus.com/pages/publications/85204108124
U2 - 10.1109/ACCESS.2024.3457866
DO - 10.1109/ACCESS.2024.3457866
M3 - Article
AN - SCOPUS:85204108124
SN - 2169-3536
VL - 12
SP - 149940
EP - 149959
JO - IEEE Access
JF - IEEE Access
ER -