TY - GEN
T1 - Leveraging Data Augmentation and Large Language Models for Enhanced COVID-19 Tweet Classification
AU - Landaverde, Eric
AU - Spencer, Adam
AU - Kwak, Daehan
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - This study explores the integration of Large Language Models with traditional machine learning techniques to classify COVID-19 related tweets, leveraging the lightweight Mistral model developed by Mistral AI. Enhanced by word2vec-based data augmentation, the approach generates dynamic, class-defining attributes to produce high-quality, contextually relevant text. This dual-model framework combines Mistral’s preprocessing strengths with BERT and a Random Forest classifier to effectively address themes such as vaccines, masks, quarantine, and social distancing. This exploratory case study demonstrates that LLM labeling and strategic data augmentation can significantly improve accuracy on small datasets and provide a scalable solution for social media content analysis. The BERT model achieved 94% accuracy with simple augmentation and 91% accuracy with advanced augmentation, while the Random Forest model showed lower performance, classifying fewer examples correctly. The study highlights the effectiveness of LLM-generated labels and advanced data augmentation, particularly with the BERT model, in enhancing classification accuracy, semantic relevance, and reducing uncertainty. Future research will focus on expanding the range of data classification and improving the quality of dynamically generated class attributes to better capture semantic complexity and further enhance model performance.
AB - This study explores the integration of Large Language Models with traditional machine learning techniques to classify COVID-19 related tweets, leveraging the lightweight Mistral model developed by Mistral AI. Enhanced by word2vec-based data augmentation, the approach generates dynamic, class-defining attributes to produce high-quality, contextually relevant text. This dual-model framework combines Mistral’s preprocessing strengths with BERT and a Random Forest classifier to effectively address themes such as vaccines, masks, quarantine, and social distancing. This exploratory case study demonstrates that LLM labeling and strategic data augmentation can significantly improve accuracy on small datasets and provide a scalable solution for social media content analysis. The BERT model achieved 94% accuracy with simple augmentation and 91% accuracy with advanced augmentation, while the Random Forest model showed lower performance, classifying fewer examples correctly. The study highlights the effectiveness of LLM-generated labels and advanced data augmentation, particularly with the BERT model, in enhancing classification accuracy, semantic relevance, and reducing uncertainty. Future research will focus on expanding the range of data classification and improving the quality of dynamically generated class attributes to better capture semantic complexity and further enhance model performance.
KW - Data augmentation
KW - Large language models (LLM)
KW - Social media content analysis
KW - Text classification
UR - https://www.scopus.com/pages/publications/105005255326
U2 - 10.1007/978-3-031-86623-4_14
DO - 10.1007/978-3-031-86623-4_14
M3 - Conference contribution
AN - SCOPUS:105005255326
SN - 9783031866227
T3 - Communications in Computer and Information Science
SP - 174
EP - 190
BT - Artificial Intelligence and Applications - 26th International Conference, ICAI 2024, Held as Part of the World Congress in Computer Science, Computer Engineering and Applied Computing, CSCE 2024, Revised Selected Papers
A2 - Arabnia, Hamid R.
A2 - Deligiannidis, Leonidas
A2 - Amirian, Soheyla
A2 - Shenavarmasouleh, Farzan
A2 - Ghareh Mohammadi, Farid
A2 - de la Fuente, David
PB - Springer Science and Business Media Deutschland GmbH
T2 - 26th International Conference on Artificial Intelligence and Applications, ICAI 2024, held as part of the World Congress in Computer Science, Computer Engineering and Applied Computing, CSCE 2024
Y2 - 22 July 2024 through 25 July 2024
ER -