Leveraging Data Augmentation and Large Language Models for Enhanced COVID-19 Tweet Classification

Eric Landaverde, Adam Spencer, Daehan Kwak

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This study explores the integration of Large Language Models with traditional machine learning techniques to classify COVID-19 related tweets, leveraging the lightweight Mistral model developed by Mistral AI. Enhanced by word2vec-based data augmentation, the approach generates dynamic, class-defining attributes to produce high-quality, contextually relevant text. This dual-model framework combines Mistral’s preprocessing strengths with BERT and a Random Forest classifier to effectively address themes such as vaccines, masks, quarantine, and social distancing. This exploratory case study demonstrates that LLM labeling and strategic data augmentation can significantly improve accuracy on small datasets and provide a scalable solution for social media content analysis. The BERT model achieved 94% accuracy with simple augmentation and 91% accuracy with advanced augmentation, while the Random Forest model showed lower performance, classifying fewer examples correctly. The study highlights the effectiveness of LLM-generated labels and advanced data augmentation, particularly with the BERT model, in enhancing classification accuracy, semantic relevance, and reducing uncertainty. Future research will focus on expanding the range of data classification and improving the quality of dynamically generated class attributes to better capture semantic complexity and further enhance model performance.

Original languageEnglish
Title of host publicationArtificial Intelligence and Applications - 26th International Conference, ICAI 2024, Held as Part of the World Congress in Computer Science, Computer Engineering and Applied Computing, CSCE 2024, Revised Selected Papers
EditorsHamid R. Arabnia, Leonidas Deligiannidis, Soheyla Amirian, Farzan Shenavarmasouleh, Farid Ghareh Mohammadi, David de la Fuente
PublisherSpringer Science and Business Media Deutschland GmbH
Pages174-190
Number of pages17
ISBN (Print)9783031866227
DOIs
StatePublished - 2025
Event26th International Conference on Artificial Intelligence and Applications, ICAI 2024, held as part of the World Congress in Computer Science, Computer Engineering and Applied Computing, CSCE 2024 - Las Vegas, United States
Duration: 22 Jul 202425 Jul 2024

Publication series

NameCommunications in Computer and Information Science
Volume2252 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference26th International Conference on Artificial Intelligence and Applications, ICAI 2024, held as part of the World Congress in Computer Science, Computer Engineering and Applied Computing, CSCE 2024
Country/TerritoryUnited States
CityLas Vegas
Period22/07/2425/07/24

Keywords

  • Data augmentation
  • Large language models (LLM)
  • Social media content analysis
  • Text classification

Fingerprint

Dive into the research topics of 'Leveraging Data Augmentation and Large Language Models for Enhanced COVID-19 Tweet Classification'. Together they form a unique fingerprint.

Cite this