TY - JOUR
T1 - A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI
AU - Kumar, Yulia
AU - Lin, Mengtian
AU - Paredes, Christopher
AU - Li, Dan
AU - Yang, Guohao
AU - Kruger, Dov
AU - Li, Juan
AU - Morreale, Patricia
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/12
Y1 - 2024/12
N2 - In a previous paper we defined testFAILS, a set of benchmarks for measuring the efficacy of Large Language Models in various domains. This paper defines a second-generation framework, testFAILS-2 to measure how current AI engines are progressing towards Artificial General Intelligence (AGI). The testFAILS-2 framework offers enhanced evaluation metrics that address the latest developments in Artificial Intelligence Linguistic Systems (AILS). A key feature of this re-view is the “Chat with Alan” project, a Retrieval-Augmented Generation (RAG)-based AI bot inspired by Alan Turing, designed to distinguish between human and AI generated interactions, thereby emulating Turing’s original vision. We assess a variety of models, including ChatGPT-4o-mini and other Small Language Models (SLMs), as well as prominent Large Language Models (LLMs), utilizing expanded criteria that encompass result relevance, accessibility, cost, multimodality, agent creation capabilities, emotional AI attributes, AI search capacity, and LLM-robot integration. The analysis reveals that testFAILS-2 significantly enhances the evaluation of model robustness and user productivity, while also identifying critical areas for improvement in multimodal processing and emotional reasoning. By integrating rigorous evaluation standards and novel testing methodologies, testFAILS-2 advances the assessment of AILS, providing essential insights that contribute to the ongoing development of more effective and resilient AI systems towards achieving AGI.
AB - In a previous paper we defined testFAILS, a set of benchmarks for measuring the efficacy of Large Language Models in various domains. This paper defines a second-generation framework, testFAILS-2 to measure how current AI engines are progressing towards Artificial General Intelligence (AGI). The testFAILS-2 framework offers enhanced evaluation metrics that address the latest developments in Artificial Intelligence Linguistic Systems (AILS). A key feature of this re-view is the “Chat with Alan” project, a Retrieval-Augmented Generation (RAG)-based AI bot inspired by Alan Turing, designed to distinguish between human and AI generated interactions, thereby emulating Turing’s original vision. We assess a variety of models, including ChatGPT-4o-mini and other Small Language Models (SLMs), as well as prominent Large Language Models (LLMs), utilizing expanded criteria that encompass result relevance, accessibility, cost, multimodality, agent creation capabilities, emotional AI attributes, AI search capacity, and LLM-robot integration. The analysis reveals that testFAILS-2 significantly enhances the evaluation of model robustness and user productivity, while also identifying critical areas for improvement in multimodal processing and emotional reasoning. By integrating rigorous evaluation standards and novel testing methodologies, testFAILS-2 advances the assessment of AILS, providing essential insights that contribute to the ongoing development of more effective and resilient AI systems towards achieving AGI.
KW - AI evaluation
KW - AI linguistic systems
KW - artificial general intelligence
KW - multimodal AI
KW - testFAILS-2
UR - http://www.scopus.com/inward/record.url?scp=85213274505&partnerID=8YFLogxK
U2 - 10.3390/electronics13244991
DO - 10.3390/electronics13244991
M3 - Article
AN - SCOPUS:85213274505
SN - 2079-9292
VL - 13
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 24
M1 - 4991
ER -