TY - GEN
T1 - A Testing Framework for AI Linguistic Systems (testFAILS)
AU - Kumar, Y.
AU - Morreale, P.
AU - Sorial, P.
AU - Delgado, J.
AU - Li, J. Jenny
AU - Martins, P.
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - This paper introduces testFAILS, an innovative testing framework designed for the rigorous evaluation of AI Linguistic Systems, with a particular emphasis on various iterations of ChatGPT. Leveraging orthogonal array coverage, this framework provides a robust mechanism for assessing AI systems, addressing the critical question, 'How should we evaluate AI?' While the Turing test has traditionally been the benchmark for AI evaluation, we argue that current publicly available chatbots, despite their rapid advancements, have yet to meet this standard. However, the pace of progress suggests that achieving Turing test-level performance may be imminent. In the interim, the need for effective AI evaluation and testing methodologies remains paramount. Our research, which is ongoing, has already validated several versions of ChatGPT, and we are currently conducting comprehensive testing on the latest models, including ChatGPT-4, Bard, Bing Bot, and the LLaMA model. The testFAILS framework is designed to be adaptable, ready to evaluate new bot versions as they are released. Additionally, we have tested available chatbot APIs and developed our own application, AIDoctor, utilizing the ChatGPT-4 model and Microsoft Azure AI technologies.
AB - This paper introduces testFAILS, an innovative testing framework designed for the rigorous evaluation of AI Linguistic Systems, with a particular emphasis on various iterations of ChatGPT. Leveraging orthogonal array coverage, this framework provides a robust mechanism for assessing AI systems, addressing the critical question, 'How should we evaluate AI?' While the Turing test has traditionally been the benchmark for AI evaluation, we argue that current publicly available chatbots, despite their rapid advancements, have yet to meet this standard. However, the pace of progress suggests that achieving Turing test-level performance may be imminent. In the interim, the need for effective AI evaluation and testing methodologies remains paramount. Our research, which is ongoing, has already validated several versions of ChatGPT, and we are currently conducting comprehensive testing on the latest models, including ChatGPT-4, Bard, Bing Bot, and the LLaMA model. The testFAILS framework is designed to be adaptable, ready to evaluate new bot versions as they are released. Additionally, we have tested available chatbot APIs and developed our own application, AIDoctor, utilizing the ChatGPT-4 model and Microsoft Azure AI technologies.
KW - AI Linguistic Systems Testing Framework (testFAILS)
KW - AIDoctor
KW - Bot Technologies
KW - Chatbots
KW - Validation of Chatbots
UR - http://www.scopus.com/inward/record.url?scp=85172264804&partnerID=8YFLogxK
U2 - 10.1109/AITest58265.2023.00017
DO - 10.1109/AITest58265.2023.00017
M3 - Conference contribution
AN - SCOPUS:85172264804
T3 - Proceedings - 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023
SP - 51
EP - 54
BT - Proceedings - 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th IEEE International Conference on Artificial Intelligence Testing, AITest 2023
Y2 - 17 July 2023 through 20 July 2023
ER -