Large Language Models (LLMs) are increasingly used for historical and theological inquiry, yet their reliability in specialized scholarly domains remains unexamined. This paper presents a systematic empirical evaluation of LLM accuracy in early Christian studies, using two fourth-century figures as case studies: Macrina the Younger (c. 327-379 CE) and Olympias of Constantinople (c. 368-408 CE). These figures were selected to probe LLM behavior across axes of scholarly versus popular reception, source type, and gender representation. Using a structured benchmarking methodology - testing biographical accuracy, chronological precision, theological positioning, and source-critical reasoning across multiple models - we aim to identify consistent failure patterns, including factual conflation, hallucination, and what we term association collapse: the systematic narration of women's significance through male contemporaries. We conclude with practical guidance for educators on integrating critical AI literacy into religious studies pedagogy and a replicable framework for evaluating LLMs in other historical and theological contexts.
Attached Paper
In-person November Annual Meeting 2026
The Future of the Past: Evaluating Large Language Model Reliability in Early Christian Historical Studies
Papers Session: History of the Present, the Futures of Christianity's Past
Abstract for Online Program Book (maximum 150 words)
