Thoughts for Christians Building LLM Standards and Benchmarks

November 21, 2025

Introduction

Since early 2023, I have discussed a common concern with Christians of various faith and professional backgrounds: LLMs come with serious ethical, social, and religious challenges that require a response. Many standards and benchmarks have been proposed by thoughtful and well-meaning Christians over the last two and a half years. As far as I am aware, few of these have made a serious impact. From the standards and benchmark meetings I have taken over the years, it is telling how often discussion descends into questions like “what does it mean to be human” and “can machines be conscious”. While these questions are important and thought-provoking, they are not relevant to the LLMs before us today. Sharing a nuanced understanding of how LLMs operate could table these thorny topics in favor of the task at hand: how to improve LLM performance on specific faith-relevant tasks. Many Christian organizations and other religious groups are intent on addressing problems with LLMs. Here are my thoughts as an AI research scientist on ethical standards and benchmarks.

Understand the Substrate

Within empirical disciplines, an experimenter must understand the substrate they are testing before characterizing it. The substrate of an LLM may appear linguistic, but it is primarily statistical. It involves tokens, tensors, linear algebra, nonlinear operators, and partial derivatives. Gaining a deep understanding of model architecture and training can take significant background reading, but a more nuanced high-level understanding of LLMs is very helpful. Likewise, we must have a better understanding of LLMs to avoid mechanistic misconceptions. These often stem from metaphors that offer an initial approximation but are ultimately insufficient. I have heard LLMs described as a black box, a next-word predictor, a stochastic parrot, a pattern-matching machine, and emulating our own neural machinery for language construction. These metaphors can be helpful for an introductory knowledge of AI but have limited use and specificity.

One of the most common misconceptions about LLMs, which I think originates from the black box metaphor, is the idea that a single prompt and response yields insight on how LLMs “think” or “decide” on an issue (what I call the one-shot fallacy). Imagine asking a chatbot for a number between one and ten, then writing an essay about why the LLM “prefers” seven. This is not how LLMs operate. An LLM prompted many times will lead to a distribution of responses from one to ten. No singular prompt and response can provide actionable guidance to improve model performance. Yet specific LLM responses to individual prompts like “why should I vote for {insert politician}?” or “can you explain {Bible verse}?” have led to much consternation, and potentially incorrect conclusions about overall model performance¹. The one-shot fallacy has also allowed marketers to present standard LLM outputs as straw-man arguments for why their new tool (typically an LLM wrapper) is superior. Finally, this fallacy can prevent users from interfacing with LLM developers. When a single user encounters a concerning or misleading LLM response, there is often no clear resolution if the response cannot be readily reproduced, or the severity of the problem cannot be ascertained.

Identify a Specific Need

While many would agree that LLMs need to be governed more responsibly, there is no strong consensus on what this should look like. It is hard to improve a tool when the requested changes supply no clear action. When guidelines state that LLMs should be “fair” or “human-centric” or “transparent”, what does that mean? Vagueness leads to ethics washing: any company representative can agree to a set of guiding principles, so long as it does not impinge on their company’s timeline.

Altering how LLMs perform on a task requires a degree of testing and specificity that guiding principles alone cannot provide. Benchmarks are the most popular tool for quantitatively estimating how well LLMs perform a specific task. There are currently hundreds, if not thousands, of LLM benchmarks spanning a multitude of tasks. This includes tasks like ethical decision making², curbing hate speech³, unpacking moral reasoning⁴, and addressing diverse worldviews⁵.

Before going through the tedious process of building a benchmark from scratch, it is important to do sufficient background research. Perhaps there are relevant benchmarks that can serve as a starting point, or help pinpoint the deficiencies of current models in more quantitative terms. Once a specific task has been identified, one should establish a baseline for the current LLM performance at that task. In what ways is the performance of a standard LLM inadequate? What should be considered adequate performance at the task? Having a handle on these questions is an important preliminary step before the expensive process of designing and testing a benchmark.

Propose and Build Benchmark Experiments

After deciding on the need for a benchmark, here is how I would recommend building one. LLM benchmarks tend to include the following three components: prompt inputs, corresponding expected outputs, and a scoring metric to compare expected outputs to LLM outputs. For example, the most popular ethical decision-making benchmark presents a situation and prompts an LLM for whether the situation constitutes acceptable or unacceptable behavior (that is, a binary classification task). LLM outputs are compared to ground truth outputs via a simple accuracy-based metric. Forcing an LLM to respond with a selection from a list (such as a multiple-choice format) makes measuring performance easier, although a fixed selection benchmark is not normally representative of how users will interact with the model. Next, LLM benchmark data is divided into at least two parts: a training set and a testing set. The training set can be used for fine-tuning or prompting for in-context learning (ICL), while the test set helps compare LLM performance before and after leveraging the training set.

The most-used benchmarks aim to systematically test an LLM along dimensions of ability, reliability, and robustness that matter for a specific use case, without leaking training data or introducing confounding factors. Similarly, a Christian benchmark should seek to be representative of a set of specific, well-defined, underlying tasks.

Debating Over

While LLM benchmarks are essential, they are not a panacea. Incorrect or inappropriate behaviors may be reduced but never be eliminated. Jailbreaking will always be possible. This is the nature of the LLM statistical substrate, and we need to be accustomed to it and communicate clearly about it. Once we do, we can borrow well-developed concepts from engineering disciplines that address safety concerns quantitatively.

Suppose a civil engineer is building a bridge and is weighing an expensive safety measure that would make the bridge slightly safer. How should the civil engineer evaluate this decision? If a project is maximized for safety, costs would balloon. Conversely, no one wants to use an unsafe bridge built on a shoestring budget. An important tool for this decision is the value of a statistical life (VSL). In the US, VSL is around $13.7 million⁶. Suppose the safety measure would save an estimated 1.5 statistical lives over the bridge’s usable life but would cost $30 million. In this example, the safety measure would not be selected, and the bridge would be built in a slightly less safe manner.

For many, this sort of cold calculation is abhorrent. Placing a monetary value on a human life, even an abstract statistical life, seems incompatible with our status as God’s image bearers. These are valid criticisms. Unfortunately, without a metric like VSL, how is the civil engineer to decide on the safety measure? How should any engineering safety decision be resolved? Engineers do not make decisions based on their emotions, and we are safer for it. Similarly, software engineers working on LLMs cannot leave safety up to intuition.

We can consider safety measures with two factors in mind: the cost of the safety measure and its estimated effectiveness. Suppose we have a benchmark to determine whether an LLM will recommend self-harm⁷. Altering an LLM to reduce instances of self-harm recommendation will be expensive. Since we have no ironclad guarantee that an LLM will never recommend self-harm, we must consider what an acceptable rate would be. In such a situation mathematicians frequently use epsilon, , as a stand-in for a very small number. What is an acceptable for self-harm recommendation? One in a thousand? One in a million? For issues like self-harm, I think tech companies have already settled on their acceptable value for . These decisions are reflected in current production models and the ongoing benchmarking literature. Should be lower? Rather than advocating for LLMs behaving in absolute terms, which is not achievable, I think we need to shift our conversation to measuring and debating a better for the tasks we care about.

Learning from Past Mistakes

While building both LLM applications and benchmarks, developers and I have faced ethical issues. I have personally encountered all the examples below, and made decisions about which choices are ethical and which are not. During the development of applications and benchmarks, programmers should not…

…read the chat logs of users, even if those logs have been “anonymized”. It is inappropriate, voyeuristic, and should violate any reasonable terms of service. There are a variety of tools that can quantitatively analyze a deque of chat logs without compromising user trust and anonymity.
…leave your tool’s quality control up to your users. Responsible development should involve proactive vulnerability management, not reactive patching. This is a basic tenet of any introductory cybersecurity course⁸.
…anthropomorphize an LLM. It is not your “own personal Bible scholar” or your “Gospel companion”. At its best, it is a useful tool for studying God’s word, but a computer program should not be anthropomorphized.
…make claims that cannot be supported by data. Well-meaning developers say that their LLM wrapper supplies answers that are more “faith-oriented” or “Christ-centered” than a vanilla LLM, but this assertion is not sufficient from anecdotes. Developers should know that such claims are misleading and ultimately damaging.
…build benchmarks using only synthetic data. While there are scenarios where LLM-generated content can act as a benchmark, recent literature points to serious challenges with synthetic benchmark data for complex evaluation tasks⁹.
…hold on to benchmarks for internal use only. As we have seen from other LLM providers, an internal audit without third-party participation cannot be trusted¹⁰.
…use ICL or fine-tuning with an entire benchmark dataset, then point to your own LLM as the best model for the same benchmark. This is duplicitous.
…fine-tune an LLM (eg. via LoRA) without anticipating the downstream consequences. It is well known that fine-tuning an LLM on a specific task can degrade performance at other benchmarked tasks, a process known as “catastrophic forgetting”¹¹.
…claim that an LLM can operate across many languages without rigorous testing in those languages. There are many aspects of the Christian faith that can be lost in translation; this is why Bible translators spend decades producing versions of the Bible in other languages instead of running each verse through Google Translate. If you are unable to benchmark in a particular language, you should not claim to offer support for it. This is especially true if there are no regularly evaluated benchmarks for standard LLMs in the desired language (eg. the language is not covered by the MMLU¹²).

Conclusion

LLMs come with serious ethical, social, and religious challenges that must be addressed. There is a lot of work that we, as Christians, must do to improve these systems, especially when we use them in a faith context. We should better understand the inner workings and limitations of our tools, so we can make informed, responsible choices. In this way, we can be better stewards of the tools we use.

References

AI Christian Benchmark: Evaluating 7 Top LLMs for Theological Reliability. The Gospel Coalition (2025). https://media.thegospelcoalition.org/wp-content/uploads/2025/09/19121023/AI-Christian-Benchmark-Executive-Summary.pdf.
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2020). Aligning ai with shared human values. arXiv preprint arXiv:2008.02275.
Shen, X., Wu, Y., Qu, Y., Backes, M., Zannettou, S., & Zhang, Y. (2025). HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns.
Jiao, Junfeng, et al. “LLM ethics benchmark: a three-dimensional assessment system for evaluating moral reasoning in large language models.” Scientific Reports 15.1 (2025): 34642.
Mushtaq, Abdullah, et al. “WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models.” arXiv preprint arXiv:2505.09595 (2025).
“Departmental Guidance on Valuation of a Statistical Life in Economic Analysis”. US Department of Transportation (2025). https://www.transportation.gov/office-policy/transportation-policy/revised-departmental-guidance-on-valuation-of-a-statistical-life-in-economic-analysis.
Andriushchenko, Maksym, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks et al. “Agentharm: A benchmark for measuring harmfulness of llm agents.” arXiv preprint arXiv:2410.09024 (2024).
Matthew Finio, Amanda Downie. “What is application security (AppSec)?”. IBM Cybersecurity (2025). https://www.ibm.com/think/topics/application-security.
Maheshwari, Gaurav, Dmitry Ivanov, and Kevin El Haddad. “Efficacy of synthetic data as a benchmark.” arXiv preprint arXiv:2409.11968 (2024).
Alex Reisner. “Chatbots Are Cheating on Their Benchmark Tests.” The Atlantic (2025).
Luo, Yun, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. “An empirical study of catastrophic forgetting in large language models during continual fine-tuning.” IEEE Transactions on Audio, Speech and Language Processing (2025).
Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring massive multitask language understanding.” arXiv preprint arXiv:2009.03300 (2020).

Views and opinions expressed by authors and editors are their own and do not necessarily reflect the view of AI and Faith or any of its leadership.

Marcus Schwarting

Senior Editor for AI and Faith, Marcus Schwarting is a PhD candidate in computer science at the University of Chicago, where his research focuses on applying deep learning to important challenges in computational chemistry, materials science, and spectroscopy. Marcus graduated from the University of Louisville in 2018 with degrees in mathematics and chemical engineering. After working for four years at the National Renewable Energy Laboratory and Argonne National Laboratory, Marcus began his PhD studies in 2020 under Dr. Ian Foster.

Ethics of Life and Mind in the Era of AI #50