To Christians Developing LLM Applications: A Warning, and Some Suggestions

June 10, 2023

Introduction

Large language models (LLMs) have attracted much attention in the last year, for both good and bad reasons. Many are excited by the prospect of augmenting LLM’s existing capacity to compose high-quality content for a variety of applications. Others are concerned by the ramifications for social media, education, and other important domains. Some users and developers have already demonstrated how LLMs can be used for religious purposes, which raises difficult questions about the role of AI in religious practice. Before Christian developers embark on leveraging LLMs for faith-related tasks, it is important to carefully consider the ethical implications of such use.

A Warning

No matter how one chooses to use an LLM in a religious application, it will be informing another person’s understanding of God. When an LLM is used as a part of a spiritual practice, such as prayer or biblical interpretation, it will inform a user’s understanding of God. Many sources – from radio shows and podcasts to books and articles to in-person worship services – inform a person’s relationship with God and understanding of the Bible. The difference with an LLM comes from responsibility and agency: the people and sources that influence another’s beliefs are answerable to the Almighty for their role in shaping that person.¹ LLMs do not carry the attribution associated with authorship, or the agency associated with personhood. An LLM’s compositions are stochastic, dynamic, and probabilistically composed. The model is not to blame for its output; the creator is. A computer cannot be held accountable for its code, and there is no one else answerable but the developer. Every “hallucination,”² every unorthodox response, and every misattribution is attributable to the developer.³ Denying responsibility for these LLM errors is like a surgeon who denies having committed malpractice after knowingly operating with contaminated surgical tools. What users do with the output of an LLM, and the developer’s own culpability for their actions, is a separate discussion.⁴

Within the framework of a generative stochastic model, there is no way to perfectly eliminate the possibility of confabulations and erroneous behavior. Given the developer’s responsibility for a model’s errors, I encourage developers to exercise extreme caution when designing applications that rely on an LLM to ultimately inform users’ religious perspectives. We have myriad examples of the dangers of human false teachers (2 Peter 2:1). The damage from false teaching is personal to many Christians today. We do not need a new threat from false AI teachers. The risk is too great.

Caution is not a strong part of the software developer’s zeitgeist. Lean start-ups and motivated developers prioritize an agile mindset. We are supposed to “move fast and break things”, “fail upwards”, and “act fast before we get scooped”. Amid the media hype and resource reallocation towards LLMs, the developer FOMO (fear of missing out) is acute and its consequences real. Just prior to the release of Bard by Google, Microsoft released Sydney in what appeared to be a premature state ⁵. The result has been confusion and embarrassment ⁶. If one of the most respected, thoughtful, and best-equipped tech giants can make such a mistake with its LLM deployment (not once, but twice ⁷), no one should not assume they can avoid others’ mistakes just because they operate within a religious ethical framework.

How can one separate selfish ambitions from a genuine desire to make a tool available to those who need it? “Move fast and break things” might work in abstraction, but religious applications are designed to interface with real human souls. This mentality, with an attitude of post-deployment hot fixing and patching, is a poor excuse for releasing an app that could negatively impact its users’ religious experiences.

LLM App Taxonomy

Suppose a developer has decided to proceed with using LLMs for a Christian app. As of now, and based on some of the tools already deployed in this specific area, developers appear to be using an LLM (I will assume an OpenAI GPT model, however the same applies to the Alpaca, Dolly, Bard, Bedrock, etc. equivalent) in one of roughly four ways:

Zero-shot learning with prompt engineering for generative tasks. When a user queries the app, they are querying the LLM with a prompt-wrapper constructed to enclose their query. This gives the LLM sufficient context to compose a suitable answer.
Zero-shot learning or fine-tuning for summarization tasks. The developer gave an LLM a long passage and asked for a short summary. The developer may have fine-tuned the LLM with a set of passage-summary pairs.
Fine-tuning or reinforcement learning from human feedback (RLHF) for Q&A type tasks. The developer constructed a corpus with question prompts, then crafted or otherwise generated the sorts of answers they wish the LLM to compose.
Fine-tuning for search, retrieval, or retrieval-augmented generation (RAG) ⁸ type tasks. The corpus includes a set of what is deemed relevant and reputable sources, which are delimited into manageable blocks (chapters, verses, etc.). A user’s query will return a small subset of these sources, either word-for-word (pure retrieval) or with some generative aspects to the composition (RAG).

Religious practices inform believers’ imaginations, emotions, and minds. How can developers construct these models in a way that considers the care of precious human souls?

#1: Benchmarks and Testing

The bar is low for using a GPT model. Thanks to OpenAI, even those with even minimal coding experience can create their own tools. The ease with which these models can be deployed or fine-tuned should not diminish the importance of benchmarking and testing. Deploying any machine learning model without quantitative testing is irresponsible. As LLMs continue to become embedded in many applications, it is more important than ever for developers to perform responsible testing. Though difficult, quantitatively testing LLMs should not be considered optional.

For developers using a zero-shot approach, relying only on benchmarks published by OpenAI is insufficient. Such religious tasks differ in substance from OpenAI’s benchmarks. Future improvements to an LLM, as measured by these published benchmarks, does not imply a performance improvement to the intended specific task. Developing a specific benchmark dataset, and routinely using it for testing, is necessary.

For developers using a fine-tuning approach, the benchmark performances published by OpenAI may no longer be relevant for a fine-tuned model. OpenAI benchmark tasks are not germane to a specific religious task, so developing a benchmark is especially important in the fine-tuning case. A trustworthy benchmark allows developers to measure performance improvements due to fine-tuning. Benchmark testing helps determine the performance impact of changing the model size, training epochs, batch sizes, temperature, and other settings.

In addition to benchmarks that assess preferred responses, adversarial benchmark tests are also important. OpenAI has been concerned with constricting their models to avoid misogyny, racism, violence, innuendo, and other unsavory behavior. Religious applications should avoid these in addition to heresy. Avoiding this content requires adversarial training and testing; that is, crafting prompts that include these behaviors and ensuring suitably tailored responses are returned. In the Roman Catholic Church, the beatification process involves interviewing a “devil’s advocate,” which Arup Chatterjee and Christopher Hitchens did in Mother Teresa’s beatification ⁹. A developer’s quantitative adversarial benchmarks to test for heresy should involve something like a “devil’s advocate” to ensure users are not sent down a wrong path.

Constructing trustworthy benchmarks is difficult, tedious, and resource intensive. This has led some developers to skip it, effectively leaving error handling up to the users. Average users are not responsible for pointing out the potential pitfalls of a model. Exploring failure modes is an essential part of the job when developing these tools.

#2: Transparency and Responsible Data Handling

When it comes to the daily practice of faith, believers are eager to find God in everything from a piece of toast to a transcendent emotional experience. When using an app built for religious purposes, users must be made aware that they are engaging an AI, not a person, and certainly not God. Anthropomorphization is dangerous in this instance. Non-religious anthropomorphized LLMs like Replika have proved extremely harmful to their users’ emotions ¹⁰, and anthropomorphizing a religious app could lead to spiritual abuse. Users must be persistently reminded that they are engaging with a computer model and nothing more.

LLMs do not understand user queries on a conscious level – they neither empathize with users’ troubles nor do they possess any intuition for religious doctrine. In short, they do not care – and cannot be made to care – about the consequences of their words. They have not been baptized or confirmed, they do not participate in the Eucharist, and they are not beloved creations made in the image of God. They do not have souls. They are interpreters of human language: they receive queries and return stochastic compositions. An AI is a grossly insufficient substitute for a fellow human when serving as a friend, mentor, pastor, or counsellor. Developers must make users constantly aware of this when using the model¹¹.

Users should be aware of what happens to their submitted queries. The privacy that a developer signed away when accepting the OpenAI terms of service is not what users knowingly agreed to by interfacing with the app. It is the developers’ responsibility to make users aware. From a privacy perspective, querying an OpenAI model is far more invasive than a search engine (especially one that emphasizes privacy, such as DuckDuckGo ¹²), such that users may not want to divulge private aspects of their faith or personal life.

When a developer is fine-tuning and their training set contains data with some protected status (embargo, copyright, etc.), they may be violating this status by passing the data to OpenAI for training. Such protected data, along with any data used for training future models, could be susceptible to membership inference attacks ¹³. Conversely, if their dataset does not contain anything with protected status, developers should be willing to return linked citations to that corpus in the LLM compositions, as well as make their entire fine-tuning corpus public. It is important that developers inform users about what they can expect from using the application.

#3: Starting Simple

There is an unavoidable increase in computational expense for running a more complex calculation, no matter the domain. To be adopted, a more complex calculation must carry a benefit such as better performance, higher fidelity, or improved resolution. Depending on the complexity of the task, it may be worth considering less complex calculations instead of querying an LLM. The recent news around GPT has introduced many developers to natural language processing (NLP) for the first time, but the discipline is over sixty years old and has a long record of effective models for specific language-based subtasks. These predecessors are worth exploring in further depth.

Because OpenAI makes GPT both easy to use and operationally opaque, it will always be hard to do due diligence with benchmarks and testing. Running inference on an LLM is computationally expensive and energy-intensive (with the largest models continuing to rise in expense ¹⁴). Depending on the task developers are trying to accomplish, they should be willing to explore simpler open-source alternatives. Is the model performing a retrieval task? Gensim’s Doc2vec and TF-IDF are great places to start ¹⁵. Is the model performing a text summarization task? HuggingFace ¹⁶ and PapersWithCode ¹⁷ both list open-source pre-trained transformer architectures, many of which are small enough to be fine-tuned locally or in a Collab notebook. Aberrant behavior, even when it arises from a transformer, should be far easier to identify and account for when there is full model transparency. The benchmark dataset should allow developers to quantify the performance of an open-source model relative to GPT. Given the many upsides of using an open-source model, it may be worth accepting a dip in performance. Whatever they ultimately decide, developers will be making a more informed, responsible choice on behalf of their users.

Conclusion

The Apostle Paul tells Timothy, “All Scripture is God-breathed and is useful for teaching, rebuking, correcting and training in righteousness, so that the servant of God may be thoroughly equipped for every good work” (2 Timothy 3:16-17). James 3:1 states, “Not many of you should become teachers, my fellow believers, because you know that we who teach will be judged more strictly.” In constructing these applications, developers are trusting an LLM to leverage Scripture to equip the servants of God for every good work and placing LLMs in the role of teacher. Personally, I find the risks too great. The spiritual responsibility an LLM places on its developers is heavy. For those who disagree, hopefully these suggestions help developers consider the ethical implications of their work.

I am eager to hear your thoughts and comments, though I will respond in human time and not at the speed of an LLM. Please email me at mschwarting@aiandfaith.org or comment below.

Acknowledgements

A big thanks to Melody Cantwell for significant editing, tone adjustments, and the points on Replika and the “devil’s advocate.” Thanks to Haley Griese for her thoughts and edits, as well as her excellent points about the use of the term “hallucination” and the ELIZA system.

References

James 3:1 ↑
The term “hallucination” is used to describe when an LLM responds both confidently and incorrectly to a prompt. Going forward, I will use the term “confabulation” two reasons. First, “hallucination” implies a consciousness and perception that is not part of the model. Second, “hallucination” in this context may stigmatize those with mental illnesses who suffer from hallucinatory symptoms. ↑
While an author can defend the correct interpretation of their book or article, as a finite written document, a developer is responsible for the compositions returned by the LLM. ↑
Suppose someone consumes a piece of online content then commits an unethical act because of what they consumed. The unethical act may be a result of the person’s belief that they acted justly, even though they did not. According to many faith traditions, ignorance does not absolve a perpetrator of wrongdoing (this phenomenon is known as culpable ignorance ¹⁸). However, in instances where the perpetrator is not deemed culpable for their ignorance, the content author’s actions can be considered unethical as well. I would argue that the developer is responsible for the content their model created. The question becomes whether the individual is culpable for their ignorance, which underscores the importance and seriousness of informing users that they are communicating with an AI that can be fallible. ↑
NLP programs dating back to Joseph Weizenbaum’s ELIZA have demonstrated how users can develop emotional attachments to a bot ¹⁹. Even in the 1960s, Weizenbaum recognized the importance of ensuring that users understood that they were conversing with a machine. ↑

2 Comments

Bruce E Pease

2:01 PM, 9 August 2023

Very insightful, Mr. Schwarting. We have a long way to go in defining guidelines and guardrails for proper employment of LLMs. But you have provided some wise and very practical advice for those building them. Thank you.

Bruce Pease, author of Leading Intelligence Analysis
Simon Nag

1:08 PM, 21 June 2024

Greetings to you
I ask you prayer for ministries works here.India is idolatrys country’s.and poorest of poor country from world.
With regards
Pastor Simon Nag
India
Church of Lord Jesus Christ

Comments are closed.

AI vs The Bomb

Interview with Dr. Derek Schuurman