Chatbots are an increasingly integral part of society. There are multiple different ways to develop them, with each model architecture and evaluation criteria chosen for each specific application. Although these language models are currently divided into open-domain and task-oriented models, there is significant research into combining them into a single, all-encompassing model. Furthermore, there is research that predicts future user inputs, tailors chatbot responses based on personality, and more. The field of chatbot development is a rapidly evolving field.

1 Introduction

Chatbots are language models that simulate conversations with people. With its wide breadth of knowledge and human-like responses, models like ChatGPT are pushing the bounds of what language modeling can do. With a reported IQ between 120 and 147, natural language AIs like OpenAI’s ChatGPT can surpass the intelligence of even university professors (de Souza et al., 2023). This degree of knowledge and adaptability is a key characteristic of modern chatbots.

1.1 Historical Developments

ELIZA is an early first language model developed in 1966 by Joseph Weizenbaum. It utilizes decomposition rules to identify key words in the input query, apply grammatical transformations, and generate a response (Weizenbaum, 1966). Because it only applies linguistic templates to the input query, ELIZA is not an intelligent model; it simply exploits question-answer grammar structures. This approach is reflective of the theories and developments of the 1960s and 1970s, with studies focusing on pattern matching and logical deduction (Enea and Colby, 1973). The primary goal of these language models were to understand and hard-code language structures.

Beyond improving model outputs, there was research into making chatbots act human-like. A study in the late 1990s attempted to simulate human typing speeds and errors by randomly pausing the algorithm between character inputs and making occasional typos in the query response (Bastin and Cordier, 1998). Modern reinterpretations of this post-processing include Google Assistant’s automated calling agent, which intersperses pauses and “ums” into its conversation to replicate human speech (Wamsley, 2018).

1.2 Current Applications

Chatbots have fast reaction times and are accessible during any time of the day (Chakrabortty et al., 2023). For industries that value these qualities, chatbots are an invaluable asset. In software engineering, chatbots are used for preliminary code generation and code testing; in marketing, they are used in customer service, content creation, and data collection (Fraiwan and Khasawneh, 2023; George et al., 2023). Beyond industry, chatbots have a massive human impact on three aspects of society: healthcare, education, and companionship.

Świątoniowska-Lonc et al. (2020) found that “the more satisfied the patient is with communication, the better their adherence [to] self-care;” physician communication is key in improving patient outcomes. Another study found that when patients had questions, nearly 80 percent of patients found a chatbot’s answer more empathetic and informative than a physician’s response (Ayers et al., 2023). Generated responses are more empathetic and higher quality than those a physician provides.

Another application is education. Although a meta-analysis of 32 empirical studies between 2010 and 2022 found that chatbots did not improve students’ critical thinking, educational engagement, and motivation, they had a “significant and positive influence on explicit reasoning, learning achievement, knowledge retention, and learning interest” (Deng and Yu, 2023). With the Chinese Ministry of Education rolling out such chatbots, the study suggests these chatbots would be best served as a teaching assistant or tutor. Chatbots are no substitute for a dedicated teacher, but their flexibility and personalization make them an invaluable educational tool.

A final application is companionship. Elderly people suffer from social isolation and require social support. A study found that Google Assistant, Amazon Alexa, Apple Siri, and Microsoft Cortana could help them perform various tasks, like greetings, interpersonal communication, and games (Reis et al., 2017). Combined, these services provide connection and entertainment.

2 Chatbot Design

2.1 Training Datasets

A chatbot's training set composition and size determines the scope and capabilities of the language model. While specialized chatbots are trained on specialized datasets, general chatbots often use generalized datasets like Wikipedia, BookCorpus, and WebText (Tan and Celis, 2019). These datasets are used because of their size and content diversity. Movie scripts are also used because of their structured conversational format, which approximates chatbot dialogue generation (Roghult, 2014).

2.2 Chatbot Development Schemes

There are two primary ways to classify chatbots: open-domain and task-oriented. Open-domain chatbots are general and designed to answer any question, while task-oriented chatbots focus on accomplishing a specific task. A subset of task-oriented chatbots are knowledge-graph chatbots, which query an external datalake – or knowledge graph – for the most up-to-date information (Omar et al., 2023).

Open-domain language models are typically large neural networks that have been trained on large datasets. There are three common chatbot models: retrieval, generative, and retrieve-and-refine. For retrieval models, the model uses context (dialogue history) to select and return the highest scoring sentence from its training corpus. For generative models, the model uses a sequence-to-sequence transformer to tokenize the context, inputs it into a neural network, and returns its output. For retrieve-and-refine models, the model retrieves some knowledge based on the context, inputs the context and retrieved data into a neural network, then returns its output (Roller et al., 2021). These neural networks are often sequence-to-sequence, hierarchical, or transformer-based models (Verma et al., 2022). They require millions of parameters and enormous training sets – but the end result is a chatbot like ChatGPT.

In contrast, task-oriented language models are more specific – like acting as a vendor and negotiating a purchase with a potential customer. The overall architecture is similar to open-domain chatbots, except its training set is more focused on its specific assignment, allowing it to excel at that specific task. For example, a model that was trained to negotiate prices on Craigslist was able to significantly outperform other baseline models in ability, coherency, topicality, and human-like responses (Verma et al., 2022).

A way to augment the model training process is to not directly use the training data, but to first preprocess it to account for all possible permutations of the input data. By using a “generative adversarial model [to] convert noisy feedback into a plausible natural response,” models like FEED2RESP convert a single sentence into a range of potential interpretations (Sreedhar et al., 2020). By feeding each of these different interpretations into the chatbot model, this preprocessing increases the data points the model can be trained on, and thus makes it more adaptable and responsive to a wider range of sentence structures and linguistic styles.

Rather than using models to parse and decipher the range of potential user meanings, a Stanford study suggested using social engineering to control the flow of conversation to make information extraction easier. This is because in human-to-human conversations, both “participants take turns leading the conversation[, but in most human-to-bot] dialogues[,] the bot either leads unilaterally or responds passively”(Hardy et al., 2021). To fix this conversational imbalance, the study engineered the chatbot to use back-channeling (ie, giving responses like “Mm-hmm”), use open-ended statements, and provide unprompted information. This conversational structure increased user initiative, response diversity, and information extraction.

2.3 Design Limitations

Chatbots are characterized by two key aspects: a large training set and a complex language model. Both of these introduce limitations to the overall chatbot design. Because the model is unable to perfectly infer facts outside its training set, the language model is highly biased towards its training data. There are also concerns over model transparency and explainability that can decrease model interpretability and understandability (Fraiwan and Khasawneh, 2023). Further limitations are how vocabulary from generative models are biased towards more common words, how models can repeat or forget prior dialogues, and how model outputs can jump between topics without prompting (Roller et al., 2021).

2.4 Chatbot Evaluation Criteria

Because chatbots are expected to accomplish a range of tasks, there are several complementary ways to evaluate a chatbot: quality, accuracy, human-ness, and more. One of the most well-known is the Turing Test, where the goal is to convince the user that the chatbot is a human (Bastin and Cordier, 1998). Instead of quantifying what makes humans human, this evaluation section will focus on more granular evaluation criterias, some of which address the limitations highlighted in the prior section.

A 2021 study by Doğruöz and Skantze categorized human-chatbot conversations into three categories: informal small talk, relationship talk, and goal-directed talk. Because chatbots tend to jump between topics, this disjointed conversational flow is suitable for only small talk. This is reflected in the study’s findings and conclusions, where they argue that open-domain chatbots need to be evaluated on a “wider repertoire of speech events” so that language models can successfully engage in a wider range of topics and conversation types (Doğruöz and Skantze, 2021).

A study on the Afrikaans language expands on this, highlighting three key evaluation metrics for judging chatbot responses: dialogue efficiency (containing the necessary information), dialogue quality (being reasonable and understandable), and user satisfaction (minimizing conversational repetitiveness) (Shawar and Atwell, 2007). Because chatbots have different areas of focus, the paper advocates for a more holistic evaluation based on individual use cases and user needs. The evaluation criteria should reweigh the three aforementioned metrics according to the specific chatbot’s goals.

Building on the dialogue efficiency evaluation criteria of the Afrikaans language study, user intent is also a key aspect of the question-answering process. Beyond just understanding the explicit meaning of the sentence, a language model should ideally also understand why a specific question is being asked. For example:

“If an e-commerce website does not sell women's Reebok shoes of size 10, its chatbot might answer “correctly” to a user who asked if those shoes are available, by responding “No we do not have women’s Reebok shoes of size 10.” However, this answer is not “helpful,” that is, a user shown this answer will not be tempted to search for other products on the website.” (Gupta et al., 2022)

A technically correct answer does not necessarily help the chatbot meet its goal: to make a sale. By adjusting the model to account for context, a good metric for dialogue efficiency should answer both the explicit user question and their implicit reason for asking.

3 Current and Future Developments

A key area of development is making language models empathetic. A key difficulty is a lack of labeled datasets. Existing datasets are overly vague with overly broad emotional labels (positive, neural, and negative). Instead, these models require more granular labels (questioning, consoling, suggesting, etc). There is a proposed three-step process to semi-automate this dataset generation: take preexisting fine-grained labels, train a weak labeler to classify by broad emotions, then use data augmentation to train a strong (granular) labeler (Welivita et al., 2021). By first choosing a more general label, it reduces the choices that the strong classifier needs to differentiate between, thus increasing overall accuracy. This granular emotion dataset would be useful with the global-local model that Wang and Feng designed, where a global model identifies the endurance of emotions and a local BERT model categorizes emotions based on localized prior data (Wang and Feng, 2023).

Another area of development is the combination of open-domain and task-oriented chatbots into a single augmented knowledge graph (KG) chatbot system. While existing open-domain chatbots like ChatGPT can answer various complex questions, it is only because the relevant knowledge graphs were part of its training data. In addition to standard chatbot characteristics, KG chatbots will need to query relevant information (extract data from external data sources) and generalize across different domains (find information from various data sources) (Omar et al., 2023). Because data sources are not necessarily stored in a tabular format, information retrieval systems need to create a knowledge graph from the raw text and perform data extraction on the graph (Galitsky et al., 2020). Existing parsers and chatbots often error when parsing such texts, leading to incorrect relationships between various objects (Mitsuda et al., 2022). Improving this scheme is key to maximizing the data sources available to the KG chatbot, increasing its overall domain knowledge, and minimizing error rates.

One way that chatbots are being improved is to augment the traditional forward dialogue model with a backwards dialogue model that predicts future conversations, then uses those predicted futures to augment the current response (Liu et al., 2022). Dubbed ProphetChat, this model provides a more informative response by preemptively addressing whatever follow-ups the user might have. Although there is an unavoidable degree of noise with predicted future texts, these predictions allow ProphetChat to outperform forward-only dialogue models in terms of readability, sensibleness, and specificity.

An additional area of development is adjusting the conversation based on the individual. The first approach is a vocabulary filter. Although there are many different ways to restrict lexical complexity, a 2022 study from the University of Cambridge found that the optimal method is re-ranking, where the model chooses candidate responses based on CEFR proficiency levels (ie, sentence difficulty) (Tyen et al., 2022). By re-ranking potential responses based on response readability, it increases overall response comprehension. Another way to tailor conversations is based on the user’s Myers-Brigg personality type. A preliminary study trained two different chatbots: one with introversion thinking, another with extraversion feeling. When users interacted with them both, there was a statistically significant difference in terms of their preferred chatbot; with a p-value of 0.003, the chatbot that matched their personality was perceived to be more natural than the personality-mismatched chatbot (Fernau et al., 2022). As such, providing users with a personality-matched chatbot would help enhance the user experience.

4 Conclusion

There are two main forms of chatbots: open-domain and task-based. Each are designed with specific goals in mind, and thus are evaluated on different sets of criteria. Future works involve the merging of these chatbots into a more-encompassing language model, as well as other developments to augment the user experience and information accuracy.