Chatbots are an increasingly integral part of society. There are multiple different ways to develop them, with each model architecture and evaluation criteria chosen for each specific application. Although these language models are currently divided into open-domain and task-oriented models, there is significant research into combining them into a single, all-encompassing model. Furthermore, there is research that predicts future user inputs, tailors chatbot responses based on personality, and more. The field of chatbot development is a rapidly evolving field.
Chatbots are language models that simulate conversations with people. With its wide breadth of knowledge and human-like responses, models like ChatGPT are pushing the bounds of what language modeling can do. With a reported IQ between 120 and 147, natural language AIs like OpenAI’s ChatGPT can surpass the intelligence of even university professors (de Souza et al., 2023). This degree of knowledge and adaptability is a key characteristic of modern chatbots.
1.1 Historical Developments
ELIZA is an early first language model developed in 1966 by Joseph Weizenbaum. It utilizes decomposition rules to identify key words in the input query, apply grammatical transformations, and generate a response (Weizenbaum, 1966). Because it only applies linguistic templates to the input query, ELIZA is not an intelligent model; it simply exploits question-answer grammar structures. This approach is reflective of the theories and developments of the 1960s and 1970s, with studies focusing on pattern matching and logical deduction (Enea and Colby, 1973). The primary goal of these language models were to understand and hard-code language structures.
Beyond improving model outputs, there was research into making chatbots act human-like. A study in the late 1990s attempted to simulate human typing speeds and errors by randomly pausing the algorithm between character inputs and making occasional typos in the query response (Bastin and Cordier, 1998). Modern reinterpretations of this post-processing include Google Assistant’s automated calling agent, which intersperses pauses and “ums” into its conversation to replicate human speech (Wamsley, 2018).
1.2 Current Applications
Chatbots have fast reaction times and are accessible during any time of the day (Chakrabortty et al., 2023). For industries that value these qualities, chatbots are an invaluable asset. In software engineering, chatbots are used for preliminary code generation and code testing; in marketing, they are used in customer service, content creation, and data collection (Fraiwan and Khasawneh, 2023; George et al., 2023). Beyond industry, chatbots have a massive human impact on three aspects of society: healthcare, education, and companionship.
Świątoniowska-Lonc et al. (2020) found that “the more satisfied the patient is with communication, the better their adherence [to] self-care;” physician communication is key in improving patient outcomes. Another study found that when patients had questions, nearly 80 percent of patients found a chatbot’s answer more empathetic and informative than a physician’s response (Ayers et al., 2023). Generated responses are more empathetic and higher quality than those a physician provides.
Another application is education. Although a meta-analysis of 32 empirical studies between 2010 and 2022 found that chatbots did not improve students’ critical thinking, educational engagement, and motivation, they had a “significant and positive influence on explicit reasoning, learning achievement, knowledge retention, and learning interest” (Deng and Yu, 2023). With the Chinese Ministry of Education rolling out such chatbots, the study suggests these chatbots would be best served as a teaching assistant or tutor. Chatbots are no substitute for a dedicated teacher, but their flexibility and personalization make them an invaluable educational tool.
A final application is companionship. Elderly people suffer from social isolation and require social support. A study found that Google Assistant, Amazon Alexa, Apple Siri, and Microsoft Cortana could help them perform various tasks, like greetings, interpersonal communication, and games (Reis et al., 2017). Combined, these services provide connection and entertainment.
2 Chatbot Design
2.1 Training Datasets
A chatbot's training set composition and size determines the scope and capabilities of the language model. While specialized chatbots are trained on specialized datasets, general chatbots often use generalized datasets like Wikipedia, BookCorpus, and WebText (Tan and Celis, 2019). These datasets are used because of their size and content diversity. Movie scripts are also used because of their structured conversational format, which approximates chatbot dialogue generation (Roghult, 2014).
2.2 Chatbot Development Schemes
There are two primary ways to classify chatbots: open-domain and task-oriented. Open-domain chatbots are general and designed to answer any question, while task-oriented chatbots focus on accomplishing a specific task. A subset of task-oriented chatbots are knowledge-graph chatbots, which query an external datalake – or knowledge graph – for the most up-to-date information (Omar et al., 2023).
Open-domain language models are typically large neural networks that have been trained on large datasets. There are three common chatbot models: retrieval, generative, and retrieve-and-refine. For retrieval models, the model uses context (dialogue history) to select and return the highest scoring sentence from its training corpus. For generative models, the model uses a sequence-to-sequence transformer to tokenize the context, inputs it into a neural network, and returns its output. For retrieve-and-refine models, the model retrieves some knowledge based on the context, inputs the context and retrieved data into a neural network, then returns its output (Roller et al., 2021). These neural networks are often sequence-to-sequence, hierarchical, or transformer-based models (Verma et al., 2022). They require millions of parameters and enormous training sets – but the end result is a chatbot like ChatGPT.
In contrast, task-oriented language models are more specific – like acting as a vendor and negotiating a purchase with a potential customer. The overall architecture is similar to open-domain chatbots, except its training set is more focused on its specific assignment, allowing it to excel at that specific task. For example, a model that was trained to negotiate prices on Craigslist was able to significantly outperform other baseline models in ability, coherency, topicality, and human-like responses (Verma et al., 2022).
A way to augment the model training process is to not directly use the training data, but to first preprocess it to account for all possible permutations of the input data. By using a “generative adversarial model [to] convert noisy feedback into a plausible natural response,” models like FEED2RESP convert a single sentence into a range of potential interpretations (Sreedhar et al., 2020). By feeding each of these different interpretations into the chatbot model, this preprocessing increases the data points the model can be trained on, and thus makes it more adaptable and responsive to a wider range of sentence structures and linguistic styles.
Rather than using models to parse and decipher the range of potential user meanings, a Stanford study suggested using social engineering to control the flow of conversation to make information extraction easier. This is because in human-to-human conversations, both “participants take turns leading the conversation[, but in most human-to-bot] dialogues[,] the bot either leads unilaterally or responds passively”(Hardy et al., 2021). To fix this conversational imbalance, the study engineered the chatbot to use back-channeling (ie, giving responses like “Mm-hmm”), use open-ended statements, and provide unprompted information. This conversational structure increased user initiative, response diversity, and information extraction.
2.3 Design Limitations
Chatbots are characterized by two key aspects: a large training set and a complex language model. Both of these introduce limitations to the overall chatbot design. Because the model is unable to perfectly infer facts outside its training set, the language model is highly biased towards its training data. There are also concerns over model transparency and explainability that can decrease model interpretability and understandability (Fraiwan and Khasawneh, 2023). Further limitations are how vocabulary from generative models are biased towards more common words, how models can repeat or forget prior dialogues, and how model outputs can jump between topics without prompting (Roller et al., 2021).
2.4 Chatbot Evaluation Criteria
Because chatbots are expected to accomplish a range of tasks, there are several complementary ways to evaluate a chatbot: quality, accuracy, human-ness, and more. One of the most well-known is the Turing Test, where the goal is to convince the user that the chatbot is a human (Bastin and Cordier, 1998). Instead of quantifying what makes humans human, this evaluation section will focus on more granular evaluation criterias, some of which address the limitations highlighted in the prior section.
A 2021 study by Doğruöz and Skantze categorized human-chatbot conversations into three categories: informal small talk, relationship talk, and goal-directed talk. Because chatbots tend to jump between topics, this disjointed conversational flow is suitable for only small talk. This is reflected in the study’s findings and conclusions, where they argue that open-domain chatbots need to be evaluated on a “wider repertoire of speech events” so that language models can successfully engage in a wider range of topics and conversation types (Doğruöz and Skantze, 2021).
A study on the Afrikaans language expands on this, highlighting three key evaluation metrics for judging chatbot responses: dialogue efficiency (containing the necessary information), dialogue quality (being reasonable and understandable), and user satisfaction (minimizing conversational repetitiveness) (Shawar and Atwell, 2007). Because chatbots have different areas of focus, the paper advocates for a more holistic evaluation based on individual use cases and user needs. The evaluation criteria should reweigh the three aforementioned metrics according to the specific chatbot’s goals.
Building on the dialogue efficiency evaluation criteria of the Afrikaans language study, user intent is also a key aspect of the question-answering process. Beyond just understanding the explicit meaning of the sentence, a language model should ideally also understand why a specific question is being asked. For example:
“If an e-commerce website does not sell women's Reebok shoes of size 10, its chatbot might answer “correctly” to a user who asked if those shoes are available, by responding “No we do not have women’s Reebok shoes of size 10.” However, this answer is not “helpful,” that is, a user shown this answer will not be tempted to search for other products on the website.” (Gupta et al., 2022)
A technically correct answer does not necessarily help the chatbot meet its goal: to make a sale. By adjusting the model to account for context, a good metric for dialogue efficiency should answer both the explicit user question and their implicit reason for asking.
3 Current and Future Developments
A key area of development is making language models empathetic. A key difficulty is a lack of labeled datasets. Existing datasets are overly vague with overly broad emotional labels (positive, neural, and negative). Instead, these models require more granular labels (questioning, consoling, suggesting, etc). There is a proposed three-step process to semi-automate this dataset generation: take preexisting fine-grained labels, train a weak labeler to classify by broad emotions, then use data augmentation to train a strong (granular) labeler (Welivita et al., 2021). By first choosing a more general label, it reduces the choices that the strong classifier needs to differentiate between, thus increasing overall accuracy. This granular emotion dataset would be useful with the global-local model that Wang and Feng designed, where a global model identifies the endurance of emotions and a local BERT model categorizes emotions based on localized prior data (Wang and Feng, 2023).
Another area of development is the combination of open-domain and task-oriented chatbots into a single augmented knowledge graph (KG) chatbot system. While existing open-domain chatbots like ChatGPT can answer various complex questions, it is only because the relevant knowledge graphs were part of its training data. In addition to standard chatbot characteristics, KG chatbots will need to query relevant information (extract data from external data sources) and generalize across different domains (find information from various data sources) (Omar et al., 2023). Because data sources are not necessarily stored in a tabular format, information retrieval systems need to create a knowledge graph from the raw text and perform data extraction on the graph (Galitsky et al., 2020). Existing parsers and chatbots often error when parsing such texts, leading to incorrect relationships between various objects (Mitsuda et al., 2022). Improving this scheme is key to maximizing the data sources available to the KG chatbot, increasing its overall domain knowledge, and minimizing error rates.
One way that chatbots are being improved is to augment the traditional forward dialogue model with a backwards dialogue model that predicts future conversations, then uses those predicted futures to augment the current response (Liu et al., 2022). Dubbed ProphetChat, this model provides a more informative response by preemptively addressing whatever follow-ups the user might have. Although there is an unavoidable degree of noise with predicted future texts, these predictions allow ProphetChat to outperform forward-only dialogue models in terms of readability, sensibleness, and specificity.
An additional area of development is adjusting the conversation based on the individual. The first approach is a vocabulary filter. Although there are many different ways to restrict lexical complexity, a 2022 study from the University of Cambridge found that the optimal method is re-ranking, where the model chooses candidate responses based on CEFR proficiency levels (ie, sentence difficulty) (Tyen et al., 2022). By re-ranking potential responses based on response readability, it increases overall response comprehension. Another way to tailor conversations is based on the user’s Myers-Brigg personality type. A preliminary study trained two different chatbots: one with introversion thinking, another with extraversion feeling. When users interacted with them both, there was a statistically significant difference in terms of their preferred chatbot; with a p-value of 0.003, the chatbot that matched their personality was perceived to be more natural than the personality-mismatched chatbot (Fernau et al., 2022). As such, providing users with a personality-matched chatbot would help enhance the user experience.
There are two main forms of chatbots: open-domain and task-based. Each are designed with specific goals in mind, and thus are evaluated on different sets of criteria. Future works involve the merging of these chatbots into a more-encompassing language model, as well as other developments to augment the user experience and information accuracy.
- A. Seza Doğruöz and Gabriel Skantze. 2021. How “open” are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 392–402, Singapore and Online. Association for Computational Linguistics.
- A. Shaji George, A.S. Hovan George, and A.S. Gabrio Martin. 2023. A Review of ChatGPT AI's Impact on Several Business Sectors. Partners Universal International Innovation Journal (PUIIJ), 1(1):9–23, February.
- Alexander Roghult. 2014. Chatbot trained on movie dialogue. KTH, School of Computer Science and Communication (CSC), December.
- Amelia Hardy, Ashwin Paranjape, and Christopher Manning. 2021. Effective Social Chatbot Strategies for Increasing User Initiative. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 99–110, Singapore and Online. Association for Computational Linguistics.
- Anuradha Welivita, Yubo Xie, and Pearl Pu. 2021. A Large-Scale Dataset for Empathetic Response Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1251–1264, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Arsénio Reis, Dennis Paulino, Hugo Paredes, and João Barroso. 2017. Using intelligent personal assistants to strengthen the Elderlies’ social bonds. Universal Access in Human–Computer Interaction. Human and Technological Environments, 10279:593–602, May.
- Bayan Abu Shawar and Eric Atwell. 2007. Different Measurements Metrics to Evaluate a Chatbot System. In Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, pages 89–96, Rochester, New York. Association for Computational Linguistics.
- Boris Galitsky, Dmitry Ilvovsky, and Elizaveta Goncharova. 2020. On a Chatbot Navigating a User through a Concept-Based Knowledge Model. In Proceedings of Workshop on Natural Language Processing in E-Commerce, pages 53–65, Barcelona, Spain. Association for Computational Linguistics.
- Bruno Campello de Souza, Agostinho Serrano Andrade Neto, and Antonio Roazzi. 2023. Are the new AIS smart enough to steal your job? IQ scores for chatgpt, Microsoft Bing, Google Bard and Quora Poe. SSRN Electronic Journal, April.
- Chang Liu, Xu Tan, Chongyang Tao, Zhenxin Fu, Dongyan Zhao, Tie-Yan Liu, and Rui Yan. 2022. ProphetChat: Enhancing Dialogue Generation with Simulation of Future Conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 962–973, Dublin, Ireland. Association for Computational Linguistics.
- Daniel Fernau, Stefan Hillmann, Nils Feldhus, Tim Polzehl, and Sebastian Möller. 2022. Towards Personality-Aware Chatbots. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 135–145, Edinburgh, UK. Association for Computational Linguistics.
- Gladys Tyen, Mark Brenchley, Andrew Caines, and Paula Buttery. 2022. Towards an open-domain chatbot for language practice. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pages 234–249, Seattle, Washington. Association for Computational Linguistics.
- Horace Enea and Kenneth Mark Colby. 1973. Idiolectic Language-Analysis For Understanding Doctor-Patient Dialogues. International Joint Conference on Artificial Intelligence:278–284.
- John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Longhurst, Michael Hogarth, and Davey M. Smith. 2023. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine, April.
- Joseph Weizenbaum. 1966. Eliza—A Computer Program For the Study of Natural Language Communication Between Man And Machine. Communications of the ACM, 9(1):36–45, January.
- Koh Mitsuda, Ryuichiro Higashinaka, Tingxuan Li, and Sen Yoshida. 2022. Investigating person-specific errors in chat-oriented dialogue systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 464–469, Dublin, Ireland. Association for Computational Linguistics.
- Laurel Wamsley. 2018. Google's new voice bot sounds, um, maybe too real. May.
- Makesh Narsimhan Sreedhar, Kun Ni, and Siva Reddy. 2020. Learning Improvised Chatbots from Adversarial Modifications of Natural Language Feedback. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2445–2453, Online. Association for Computational Linguistics.
- Mohammad Fraiwan and Natheer Khasawneh. 2023. A Review of ChatGPT Applications in Education, Marketing, Software Engineering, and Healthcare: Benefits, Drawbacks, and Research Directions. ArXiv.
- Natalia Świątoniowska-Lonc, Jacek Polański, Wojciech Tański, and Beata Jankowska-Polańska. 2020. Impact of satisfaction with physician–patient communication on self-care and adherence in patients with hypertension: Cross-sectional study. BMC Health Services Research, 20(1), November.
- Pranav Gupta, Anand A. Rajasekar, Amisha Patel, Mandar Kulkarni, Alexander Sunell, Kyung Kim, Krishnan Ganapathy, and Anusua Trivedi. 2022. Answerability: A custom metric for evaluating chatbot performance. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 316–325, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Reham Omar, Omij Mangukiya, Panos Kalnis, and Essam Mansour. 2023. ChatGPT versus Traditional Question Answering for Knowledge Graphs: Current Status and Future Directions Towards Knowledge Graph Chatbots. ArXiv, 2302.06466, February.
- Renxi Wang and Shi Feng. 2023. Global-Local Modeling with Prompt-Based Knowledge Enhancement for Emotion Inference in Conversation. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2120–2127, Dubrovnik, Croatia. Association for Computational Linguistics.
- Ripon K. Chakrabortty, Mohamed Abdel-Basset, and Ahmed M. Ali. 2023. A multi-criteria decision analysis model for selecting an optimum customer service chatbot under uncertainty. Decision Analytics Journal, 6:100–168, March.
- Siddharth Verma, Justin Fu, Sherry Yang, and Sergey Levine. 2022. CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement Learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4471–4491, Seattle, United States. Association for Computational Linguistics.
- Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for Building an Open-Domain Chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
- V. Bastin and D. Cordier. 1998. Methods and tricks used in an attempt to pass the Turing Test. In New Methods in Language Processing and Computational Natural Language Learning.
- Xinjie Deng and Zhonggen Yu. 2023. A meta-analysis and systematic review of the effect of chatbot technology use in Sustainable Education. Sustainability, 15(4):2940, February.
- Yi Chern Tan and L. Elisa Celis. 2019. Assessing Social and Intersectional Biases in Contextualized Word Representations. arXiv, 1911.01485, November.