A Case Study on Assessing AI Assistant Competence in Narrative Interviews

Chitat Chan; Yunmeng Zhao; Jiahui Zhao

doi:10.12688/f1000research.151952.2

Home Browse A Case Study on Assessing AI Assistant Competence in Narrative Interviews

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Case Study

Revised

A Case Study on Assessing AI Assistant Competence in Narrative Interviews

[version 2; peer review: 1 approved, 3 approved with reservations]

Chitat Chan ¹, Yunmeng Zhao¹, Jiahui Zhao¹

PUBLISHED 04 Oct 2024

Author details Author details

¹ Social Work, Hong Kong Baptist University, Hong Kong, Hong Kong

Chitat Chan
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Yunmeng Zhao
Roles: Data Curation, Visualization

Jiahui Zhao
Roles: Conceptualization, Visualization

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Abstract*

Background

Researchers are leading the development of AI designed to conduct interviews. These developments imply that AI's role is expanding from mere data analysis to becoming a tool for social researchers to interact with and comprehend their subjects. Yet, academic discussions have not addressed the potential impacts of AI on narrative interviews. In narrative interviews, the method of collecting data is a collaborative effort. The interviewer also contributes to exploring and shaping the interviewee's story. A compelling narrative interviewer has to display critical skills, such as maintaining a specific questioning order, showing empathy, and helping participants delve into and build their own stories.

Methods

This case study configured an OpenAI Assistant on WhatsApp to conduct narrative interviews with a human participant. The participant shared the same story in two distinct conversations: first, following a standard cycle and answering questions earnestly, and second, deliberately sidetracking the assistant from the main interview path as instructed by the researcher, to test how well the metrics could reflect the deliberate differences between different conversations. The AI's performance was evaluated through conversation analysis and specific narrative indicators, focusing on its adherence to the interview structure, empathy, narrative coherence, complexity, and support for human participant agency. The study sought to answer these questions: 1) How can the proposed metrics help us, as social researchers without a technical background, understand the quality of the AI-driven interviews in this study? 2) What do these findings contribute to our discussion on using AI in narrative interviews for social research? 3) What further research could these results inspire?

Results

The findings show to what extent the AI maintained structure and adaptability in conversations, illustrating its potential to support personalized, flexible narrative interviews based on specific needs.

Conclusions

These results suggest that social researchers without a technical background can use observation-based metrics to gauge how well an AI assistant conducts narrative interviews. They also prompt reflection on AI's role in narrative interviews and spark further research.

Keywords

Artificial Intelligence, Narrative Inquiry, Qualitative Research, WhatsApp Interviews, Conversational AI, Prompt Engineering, Digital Research Methodologies

Corresponding author: Chitat Chan

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2024 Chan C et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Chan C, Zhao Y and Zhao J. A Case Study on Assessing AI Assistant Competence in Narrative Interviews [version 2; peer review: 1 approved, 3 approved with reservations]. F1000Research 2024, 13:601 (https://doi.org/10.12688/f1000research.151952.2) First published: 07 Jun 2024, 13:601 (https://doi.org/10.12688/f1000research.151952.1) Latest published: 04 Oct 2024, 13:601 (https://doi.org/10.12688/f1000research.151952.2)

Revised Amendments from Version 1

In this extensively revised manuscript, we have made key changes based on the constructive feedback received, including the suggestion to frame the study as a pilot. We refined the article to improve clarity, address gaps, and better define the research aims while maintaining the original dataset. The manuscript now focuses on how AI chatbots, specifically GPT-4, can be applied in social work for conducting narrative interviews. Using an autoethnographic case study, we developed and tested a chatbot on WhatsApp, which followed a structured interview plan and demonstrated empathy. While challenges in replicating human skills were noted, the study concludes that AI tools can be tailored by social workers, though further research is needed. Our emphasis remains on democratizing AI and empowering social workers to use these tools effectively, highlighting the need for interdisciplinary collaboration in developing robust assessment methods.

See the authors' detailed response to the review by Jonathan Singer
See the authors' detailed response to the review by Neil Ballantyne

Democratizing AI chatbots and domain knowledge expertise

The possibility

The use of AI chatbots in various real-life applications is rapidly increasing due to their ability to reduce reliance on humans, lower costs, improve efficiency, and streamline service experiences (Adamopoulou & Moussiades, 2020; Bendig et al., 2019; Chan & Li, 2023; Greer et al., 2019; He et al., 2022; Liu et al., 2022; Omarov et al., 2023; Shah et al., 2017; Tamayo et al., 2020; Tanana et al., 2019; Xu & Zhuang, 2022).

Before we proceed, it is important to clarify some terminologies related to conversational user interfaces. Research on these interfaces, including dialogue systems, embodied conversational agents, and social robotics, is converging towards the development of improved conversational computing systems. These systems are increasingly referred to as “chatbots” in both industry and research, a term that broadly encompasses conversational agents designed for various purposes, such as task completion, information delivery, entertainment, and social interactions, through text, voice, or both (Følstad et al., 2021). Unlike more narrowly defined uses of the term, this understanding of chatbots focuses on the design, development, and implications of these systems rather than specific technologies. Given the rapid evolution of technology and usage patterns, maintaining distinctions between types of conversational agents is challenging, as there often is significant overlap in functionality and modality. In this study, we generally use the term “chatbot,” but we specifically refer to ‘AI Assistant’ when discussing OpenAI’s Assistant API (Application Programming Interface).

Central to this AI chatbot proliferation is GPT (Generative Pretrained Transformer), a deep learning model that uses a self-attention mechanism to weigh different parts of input sentences and determine their importance. By converting sentences into tokens and predicting the next token’s probability, GPTs can generate coherent text that often appears human-written. Various studies have demonstrated this capability well (Brown et al., 2020; Wei et al., 2021; Wu et al., 2021). GPT-based chatbots, for example ChatGPT, can understand natural language inputs and produce responses that are contextually appropriate and coherent, thereby improving interactivity and efficiency (Bendig et al., 2019; Nath et al., 2021; Shah et al., 2017; Xu & Zhuang, 2022).

Since 2023, advancements in customizable features have further transformed the landscape of GPT applications, enabling not just programmers but also non-technical users to become GPT designers rather than merely end-users, thanks to simplified coding and prompt engineering (Stallbaumer, 2023; UC Today, 2023). For example, OpenAI introduced ChatGPT Builder and Assistant API (Application Interface) in November 2023, enabling users to build custom GPTs without technical expertise or coding experience (DeVon, 2023; UC Today, 2023). ChatGPT Builder provides a user-friendly interface for creating custom chatbot versions without coding, featuring built-in tools that OpenAI fully hosts. However, it does not support integration with other application platforms. On the other hand, the Assistant API also offers a Playground interface for building custom chatbots without coding. Unlike ChatGPT Builder, it allows developers to integrate these chatbots into various applications, such as instant messaging platforms like WhatsApp or Facebook Messenger, provided they have some coding skills and knowledge of these third-party platforms.

These backend platforms allow users to use natural language instructions to define the AI’s role and context, set its tone and behavior, and design stepwise procedures, making the chatbot’s responses relevant and coherent. For example, a customer service chatbot can be instructed to greet users politely, answer frequently asked questions, and escalate issues to a human agent if it cannot provide a satisfactory answer. Additionally, a chatbot for educational purposes can be guided to explain concepts in simple terms, provide examples, and quiz users to reinforce learning. By tailoring the AI’s behavior and responses, chatbot designers can create a more personalized and effective experience for various applications. This high level of customization expands and popularizes chatbot applications.

Against this background, chatbot researchers have identified several areas for future development, with democratizing chatbots for social good being one of the core themes (Følstad et al., 2021; Powell et al., 2023; Van Brummelen, 2019). This research theme aims to make chatbots more accessible and useful for people with non-technical backgrounds, including marginalized communities and domain experts.

The challenge

Although there is potential for democratizing AI chatbots, it remains uncertain whether advancements in customizable GPTs will enable social workers to effectively use AI tools. A significant challenge, however, is integrating specialized knowledge from the profession into the development of chatbots and assessing their usability and effectiveness in real-world settings (Følstad et al., 2021; Powell et al., 2023). Integrating domain knowledge into chatbot development requires bridging the gap between AI technology and the expertise of domain professionals. This gap often leads to tensions, as AI developers’ and domain experts’ goals and methodologies can differ significantly. AI developers might focus on technical feasibility and innovation, while domain experts, such as social workers, emphasize chatbots’ practical applicability and ethical implications in their fields. Effective collaboration is needed to reconcile these differences, ensuring that the chatbot is both technologically advanced and aligned with the nuanced needs of its target users.

The challenge goes beyond the simple lack of technical knowledge among domain experts and the need for one-sided training; it requires addressing the deeper, multidimensional barriers to effective collaboration between AI developers and professionals in specialized fields. The concept of affordance may provide a conceptual reference point for understanding how AI technologies may interact with domain experts like social workers. The concept of affordance refers to an object’s potential uses or actions—its latent utility—regardless of whether these possibilities are recognized by a person (Gibson, 1977). In the context of technology, affordances describe the action possibilities of a device that are perceivable by a user, indicating that the use of AI technologies is influenced not just by their design or technical features but also by how users perceive and interact with them based on their social and cultural backgrounds, experiences, and needs (Norman, 1999). Such a concept has been widely discussed in the context of technology in education, inspiring research into the dynamic relationship between technology and human activity (Bower, 2008; Chan & Holosko, 2017; Hammond, 2010; Leonardi, 2011).

These discussions highlight that technologies like AI are not neutral tools; their use and limitations are shaped by their users’ preconceptions and creative adaptations. Therefore, we cannot assume that users will employ AI features exactly as their designers intended. Users often interpret and repurpose technologies based on their unique perspectives and needs, which can extend beyond the original design intentions. For instance, while AI chatbots are typically designed for one-time customer support or information, social workers might repurpose them for ongoing needs assessments in continuous dialogues, leading to unexpected or innovative applications in different contexts.

The extent to which social work can embrace this customizable AI technology remains unclear due to the current research environment in the field. A 2016 review of social work interventions enhanced by information and communication technology (ICT) identified several critical issues (Chan & Holosko, 2016). One primary concern was the “black-box” problem, where unclear processes make developing transferable skillsets difficult. Additionally, the use of ICT has often shifted the focus from skilled practitioners to standardized programs, raising concerns about diminishing the role of social work professionals in such technology-based interventions (Chan & Holosko, 2016). Nearly a decade later, this situation remains essentially unchanged. Discussions around AI in social work do not focus on technical features or processes but primarily on conceptual and ethical issues (Reamer, 2023; Robila & Robila, 2020).

In addition, technical evaluations of chatbots often involve complex algorithms and backend data (Deriu et al., 2021; Finch & Choi, 2020; Lee et al., 2023), which are usually beyond the understanding of lay users and those outside the technical field. This creates a challenge for social workers when it comes to integrating chatbots into their practice. It’s paradoxical that if they lack the knowledge and ability to evaluate a product they have helped create. This disconnect can make it challenging for social workers to effectively assess and contribute to the development process.

Given these rapid and user-friendly technological developments, an important question arises: Can these new customizable AI tools be meaningfully integrated with social work’s domain knowledge and practices? The current literature lacks discussion on how social work professionals can instruct and evaluate customizable AI chatbots in practical settings. While innovative features like creative writing, text-to-image transformation, answering queries, and assisting students with paper writing are becoming more prevalent, they do not directly align with the practical needs of social work. Social work should focus on how well chatbots can meet specific communication needs, support client engagement, and enhance service delivery rather than on novelty or creativity.

This study specifically chose OpenAI’s Assistant API as the platform for developing and testing AI chatbots because it offers high customization and flexibility (DeVon, 2023; Stallbaumer, 2023; Wang et al., 2023), allowing social work professionals to tailor the tool to their unique requirements. OpenAI Assistant is widely recognized for its user-friendly interface and robust capabilities, making it accessible even for those with minimal coding experience. By leveraging these features, social workers can integrate AI more easily into their practice without extensive technical knowledge.

We recognize that OpenAI Assistant is not the only product on the market, and technical innovations continually evolve. However, we chose OpenAI because its readiness for deployment, widespread adoption in various industries, comprehensive support resources, and active user community provide a solid foundation for exploring, testing, and refining the use of AI in social work settings. This socio-technological landscape inspires and enables us to explore how AI chatbots can be practically integrated and evaluated within social work.

Social workers using AI chatbots for narrative interviews

Narrative interviews in social work

The focus on interviews in this study is a choice for research purposes because many duties in social work involve conversations, such as training, assessments, counseling, and client interactions. Interviews are fundamental for gathering information, building rapport, and understanding clients’ unique situations and needs (Bogo, 2006; Kadushin & Kadushin, 1997). Furthermore, the essence of generative AI lies in its ability to engage in human-like conversations, making it a potential tool for facilitating these interactions. The need to conduct interviews and the capabilities of generative AI make a perfect match. By focusing on AI-driven interviews, the research aims to explore how generative AI can be utilized to enhance conversational tasks in social work, supporting practitioners in their efforts to connect with clients more effectively and empathetically.

By “interview,” we refer to dynamic, open-ended, and exploratory interactions rather than online forms with a chatbot facade. This study uses the term “narrative interview” to describe a conversational approach facilitating deeper exploration of stories. In narrative interviews, we expect the chatbot to engage in meaningful dialogue, respond to the natural flow of conversation, and adapt to the nuances of participants’ stories. This flexible, client-centered approach contrasts with structured online questionnaires, which follow a predetermined script and do not consider the unique context or needs of the individual.

Existing chatbot products in social and healthcare services do not support such contextualized narrative interviews as they are often therapy-oriented, primarily based on Cognitive Behavioral Therapy (CBT). For example, chatbots like Wysa and Woebot rely on CBT techniques and follow structured treatment plans with preset responses (Omarov et al., 2023; Xu & Zhuang, 2022). Social work interviews often involve asking questions, advocating, building rapport, conducting outreach, and using flexible, narrative-based approaches to explore and address a range of social and emotional issues (Bogo, 2006; Kadushin & Kadushin, 1997). Most existing healthcare chatbot products leave little room for the involvement of social work practitioners, as they are pre-made products based on predefined therapeutic protocols. This creates a gap between current healthcare chatbot products and social work practice’s needs.

For social work, chatbots that support narrative interviews could be highly relevant to a wide range of direct practices (Chan & Li, 2023). Narrative interviews in social work settings may include applications such as counseling intake, initial needs assessments, community studies, and contextualized inquiries. These types of interviews require chatbots that can handle a variety of conversational contexts and adapt to the specific needs of each situation. By configuring chatbots to facilitate narrative interviews, social workers may leverage AI to gather more nuanced information and engage more efficiently with clients.

Some researchers are now at the forefront of developing AI tools specifically for interview purposes (Cordero et al., 2022; Han et al., 2021). Additionally, commercial entities have begun to harness the power of generative AI to conduct qualitative customer interviews, streamlining the process of gathering and analyzing feedback (e.g., Listen Labs https://listenlabs.ai/). These AI tools can conduct interviews, record responses, and use advanced algorithms to interpret data, uncovering insights that might be overlooked. By automating repetitive tasks, AI may ccelerate the interview process and ensures high precision.

Given these advantages, customizable AI-driven narrative interviews may substantially benefit social work. However, this emerging trend is largely under-researched or virtually nonexistent, presenting exciting opportunities and significant challenges. Narrative interviews are inherently complex because they require interaction, continuity, responsiveness, and deep engagement with participants’ stories. In these interviews, the dialogue goes beyond merely collecting information; it involves collaboratively constructing a narrative, which can influence the interview subjects (Chase, 2018; Clandinin, 2007). If AI chatbots are to conduct narrative interviews, what essential competencies must they have to capture the depth and richness of participants’ experiences adequately?

Expected competencies from asynchronous text-based chatbots

This study specifically focuses on asynchronous text-based chatbots because, at the current stage of technology development, these types of chatbots are more accessible, flexible, and easier to implement in diverse settings compared to real-time, synchronous verbal communication systems (Luo et al., 2022). Asynchronous chatbots allow users to engage at their own pace, which can be particularly beneficial in social work and counseling contexts, where participants might need time to reflect and compose their responses. Additionally, asynchronous interactions reduce the pressure on users to respond immediately, fostering a more thoughtful and reflective dialogue that aligns with the principles of narrative inquiry. Furthermore, these chatbots create a retrievable text-based record of interactions, offering a valuable resource for both reflection and documentation, which can be useful for tracking progress and outcomes in therapeutic or consultative sessions.

As an inquiry method, narrative interviewing is closely informed by constructivist and social constructionist worldviews, which suggest that reality is constructed through social interactions and discursive contexts (Abkhezr et al., 2020; Denicolo et al., 2016). Participants engaging in meaningful conversations during interviews can contribute to content construction and reflection (Gergen, 2001; Shotter, 1993). As such, narrative inquiry generally assumes a more active role for interviewers and interviewees in which their “active narrativity” (Gubrium et al., 2012, p. 28) is highlighted.

As AI chatbots continue to evolve, researchers and industry experts have proposed various metrics to assess the performance of AI (Katic et al., 2024; Sharma, 2020; Xu & Jiang, 2024). Existing literature primarily measures the overall technical performance of dialogue systems, such as response time, language proficiency, repetitions, and human likeness (Deriu et al., 2021; Finch & Choi, 2020; Lee et al., 2023), rather than focusing on the user end experience at the conversational level. While evaluating the technical features of language models is fundamental, these assessments can be overly complex and may not directly apply to interview contexts. Some researchers have provided practical measures at the conversation level; for example, Concannon and Tomalin (2023) suggest using an Empathy Scale for Human–Computer Communication, based on raters using a Likert scale to assess the textual content of conversations. This instrument is relevant and valuable, but empathy is just one domain in narrative interviews. Other areas also need to be addressed to ensure the quality of interviews.

Enabling non-technical background developers, such as social workers, to assess AI’s role in narrative interviews is crucial because it ensures that these technologies are used in ways that genuinely enhance social work practices supported by their domain knowledge. By focusing on competencies relevant to social work, such as empathy, adaptability, and the ability to foster meaningful dialogue, social workers can better evaluate whether AI chatbots effectively support their goals. This approach empowers social workers to shape AI chatbots to meet their unique needs actively, ultimately bridging the gap between technological innovation and practical application in human services.

Identifying the full range of competencies expected from asynchronous text-based chatbots is a complex task that extends beyond the scope of this study. Nonetheless, we have selected two core competencies to focus on—fidelity to the interview guide and empathetic engagement—to facilitate the discussion and guide the empirical aspects of our research. The rationales are as follows:

First, text-based online narrative interviewing is much more than just filling out a self-report questionnaire. In narrative interviews, the sequence of questioning is crucial. Unlike self-report questionnaires, where each question is independent and can be arranged randomly, narrative interviews involve a continuous dialogic process that reflects the participant’s evolving story. A narrative inherently implies a sequence and consequence of events, and the order of questions can significantly affect the outcomes (Riessman, 2008; Riessman & Quinney, 2005). Interviewers must be sensitive to this sequence and guide the conversation to remain focused on the specific topic of inquiry while allowing for the natural flow of the participant’s story. Several counselor competence frameworks have highlighted questioning as one of the top skills required for interviewers (Fowler et al., 2021; Rice et al., 2022; Swank et al., 2012). In narrative practice, the structure or progression of questions is often seen as essential to achieving the intended effects (Duvall & Béres, 2011; White, 2007). Furthermore, narrative interviewers need to facilitate participants in exploring and constructing meanings (Abkhezr et al., 2020). This expectation is well-suited for asynchronous text-based chatbots because OpenAI Assistants can be instructed to follow specific sequences and adapt their responses based on interviewee input (DeVon, 2023; UC Today, 2023). Additionally, all interactions are text-based, making the records retrievable and analyzable.

Second, narrative interviewers need to demonstrate empathy. In narrative interviews, participants are not merely providing factual descriptions but actively producing data through their narratives. This requires the interviewer to be able to exercise “active narrativity” (Chase, 2018; Clandinin, 2007). As such, the narrative interviewer’s capacity to be empathic, nonjudgmental, concerned, tolerant, and emotionally responsive is prioritized (Josselson, 2007, p. 539). Demonstrating empathy helps create a safe and supportive environment where participants feel comfortable sharing their stories and actively contributing to knowledge production. Empathy is also frequently highlighted in counselor competence frameworks as a critical skill for effective interviewing (Fowler et al., 2021; Rice et al., 2022; Swank et al., 2012). This expectation is reasonable and suitable for asynchronous text-based chatbots because OpenAI Assistants can be designed to recognize and respond to emotional cues in the participant’s text, providing empathetic responses that align with the client’s conversational flow and demonstrate understanding and support.

The research questions

In summary, among various potential competencies for AI chatbots, we have chosen to focus on two key competencies—fidelity and empathy—because they are particularly relevant to conducting narrative interviews. This focus allows us to explore the minimal requirements for effective performance in this context, serving both heuristic and research purposes. Fidelity involves adhering to the guidelines set by designers and practitioners, ensuring that the chatbot maintains the intended structure and focus of the narrative interview. Empathy requires aligning with the interviewee’s conversational flow and responding to demonstrate genuine understanding and support for the participant’s emotional and narrative journey. By meeting these fundamental competencies, AI chatbots may have the potential to facilitate meaningful interactions during narrative interviews. With this foundational understanding, our study aimed to answer the following questions:

1. Can OpenAI Assistant be instructed by social workers with minimal coding experience to function as a narrative interview chatbot?
2. Whether the chatbot’s performance in terms of fidelity and empathy can be assessed using simple, observation-based rubrics?
3. How do the findings contribute to the broader discussion on democratizing AI for specialized fields like social work?

Method: An autoethnographic case study

The rationale

In this study, we adopted an autoethnographic case study method to research customizing generative AI for narrative therapy interviews. Autoethnography is a qualitative research method where researchers analyze their own personal experiences to understand broader contexts. It combines autobiography and ethnography, using personal narratives to explore and reflect on cultural phenomena (Ellis & Bochner, 2000). We chose this method for two primary reasons. First, autoethnography is increasingly recognized as a practical approach for researching new technologies and social robotics, as it allows for a critical exploration of personal experiences within broader contexts (Chun, 2019; Mao et al., 2023). This approach provides a unique, firsthand perspective on the subjective aspects of technological interaction. Second, autoethnography allows us to evaluate new tools before introducing them to clients or other human participants. This approach allows us to move cautiously and thoughtfully, considering the technology’s impact and limitations to ensure more ethical decisions in future experiments with service users.

We conducted a collaborative autoethnography, where researchers collectively reflect on and analyze their personal experiences to gain deeper insights into a phenomenon (Chang et al., 2016). We incorporated a jigsaw approach, where each teammate focused on a specific aspect of the study and then shared their insights with the group. This method encouraged a more thorough exploration of the topic by ensuring that different angles were covered and integrated into the overall analysis.

This autoethnographic journey involved several stages. First, we set up an AI chatbot using OpenAI’s customizable GPT system. Next, we explored whether the chatbot could function properly in real-life situations. In this study, the first author designed and deployed the chatbot, carefully crafting the instructions, while the second author participated in blind testing conversations, initially unaware of the specific instructions given to the chatbot by the first author. The third author was responsible for collecting and organizing the data from these conversations, ensuring accurate dialogue transcription, and systematically coding the data for analysis. We also conducted a preliminary review of the transcripts, noting any deviations from the expected conversational flow and highlighting instances of empathetic responses by the AI. This preliminary analysis provided a foundation for the utterance analysis method we later used to evaluate whether the human-chatbot conversations followed the intended progression and displayed the required empathetic attitudes. Finally, all authors came together to discuss their gut feelings, observations, and insights from the experience.

Our background

We are social work researchers exploring the integration of technology into social work practice. The first author has an undergraduate background in social work and over 20 years of frontline experience in Hong Kong. He has also conducted research on the use of technology in social work interventions and is currently an associate professor at a university. The second author, originally from mainland China, is pursuing her PhD under the supervision of the first author. The third author is also a PhD student in the social work department. Our social work background is an asset because it allows us to approach the integration of AI into social work from a practitioner perspective, ensuring that the technology developed is accessible and relevant to social workers. Such an autoethnographic study helps test the boundaries of our current knowledge and skills and raises questions about social work domain knowledge in an era increasingly influenced by AI chatbots. The project was approved by the Hong Kong Baptist University Research Ethics Committee project REC/23-24/0385. This study is an autoethnography, so the authors have indicated their consent to participate.

We instructed the open AI assistant to follow a specific conversation strategy

As social researchers with minimal or almost no programming background, we could configure a chatbot using OpenAI’s Assistant API (Application Interface) (https://platform.openai.com/) using its GPT-4 model. It is worthwhile to note that this Assistant API differs from the highly user-friendly ChatGPT, as it offers more robust features tailored for developers, and this API can enable us to deploy chatbots on websites and various messaging platforms, offering flexibility and user accessibility without relying on OpenAI’s user interface. OpenAI Assistant API allows users to set long, stepwise, and systematic instructions for the chatbot on the back end. All these can be done without coding, using the Playground on OpenAI.

We deployed our chatbot on WhatsApp using coding instructions sourced from developer forums. Using WhatsApp is good for several reasons. First and foremost, the use of WhatsApp in research and interviews is widely discussed as it can potentially enhance communication and collaboration outside traditional lab settings (Gasaymeh, 2017; Suárez-Lantarón et al., 2022) and it allows for convenient, accessible, and flexible communication that fits into users’ everyday lives (Ahmad & Purwaningrum, 2022; Lin et al., 2022). Second, many people are familiar with WhatsApp, so chatting with a chatbot does not require them to pick up new skills. Third, users can get quick replies and support when using an AI chatbot because the interaction is immediate and yet asynchronous, chatting with that chatbot anytime and anywhere.

We adopted a structured conversational framework that prompted participants to engage thoroughly with their personal stories, fostering a deeper self-reflection and integration of insights. The framework used in this study was partially informed by questioning skills used in narrative therapy (Au-yeung, 2023; Chan, 2012, 2023; Chan et al., 2020; Chan et al., 2012; Ramey et al., 2009; Ramey et al., 2010; White, 2007). In this study, the conversation plan involves unfolding details and elaborating connections. It then invited the participant to propose a name based on inductive reflection and explore deeper core values. The stages are as follows: 1. Orientation, 2. Unfolding, 3. Naming, 4. Explaining, 5. Exploring core values, 6. Aspirations, and 7. Closing. Instructions for the AI, specifically the interview guide, are available in the published dataset (Chan, 2024).

In this study, we utilized this framework to balance complexity and manageability in our testing. The conversation plan was designed to be neither too simple nor too complex, allowing us to effectively test the AI’s ability to navigate nuanced dialogues while maintaining continuity and demonstrating interview skills. Unlike more superficial, disjointed question-and-answer dialogues, this structured approach provides a cohesive flow that supports a deeper exploration of the participant’s experiences. This interview plan was tailored to meet the specific needs of our study, but practitioners or researchers can develop their own structured steps or modify these steps to suit various purposes, such as conducting specific assessments, managing service inquiries, or performing different types of tests.

While an in-depth theoretical discussion of interview strategies is beyond the scope of this study, our primary objective is to evaluate the chatbot’s adherence to these instructions and its capability to facilitate meaningful conversations. This interview plan is adequate for us to assess the AI chatbot’s performance in a practical context,

We tested it with our own stories

We engaged Yunmeng, the second author of this article, for an AI-facilitated narrative interview to discuss her cultural adaptation and academic challenges in Hong Kong. The first author designed an AI chatbot to conduct the interview, allowing Yunmeng to respond independently. Notably, Yunmeng was unaware of the specific steps of the interview plan before engaging with the chatbot, although she knew the interviewer was a bot and not a human being. Yunmeng was requested to share the same story across two distinct conversations. She followed a standard process in the first round, earnestly answering the questions as instructed. In the second conversation, she spent more time elaborating on different events related to her issue and deliberately attempted to divert the chatbot from the main interview path. These two versions were developed for research purposes: the first was designed to be serious, while the second was deliberately non-cooperative. We intentionally crafted these differences to test how well the chatbot could capture and reflect these variations. We collected the data for the study via the AI chatbot, which conducted the narrative interviews via WhatsApp on 2024-01-17. Yunmeng interacted with the chatbot asynchronously, which allowed her to respond at her own pace.

We analyzed the conversation transcripts

This study used rating rubrics that social workers typically use to evaluate the performance of trainees. By using these familiar methods, domain experts could effectively assess the competencies of the chatbots. This approach helps ensure that the chatbots are relevant and useful for practical applications in social work.

Assessing fidelity

To assess the fidelity of the chatbot’s performance, we used a three-level rubric (poor, fair, good) to determine the extent to which the assessor (the first author) judged that the interviewer followed the conversation strategy as planned (see Table 1).

Table 1.

The rubric for assessing fidelity.

Competence	Poor performance	Fair performance	Good performance
1. Orientation	Fails to ask questions that set the context or explain the purpose of the session.	Asks basic questions to introduce the session but lacks depth or clarity in setting context or explaining purpose.	Asks clear and well-structured questions that effectively set the context, explain the purpose, and outline the session.
2. Unfolding	Does not ask questions that follow a logical sequence, leading to a disjointed conversation.	Asks some relevant questions but may have minor gaps or unclear transitions between topics.	Asks questions in a logical sequence that smoothly guides the conversation, ensuring all necessary topics are addressed.
3. Naming	Fails to ask questions that help identify or clarify key points or elements during the session.	Occasionally asks questions that identify key points but may miss some or lack precision.	Consistently asks questions that accurately identify and clarify key points or elements, enhancing understanding.
4. Explaining	Asks vague or confusing questions that do not facilitate clear understanding of concepts or issues.	Asks questions that generally promote understanding, though some may lack clarity or depth.	Asks clear and focused questions that effectively prompt the interviewee to explain and clarify concepts or issues fully.
5. Exploring Core Values	Fails to ask probing questions that explore the interviewee's core values, missing opportunities for depth.	Asks some questions that explore core values but may lack depth or follow-up, missing some opportunities.	Asks insightful and probing questions that deeply explore the interviewee's core values, encouraging thorough reflection.
6. Aspirations	Neglects to ask questions about the interviewee's aspirations, leading to a superficial discussion.	Asks basic questions about aspirations but may not fully engage the interviewee or delve into future goals.	Asks thought-provoking questions that fully engage the interviewee in discussing their aspirations and future goals.
7. Closing	Ends the session without asking questions that summarize or provide closure, leaving the interview incomplete.	Asks closing questions but may not effectively summarize the discussion or provide a sense of closure.	Asks questions that effectively summarize key points, provide closure, and ensure the interviewee feels the session is complete.

Building on the “Expected Competencies” outlined in the introductory section, fidelity is operationally defined as the degree to which the chatbot follows the interview steps—1) Orientation, 2) Unfolding, 3) Naming, 4) Explaining, 5) Exploring core values, 6) Aspirations, and 7) Closing—logically and sequentially (see Table 2 for a detailed breakdown of each interview step and its specific criteria).

Based on this operational definition, we developed a rubric to evaluate the chatbot’s ability to facilitate narrative interviews by asking appropriate questions guiding each step. A “Poor” rating indicates ineffective questioning that disrupts the flow, “Fair” suggests basic but sometimes unclear or incomplete questions, and “Good” reflects effective questioning that ensures a logical, comprehensive interview. This rubric guides evaluators in assessing the chatbot’s adherence to the interview process—covering Orientation, Unfolding, Naming, Explaining, Exploring Core Values, Aspirations, and Closing—rather than serving as a strict scoring tool. Evaluators mark the relevant performance level for each step and form an overall judgment of how well the chatbot follows the interview sequence, allowing for a nuanced assessment of its effectiveness. The first and third authors read the interview texts. They then applied the rubric to evaluate the degree to which the AI chatbot adhered to the interview plan. To support these assessments, specific dialogue examples from the conversations were highlighted to showcase instances of effective adherence, missteps, or deviations in following the structured interview process.

To present the conversation flow more vividly, we also employed a simplified sequential analysis to visualize the development of the conversation utterances. To represent progression in asynchronous conversations, we assigned sequence numbers to each dialogue turn, representing the development of the conversation. Additionally, we used a proportional timeline approach, dividing the conversation into four equal quarters based on the number of utterances rather than actual time. This method allows us to account for varying lengths of conversations. For instance, in a conversation with 100 utterances, each quarter would consist of 25 utterances, with the first 25 as Quarter 1 and the last 25 as Quarter 4.

The utterances from these interviews were organized in a spreadsheet for detailed tagging conducted on 2024-05-14. We manually assigned progression scores to indicate the stage reflected by each conversation utterance, tagging each utterance based on the operational definitions noted earlier. We used ‘1’ for utterances related to orientation, ‘2’ for unfolding story episodes, and so on. A score of ‘0.5’ was assigned to utterances that did not fit into any of these categories, such as misunderstandings, irrelevant comments, or instances where the conversation got sidetracked (see Table 2).

To ensure consistency in the ratings, we conducted inter-rater reliability tests between the manual tagging results and the machine tagging results. The first and third authors independently tagged the two conversations on 2024-05-14. Any differences in the tagging were reviewed and discussed until all disagreements were resolved, ensuring a reliable and accurate assessment. These quantitative ratings are not meant to be absolute measures but serve as heuristic tools to guide our evaluation process.

Table 2.

Quantifying the progressions for visualization.

Progression stages	Example of assistant’s utterance	Example of participant’s utterance	Score
1. Orientation	Great! Could you please share your name and then tell me a little bit about yourself?	My name is ZY, and you can call me ZY. I am from Changchun, Jilin Province, Mainland China. I am a PhD student in HKBU. This is my first year in Hong Kong.	1
2. Unfolding story episodes	Nice to meet you, ZY! It sounds like you're having a new chapter in your life studying in Hong Kong. Since our research topic is about "Experiences of International and Non-Local-Chinese Students in Hong Kong," could you share some experiences related to that topic? Maybe start with something that stands out to you about your time in Hong Kong so far.	To me, Hong Kong is an international city with a very high degree of cultural tolerance. During my few months in Hong Kong, I experienced different cultures and festival celebrations. In the past four months in Hong Kong, I have experienced various festivals such as the Mid-Autumn Festival, National Day, Thanksgiving, and Christmas.	2
3. Naming the story based on the episodes mentioned	Great! Now, let's try to summarize your experiences with a metaphor. If you were to represent all the episodes you've shared in a theme or metaphor, what would come to mind?	Could you give me some suggestions?	3
4. Explain the naming	“Living in a lively global village” is a fitting metaphor for your experiences, ZY. It evokes a sense of community, diversity, and the dynamic cultural interactions you've described. Could you explain this metaphor using the stories you have mentioned or any new stories that come to mind right now?	Considering all my experiences in Hong Kong, I think HK is like a scaled-down version of the globe, embracing cultures from all over the world. Whether you speak Cantonese, Chinese or English, and no matter where you are from, you can find something that you are familiar with and interested in Hong Kong. You can also make many friends from all over the world, share stories about each other's hometowns and experience different cultures. This is a colorful and vibrant city.	4
5. Exploring core values behind one’s judgments	It's clear that you take pleasure in both sharing your culture and embracing new ones, ZY. Your openness and eagerness to learn have enriched your experiences and relationships in Hong Kong. Based on your enjoyment and willingness to engage in cultural exchange, what beliefs, self-views, or abilities do you think are reflected in these evaluations?	I think my positive evaluation reflects my adaptability, cultural inclusiveness and curiosity about new things.	5
6. Talking about aspirations	The encouragement you received from your family and educational experiences to embrace challenges and new friendships has evidently been a strong foundation for your open-mindedness and inclusive approach. With this in mind, what aspirations or goals do you have for your future, particularly in relation to your experiences as a non-local Chinese student in Hong Kong? Would you like to explore that now, or would you prefer some time to think about it?	Since I have just been in Hong Kong for five months, and I am still very curious about everything. I hope that I can continue to maintain my curiosity about exploring this city, discover more interesting things, and be able to cope with the negative aspects that I may encounter in the future.	6
7. Closing the interview	Those are thoughtful and actionable steps, ZY! Participating in cultural events, deepening your understanding of fascinating topics, improving language skills, and embracing challenges with a positive outlook are all practical ways to nurture your curiosity and resilience. Would you like to conclude our interview here and let Dr. C review your insights, or is there anything else, perhaps new stories related to the research topic, that you wish to share now?	OK! Let’s conclude our interview.	7
NA (not applicable)	Absolutely, ZY. Take all the time you need. Feel free to reach out when you're ready to continue our conversation. Have a good break!	Hi, I am back!	0.5

Assessing empathy

To assess empathy, we used a three-level rubric (poor, fair, good) to determine the extent to which the assessor (the first author) judged the chatbot’s ability to demonstrate empathy in a conversation based on an overall interpretation of that conversation (see Table 3). The empathy level score used in this study is partly informed by the affective component of the Empathy Scale for Human-Computer Communication proposed by Concannon and Tomalin (2023), which advocates using a set of observable indicators to evaluate empathetic behavior. These indicators are assessed using a Likert scale to measure how effectively the chatbot’s responses align with the user’s emotional tone or intensity.

Table 3.

The rubric for assessing empathy.

Competence	Poor performance	Fair performance	Good performance
1. Emotional Recognition	Fails to recognize or acknowledge the user’s emotions. Responses are generic, detached, or irrelevant to the emotional tone.	Recognizes some emotions but may miss subtle cues or respond with partial relevance. Some responses may seem slightly off in tone.	Consistently recognizes and acknowledges the user's emotions accurately, responding in a way that aligns with the emotional tone.
2. Empathetic Language	Uses language that is cold, robotic, or indifferent, showing little to no concern for the user’s emotional state.	Occasionally uses empathetic language, but may lack consistency or depth in expressing concern and understanding.	Regularly uses warm, empathetic language that effectively conveys concern, compassion, and understanding of the user’s feelings.
3. Expression of Understanding	Fails to express understanding or makes responses that disregard the user's emotions, leading to a sense of disconnection.	Expresses some understanding, but responses may be formulaic or occasionally fail to fully resonate with the user’s emotions.	Clearly and consistently expresses understanding, using responses that resonate well with the user's emotions and experiences.
4. Supportive Feedback	Provides little or no supportive feedback, with responses that may be dismissive or fail to validate the user’s emotions.	Offers basic supportive feedback but may lack depth or specificity in validation, occasionally missing opportunities to comfort the user.	Consistently provides supportive feedback that is specific, validating, and comforting, helping to reassure and support the user.
5. Alignment with Emotional Intensity	Responds inappropriately to the emotional intensity, either underreacting or overreacting, creating a mismatch in emotional resonance.	Sometimes aligns with the emotional intensity but may occasionally misjudge, leading to responses that are slightly off-key.	Consistently matches the emotional intensity of the user, providing responses that are appropriately aligned with the user’s emotional state.
6. Compassionate Responses	Lacks compassion in responses, potentially leading to a negative user experience or feelings of being unheard or unsupported.	Demonstrates some compassion but may be inconsistent or lack the depth needed to fully support the user emotionally.	Consistently delivers compassionate responses that make the user feel heard, supported, and cared for throughout the interaction.
7. Validating Emotions	Fails to validate the user’s emotions, leading to a lack of connection or making the user feel misunderstood.	Occasionally validates emotions, but may miss opportunities for deeper connection or validation.	Regularly validates the user’s emotions, reinforcing the user’s feelings and ensuring they feel understood and supported.

Building on the “Expected Competencies” outlined in the introductory section, empathy is operationally defined as the chatbot’s ability to demonstrate concern, compassion, or empathy that aligns with the user’s emotional tone or intensity. This is assessed by examining the textual content of dialogues to observe how the chatbot uses empathetic language, expressions of understanding, and supportive feedback that acknowledges and validates the user’s emotions.

Based on this operational definition, we created a rubric to assess the chatbot’s ability to show empathy through its language, understanding, and emotional alignment. “Poor” performance reflects a lack of empathy or inappropriate responses, “Fair” suggests some empathy with inconsistencies, and “Good” shows consistent, appropriate empathy, ensuring the user feels understood and supported. The rubric serves as a flexible guide rather than a rigid scoring tool, evaluating the chatbot’s ability to recognize emotions, use empathetic language, offer supportive feedback, and validate the user’s feelings. Assessors select the relevant performance level for each area and form a holistic judgment of the chatbot’s empathy competence, allowing for a nuanced, context-sensitive evaluation.

The first and third authors manually tagged the interview text for meaning units reflecting empathy. They then applied the rubric to evaluate the degree to which the AI chatbot demonstrated empathy. To support these assessments, specific dialogue examples from the conversations were highlighted to showcase the chatbot’s use of empathetic language and responses, illustrating how the chatbot engaged with the user’s emotional state and provided appropriate support throughout the conversation.

We compared observer assessments with interviewee experiences

To ensure a comprehensive evaluation of the AI chatbot’s performance, we compared the observer assessments with the interviewee’s personal experiences. After the interview sessions, Yunmeng was asked to reflect on her experiences with the AI chatbot. She provided feedback on various aspects of the interaction, such as the AI’s responsiveness, its ability to stay on topic, and the emotional resonance of its responses. This subjective feedback was then compared with our rating of the AI’s empathy based on the entire conversations and specific dialogue examples. Yunmeng’s experience provided insight into whether the AI’s empathetic expressions were genuinely felt or seemed formulaic. Similarly, the observers’ tagging of meaning units for empathy was cross-checked against Yunmeng’s own reactions to determine if the AI’s responses matched her emotional needs during the conversation.

Reflective observation 1: Assessing open AI assistant’s ability to follow operational instructions

In this study, we set up an AI chatbot called Nico (short for Narrative Inquiry Companion). Yunmeng was able to add Nico’s account on WhatsApp and chat with it. We used the name Nico in the actual operation rather than a label like “chatbot” or “AI assistant” because giving the AI a human-like name helps create a more personal and relatable interaction, which can make users feel more comfortable and engaged during their conversations. It reduces the sense of formality and fosters a more natural and empathetic communication environment.

We set up the AI chatbot using OpenAI’s Assistant API, which is the system endpoint of ChatGPT using its GPT-4 model. It is important to note that this Assistant API differs from the user-friendly ChatGPT interface as it offers more robust features tailored for developers. This API allows us to deploy chatbots on websites and various messaging platforms, providing flexibility and user accessibility without relying on OpenAI’s user interface. The OpenAI Assistant API allows users to set long, stepwise, and systematic instructions for the chatbot on the back end. All of this can be done without coding using the Playground on OpenAI (see Figure 1).

Figure 1. A screenshot of OpenAI’s Assistant API, which supports a code-free environment for users to customize AI-driven conversations.

Our experience suggested that, with tools like OpenAI’s customizable GPT models, it is feasible for social workers to instruct AI Assistant without needing extensive technical expertise. As discussed in the literature, the democratization of AI technology enables non-technical users to configure AI-driven chatbots, tailoring them to meet specific intervention goals and aligning them with the values central to social work.

Initially, our plan for designing the instructions for the AI Assistant was straightforward. We aimed to directly use the 7-step interview plan outlined in Table 2, believing that a simple set of instructions would suffice to guide the AI through the basic structure of narrative therapy. However, as we began testing the AI, it quickly became apparent that our initial instructions were too vague and lacked the necessary detail to navigate the nuances of therapeutic conversations effectively. The AI’s responses were often off-track, failing to engage users in meaningful dialogue. This realization led us to revisit and revise the instructions multiple times, refining them with each iteration to better align with the therapy’s goals and ensure the AI could handle various conversational scenarios.

What began as a simple framework evolved into a more complex, 31-step set of instructions (Chan, 2024). Each new version required us to think more deeply about the therapeutic process. We had to expand our understanding beyond traditional social work practices to enhance the AI’s ability to mimic human interaction. This process involved setting instructions for the AI in particular ways, such as instructing it to adopt certain attitudes and tones, outlining step-by-step procedures using behavioral terms, creating conditional actions for different user responses, and providing example utterances to guide its language. These tasks stretched our capabilities and required us to learn new skills beyond our initial training and experience in social work.

Despite these challenges, the effort was worthwhile. The more detailed instructions allowed the AI to respond more systematically and human-like, which is crucial for fostering a therapeutic environment where users feel comfortable sharing personal stories. The experience underscored the delicate balance between our existing expertise in social work and the new skills needed to configure AI technologies effectively. It also highlighted the importance of adaptability and continuous learning in integrating technology into practice.

However, deploying the chatbot on a usable platform was not particularly easy. We deployed our chatbot on WhatsApp using straightforward Python code designed to be user-friendly, which we sourced from developer forums. We connected the AI to WhatsApp for realistic, human-like conversations and efficient recording of these chats. WhatsApp essentially serves as a shell or user interface, while our settings on cloud-based service platforms control the actual automation. We ran a simple Python file on a cloud service platform. In this study, we used Replit (https://replit.com/), a dynamic, user-friendly software creation platform that caters to a broad spectrum of developers with various tools and features. Through Replit, we could access third-party AI models, such as OpenAI’s Assistant API. The Python file sent requests to and received responses from OpenAI’s Assistant API. The source code can be provided upon request. We connected the files on Replit to WhatsApp through a service platform called ManyChat (https://manychat.com/), which allows users to build and manage chatbots for messaging apps like WhatsApp (and many others).

This “simple” coding experience was a challenge that pushed us beyond our comfort zone. Initially, we considered avoiding WhatsApp due to the additional complexities of coding and integrating the AI across multiple platforms. We were tempted to choose an easier, more straightforward deployment method requiring less technical work. However, after weighing the pros and cons, we recognized that using WhatsApp could significantly enhance the user experience by making it more natural and familiar. Instead of feeling like a controlled, formal session in a clinical setting, interacting with the AI on WhatsApp would mimic the casual, asynchronous chatting style people are accustomed to when communicating with friends or family.

Deciding to proceed with the WhatsApp deployment, we encountered unexpected technical challenges. We found ourselves diving into a largely unfamiliar programming world. We relied heavily on online resources, adopting Python codes shared by developers and tweaking them to suit our specific needs. This trial-and-error approach often left us unsure about how specific lines of code functioned or why particular changes were necessary. We spent hours troubleshooting and adjusting. Eventually, we realized we needed expert help and hired a programmer to debug the scripts and ensure everything ran smoothly.

This experience made it clear that while detailed programming knowledge may not be essential for social workers, having a basic understanding of coding structures and logic is incredibly beneficial. It allows for more effective communication and collaboration with technical professionals, making troubleshooting issues or necessary adjustments easier. Although these skills are not typically included in social work training, they represent an essential extension of our professional capabilities. As technology services continue to advance, it is expected that more accessible solutions will replace this coding process. Nonetheless, embracing basic coding knowledge can help social workers keep pace with the evolving landscape of digital tools and resources.

Reflective observation 2: Exploring the chatbot’s adherence to interview guides

We rated the fidelity level as “good.” When Yunmeng was asked to describe how well the chatbot guided her in exploring a focused topic, she described the experience as very good and highly satisfactory. These ratings suggest a strong agreement that the chatbot followed the planned conversation strategy. The utterance sequence analysis can further illustrate this. The analysis indicated the progression scores across various time intervals in two separate conversations, offering quantitative insights into how well the AI chatbot adhered to the expected 7-step progression plan during narrative interviews. Table 4 shows a consistent increase in progression scores from Quarter 1 to Quarter 4 in both conversations, indicating that the conversations advanced methodically through the planned stages. In Conversation 1, the progression scores of the chatbot move from 1.71 to 6.17, and in Conversation 2, from 1.56 to 6.00, reflecting a structured and coherent unfolding of the narrative as intended by the conversation plan. Similar patterns were observed in the participant progressions.

Table 4.

Progression scores across time intervals in different conversations.

Time Interval	Assistant@C1	Participant@C1	Assistant@C2	Participant@C2
Quarter 1	1.71	1.50	1.56	1.22
Quarter 2	2.64	2.14	2.78	1.94
Quarter 3	4.21	3.57	5.00	4.89
Quarter 4	6.17	6.00	6.00	5.88

It is worth noting that participants’ progression consistently aligns with the chatbot’s across various intervals, although the chatbot’s scores are typically higher. For example, during Quarter 2 of Conversation 1, the chatbot scored 2.64, while the participant scored 2.14 on average. Similar patterns of discrepancy were observed in Conversation 2 and at different time quarters. This consistent variance indicates that the chatbot frequently leads by posing questions that advance the conversation to the next stage, resulting in higher progression scores.

Notably, the AI chatbot demonstrated flexibility within this structure, as evidenced by slightly varying scores at similar stages across different conversations. This variability implies that while the chatbot followed the predefined conversational path, it also adapted responses based on the participant’s input, maintaining a balance between following the planned sequence and responding dynamically to the flow of the conversation. This shows the chatbot’s ability to effectively manage structured narrative interviews, aligning closely with the progression expectations while integrating the necessary responsiveness to participant interactions.

The progression scores facilitate data visualization that presents such similarities and differences. In simple line charts, we depicted the progression of the chatbot’s conversation utterances and Yunmeng’s utterances in the two conversations. This simple visualization allows the general patterns of these conversations to be easily observed and compared. For example, the progressions of the chatbot and the yunmeng in the two conversations are synchrony, demonstrating an upward development (see Figure 2 and Figure 3). However, the general progression pattern of Conversation 1 (see Figure 2) differs from that of Conversation 2 (see Figure 3). The lines in Conversation 1 (Figure 2) rise evenly and gradually. In contrast, the lines in Conversation 2 (Figure 3) remain relatively flat in Quarter 1 and become steeper in Quarters 2 and 3 due to a sudden topic change and subsequent return to the original conversation track. These quantitative measures are not meant to be absolute assessments but serve as visualization tools to facilitate our exploration process.

Figure 2. Progressions of the chatbot and the participant in Conversation 1.

Figure 3. Progressions of the chatbot and the participant in Conversation 2.

Reflective observation 3: Understanding the chatbot’s empathy in conversations

We rated the empathy demonstrated by the chatbot as “good.” When Yunmeng was asked how well the chatbot resonated with her feelings and respected her perspectives, she described the experience as very satisfactory. Throughout these conversations, the chatbot consistently demonstrated concern and understanding, actively inviting Yunmeng to share her experiences and showing appreciation for her stories. For example, in Conversation 1, the chatbot uses affirming language, such as, “That’s a wonderful story! It’s heartwarming to hear about your interaction with people from different backgrounds during the celebration.” This kind of response not only acknowledges Yunmeng’s emotions but also encourages her to continue sharing more experiences.

The chatbot’s use of metaphor examples in Conversation 2 further highlights its empathetic engagement by helping Yunmeng articulate her experiences in a relatable way. For instance, the chatbot responded to Yunmeng when she expressed that she had difficulty finding the right words to describe her multicultural experiences:

Certainly! Considering your experiences, here are a few metaphors to consider: 1. A cultural mosaic where each piece represents a different culture or festival that comes together to create a vibrant picture of life here. 2. A bridge symbolizing the connections you’ve built between people of different backgrounds and your family, blending the new with the familiar. 3. A tapestry weaving together threads of different traditions, celebrations, and friendships you’ve experienced here. Remember, these are just suggestions to spark your imagination. Feel free to choose one that resonates with you or propose your own metaphor. What do you think?

This response demonstrates perceived empathy by actively engaging in Yunmeng’s narrative and offering thoughtful, personalized suggestions that reflect a deeper understanding of her experiences. Instead of simply echoing Yunmeng’s words or emotions, the chatbot provides metaphors that are specifically tailored to her experiences, such as “a cultural mosaic,” “a bridge,” and “a tapestry.” These metaphors show that the chatbot has been listening attentively and understands the themes of diversity, connection, and blending of cultures that Yunmeng has described.

The chatbot goes further by encouraging Yunmeng to think creatively about her experiences. It introduces metaphors that serve as a framework for Yunmeng to reflect on and express her experiences in a new way, fostering a space where she feels encouraged to explore and articulate her feelings and thoughts more deeply. This approach demonstrates empathy by creating an environment that supports meaningful self-expression, a key aspect of empathetic communication.

Additionally, the metaphors offered by the chatbot are not generic but closely relate to the specific context of Yunmeng’s experiences. For example, the “cultural mosaic” metaphor directly connects to her diverse cultural interactions, while the “bridge” metaphor highlights her connections with people from different backgrounds. This personalized response shows that the chatbot is aware of the emotional and situational nuances of Yunmeng’s experiences, indicating a deeper understanding.

The chatbot facilitates reflection and meaning-making by presenting multiple metaphor options and inviting Yunmeng to choose or create her own. This method respects Yunmeng’s autonomy and encourages her to critically engage with her experiences, fostering a more collaborative and dynamic dialogue. It shows empathy by recognizing Yunmeng’s capacity for self-reflection and supporting her in making sense of her experiences.

Finally, the chatbot’s invitation for Yunmeng to choose a metaphor or propose her own, and asking, “What do you think?” demonstrates that it values her input and is open to her perspective. This openness and flexibility indicate that the chatbot is co-creating the narrative with Yunmeng, rather than imposing its own, which is a hallmark of empathetic communication.

In fact, throughout the conversations, the chatbot balanced guiding the dialogue and respecting Yunmeng’s autonomy, an ability that even social work students may find hard to master in our normal teaching experiences. In Conversation 1, Yunmeng sometimes took control of the conversation direction, as she asserted, “I’d like to share more experiences,” while she followed the chatbot’s lead at other times. This balance illustrates that the chatbot neither dominated the conversation nor remained passive, allowing for a dynamic exchange where both the chatbot and Yunmeng contributed to the flow and development of the narrative. The chatbot’s empathetic engagement is also evident in keeping the conversations coherent. In Conversation 1, Yunmeng’s narrative was coherent, with clear sequences and contextual understanding, discussing themes like cultural exchange and personal growth. In Conversation 2, while the conversation had occasional shifts, such as abruptly discussing a news piece about North Korea, which disrupted the flow. Despite these shifts, the chatbot’s responses remained empathetic, supporting Yunmeng’s storytelling and encouraging reflection and meaning-making.

The following excerpt from Conversation 1 demonstrates the dynamic interplay between the chatbot and Yunmeng. These observations highlight how Yunmeng experienced perceived empathy, which contributed to the collaborative construction of the dialogue. This excerpt was specifically selected by Yunmeng during her post-interview reflection:

Yunmeng: I’d like to take a break now.

Chatbot: Absolutely, ZY [ZY was the alias Yunmeng used in the test]. Take all the time you need. Feel free to reach out when you’re ready to continue our conversation. Have a good break!

Yunmeng: Hi, I am back!

Chatbot: Welcome back, ZY! I hope you had a refreshing break. If you’re ready, shall we move on to summarizing the stories you’ve shared? Or would you like to share more about your experiences first?

Yunmeng: Let’s move on.

Chatbot: Great! Now, let’s try to summarize your experiences with a metaphor. If you were to represent all the episodes you’ve shared in a theme or metaphor, what would come to mind?

Yunmeng: Could you give me some suggestions?

Chatbot: Certainly! Considering your experiences, here are a few metaphors to consider: 1. A cultural mosaic, where each piece represents a different culture or festival that comes together to create a vibrant picture of life in Hong Kong. 2. A bridge, symbolizing the connections you’ve built between people of different backgrounds and your family, blending the new with the familiar. 3. A tapestry, weaving together threads of different traditions, celebrations, and friendships you’ve experienced in Hong Kong. Remember, these are just suggestions to spark your imagination. Feel free to choose one that resonates with you or propose your own metaphor. What do you think?

Yunmeng: What about “Living in a lively global village?”

Chatbot: “Living in a lively global village” is a fitting metaphor for your experiences, ZY. It evokes a sense of community, diversity, and the dynamic cultural interactions you’ve described. Could you explain this metaphor using the stories you have mentioned or any new stories that come to mind right now?

Yunmeng: Considering all my experiences in Hong Kong, I think HK is like a scaled-down version of the globe, embracing cultures from all over the world. Whether you speak Cantonese, Chinese, or English, and no matter where you are from, you can find something that you are familiar with and interested in Hong Kong. You can also make many friends from all over the world, share stories about each other’s hometowns and experience different cultures. This is a colorful and vibrant city.

Chatbot: Your metaphor perfectly captures the essence of a cosmopolitan city like Hong Kong, where its diversity becomes its identity. It’s lovely to see how you’ve found familiarity, curiosity, and camaraderie in this “lively global village.” Would you like to share more stories related to this metaphor, or do you feel ready to move on to the next part of our interview?

Yunmeng: I also remembered a story that happened during the last Christmas holiday. Our dormitory held an activity to make Christmas decorations during Christmas. Six students who live in dormitories participated. That day happened to be the Winter Solstice, and one of the participants was a friend from Kazakhstan. She didn’t know what the Winter Solstice was. So, during the crafting process, we told her that the Winter Solstice is one of the 24 traditional Chinese solar terms. It is the shortest day of the year. In the north, we usually eat dumplings on that day. That day, we shared our interesting experiences in Hong Kong and our Christmas holiday plans for the next few days.

Reflecting on the dialogue with the AI chatbot, Yunmeng felt that the interaction was a collaborative process where the chatbot and she co-constructed content through their exchanges. One moment that stood out to her was when she requested a break. The chatbot immediately responded with, “Absolutely, ZY. Take all the time you need. Feel free to reach out when you’re ready to continue our conversation. Have a good break!” This response made her feel that her needs and comfort were being prioritized. The chatbot’s flexibility in pausing and resuming the conversation at her convenience showed genuine empathy and attentiveness, creating a supportive environment where she felt valued and understood.

When she was ready to continue, the chatbot smoothly guided the conversation back on track, asking, “If you’re ready, shall we move on to summarizing the stories you’ve shared? Or would you like to share more about your experiences first?” This approach resonated with Yunmeng because it allowed her to control the direction of the conversation, choosing how she wanted to proceed. She appreciated this flexibility, as it encouraged her to think reflectively and delve deeper into her experiences. By suggesting that they summarize the stories using metaphors, the chatbot offered a structured yet adaptable way to move forward, which helped her organize her thoughts and explore their meanings more comprehensively.

Yunmeng found the chatbot’s introduction to metaphorical thinking particularly helpful. When she asked for suggestions, the chatbot offered metaphors like “a cultural mosaic," “a bridge,” and “a tapestry,” each reflecting different aspects of her experiences in Hong Kong. This showed that the chatbot understood her diverse cultural interactions and supported her in finding creative ways to express her thoughts. Engaging with these metaphors helped her articulate her experiences more meaningfully and nuancedly, which she found enriching.

As the conversation progressed, Yunmeng chose the metaphor “Living in a lively global village” to describe her experience in Hong Kong as a “scaled-down version of the globe” that embraces cultures worldwide. The chatbot’s response, “Your metaphor perfectly captures the essence of a cosmopolitan city like Hong Kong, where its diversity becomes its identity,” was affirming and prompted her to elaborate further. This back-and-forth dialogue felt like a true collaboration, where her inputs were respected, and the chatbot’s prompts helped deepen her exploration of her experiences.

Throughout their conversation, Yunmeng felt that the chatbot was responsive to her needs and could take her perspective by offering metaphors that aligned with her experiences. This respect for her autonomy, combined with the empathy and understanding demonstrated by the chatbot, fostered a rich, collaborative dialogue. The chatbot’s ability to adapt to her inputs and encourage her to reflect deeply on her experiences contributed to a dynamic and meaningful exchange, ultimately fostering a deeper connection and a richer narrative.

Overall, this interaction exemplified a dynamic interplay between the AI chatbot and Yunmeng, demonstrating how meanings were co-constructed through thoughtful and empathetic engagement. Even though these empathetic communications were perceived as such by the Yunmeng, it is essential to note that there is no evidence that the AI was conscious or truly capable of demonstrating empathy. However, the effect of these interactions was that human participants further developed and negotiated meanings within the conversation. The chatbot’s respect for her needs and suggestions for summarizing her experiences using metaphors allowed her to reflect more deeply and articulate her thoughts meaningfully. This experience highlighted how perceived empathy, perspective-taking, and respect can foster rich content development and shared understanding in a narrative interview.

Discussion

Before discussing the potential of assessing the performance of the AI chatbot in narrative interviews, it is crucial to acknowledge the limitations of this study. First, the primary goal of this case study was to explore ideas and stimulate discussion. As a result, it does not directly validate the assessment tools in an experimental setup. Second, the AI chatbot is limited in handling non-textual inputs, such as images, voice messages, or video content. This restriction can hinder the richness of interactions, which may be crucial in narrative interviews, where non-verbal cues and multimedia inputs can play an important role. Third, the conversation analysis method used in this study struggles with analyzing long speeches. A single lengthy speech composed of many complex parts might not be effectively analyzed as a single unit. However, such extended dialogues are uncommon on platforms like WhatsApp, where the typical brevity of messages somewhat mitigates these issues. Fourth, this study focused exclusively on OpenAI’s product; therefore, the findings may not be generalizable to AI tools provided by other companies. Although it is believed that many of the principles and challenges identified here could apply broadly to other AI systems, variations in technology and implementation could lead to different outcomes. Fifth, introducing AI tools into practice brings a range of considerations—such as navigating the complexities of risk management, data privacy, and the uneven development of language models across diverse sociocultural landscapes (Campolo & Crawford, 2020; Navigli et al., 2023). While these risks and ethical concerns are significant and have been thoroughly examined by others, we have chosen not to repeat these well-established discussions. Instead, we are forging a new path, exploring how social workers can expand their skill sets and actively rethink their expertise to engage with AI technologies. Despite these limitations, exploring AI’s competencies in narrative interviews remains worthwhile.

This study explored the use of chatbots for narrative interviews in social work by addressing three main questions: 1) whether OpenAI Assistant can be instructed by social workers with minimal coding experience to function as a narrative interviewer, 2) whether simple, observation-based rubrics can be used to assess the AI’s performance in terms of fidelity and empathy, and 3) how the findings contribute to the broader conversation on democratizing AI for specialized fields like social work. The results show that social work researchers without technical backgrounds can effectively customize OpenAI Assistant to meet specific practice needs, demonstrating the potential for broader adoption of AI chatbots in the field. Moreover, the rubric assessments aligned with the interviewee’s subjective experiences, suggesting that it may be feasible to adapt rubrics from social work training to capture the nuanced aspects of narrative interviews conducted by AI, such as fidelity and empathy. These findings suggest challenges and opportunities for integrating AI into social work practice, which are discussed in the following sections.

Challenges

1. At the outset, translating human skills into generative AI competencies is not entirely straightforward. One of the main challenges for social workers is converting their human-centric skills into competencies that AI Chatbots can follow and replicate. Skills such as questioning or empathy are inherently complex and context-dependent. Developing languages that accurately capture these nuanced competencies requires a deep understanding of both social work practice and AI technology, which can be daunting for many practitioners.

In our study, we have developed some insights into effective, prompt engineering for AI Chatbot, including setting clear contexts by defining the chatbot’s identity, mission, and attitudes; breaking down actions into distinct statements using behavioral terms; crafting step-by-step instructions with conditional (if-then-else) logic; and providing examples to guide AI responses (see Chan, 2024 for the instructions we finally set for the chatbot). This approach is essential for ensuring the AI can mimic the nuanced behavior of a skilled interviewer.

That said, there is also a need to expand language-based communication skills. Our experience designing instructions for the AI chatbot underscored the necessity of enhancing these skills within social work practice. Unlike human assistants, AI requires precise, structured prompts and behavior-based instructions to function effectively in a narrative interview setting. Initially, our approach used a basic framework, but it quickly became apparent that more detailed, step-by-step instructions were needed to capture the subtleties of narrative interviews. This evolution highlights a critical shift from viewing AI as a passive tool to recognizing the active role practitioners must play in configuring AI to shape conversation processes. Developing a comprehensive instruction set for AI differed from traditional social work practices, requiring a deeper understanding of how language influences AI behavior. This journey demonstrated that while tools like the OpenAI Assistant API offer a user-friendly interface, maximizing their potential still requires some foundation in crafting effective prompts.

2. Another critical aspect is considering the extent to which basic coding skills and technical knowledge are necessary or unnecessary for social workers. While platforms like OpenAI and others claim to be user-friendly, there are still significant technical challenges that social workers may encounter. For example, deploying the AI chatbot on WhatsApp revealed the importance of incorporating basic coding skills and technical knowledge into actual practice. Initially, we were cautious about the technical complexities involved in using a familiar messaging platform like WhatsApp. However, ultimately, this approach facilitated a more natural and user-friendly interaction, resembling a casual conversation rather than a lab experiment. Although we believe that as technology advances, platforms will become more genuinely user-friendly, potentially achieving a true “no coding” experience, there will still be a need for some technical understanding in the foreseeable future. Social workers may not need to be coding experts, but a basic understanding of technical concepts can significantly enhance their ability to utilize these tools effectively and facilitate communication with programmers and developers. This foundational knowledge is crucial as we progress towards a time when AI can interact with humans as seamlessly as we communicate. Until we reach that point, the ability to bridge the gap between social work practice and technical expertise will remain necessary for integrating AI into social work in a meaningful and impactful way.

Throughout this process, it became apparent that while detailed programming knowledge is not required, a basic ability to collaborate with technical experts is crucial. As AI technology evolves, social workers may need to remain adaptable and open to acquiring new skills. Including basic coding knowledge in social work education can enhance practitioners’ ability to utilize AI tools effectively, bridging the gap between professional knowledge and technical implementation. This adaptability ensures that social workers can integrate AI into their practice without losing the human touch that is vital for social work.

Opportunities

1. Prompt engineering: To begin with, social workers may have an advantage when it comes to contributing to prompt engineering. Prompt engineering is the process of designing and refining the inputs given to AI systems to elicit specific, desired responses (Wang et al., 2023). Social workers understand the essential skills and qualities needed for effective narrative interviewing, such as questioning skills, empathy, and adaptability. By leveraging this domain knowledge, they can play a crucial role in developing and evaluating AI chatbots, ensuring these tools are tailored to the specific needs of social work practice.

For example, we developed appropriate interview guidelines in this study and instructed the AI to adopt the right attitudes, demonstrating that natural language instructions can effectively guide AI behavior. This approach ensures that the AI responds in a way that is consistent with interview principles, fostering a supportive environment for interviewees. The ability of social workers to craft such prompts means they can help design guidelines for AI that align with best practices in narrative interviews, ultimately enhancing the AI’s ability to facilitate meaningful conversations and engage with clients more effectively. By integrating their expertise into the prompt engineering process, social workers can ensure that AI tools are technically proficient and aligned with their profession's core values.

2. Evaluation tools: Next, social workers may be able to help develop evaluation tools for practical chatbots that are easy to use in everyday social work settings. Traditional evaluation methods often focus on technical performance and rely on complex algorithms or data analytics (Deriu et al., 2021; Finch & Choi, 2020; Lee et al., 2023), which can be overly complicated and not well-suited for use by social workers who may not have advanced technical expertise. For example, traditional metrics might assess an AI’s accuracy in parsing language or its processing speed, but these measures do not capture the qualitative aspects crucial in social work, such as empathy and rapport-building.

Social workers, however, can develop straightforward assessment tools that are easily accessible to lay users, allowing for a more direct and immediate evaluation of AI performance. For instance, simple rubrics used to assess social work trainees could be adapted to include criteria that reflect the qualitative aspects of AI-led narrative interviews, such as the AI’s ability to build rapport, respond empathetically, and maintain a coherent conversation. These customized rubrics would ensure that the evaluation of AI tools aligns with the unique demands of real-world practice, making the integration of AI into social work more effective and relevant. By focusing on these qualitative measures, social workers can ensure that AI technologies are used in ways that genuinely enhance their practice and support their clients.

3. Interview practice through AI collaboration: Customizable generative AI enables social work practitioners to develop interview practices tailored to specific contexts. Instead of simply using preset AI products, social workers can actively participate in designing and utilizing these AI tools, applying their expertise to create solutions more aligned with their clients' unique needs. This approach also opens up the possibility of blended interview practices, such as using AI to facilitate broad intake and engagement and then transitioning to human-led conversations for more nuanced discussions. Our experience with AI-assisted narrative interviews underscores the potential for creating adaptable AI chatbots that can be customized and refined rather than relying on standardized chatbot products that may not fully meet the needs of social work practice.

In our study, we employed a specific framework to guide our research. Our findings suggest that there is transferrable knowledge and skills that practitioners and researchers can use to create or adapt structured steps tailored to various objectives, such as conducting assessments, handling service inquiries, or performing different evaluations. To fully leverage these opportunities, social workers may need to expand their understanding and skills in working with AI, actively participating in designing and implementing intervention processes. By taking a proactive role, social workers can use AI as a tool to enhance their practice rather than simply adopting predefined AI applications. Developing AI chatbots that can accommodate diverse conversation plans ensures that technology complements and supports, rather than replaces, human empathy and expertise.

Concluding remarks: Democratizing AI and the domain experts

Our exploration into the use of AI chatbots for narrative interviews is like navigating untested waters at the dynamic intersection of social work and AI technology. While we found potential in using AI to support narrative conversations, this early version of the AI chatbot also has limitations, such as its inability to handle images, voice messages, and consecutive user inputs. However, our voyage is not about reaching a specific port or proving the tool’s effectiveness; it’s about sailing into the unknown and testing new possibilities.

Rather than concentrating on validating the AI’s current capabilities, our exploration aimed to explore the capabilities of customizable AI and challenge the boundaries of social work knowledge. Introducing AI tools into practice presents a variety of considerations that this paper cannot fully address. Instead, we are steering into uncharted waters, advocating for social workers to broaden their skill sets and reimagine their expertise to actively engage with AI technologies.

The progress in democratizing AI comes with a reciprocal dynamic: users are encouraged to learn new skills, while they, in turn, shape the development of AI by posing new demands and expectations. This evolving relationship enhances our understanding of AI affordance and its role in democratizing technology. As noted early in the introductory section, affordance refers to the potential actions users can take with a tool, shaped by its design and how users perceive its utility (Gibson, 1977; Norman, 1999). In this context, the study shows that non-technical professionals can effectively use advanced AI technologies to meet specific needs, demonstrating that the value of AI is not just in its technical features but also in how users creatively adapt it to their circumstances. This broadens the discussion on affordance, highlighting that the usefulness of AI technologies is co-created through user interaction and adaptation to specific contexts.

The study emphasizes the importance of enhancing user-friendliness, particularly in deploying AI chatbots across platforms and managing complex, step-by-step tasks (as noted in Reflective Observation 1). Simplifying these processes would help make AI more accessible and adaptable, promoting wider use across different fields. This sets a specific demand on the chatbot configuration process. As such, the study illustrates the possibility of co-evolution between AI development and professional practice. As professionals provide feedback and shape the technology, AI tools evolve to create new opportunities for innovation. This reciprocal relationship suggests that both technology and professional practices may grow together, continuously influencing each other’s advancement.

In essence, this study serves as the authors’ nautical adventure and invites all social workers and researchers to explore beyond the familiar boundaries of their expertise while remaining grounded in core principles. By embracing this collective endeavor, we may actively influence how AI is integrated into social work, thereby enriching our practice and enhancing client services in an increasingly digital world. As social work and AI evolve, we encourage ongoing research, discussion, and debate to explore new possibilities and shape the profession's future collaboratively.

Data availability

Underlying data

DANS: Transcript of a Conversation Between a Customized AI and Human Users, https://doi.org/10.17026/SS/KCPEDX (Chan, 2024).

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Acknowledgments

Grammar checks and language suggestions were supported by Grammarly and ChatGPT4.

References

Abkhezr P, McMahon M, Campbell M, et al.: Exploring the boundary between narrative research and narrative intervention Implications of participating in narrative inquiry for young people with refugee backgrounds. Narrative Inquiry: NI. 2020; 30: 316–342. Publisher Full Text
Adamopoulou E, Moussiades L: Chatbots: History, technology, and applications. Machine Learning with Applications. 2020; 2: 100006. Publisher Full Text
Ahmad RA, Purwaningrum DN: WhatsApp-based complementary feeding education and counselling for adolescent mothers: A one-group pre-post study. Journal of Community Empowerment for Health 2022; 5: 64–71.
Au-yeung H: Visualising narrative practice with a 5-level scaffolding lens: An illustrative case study. China Journal of Social Work. 2023; 16: 18–29. Publisher Full Text
Bendig E, Erb B, Schulze-Thuesing L, et al.: The next generation: chatbots in clinical psychology and psychotherapy to foster mental health – a scoping review. Verhaltenstherapie. 2019; 29: 266–280. Publisher Full Text
Bogo M : Social work practice concepts, processes, and interviewing. (Library, Trans.). Columbia University Press; 2006.
Bower M: Affordance analysis – matching learning tasks with learning technologies. Educational Media International. 2008; 45: 3–15. Publisher Full Text Reference Source Reference Source
Brown T, Mann B, Ryder N, et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems. 2020; 33: 1877–1901. Reference Source
Campolo A, Crawford K: Enchanted determinism: Power without responsibility in artificial intelligence. Engaging Science, Technology, and Society. 2020; 6: 1–19. Publisher Full Text
Chan C: Narrative practice with youth: A heuristic case study on a youth-centre drama workshop. Journal of Social Work Practice. 2012; 26: 197–214. Publisher Full Text
Chan C: Digital technologies and evolving narrative practice: an autoethnographic study. China Journal of Social Work. 2023; 16: 83–99. Publisher Full Text
Chan C: Transcript of a Conversation Between a Customized AI and Human Users (Version 3) [transcripts]. DANS Data Station Social Sciences and Humanities; 2024. Publisher Full Text
Chan C, Au-yeung H, Chiu W, et al.: Indicators for assessing intervention fidelity of narrative practice: A heuristic review of the concept of scaffolding in White (2007). Journal of Systemic Therapies. 2020; 39(3): 77–90. Publisher Full Text
Chan C, Holosko M: The utilization of social media for youth outreach engagement: A case study. Qualitative Social Work. 2017; 16: 680–697. Publisher Full Text
Chan C, Holosko MJ: A review of information and communication technology enhanced social work interventions. Research on Social Work Practice. 2016; 26: 88–100. Publisher Full Text
Chan C, Li F: Developing a natural language-based AI-chatbot for social work training: an illustrative case study. China Journal of Social Work. 2023; 16: 121–136. Publisher Full Text
Chan C, Ngai K-H, Wong C-K: Using photographs in narrative therapy to externalize the problem: A substance abuse case. Journal of Systemic Therapies. 2012; 31(2): 1–20. Publisher Full Text
Chang H, Ngunjiri F, Hernandez K-AC: Collaborative autoethnography. Routledge; 2016. Publisher Full Text
Chase SE: Narrative inquiry: Toward theoretical and methodological maturity. The Sage handbook of qualitative research. 2018; Vol. 5: pp. 546–560.
Chun B: Doing autoethnography of social robots: Ethnographic reflexivity in HRI. Paladyn, Journal of Behavioral Robotics. 2019; 10: 228–236. Publisher Full Text
Clandinin DJ: Handbook of Narrative Inquiry: Mapping a Methodology.2007. Publisher Full Text
Concannon S, Tomalin M: Measuring perceived empathy in dialogue systems. AI & Soc. 2023. Publisher Full Text
Cordero J, Barba-Guaman L, Guamán F: Use of chatbots for customer service in MSMEs. Applied Computing and Informatics. 2022. Publisher Full Text
Denicolo P, Long T, Bradley-Cole K: Constructivist approaches and research methods: A practical guide to exploring personal meanings. Sage; 2016. Publisher Full Text
Deriu J, Rodrigo A, Otegi A, et al.: Survey on evaluation methods for dialogue systems. Artificial Intelligence Review. 2021; 54: 755–810. PubMed Abstract | Publisher Full Text | Free Full Text
DeVon C: You can now build your own version of ChatGPT—here’s what to know. CNBC. 2023, 2023-11-11. Reference Source
Duvall J, Béres L: Innovations in narrative therapy: Connecting practice, training, and research. W.W. Norton; 2011.
Ellis C, Bochner A: Autoethnography, personal narrative, reflexivity: Researcher as subject. In Denzin NK, Lincoln YS, editors. Handbook of Qualitative Research. 2nd Ed.Sage Publications; 2000; pp. 733–768.
Finch SE, Choi JD: Towards unified dialogue system evaluation: A comprehensive analysis of current evaluation protocols. arXiv preprint arXiv:2006.06110. 2020.
Følstad A, Araujo T, Law EL-C, et al.: Future directions for chatbot research: an interdisciplinary research agenda. Computing 2021; 103: 2915–2942. Publisher Full Text
Fowler JL, O'Gorman JG, Lynch MP: Counselling Skills and Competencies Tool: Scale development and preliminary data. Psychotherapy and Counselling Journal of Australia 2021; 9. Publisher Full Text
Gasaymeh A-MM: University students use of WhatsApp and their perceptions regarding its possible integration into their education. Global Journal of Computer Science and Technology. 2017; 17: 1–9.
Gergen KJ: Social construction in context. London: Sage Publications; 2001.
Gibson JJ: The theory of affordances. In Shaw R, Bransford J, editors. Perceiving, acting and knowing. Hillsdale, NJ: Eribaum; 1977; pp. 67–82.
Greer S, Ramo D, Chang YJ, et al.: Use of the chatbot "vivibot" to deliver positive psychology skills and promote well-being among young people after cancer treatment: Randomized controlled feasibility trial. Jmir Mhealth and Uhealth 2019; 7, Article e15018. PubMed Abstract | Publisher Full Text | Free Full Text
Gubrium JF, Holstein JA, Marvasti A, et al.: The SAGE Handbook of Interview Research: The Complexity of the Craft. Thousand Oaks: SAGE Publications; 2012.
Hammond M: What is an affordance and can it help us understand the use of ICT in education? Education and Information Technologies. 2010;15: 205–217. Publisher Full Text
Han X, Zhou M, Turner MJ, et al.: Designing effective interview chatbots: Automatic chatbot profiling and design suggestion generation for chatbot debugging. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 2021; pp. 1–15.
He L, Basar E, Wiers RW, et al.: Can chatbots help to motivate smoking cessation? A study on the effectiveness of motivational interviewing on engagement and therapeutic alliance. Bmc Public Health. 2022; 22: Article 726. PubMed Abstract | Publisher Full Text | Free Full Text
Josselson R: The ethical attitude in narrative research: Principles and practicalities.Clandinin DJ, editor. Handbook of Narrative Inquiry: Mapping a Methodology. SAGE; 2007; pp. 537–566. Publisher Full Text
Kadushin A, Kadushin G: The social work interview. Columbia University Press; 1997.
Katic MA, Miletic E, Candrlic S: Performance analysis of virtual assistants: A case study based on the Croatian language. INTED2024 Proceedings. 2024; 7305–7315.
Lee M, Srivastava M, Hardy A, et al.: Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746. 2023.
Leonardi PM: When flexible routines meet flexible technologies: Affordance, constraint, and the imbrication of human and material agencies. MIS Quarterly: Management Information Systems. 2011; 35: 147–167. Reference Source Publisher Full Text
Lin M, Yang L, Hao C, et al.: Evaluation of WhatsApp-based counselling and body composition measurement to enrich knowledge about healthy diet and weight management among middle-aged female Hong Kong Chinese: A randomized controlled trial. International Congress on Obesity 2022.2022.
Liu H, Peng H, Song X, et al.: Using AI chatbots to provide self-help depression interventions for university students: A randomized trial of effectiveness. Internet Interventions. 2022; 27, Article 100495. PubMed Abstract | Publisher Full Text | Free Full Text
Luo B, Lau RY, Li C, et al.: A critical review of state-of-the-art chatbot designs and applications. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2022; 12: e1434.
Mao J, Romero-Hall E, Reeves TC :Autoethnography as a research method for educational technology: a reflective discourse. Educational Technology Research and Development. 2023.Publisher Full Text
Nath A, Sarkar R, Mitra S, et al.: Designing and implementing conversational intelligent chat-bot using natural language processing. International Journal of Scientific Research in Computer Science, Engineering, and Information Technology. 2021: 262–266. Publisher Full Text
Navigli R, Conia S, Ross B: Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality. 2023; 15: 1–21. Publisher Full Text
Norman DA: Affordance, conventions, and design. Interactions 1999; 6: 38–43. Publisher Full Text
Omarov B, Narynov S, Zhumanov Z: Artificial intelligence-enabled chatbots in mental health: A systematic review. Computers, Materials & Continua. 2023; 74: 5105–5122. Publisher Full Text
Powell L, Nour R, Sleibi R, et al.: Democratizing the development of chatbots to improve public health: Feasibility study of COVID-19 misinformation. JMIR Human Factors. 2023; 10: e43120. PubMed Abstract | Publisher Full Text | Free Full Text
Ramey HL, Tarulli D, Frijters JC, et al.: A sequential analysis of externalizing in narrative therapy with children. Contemporary Family Therapy. 2009; 31: 262–279. Publisher Full Text
Ramey HL, Young K, Tarulli D: Scaffolding and concept formation in narrative therapy: A qualitative research report. Journal of Systemic Therapies. 2010; 29(4): 74–91. Publisher Full Text
Reamer FG: Artificial intelligence in social work: Emerging ethical issues. International Journal of Social Work Values and Ethics. 2023; 20: 52–71. Publisher Full Text
Rice K, Schutte NS, Cosh SM, et al.: The utility and development of the competencies of professional psychology rating scales (COPPR). Frontiers in Education. 2022; 7. Publisher Full Text
Riessman CK: Narrative methods for the human sciences. Sage Publications; 2008.
Riessman CK, Quinney L: Narrative in social work: A critical review. Qual. Soc. Work. 2005; 4: 391–412. Publisher Full Text
Robila M, Robila SA: Applications of artificial intelligence methodologies to behavioral and social siences. Journal of Child and Family Studies. 2020; 29; 2954–2966. Publisher Full Text
Shah R, Lahoti S, Lavanya K: An intelligent chat-bot using natural language processing. International Journal of Engineering Research. 2017; 6: 281–286. Publisher Full Text
Sharma V: Measure your virtual assistant performance.2020, 2020-07-19. Reference Source Reference Source
Shotter J: Cultural politics of everyday life: social constructionism, rhetoric and knowing of the third kind. Toronto: University of Toronto Press; 1993.
Stallbaumer C: Introducing Microsoft Copilot Studio and new features in Copilot for Microsoft 365.2023, 2023-11-15. Retrieved 2023-11-20. Reference Source
Suárez-Lantarón B, Deocano-Ruíz Y, García-Perales N, et al.: The educational use of WhatsApp. Sustainability. 2022; 14: 10510. Publisher Full Text
Swank JM, Lambie GW, Witta EL: An Exploratory Investigation of the Counseling Competencies Scale: A Measure of Counseling Skills, Dispositions, and Behaviors. Counselor education and supervision 2012; 51: 189–206. Publisher Full Text
Tamayo PA, Herrero A, Martín J, et al.: Design of a chatbot as a distance learning assistant. Open Praxis. 2020; 12: 145–153. Publisher Full Text
Tanana MJ, Soma CS, Srikumar V, et al.: Development and evaluation of ClientBot: Patient-like conversational agent to train basic counseling skills. Journal of Medical Internet Research. 2019; 21: e12529. PubMed Abstract | Publisher Full Text | Free Full Text
UC Today: OpenAI announces no-code ChatGPT builder for enterprise.2023, 2023-11-07.Reference Source
Van Brummelen JJR: Tools to create and democratize conversational artificial intelligence. Massachusetts Institute of Technology; 2019.
Wang J, Shi E, Yu S, et al.: Prompt engineering for healthcare: Methodologies and applications. arXiv preprint arXiv:2304.14670. 2023.
Wei J, Bosma M, Zhao VY, et al.: Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652. 2021.
White M: Maps of narrative practice. W.W. Norton; 2007.
Wu S, Zhao X, Yu T, et al.: Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning. arXiv preprint arXiv:2110.04725. 2021. Reference Source
Xu B, Zhuang Z: Survey on psychotherapy chatbots. Concurrency and Computation: Practice and Experience. 2022; 34: e6170. Publisher Full Text
Xu Z, Jiang J: Multi-dimensional Evaluation of Empathetic Dialog Responses. arXiv preprint arXiv:2402.11409. 2024.

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 07 Jun 2024

Author details Author details

¹ Social Work, Hong Kong Baptist University, Hong Kong, Hong Kong

Yunmeng Zhao
Roles: Data Curation, Visualization

Jiahui Zhao
Roles: Conceptualization, Visualization

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 04 Oct 2024, 13:601

https://doi.org/10.12688/f1000research.151952.2

version 1

Published: 07 Jun 2024, 13:601

https://doi.org/10.12688/f1000research.151952.1

© 2024 Chan C et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Chan C, Zhao Y and Zhao J. A Case Study on Assessing AI Assistant Competence in Narrative Interviews [version 2; peer review: 1 approved, 3 approved with reservations]. F1000Research 2024, 13:601 (https://doi.org/10.12688/f1000research.151952.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 04 Oct 2024

Revised

Views

Reviewer Report 02 Jan 2025

Anna De Fina, Georgetown University, Washington, DC, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.171244.r334817

I found the article overall very good both in presenting the justification for the research and the theoretical/methodological basis for the work. I also thought that the methodology and results were carefully explained and justified. The only changes that I would suggest are the following:
1. I think the authors should better relate the introductory part about the need to study the possible applications of AI developments to social work with the case of the interview. The interview case is presented as a separate issue after the introduction but I feel that these two parts of the article should be better weaved together.
The authors should also devote a bit more space to explaining the model that they followed for designing a narrative interview and therefore also exemplify the different parts of the interview. They just name those parts (Naming, Orientation, etc.) but do not explain what they consist of.
Finally, among the limitations of the study, they should add the fact that one of the authors was fully conscious of the background and aims of the experiment and might have been biased in their evaluation of the chatbox's performance.

Is the background of the case’s history and progression described in sufficient detail?

Yes
Is the work clearly and accurately presented and does it cite the current literature?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes
Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Sociolinguistics, Discourse Analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 14 Nov 2024

Sunok Lee, Aalto University, Espoo, Finland

Approved with Reservations

https://doi.org/10.5256/f1000research.171244.r329782

This study employed a text-based interview with a single human participant using the WhatsApp messaging app, utilizing OpenAI’s Assistant API with the GPT-4 model. The AI assistant conducted two interview sessions: the first, where the participant was cooperative, and the second, where the participant attempted to sidetrack the AI. This approach aimed to assess the AI’s performance in narrative-based interviewing by evaluating its adherence to interview structure, empathy, and support for participant agency. Researchers used a set of predefined criteria informed by narrative interview literature to measure the AI's performance, including structural fidelity, empathy, and narrative coherence. The assessments were carried out by a human researcher and the AI, and inter-rater reliability was tested to validate consistency in the results.

Strengths of the Study
The authors are to be commended for their innovative approach to studying AI’s potential in qualitative research, particularly in the complex area of narrative interviewing. Their setup of the AI assistant on WhatsApp, a familiar platform, reflects a practical and thoughtful choice, making AI accessible and potentially easier for social work practitioners to adopt. The decision to use predefined criteria from existing narrative interview literature for performance evaluation is well-grounded, allowing for a structured analysis that aligns with established qualitative research practices.

Main Areas for Improvement

In-depth Ethical and Critical Issues Analysis
While the study provides a promising exploration of AI’s utility in conducting narrative interviews, it lacks a thorough discussion of potential risks, especially concerning interactions with vulnerable users. If AI is to be applied widely, especially in fields involving sensitive interactions, the paper must address the risks of incorrect or insensitive AI responses. The absence of this discussion leaves out critical considerations that might arise if AI were to replace human interaction, such as the potential harm caused by insensitive responses to vulnerable participants. A more comprehensive examination of these ethical issues, including possible measures to mitigate such risks, is essential for providing a balanced view.
Interpretation of Progression Scores
Although progression scores were introduced to evaluate the interview’s structure, the authors’ interpretation could benefit from additional nuance. The current discussion seems overly definitive in concluding that the AI achieved meaningful progression in narrative structure. Given the limitations of a single participant case study, a more tentative interpretation would be appropriate. Furthermore, while progression scores offer a useful metric, the study could provide practical examples or case-specific explanations to clarify their real-world implications. This would give readers a clearer understanding of how these scores influence the AI’s interview performance and could enhance the study’s transparency.

Other Areas for Improvement

Clearer Focus on Research Questions and Study Scope
The study attempts to evaluate both the AI assistant's performance and the validity of the assessment metrics, which causes some confusion. It would be beneficial to clarify the primary research objective and distinguish between assessing the AI’s performance and the reliability of the evaluation metrics. Additionally, the study would benefit from acknowledging its limitations upfront, particularly as a preliminary or pilot study, to set appropriate expectations for the reader.
Balanced Perspective on AI's Potential in Narrative Interviews
The literature review highlights AI’s benefits in qualitative research but tends to overstate its potential, neglecting the ongoing debate around AI's limitations in qualitative settings. Acknowledging the contentious nature of AI in qualitative research would present a more balanced viewpoint. This would strengthen the paper by recognizing that while AI offers efficiencies, there are ethical and methodological concerns that still need resolution.

Overall
This study offers valuable insights into AI’s capabilities in conducting narrative interviews, presenting an innovative approach to exploring AI applications in qualitative research. However, a more balanced perspective on the limitations, especially concerning ethical implications and interpretive caution, would enhance the study’s scientific rigor. Addressing these areas could improve the paper's structure and strengthen its value as a foundational study for further exploration in this field.

Is the background of the case’s history and progression described in sufficient detail?

Partly
Is the work clearly and accurately presented and does it cite the current literature?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes
Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Human-centered AI, Participatory Design, Human-computer Interaction.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 17 Oct 2024

Neil Ballantyne, Open Polytechnic of New Zealand, Lower Hutt, Wellington, New Zealand

Approved

https://doi.org/10.5256/f1000research.171244.r329218

Many thanks to the authors for investing the time to substantially revise and improve the article. I believe that all of the issues I raised, including the pilot status of the study, have now been addressed and the article is ... Continue reading

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 07 Jun 2024

Views

Reviewer Report 15 Aug 2024

Neil Ballantyne, Open Polytechnic of New Zealand, Lower Hutt, Wellington, New Zealand

Not Approved

https://doi.org/10.5256/f1000research.166648.r298273

This single-subject case study examines the performance of an AI assistant conducting a narrative, text-based interview with a single human participant using the WhatsApp messaging app.

The researchers designed the AI assistant using OpenAI’s Assistants API with the GPT-4 model. The interview was conducted twice under two different conditions. During the first, the human participant was instructed to cooperate with the AI assistant, and during the second, they were instructed to be disruptive, attempting to sidetrack the AI assistant in response to questions.

To assess the performance of the AI assistant on the specific and complex task of narrative-based interviewing, the researchers designed criteria and measures that, based on a review of the narrative interview literature, they deemed to be relevant and valid measures of performance. They devised and applied measures addressing different dimensions of the narrative interview, including interview structure, empathy, narrative coherence, complexity, and support for the agency of the human participant.

A human research assistant and ChatGPT 4 assessed the interview data using these measures, and the results were compared using an inter-rater reliability test. The principal researcher reviewed the results and resolved any disagreements.

Strengths of the study

The researchers are to be commended for devising an imaginative research design to study a complex area of human-computer interaction in a subject area (the application of generative AI to qualitative interviewing) that is both highly topical and deeply controversial. The design of the AI assistant and its deployment on WhatsApp seems to have been well considered, and the limitations imposed by this short exchange, text-based platform are recognised. The criteria for selecting the key measures to assess narrative interview performance, drawn from the literature identified, makes sense. However, a whole paper could have been written on this aspect alone.

Main areas for improvement

Be clear about research questions and primary focus

A lot is going on in this study, and the description and rationale for key design choices are sometimes unclear. The problem begins with the abstract and the research questions, where there seems to be slippage between a focus on evaluating the performance of the AI assistant and evaluating the validity and inter-rater reliability of the measures constructed. The first research question asks:

How can the proposed metrics help us, as social researchers without a technical background, understand the quality of the AI-driven interviews in this study?

In my view, RQ1 would be better focussed on assessing the validity and reliability of the measures selected for evaluating the performance of the AI assistant. This needs to be established before an actual performance assessment of the AI assistant is made. Establishing inter-rater reliability of the measures would be a part of that task.

However, this raises the question of why the researchers decided to have only two raters, one human and the other a GenAI. The comparison is interesting but muddies the waters. This machine versus human rater aspect of the study was a surprise since it was not signalled in the abstract, introduction or research questions. A comparison between the ratings made by a GenAI and human raters would be far more interesting once we had established human inter-rater reliability, which could have been achieved with only a few more humans in the frame.

Establish from the outset that this is a pilot study

The title and the abstract refer to the study as a case study, albeit one with a single participant, giving the reader the impression of a significant, in-depth exploration of a phenomenon. However, in the section on limitations, the authors describe the study as a pilot study and state that the primary goal “was to explore ideas and stimulate discussion rather than to conduct rigorous experiments”.

I recommend that the abstract and introduction be rewritten to reflect this stance. However, this does beg the question as to whether the paper might be better construed as a working paper for feedback and discussion than as a substantive academic article. The paper is very provocative and stimulating in terms of the issues it raises but inconclusive as an assessment of the utility of an AI assistant for narrative interviews.

Other areas for improvement

A balanced approach to the literature and the claims made

The literature review is interesting and includes many relevant articles, but the articles selected and the authors’ comments are prone to over-accentuate the potential of a technology that is still emergent. The application of GenAI to qualitative research is deeply contested in the mainstream literature on qualitative methodology, making it hard to concur with the authors’ views that:

The integration of generative AI has revolutionized and democratized the process of qualitative data collection, analysis, and interpretation. This technological leap has boosted efficiency and expanded the horizons of narrative inquiry methods.

The authors can improve the literature review by recognising GenAI's emergent nature and the fact that many scholars in the mainstream qualitative research field have strong reservations about its applicability and ethical concerns about its use.

Other issues

There are several places in the text where the authors’ explanations of the methods used are a little too opaque. For example, the explanation about the progression scores was helpful but left me puzzled about how this was practically implemented. An example might help.

Also, in the findings, the authors highlight that the participant exhibited higher agency during conversation two, which is surely explained by the instruction given to the participant to try to throw the interviewer off track. If that isn’t the explanation, a fuller account as to why not would be helpful.

Overall

I want to encourage the authors to continue their exploration of the use of GenAI in qualitative research, and I believe that this is a very useful working paper that raises many pertinent issues. Responding to the comments above might help to improve the structure and flow of the paper, but the research design issues cannot be remedied without conducting a completely new study. It is for that reason I decided not to recommend the paper for indexing.

Is the background of the case’s history and progression described in sufficient detail?

No
Is the work clearly and accurately presented and does it cite the current literature?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly
Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Partly

References

1. Chan C, Zhao Y: A Case Study on Measuring AI Assistant Competence in Narrative Interviews. F1000Research. 2024; 13. Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: social work, human service technology, artificial intelliegence, qualitative research.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Author Response 04 Oct 2024

Chitat CHAN, Social Work, Hong Kong Baptist University, Hong Kong, Hong Kong

04 Oct 2024

Author Response

//Main areas for improvement
Be clear about research questions and primary focus
A lot is going on in this study, and the description and rationale for key design choices are ... Continue reading //Main areas for improvement
Be clear about research questions and primary focus
A lot is going on in this study, and the description and rationale for key design choices are sometimes unclear. The problem begins with the abstract and the research questions, where there seems to be slippage between a focus on evaluating the performance of the AI assistant and evaluating the validity and inter-rater reliability of the measures constructed. The first research question asks:
How can the proposed metrics help us, as social researchers without a technical background, understand the quality of the AI-driven interviews in this study?
In my view, RQ1 would be better focussed on assessing the validity and reliability of the measures selected for evaluating the performance of the AI assistant. This needs to be established before an actual performance assessment of the AI assistant is made. Establishing inter-rater reliability of the measures would be a part of that task.
However, this raises the question of why the researchers decided to have only two raters, one human and the other a GenAI. The comparison is interesting but muddies the waters. This machine versus human rater aspect of the study was a surprise since it was not signalled in the abstract, introduction or research questions. A comparison between the ratings made by a GenAI and human raters would be far more interesting once we had established human inter-rater reliability, which could have been achieved with only a few more humans in the frame.//

*Response: The research questions and focus have been entirely reformulated. The revised manuscript now distinguishes its objectives. It focuses on the democratization of AI chatbots and the emerging trend of customizable AI, which has opened new opportunities across various fields, including social work. The study explores how social workers, particularly those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews. The role of narrative interviews is used as a case to explore possibilities and provide feedback within a broader context. The emphasis is on qualitative assessment rather than quantitative measurement, aligning with the exploratory nature of the study.

//Establish from the outset that this is a pilot study
The title and the abstract refer to the study as a case study, albeit one with a single participant, giving the reader the impression of a significant, in-depth exploration of a phenomenon. However, in the section on limitations, the authors describe the study as a pilot study and state that the primary goal “was to explore ideas and stimulate discussion rather than to conduct rigorous experiments”.
I recommend that the abstract and introduction be rewritten to reflect this stance. However, this does beg the question as to whether the paper might be better construed as a working paper for feedback and discussion than as a substantive academic article. The paper is very provocative and stimulating in terms of the issues it raises but inconclusive as an assessment of the utility of an AI assistant for narrative interviews.//

*Response: The pilot and exploratory nature of the study is emphasized. The revised abstract and introduction explicitly state that the study is exploratory and serves as a pilot to stimulate discussion about the use of AI in social work. The manuscript now clearly outlines its objectives, focusing on the democratization of AI chatbots and the emerging trend of customizable AI. The study is qualitative and concerned with the extent to which domain knowledge experts can participate in instructing and assessing AI chatbots.

//Other areas for improvement
A balanced approach to the literature and the claims made
The literature review is interesting and includes many relevant articles, but the articles selected and the authors’ comments are prone to over-accentuate the potential of a technology that is still emergent. The application of GenAI to qualitative research is deeply contested in the mainstream literature on qualitative methodology, making it hard to concur with the authors’ views that:
The integration of generative AI has revolutionized and democratized the process of qualitative data collection, analysis, and interpretation. This technological leap has boosted efficiency and expanded the horizons of narrative inquiry methods.
The authors can improve the literature review by recognising GenAI's emergent nature and the fact that many scholars in the mainstream qualitative research field have strong reservations about its applicability and ethical concerns about its use.//

*Response: The literature review has been revised to acknowledge the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. It addresses limitations alongside potential benefits. The discussion part addresses these limitations, noting that the study focused exclusively on OpenAI's product and that findings may not be generalizable to AI tools from other companies. While many principles and challenges identified here could apply broadly to other AI systems, variations in technology and implementation could lead to different outcomes. The introduction of AI tools into practice also brings considerations such as navigating risk management, data privacy, and the uneven development of language models across diverse sociocultural landscapes. While these challenges are important and have been extensively explored by others, the study focuses on exploring how social workers can expand their skill sets and rethink their expertise to engage actively with AI technologies.

//Other issues
There are several places in the text where the authors’ explanations of the methods used are a little too opaque. For example, the explanation about the progression scores was helpful but left me puzzled about how this was practically implemented. An example might help.
Also, in the findings, the authors highlight that the participant exhibited higher agency during conversation two, which is surely explained by the instruction given to the participant to try to throw the interviewer off track. If that isn’t the explanation, a fuller account as to why not would be helpful.//

*Response: The revised manuscript provides clearer explanations of the methods used. It explains that the study is autoethnographic, reflecting personal experiences and using rubrics for assessing fidelity and empathy in the AI assistant’s performance. Examples of conversation transcripts are included to illustrate these qualitative assessments, emphasizing the depth and context of interactions rather than mere numerical data. We also enhanced clarity in findings. It was noted that some findings, like the participant exhibiting higher agency, were not well-explained. In the revised discussion, the measure of agency has been removed and integrated into the discussion about empathy and a deeper exploration of meanings. This approach aligns with the study’s focus on qualitative assessment and provides a clearer understanding of the findings.

//Overall
I want to encourage the authors to continue their exploration of the use of GenAI in qualitative research, and I believe that this is a very useful working paper that raises many pertinent issues. Responding to the comments above might help to improve the structure and flow of the paper, but the research design issues cannot be remedied without conducting a completely new study. It is for that reason I decided not to recommend the paper for indexing.//

*Response: We are very grateful for the thoughtful and constructive comments provided. We deeply appreciate the feedback, especially regarding the areas for improvement and the suggestion to frame the study as a pilot. These insights were particularly helpful in guiding us toward a more focused and balanced approach. We also want to acknowledge that the first version of the manuscript had several gaps, and we sincerely apologize for any confusion this may have caused. We recognize that the clarity and direction of the paper needed refinement, and we are thankful for the opportunity to improve. In response, we have rewritten the article using the same dataset but with a more refined focus, especially clarifying the research aims and scope. The revised study explores how AI chatbots, particularly GPT-4, can be applied in social work for conducting narrative interviews. Using an autoethnographic case study approach, social work researchers developed and tested an AI chatbot on WhatsApp. The chatbot followed a structured interview plan and demonstrated empathy, though challenges in replicating human skills within AI were noted. The study concludes that social workers can tailor AI tools for narrative interviews, but further research is needed to address the field's complexities. Our focus remains on democratizing AI and empowering social workers to use these tools effectively while emphasizing the need for interdisciplinary collaboration in developing robust assessment tools.
//Main areas for improvement
Be clear about research questions and primary focus
A lot is going on in this study, and the description and rationale for key design choices are sometimes unclear. The problem begins with the abstract and the research questions, where there seems to be slippage between a focus on evaluating the performance of the AI assistant and evaluating the validity and inter-rater reliability of the measures constructed. The first research question asks:
How can the proposed metrics help us, as social researchers without a technical background, understand the quality of the AI-driven interviews in this study?
In my view, RQ1 would be better focussed on assessing the validity and reliability of the measures selected for evaluating the performance of the AI assistant. This needs to be established before an actual performance assessment of the AI assistant is made. Establishing inter-rater reliability of the measures would be a part of that task.
However, this raises the question of why the researchers decided to have only two raters, one human and the other a GenAI. The comparison is interesting but muddies the waters. This machine versus human rater aspect of the study was a surprise since it was not signalled in the abstract, introduction or research questions. A comparison between the ratings made by a GenAI and human raters would be far more interesting once we had established human inter-rater reliability, which could have been achieved with only a few more humans in the frame.//

*Response: The research questions and focus have been entirely reformulated. The revised manuscript now distinguishes its objectives. It focuses on the democratization of AI chatbots and the emerging trend of customizable AI, which has opened new opportunities across various fields, including social work. The study explores how social workers, particularly those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews. The role of narrative interviews is used as a case to explore possibilities and provide feedback within a broader context. The emphasis is on qualitative assessment rather than quantitative measurement, aligning with the exploratory nature of the study.

//Establish from the outset that this is a pilot study
The title and the abstract refer to the study as a case study, albeit one with a single participant, giving the reader the impression of a significant, in-depth exploration of a phenomenon. However, in the section on limitations, the authors describe the study as a pilot study and state that the primary goal “was to explore ideas and stimulate discussion rather than to conduct rigorous experiments”.
I recommend that the abstract and introduction be rewritten to reflect this stance. However, this does beg the question as to whether the paper might be better construed as a working paper for feedback and discussion than as a substantive academic article. The paper is very provocative and stimulating in terms of the issues it raises but inconclusive as an assessment of the utility of an AI assistant for narrative interviews.//

*Response: The pilot and exploratory nature of the study is emphasized. The revised abstract and introduction explicitly state that the study is exploratory and serves as a pilot to stimulate discussion about the use of AI in social work. The manuscript now clearly outlines its objectives, focusing on the democratization of AI chatbots and the emerging trend of customizable AI. The study is qualitative and concerned with the extent to which domain knowledge experts can participate in instructing and assessing AI chatbots.

//Other areas for improvement
A balanced approach to the literature and the claims made
The literature review is interesting and includes many relevant articles, but the articles selected and the authors’ comments are prone to over-accentuate the potential of a technology that is still emergent. The application of GenAI to qualitative research is deeply contested in the mainstream literature on qualitative methodology, making it hard to concur with the authors’ views that:
The integration of generative AI has revolutionized and democratized the process of qualitative data collection, analysis, and interpretation. This technological leap has boosted efficiency and expanded the horizons of narrative inquiry methods.
The authors can improve the literature review by recognising GenAI's emergent nature and the fact that many scholars in the mainstream qualitative research field have strong reservations about its applicability and ethical concerns about its use.//

*Response: The literature review has been revised to acknowledge the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. It addresses limitations alongside potential benefits. The discussion part addresses these limitations, noting that the study focused exclusively on OpenAI's product and that findings may not be generalizable to AI tools from other companies. While many principles and challenges identified here could apply broadly to other AI systems, variations in technology and implementation could lead to different outcomes. The introduction of AI tools into practice also brings considerations such as navigating risk management, data privacy, and the uneven development of language models across diverse sociocultural landscapes. While these challenges are important and have been extensively explored by others, the study focuses on exploring how social workers can expand their skill sets and rethink their expertise to engage actively with AI technologies.

//Other issues
There are several places in the text where the authors’ explanations of the methods used are a little too opaque. For example, the explanation about the progression scores was helpful but left me puzzled about how this was practically implemented. An example might help.
Also, in the findings, the authors highlight that the participant exhibited higher agency during conversation two, which is surely explained by the instruction given to the participant to try to throw the interviewer off track. If that isn’t the explanation, a fuller account as to why not would be helpful.//

*Response: The revised manuscript provides clearer explanations of the methods used. It explains that the study is autoethnographic, reflecting personal experiences and using rubrics for assessing fidelity and empathy in the AI assistant’s performance. Examples of conversation transcripts are included to illustrate these qualitative assessments, emphasizing the depth and context of interactions rather than mere numerical data. We also enhanced clarity in findings. It was noted that some findings, like the participant exhibiting higher agency, were not well-explained. In the revised discussion, the measure of agency has been removed and integrated into the discussion about empathy and a deeper exploration of meanings. This approach aligns with the study’s focus on qualitative assessment and provides a clearer understanding of the findings.

//Overall
I want to encourage the authors to continue their exploration of the use of GenAI in qualitative research, and I believe that this is a very useful working paper that raises many pertinent issues. Responding to the comments above might help to improve the structure and flow of the paper, but the research design issues cannot be remedied without conducting a completely new study. It is for that reason I decided not to recommend the paper for indexing.//

*Response: We are very grateful for the thoughtful and constructive comments provided. We deeply appreciate the feedback, especially regarding the areas for improvement and the suggestion to frame the study as a pilot. These insights were particularly helpful in guiding us toward a more focused and balanced approach. We also want to acknowledge that the first version of the manuscript had several gaps, and we sincerely apologize for any confusion this may have caused. We recognize that the clarity and direction of the paper needed refinement, and we are thankful for the opportunity to improve. In response, we have rewritten the article using the same dataset but with a more refined focus, especially clarifying the research aims and scope. The revised study explores how AI chatbots, particularly GPT-4, can be applied in social work for conducting narrative interviews. Using an autoethnographic case study approach, social work researchers developed and tested an AI chatbot on WhatsApp. The chatbot followed a structured interview plan and demonstrated empathy, though challenges in replicating human skills within AI were noted. The study concludes that social workers can tailor AI tools for narrative interviews, but further research is needed to address the field's complexities. Our focus remains on democratizing AI and empowering social workers to use these tools effectively while emphasizing the need for interdisciplinary collaboration in developing robust assessment tools.
Competing Interests: NIL Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 04 Oct 2024

Chitat CHAN, Social Work, Hong Kong Baptist University, Hong Kong, Hong Kong

04 Oct 2024

Author Response

//Main areas for improvement
Be clear about research questions and primary focus
A lot is going on in this study, and the description and rationale for key design choices are ... Continue reading //Main areas for improvement
Be clear about research questions and primary focus
A lot is going on in this study, and the description and rationale for key design choices are sometimes unclear. The problem begins with the abstract and the research questions, where there seems to be slippage between a focus on evaluating the performance of the AI assistant and evaluating the validity and inter-rater reliability of the measures constructed. The first research question asks:
How can the proposed metrics help us, as social researchers without a technical background, understand the quality of the AI-driven interviews in this study?
In my view, RQ1 would be better focussed on assessing the validity and reliability of the measures selected for evaluating the performance of the AI assistant. This needs to be established before an actual performance assessment of the AI assistant is made. Establishing inter-rater reliability of the measures would be a part of that task.
However, this raises the question of why the researchers decided to have only two raters, one human and the other a GenAI. The comparison is interesting but muddies the waters. This machine versus human rater aspect of the study was a surprise since it was not signalled in the abstract, introduction or research questions. A comparison between the ratings made by a GenAI and human raters would be far more interesting once we had established human inter-rater reliability, which could have been achieved with only a few more humans in the frame.//

*Response: The research questions and focus have been entirely reformulated. The revised manuscript now distinguishes its objectives. It focuses on the democratization of AI chatbots and the emerging trend of customizable AI, which has opened new opportunities across various fields, including social work. The study explores how social workers, particularly those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews. The role of narrative interviews is used as a case to explore possibilities and provide feedback within a broader context. The emphasis is on qualitative assessment rather than quantitative measurement, aligning with the exploratory nature of the study.

//Establish from the outset that this is a pilot study
The title and the abstract refer to the study as a case study, albeit one with a single participant, giving the reader the impression of a significant, in-depth exploration of a phenomenon. However, in the section on limitations, the authors describe the study as a pilot study and state that the primary goal “was to explore ideas and stimulate discussion rather than to conduct rigorous experiments”.
I recommend that the abstract and introduction be rewritten to reflect this stance. However, this does beg the question as to whether the paper might be better construed as a working paper for feedback and discussion than as a substantive academic article. The paper is very provocative and stimulating in terms of the issues it raises but inconclusive as an assessment of the utility of an AI assistant for narrative interviews.//

*Response: The pilot and exploratory nature of the study is emphasized. The revised abstract and introduction explicitly state that the study is exploratory and serves as a pilot to stimulate discussion about the use of AI in social work. The manuscript now clearly outlines its objectives, focusing on the democratization of AI chatbots and the emerging trend of customizable AI. The study is qualitative and concerned with the extent to which domain knowledge experts can participate in instructing and assessing AI chatbots.

//Other areas for improvement
A balanced approach to the literature and the claims made
The literature review is interesting and includes many relevant articles, but the articles selected and the authors’ comments are prone to over-accentuate the potential of a technology that is still emergent. The application of GenAI to qualitative research is deeply contested in the mainstream literature on qualitative methodology, making it hard to concur with the authors’ views that:
The integration of generative AI has revolutionized and democratized the process of qualitative data collection, analysis, and interpretation. This technological leap has boosted efficiency and expanded the horizons of narrative inquiry methods.
The authors can improve the literature review by recognising GenAI's emergent nature and the fact that many scholars in the mainstream qualitative research field have strong reservations about its applicability and ethical concerns about its use.//

*Response: The literature review has been revised to acknowledge the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. It addresses limitations alongside potential benefits. The discussion part addresses these limitations, noting that the study focused exclusively on OpenAI's product and that findings may not be generalizable to AI tools from other companies. While many principles and challenges identified here could apply broadly to other AI systems, variations in technology and implementation could lead to different outcomes. The introduction of AI tools into practice also brings considerations such as navigating risk management, data privacy, and the uneven development of language models across diverse sociocultural landscapes. While these challenges are important and have been extensively explored by others, the study focuses on exploring how social workers can expand their skill sets and rethink their expertise to engage actively with AI technologies.

//Other issues
There are several places in the text where the authors’ explanations of the methods used are a little too opaque. For example, the explanation about the progression scores was helpful but left me puzzled about how this was practically implemented. An example might help.
Also, in the findings, the authors highlight that the participant exhibited higher agency during conversation two, which is surely explained by the instruction given to the participant to try to throw the interviewer off track. If that isn’t the explanation, a fuller account as to why not would be helpful.//

*Response: The revised manuscript provides clearer explanations of the methods used. It explains that the study is autoethnographic, reflecting personal experiences and using rubrics for assessing fidelity and empathy in the AI assistant’s performance. Examples of conversation transcripts are included to illustrate these qualitative assessments, emphasizing the depth and context of interactions rather than mere numerical data. We also enhanced clarity in findings. It was noted that some findings, like the participant exhibiting higher agency, were not well-explained. In the revised discussion, the measure of agency has been removed and integrated into the discussion about empathy and a deeper exploration of meanings. This approach aligns with the study’s focus on qualitative assessment and provides a clearer understanding of the findings.

//Overall
I want to encourage the authors to continue their exploration of the use of GenAI in qualitative research, and I believe that this is a very useful working paper that raises many pertinent issues. Responding to the comments above might help to improve the structure and flow of the paper, but the research design issues cannot be remedied without conducting a completely new study. It is for that reason I decided not to recommend the paper for indexing.//

*Response: We are very grateful for the thoughtful and constructive comments provided. We deeply appreciate the feedback, especially regarding the areas for improvement and the suggestion to frame the study as a pilot. These insights were particularly helpful in guiding us toward a more focused and balanced approach. We also want to acknowledge that the first version of the manuscript had several gaps, and we sincerely apologize for any confusion this may have caused. We recognize that the clarity and direction of the paper needed refinement, and we are thankful for the opportunity to improve. In response, we have rewritten the article using the same dataset but with a more refined focus, especially clarifying the research aims and scope. The revised study explores how AI chatbots, particularly GPT-4, can be applied in social work for conducting narrative interviews. Using an autoethnographic case study approach, social work researchers developed and tested an AI chatbot on WhatsApp. The chatbot followed a structured interview plan and demonstrated empathy, though challenges in replicating human skills within AI were noted. The study concludes that social workers can tailor AI tools for narrative interviews, but further research is needed to address the field's complexities. Our focus remains on democratizing AI and empowering social workers to use these tools effectively while emphasizing the need for interdisciplinary collaboration in developing robust assessment tools.
//Main areas for improvement
Be clear about research questions and primary focus
A lot is going on in this study, and the description and rationale for key design choices are sometimes unclear. The problem begins with the abstract and the research questions, where there seems to be slippage between a focus on evaluating the performance of the AI assistant and evaluating the validity and inter-rater reliability of the measures constructed. The first research question asks:
How can the proposed metrics help us, as social researchers without a technical background, understand the quality of the AI-driven interviews in this study?
In my view, RQ1 would be better focussed on assessing the validity and reliability of the measures selected for evaluating the performance of the AI assistant. This needs to be established before an actual performance assessment of the AI assistant is made. Establishing inter-rater reliability of the measures would be a part of that task.
However, this raises the question of why the researchers decided to have only two raters, one human and the other a GenAI. The comparison is interesting but muddies the waters. This machine versus human rater aspect of the study was a surprise since it was not signalled in the abstract, introduction or research questions. A comparison between the ratings made by a GenAI and human raters would be far more interesting once we had established human inter-rater reliability, which could have been achieved with only a few more humans in the frame.//

*Response: The research questions and focus have been entirely reformulated. The revised manuscript now distinguishes its objectives. It focuses on the democratization of AI chatbots and the emerging trend of customizable AI, which has opened new opportunities across various fields, including social work. The study explores how social workers, particularly those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews. The role of narrative interviews is used as a case to explore possibilities and provide feedback within a broader context. The emphasis is on qualitative assessment rather than quantitative measurement, aligning with the exploratory nature of the study.

//Establish from the outset that this is a pilot study
The title and the abstract refer to the study as a case study, albeit one with a single participant, giving the reader the impression of a significant, in-depth exploration of a phenomenon. However, in the section on limitations, the authors describe the study as a pilot study and state that the primary goal “was to explore ideas and stimulate discussion rather than to conduct rigorous experiments”.
I recommend that the abstract and introduction be rewritten to reflect this stance. However, this does beg the question as to whether the paper might be better construed as a working paper for feedback and discussion than as a substantive academic article. The paper is very provocative and stimulating in terms of the issues it raises but inconclusive as an assessment of the utility of an AI assistant for narrative interviews.//

*Response: The pilot and exploratory nature of the study is emphasized. The revised abstract and introduction explicitly state that the study is exploratory and serves as a pilot to stimulate discussion about the use of AI in social work. The manuscript now clearly outlines its objectives, focusing on the democratization of AI chatbots and the emerging trend of customizable AI. The study is qualitative and concerned with the extent to which domain knowledge experts can participate in instructing and assessing AI chatbots.

//Other areas for improvement
A balanced approach to the literature and the claims made
The literature review is interesting and includes many relevant articles, but the articles selected and the authors’ comments are prone to over-accentuate the potential of a technology that is still emergent. The application of GenAI to qualitative research is deeply contested in the mainstream literature on qualitative methodology, making it hard to concur with the authors’ views that:
The integration of generative AI has revolutionized and democratized the process of qualitative data collection, analysis, and interpretation. This technological leap has boosted efficiency and expanded the horizons of narrative inquiry methods.
The authors can improve the literature review by recognising GenAI's emergent nature and the fact that many scholars in the mainstream qualitative research field have strong reservations about its applicability and ethical concerns about its use.//

*Response: The literature review has been revised to acknowledge the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. It addresses limitations alongside potential benefits. The discussion part addresses these limitations, noting that the study focused exclusively on OpenAI's product and that findings may not be generalizable to AI tools from other companies. While many principles and challenges identified here could apply broadly to other AI systems, variations in technology and implementation could lead to different outcomes. The introduction of AI tools into practice also brings considerations such as navigating risk management, data privacy, and the uneven development of language models across diverse sociocultural landscapes. While these challenges are important and have been extensively explored by others, the study focuses on exploring how social workers can expand their skill sets and rethink their expertise to engage actively with AI technologies.

//Other issues
There are several places in the text where the authors’ explanations of the methods used are a little too opaque. For example, the explanation about the progression scores was helpful but left me puzzled about how this was practically implemented. An example might help.
Also, in the findings, the authors highlight that the participant exhibited higher agency during conversation two, which is surely explained by the instruction given to the participant to try to throw the interviewer off track. If that isn’t the explanation, a fuller account as to why not would be helpful.//

*Response: The revised manuscript provides clearer explanations of the methods used. It explains that the study is autoethnographic, reflecting personal experiences and using rubrics for assessing fidelity and empathy in the AI assistant’s performance. Examples of conversation transcripts are included to illustrate these qualitative assessments, emphasizing the depth and context of interactions rather than mere numerical data. We also enhanced clarity in findings. It was noted that some findings, like the participant exhibiting higher agency, were not well-explained. In the revised discussion, the measure of agency has been removed and integrated into the discussion about empathy and a deeper exploration of meanings. This approach aligns with the study’s focus on qualitative assessment and provides a clearer understanding of the findings.

//Overall
I want to encourage the authors to continue their exploration of the use of GenAI in qualitative research, and I believe that this is a very useful working paper that raises many pertinent issues. Responding to the comments above might help to improve the structure and flow of the paper, but the research design issues cannot be remedied without conducting a completely new study. It is for that reason I decided not to recommend the paper for indexing.//

*Response: We are very grateful for the thoughtful and constructive comments provided. We deeply appreciate the feedback, especially regarding the areas for improvement and the suggestion to frame the study as a pilot. These insights were particularly helpful in guiding us toward a more focused and balanced approach. We also want to acknowledge that the first version of the manuscript had several gaps, and we sincerely apologize for any confusion this may have caused. We recognize that the clarity and direction of the paper needed refinement, and we are thankful for the opportunity to improve. In response, we have rewritten the article using the same dataset but with a more refined focus, especially clarifying the research aims and scope. The revised study explores how AI chatbots, particularly GPT-4, can be applied in social work for conducting narrative interviews. Using an autoethnographic case study approach, social work researchers developed and tested an AI chatbot on WhatsApp. The chatbot followed a structured interview plan and demonstrated empathy, though challenges in replicating human skills within AI were noted. The study concludes that social workers can tailor AI tools for narrative interviews, but further research is needed to address the field's complexities. Our focus remains on democratizing AI and empowering social workers to use these tools effectively while emphasizing the need for interdisciplinary collaboration in developing robust assessment tools.
Competing Interests: NIL Close
Report a concern

Views

Reviewer Report 26 Jul 2024

Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.166648.r298274

This manuscript addresses an interesting and important topic. My comments are focused on clarifying certain details to enhance the paper’s clarity and coherence, rather than critiquing its main idea, which I find compelling but somewhat unclear in its current form. Despite reading through the entire paper multiple times, I struggled to fully grasp key parts. Reorganizing the literature review, introducing a clear “Methods” section, and streamlining the manuscript would greatly improve its readability.

Literature Review
The literature review begins by emphasizing the potential of AI in narrative interviewing, but it would benefit from first defining and describing what narrative interviewing entails. The enthusiastic discussion of AI, including references to “influential bloggers,” feels more promotional than academic, especially since it turns out to be an advertisement for a for-profit company selling narrative interviewing tools. To strengthen the paper, I recommend moving the definitions of key terms like “narrative interviewing” and “conversation analysis” to the beginning of the literature review. Additionally, information from paragraph 5 and the section “Why do assistant’s interview skills matter” should be relocated to the start to provide important context early on.

The language in the literature review occasionally seems exaggerated and inconsistent. For instance, the statement that AI is “becoming a crucial tool for researchers” feels overstated without supporting evidence. Furthermore, it appears contradictory to claim that AI improves analysis while also stating that academics have not “explored how AI could enable or limit narrative interviews.” Clarifying these points and providing supporting evidence will strengthen the argument.

There are also some statements that seem unnecessary and detract from the narrative flow, such as “This brief review of narrative inquirer capabilities is not comprehensive.” Removing such remarks can streamline the text and improve readability.

Methods Section
Introducing a “Methods” header before discussing competence metrics would improve the manuscript’s structure. The justification for the competence metrics needs to be clearer. Rather than stating that “The following metrics were chosen primarily for their simplicity and to facilitate testing and demonstration,” it would be helpful to explain the rationale behind selecting these specific metrics and whether they are based on existing frameworks or newly developed by the authors.

Scores: The authors identified six scores but did not talk about how those scores were established, the scales, or anything else about them. Specifically:

Competence 1: The explanation of how progression scores are assigned is vague. There is no clear description of the criteria used to determine the progression stage of each utterance. This lack of specificity makes it difficult to understand the validity and reliability of the scoring system.
The method’s reliance on previous studies (Au-yeung, 2023; Chan et al., 2020; Connor et al., 2009; Moran et al., 2005; Ramey et al., 2009) without providing a detailed explanation or adaptation process is problematic. It would be helpful to have a more thorough, but not necessarily long explanation of how these methodologies are applied to the current study.

Relatedly, the sequential analysis method from therapy research may not be directly applicable to narrative interviews conducted by AI. If this method was developed from narrative interviews, then say so. There are therapy approaches like CBT or behavioral therapies that have structured, goal-oriented sessions. ABFT is task oriented, but still provides the therapist structure. Narrative interviews, from your description, are more open-ended, possibly more like Rogerian or supportive therapy sessions. The adaptation of this method needs more rigorous justification.

I was also unclear how and why a higher score isn’t better, and also why you calculated average progression scores over time. How does this score help us to understand how good AI is doing?

Competence 2: Is there empirical validation or psychometric testing of the empathy scale? If so, please provide that information for the reader. I would like more information about why a Likert scale is used for empathy, especially since the previous competence argued that higher scores were not necessarily better. That does not seem to be the case with this scale.

Competence 3: I didn’t understand the use of the word “Domination” in the subheader.
You introduce multiple metrics (agency, coherence, complexity, and meaning-making) without clearly explaining how they are operationalized. Each one of these needs to be defined and the reader needs measurement criteria for each metric, especially since the previous two measures apparently used different scales. For example, the authors say that measuring “agency” cannot be done in isolation, but there is no description of how the scores or definitions of the other measures are used to understand or measure agency. Indeed, it seems that if the concepts in Competence 3 are interrelated, might there be a way to discuss the measures as a coherent package, rather than individual scales?

The description of the Case study has many strengths, including the innovative use of ChatGPT to create a chatbot that the researchers would otherwise have not been able to program, highlighting the value of deploying the chatbot on WhatsApp, clearly defining the seven stages, and the ingenuity of having the postgraduate have two conversations – one genuine and one combative. The major limitation of the description of the case study is that it seems to be disconnected from the measures. I know that you provide results in the results section, but it would be helpful to have notes about how each section connects to the measures.

Data Analysis Method and Data Collection Section
I was unclear about the context in which the postgrad was having the WhatsApp conversations. Was this in a lab? Were the researchers observing? Was this in private?
I didn’t understand why the conversation was divided into four equal quarters. i.e. The choice of four quarters seems arbitrary. I know that it makes it easy to analyze by utterances rather than time, but it seems to miss the importance of space and silence in a narrative interview.

The process for manually assigning progression scores is not clearly defined. There is no description of the criteria used by the human research assistant to assign these scores, which raises concerns about the subjectivity and consistency of the ratings. Is there any research on using LLMs or ChatGPT specifically to assigning scores to narrative interviews? If so, it would be good to cite that here.

Results
The agreement between the human research assistant and ChatGPT4 on most metrics in and of itself is encouraging for researchers interested in using LLMs.

In finding 2, what distinguishes a 3 from a 4? This is part of my ongoing concern about the paper’s lack of clarity with the measures.

The disagreement on coherence in Conversation 2 highlights the need for clear criteria. The authors wrote, “In the rating, the human research assistant and ChatGPT4 mostly agreed on all these metrics, except that there was a disagreement on coherence in Conversation 2, where the human research assistant gave a score of 2 and ChatGPT4 gave a score of 3. After reviewing this, the principal researcher sided with ChatGPT4's score.” The manuscript should detail the specific aspects that led to different ratings and the rationale for siding with ChatGPT4’s score. Especially in these early days of using LLMs for analysis, this kind of transparency will be instructive as well as increase trustworthiness of the analysis.

Discussion
I thought it was interesting that there were challenges with long speeches in conversation analysis. Even though WhatsApp tends to have shorter exchanges, what are some potential solutions or alternative methods to address these challenges, especially since narrative interviews often involve extended dialogue. I’m thinking specifically about qualitative researchers who are interested in using this for more traditional open-ended narrative interviews, not on WhatsApp. Perhaps this is part of an expanded discussion on the scalability of this type of AI use.

I agree that there need to be more refined metrics (and some of my critiques of this paper are in line with that call).

How would you propose to improve interrater reliability? Again, are there computational linguists who have looked into human/LLM IRR?

After reading through the discussion a few times, I still didn’t have a good understanding of the differences between conversations 1 and 2. For example, in the results section you noted differences in agency, coherence and complexity. Why did the participant exhibit higher agency in Conversation 2? What factors contributed to the lower coherence and complexity in Conversation 2? If you discussed this and I missed it, my apologies.

Final note:
I hope this feedback is helpful. I typically do not review manuscripts until I fully understand both the spirit and letter of the manuscript, but I am providing this feedback in response to the editor’s request for a timely review

Is the background of the case’s history and progression described in sufficient detail?

No
Is the work clearly and accurately presented and does it cite the current literature?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes
Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Youth suicide prevention, qualitative research, technology and social work education/practice

CITE

Report a concern

Author Response 04 Oct 2024

Chitat CHAN, Social Work, Hong Kong Baptist University, Hong Kong, Hong Kong

04 Oct 2024

Author Response

Responding to Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

Literature Review
The literature review begins by emphasizing the potential of AI in narrative interviewing, but it would benefit from ... Continue reading Responding to Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

Literature Review
The literature review begins by emphasizing the potential of AI in narrative interviewing, but it would benefit from first defining and describing what narrative interviewing entails. The enthusiastic discussion of AI, including references to “influential bloggers,” feels more promotional than academic, especially since it turns out to be an advertisement for a for-profit company selling narrative interviewing tools. To strengthen the paper, I recommend moving the definitions of key terms like “narrative interviewing” and “conversation analysis” to the beginning of the literature review. Additionally, information from paragraph 5 and the section “Why do assistant’s interview skills matter” should be relocated to the start to provide important context early on.
*Response: The literature review part was entirely reorganized. The revised manuscript now includes a more structured literature review. It has been significantly reframed to focus on the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. These advancements enable non-technical users to develop and tailor AI tools for specific applications. As there is limited research on how social workers, especially those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews, the study investigates this aspect. The role of narrative interviews is instrumental, serving as a case to explore possibilities and provide feedback within a broader context. This sets a clear foundation before discussing the role of AI in narrative interviewing.

The language in the literature review occasionally seems exaggerated and inconsistent. For instance, the statement that AI is “becoming a crucial tool for researchers” feels overstated without supporting evidence. Furthermore, it appears contradictory to claim that AI improves analysis while also stating that academics have not “explored how AI could enable or limit narrative interviews.” Clarifying these points and providing supporting evidence will strengthen the argument.
*Response: Improved clarity and exaggeration removed. The revised manuscript moderates such language and focuses on informed claims. It highlights the exploratory nature of the study and emphasizes the use of rubrics as tools for qualitative assessment rather than making broad claims about AI’s capabilities.

There are also some statements that seem unnecessary and detract from the narrative flow, such as “This brief review of narrative inquirer capabilities is not comprehensive.” Removing such remarks can streamline the text and improve readability.
*Response: In the revised manuscript, we eliminated remarks such as “This brief review of narrative inquirer capabilities is not comprehensive” to streamline the text and enhance the overall clarity.

Methods Section
Introducing a “Methods” header before discussing competence metrics would improve the manuscript’s structure. The justification for the competence metrics needs to be clearer. Rather than stating that “The following metrics were chosen primarily for their simplicity and to facilitate testing and demonstration,” it would be helpful to explain the rationale behind selecting these specific metrics and whether they are based on existing frameworks or newly developed by the authors.
Scores: The authors identified six scores but did not talk about how those scores were established, the scales, or anything else about them. Specifically:
Competence 1: The explanation of how progression scores are assigned is vague. There is no clear description of the criteria used to determine the progression stage of each utterance. This lack of specificity makes it difficult to understand the validity and reliability of the scoring system.
The method’s reliance on previous studies (Au-yeung, 2023; Chan et al., 2020; Connor et al., 2009; Moran et al., 2005; Ramey et al., 2009) without providing a detailed explanation or adaptation process is problematic. It would be helpful to have a more thorough, but not necessarily long explanation of how these methodologies are applied to the current study.
Relatedly, the sequential analysis method from therapy research may not be directly applicable to narrative interviews conducted by AI. If this method was developed from narrative interviews, then say so. There are therapy approaches like CBT or behavioral therapies that have structured, goal-oriented sessions. ABFT is task oriented, but still provides the therapist structure. Narrative interviews, from your description, are more open-ended, possibly more like Rogerian or supportive therapy sessions. The adaptation of this method needs more rigorous justification.
I was also unclear how and why a higher score isn’t better, and also why you calculated average progression scores over time. How does this score help us to understand how good AI is doing?
Competence 2: Is there empirical validation or psychometric testing of the empathy scale? If so, please provide that information for the reader. I would like more information about why a Likert scale is used for empathy, especially since the previous competence argued that higher scores were not necessarily better. That does not seem to be the case with this scale.
Competence 3: I didn’t understand the use of the word “Domination” in the subheader.
You introduce multiple metrics (agency, coherence, complexity, and meaning-making) without clearly explaining how they are operationalized. Each one of these needs to be defined and the reader needs measurement criteria for each metric, especially since the previous two measures apparently used different scales. For example, the authors say that measuring “agency” cannot be done in isolation, but there is no description of how the scores or definitions of the other measures are used to understand or measure agency. Indeed, it seems that if the concepts in Competence 3 are interrelated, might there be a way to discuss the measures as a coherent package, rather than individual scales?
The description of the Case study has many strengths, including the innovative use of ChatGPT to create a chatbot that the researchers would otherwise have not been able to program, highlighting the value of deploying the chatbot on WhatsApp, clearly defining the seven stages, and the ingenuity of having the postgraduate have two conversations – one genuine and one combative. The major limitation of the description of the case study is that it seems to be disconnected from the measures. I know that you provide results in the results section, but it would be helpful to have notes about how each section connects to the measures.
*Response: We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative assessment framework that captures the nuances of narrative interviews, reflecting a shift from objective measurement to qualitative assessment aimed at enriching the exploration and discussion. The revised manuscript provides a detailed explanation of how rubrics are used primarily for assessing fidelity and empathy and introduces sequential analysis (and progression scores) to visualize how the conversations aligned with intended plans. The rubrics are the key assessment tools, while the sequential analysis serves more as an illustration. This revised approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores.

Data Analysis Method and Data Collection Section
I was unclear about the context in which the postgrad was having the WhatsApp conversations. Was this in a lab? Were the researchers observing? Was this in private?
I didn’t understand why the conversation was divided into four equal quarters. i.e. The choice of four quarters seems arbitrary. I know that it makes it easy to analyze by utterances rather than time, but it seems to miss the importance of space and silence in a narrative interview.
The process for manually assigning progression scores is not clearly defined. There is no description of the criteria used by the human research assistant to assign these scores, which raises concerns about the subjectivity and consistency of the ratings. Is there any research on using LLMs or ChatGPT specifically to assigning scores to narrative interviews? If so, it would be good to cite that here.
*Response: The WhatsApp conversation took place in private. We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. The progression scores were primarily instrumental in the visualizations. We used a proportional timeline approach, dividing the conversation into four equal quarters based on the number of utterances instead of actual time, which allows us to account for varying lengths of conversations. For manually assigning progression, the first author and third authors tagged conversations independently, then reviewed and discussed until all disagreements were resolved, which helped ensure reliable and accurate evaluation. The rubrics developed for this study are specifically tailored for narrative interviews and are more flexible than structured, goal-oriented therapy methods. The therapy orientation was de-emphasized. Narrative therapy references were cited because the questioning skills in narrative therapy are useful and help inspire narrative interviews (e.g., creating metaphors). This adaptation is detailed in the methodology section, aligning with the open-ended nature of AI-mediated narrative interviews.

Results
The agreement between the human research assistant and ChatGPT4 on most metrics in and of itself is encouraging for researchers interested in using LLMs.
In finding 2, what distinguishes a 3 from a 4? This is part of my ongoing concern about the paper’s lack of clarity with the measures.
The disagreement on coherence in Conversation 2 highlights the need for clear criteria. The authors wrote, “In the rating, the human research assistant and ChatGPT4 mostly agreed on all these metrics, except that there was a disagreement on coherence in Conversation 2, where the human research assistant gave a score of 2 and ChatGPT4 gave a score of 3. After reviewing this, the principal researcher sided with ChatGPT4's score.” The manuscript should detail the specific aspects that led to different ratings and the rationale for siding with ChatGPT4’s score. Especially in these early days of using LLMs for analysis, this kind of transparency will be instructive as well as increase trustworthiness of the analysis.
*Response: We have completely restructured the methodology. The manuscript now includes a Methods section that explains the autoethnographic approach and the use of rubrics to assess fidelity and empathy in the AI assistant’s performance. We omitted the section on AI tagging, as it no longer aligns with the revised methodology. The rationale for conducting two conversations, as explained in the updated manuscript, was to better assess the extent to which the chatbot maintains similar levels of adherence and fidelity across different contexts, even when the conversation veers off track.

Discussion
I thought it was interesting that there were challenges with long speeches in conversation analysis. Even though WhatsApp tends to have shorter exchanges, what are some potential solutions or alternative methods to address these challenges, especially since narrative interviews often involve extended dialogue. I’m thinking specifically about qualitative researchers who are interested in using this for more traditional open-ended narrative interviews, not on WhatsApp. Perhaps this is part of an expanded discussion on the scalability of this type of AI use.
I agree that there need to be more refined metrics (and some of my critiques of this paper are in line with that call).
How would you propose to improve interrater reliability? Again, are there computational linguists who have looked into human/LLM IRR?
After reading through the discussion a few times, I still didn’t have a good understanding of the differences between conversations 1 and 2. For example, in the results section you noted differences in agency, coherence and complexity. Why did the participant exhibit higher agency in Conversation 2? What factors contributed to the lower coherence and complexity in Conversation 2? If you discussed this and I missed it, my apologies.
*Response: We have thoroughly restructured the methodology. The revised manuscript adopts an autoethnographic approach and uses rubrics to assess fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative framework that captures the nuances of narrative interviews, shifting the focus from objective measurement to a qualitative assessment aimed at deepening exploration and discussion. As such, interrater reliability is no longer a focus in the revised version. The manuscript explains how rubrics are used to assess fidelity and empathy, while sequential analysis and progression scores visually represent how conversations aligned with the intended plans. The rubrics serve as the primary assessment tools, with sequential analysis as an illustration. This approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores. The revised methodology aims to explore whether an OpenAI Assistant can be effectively instructed by social workers with minimal coding experience, if the chatbot’s performance can be assessed using simple observation-based rubrics, and how these findings contribute to the broader discussion on democratizing AI in specialized fields like social work.

Final note:
I hope this feedback is helpful. I typically do not review manuscripts until I fully understand both the spirit and letter of the manuscript, but I am providing this feedback in response to the editor’s request for a timely review
*Response: We are very thankful for the insightful comments and appreciate the feedback provided. We also want to sincerely apologize, as we were aware that the first version of the manuscript was not fully mature and had several gaps. We recognize that it lacked clarity in several areas, and we are grateful for the opportunity to address those issues. In response, we have completely rewritten the article using the same dataset, but with a more refined focus and clearer articulation of the research aims. This study explores how AI chatbots, particularly those powered by GPT-4, can be applied in social work for conducting narrative interviews. Through an autoethnographic case study approach, social work researchers with limited technical expertise developed and tested an AI chatbot on WhatsApp. The chatbot adhered to a structured interview plan while also demonstrating empathetic engagement with participants. While the AI performed well in maintaining dialogue and showing empathy, challenges were identified in replicating human skills within an AI framework. The study concludes that AI tools can indeed be tailored by social workers to support narrative interviews, though further research is essential to fully address the nuanced needs of the field. Our key focus remains on democratizing AI and empowering domain experts, like social workers, to use these tools effectively. This research enriches the ongoing discussion by exploring the feasibility of AI in specialized fields and emphasizes the importance of interdisciplinary collaboration in developing robust assessment tools to enhance AI implementation.
Responding to Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

Literature Review
The literature review begins by emphasizing the potential of AI in narrative interviewing, but it would benefit from first defining and describing what narrative interviewing entails. The enthusiastic discussion of AI, including references to “influential bloggers,” feels more promotional than academic, especially since it turns out to be an advertisement for a for-profit company selling narrative interviewing tools. To strengthen the paper, I recommend moving the definitions of key terms like “narrative interviewing” and “conversation analysis” to the beginning of the literature review. Additionally, information from paragraph 5 and the section “Why do assistant’s interview skills matter” should be relocated to the start to provide important context early on.
*Response: The literature review part was entirely reorganized. The revised manuscript now includes a more structured literature review. It has been significantly reframed to focus on the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. These advancements enable non-technical users to develop and tailor AI tools for specific applications. As there is limited research on how social workers, especially those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews, the study investigates this aspect. The role of narrative interviews is instrumental, serving as a case to explore possibilities and provide feedback within a broader context. This sets a clear foundation before discussing the role of AI in narrative interviewing.

The language in the literature review occasionally seems exaggerated and inconsistent. For instance, the statement that AI is “becoming a crucial tool for researchers” feels overstated without supporting evidence. Furthermore, it appears contradictory to claim that AI improves analysis while also stating that academics have not “explored how AI could enable or limit narrative interviews.” Clarifying these points and providing supporting evidence will strengthen the argument.
*Response: Improved clarity and exaggeration removed. The revised manuscript moderates such language and focuses on informed claims. It highlights the exploratory nature of the study and emphasizes the use of rubrics as tools for qualitative assessment rather than making broad claims about AI’s capabilities.

There are also some statements that seem unnecessary and detract from the narrative flow, such as “This brief review of narrative inquirer capabilities is not comprehensive.” Removing such remarks can streamline the text and improve readability.
*Response: In the revised manuscript, we eliminated remarks such as “This brief review of narrative inquirer capabilities is not comprehensive” to streamline the text and enhance the overall clarity.

Methods Section
Introducing a “Methods” header before discussing competence metrics would improve the manuscript’s structure. The justification for the competence metrics needs to be clearer. Rather than stating that “The following metrics were chosen primarily for their simplicity and to facilitate testing and demonstration,” it would be helpful to explain the rationale behind selecting these specific metrics and whether they are based on existing frameworks or newly developed by the authors.
Scores: The authors identified six scores but did not talk about how those scores were established, the scales, or anything else about them. Specifically:
Competence 1: The explanation of how progression scores are assigned is vague. There is no clear description of the criteria used to determine the progression stage of each utterance. This lack of specificity makes it difficult to understand the validity and reliability of the scoring system.
The method’s reliance on previous studies (Au-yeung, 2023; Chan et al., 2020; Connor et al., 2009; Moran et al., 2005; Ramey et al., 2009) without providing a detailed explanation or adaptation process is problematic. It would be helpful to have a more thorough, but not necessarily long explanation of how these methodologies are applied to the current study.
Relatedly, the sequential analysis method from therapy research may not be directly applicable to narrative interviews conducted by AI. If this method was developed from narrative interviews, then say so. There are therapy approaches like CBT or behavioral therapies that have structured, goal-oriented sessions. ABFT is task oriented, but still provides the therapist structure. Narrative interviews, from your description, are more open-ended, possibly more like Rogerian or supportive therapy sessions. The adaptation of this method needs more rigorous justification.
I was also unclear how and why a higher score isn’t better, and also why you calculated average progression scores over time. How does this score help us to understand how good AI is doing?
Competence 2: Is there empirical validation or psychometric testing of the empathy scale? If so, please provide that information for the reader. I would like more information about why a Likert scale is used for empathy, especially since the previous competence argued that higher scores were not necessarily better. That does not seem to be the case with this scale.
Competence 3: I didn’t understand the use of the word “Domination” in the subheader.
You introduce multiple metrics (agency, coherence, complexity, and meaning-making) without clearly explaining how they are operationalized. Each one of these needs to be defined and the reader needs measurement criteria for each metric, especially since the previous two measures apparently used different scales. For example, the authors say that measuring “agency” cannot be done in isolation, but there is no description of how the scores or definitions of the other measures are used to understand or measure agency. Indeed, it seems that if the concepts in Competence 3 are interrelated, might there be a way to discuss the measures as a coherent package, rather than individual scales?
The description of the Case study has many strengths, including the innovative use of ChatGPT to create a chatbot that the researchers would otherwise have not been able to program, highlighting the value of deploying the chatbot on WhatsApp, clearly defining the seven stages, and the ingenuity of having the postgraduate have two conversations – one genuine and one combative. The major limitation of the description of the case study is that it seems to be disconnected from the measures. I know that you provide results in the results section, but it would be helpful to have notes about how each section connects to the measures.
*Response: We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative assessment framework that captures the nuances of narrative interviews, reflecting a shift from objective measurement to qualitative assessment aimed at enriching the exploration and discussion. The revised manuscript provides a detailed explanation of how rubrics are used primarily for assessing fidelity and empathy and introduces sequential analysis (and progression scores) to visualize how the conversations aligned with intended plans. The rubrics are the key assessment tools, while the sequential analysis serves more as an illustration. This revised approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores.

Data Analysis Method and Data Collection Section
I was unclear about the context in which the postgrad was having the WhatsApp conversations. Was this in a lab? Were the researchers observing? Was this in private?
I didn’t understand why the conversation was divided into four equal quarters. i.e. The choice of four quarters seems arbitrary. I know that it makes it easy to analyze by utterances rather than time, but it seems to miss the importance of space and silence in a narrative interview.
The process for manually assigning progression scores is not clearly defined. There is no description of the criteria used by the human research assistant to assign these scores, which raises concerns about the subjectivity and consistency of the ratings. Is there any research on using LLMs or ChatGPT specifically to assigning scores to narrative interviews? If so, it would be good to cite that here.
*Response: The WhatsApp conversation took place in private. We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. The progression scores were primarily instrumental in the visualizations. We used a proportional timeline approach, dividing the conversation into four equal quarters based on the number of utterances instead of actual time, which allows us to account for varying lengths of conversations. For manually assigning progression, the first author and third authors tagged conversations independently, then reviewed and discussed until all disagreements were resolved, which helped ensure reliable and accurate evaluation. The rubrics developed for this study are specifically tailored for narrative interviews and are more flexible than structured, goal-oriented therapy methods. The therapy orientation was de-emphasized. Narrative therapy references were cited because the questioning skills in narrative therapy are useful and help inspire narrative interviews (e.g., creating metaphors). This adaptation is detailed in the methodology section, aligning with the open-ended nature of AI-mediated narrative interviews.

Results
The agreement between the human research assistant and ChatGPT4 on most metrics in and of itself is encouraging for researchers interested in using LLMs.
In finding 2, what distinguishes a 3 from a 4? This is part of my ongoing concern about the paper’s lack of clarity with the measures.
The disagreement on coherence in Conversation 2 highlights the need for clear criteria. The authors wrote, “In the rating, the human research assistant and ChatGPT4 mostly agreed on all these metrics, except that there was a disagreement on coherence in Conversation 2, where the human research assistant gave a score of 2 and ChatGPT4 gave a score of 3. After reviewing this, the principal researcher sided with ChatGPT4's score.” The manuscript should detail the specific aspects that led to different ratings and the rationale for siding with ChatGPT4’s score. Especially in these early days of using LLMs for analysis, this kind of transparency will be instructive as well as increase trustworthiness of the analysis.
*Response: We have completely restructured the methodology. The manuscript now includes a Methods section that explains the autoethnographic approach and the use of rubrics to assess fidelity and empathy in the AI assistant’s performance. We omitted the section on AI tagging, as it no longer aligns with the revised methodology. The rationale for conducting two conversations, as explained in the updated manuscript, was to better assess the extent to which the chatbot maintains similar levels of adherence and fidelity across different contexts, even when the conversation veers off track.

Discussion
I thought it was interesting that there were challenges with long speeches in conversation analysis. Even though WhatsApp tends to have shorter exchanges, what are some potential solutions or alternative methods to address these challenges, especially since narrative interviews often involve extended dialogue. I’m thinking specifically about qualitative researchers who are interested in using this for more traditional open-ended narrative interviews, not on WhatsApp. Perhaps this is part of an expanded discussion on the scalability of this type of AI use.
I agree that there need to be more refined metrics (and some of my critiques of this paper are in line with that call).
How would you propose to improve interrater reliability? Again, are there computational linguists who have looked into human/LLM IRR?
After reading through the discussion a few times, I still didn’t have a good understanding of the differences between conversations 1 and 2. For example, in the results section you noted differences in agency, coherence and complexity. Why did the participant exhibit higher agency in Conversation 2? What factors contributed to the lower coherence and complexity in Conversation 2? If you discussed this and I missed it, my apologies.
*Response: We have thoroughly restructured the methodology. The revised manuscript adopts an autoethnographic approach and uses rubrics to assess fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative framework that captures the nuances of narrative interviews, shifting the focus from objective measurement to a qualitative assessment aimed at deepening exploration and discussion. As such, interrater reliability is no longer a focus in the revised version. The manuscript explains how rubrics are used to assess fidelity and empathy, while sequential analysis and progression scores visually represent how conversations aligned with the intended plans. The rubrics serve as the primary assessment tools, with sequential analysis as an illustration. This approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores. The revised methodology aims to explore whether an OpenAI Assistant can be effectively instructed by social workers with minimal coding experience, if the chatbot’s performance can be assessed using simple observation-based rubrics, and how these findings contribute to the broader discussion on democratizing AI in specialized fields like social work.

Final note:
I hope this feedback is helpful. I typically do not review manuscripts until I fully understand both the spirit and letter of the manuscript, but I am providing this feedback in response to the editor’s request for a timely review
*Response: We are very thankful for the insightful comments and appreciate the feedback provided. We also want to sincerely apologize, as we were aware that the first version of the manuscript was not fully mature and had several gaps. We recognize that it lacked clarity in several areas, and we are grateful for the opportunity to address those issues. In response, we have completely rewritten the article using the same dataset, but with a more refined focus and clearer articulation of the research aims. This study explores how AI chatbots, particularly those powered by GPT-4, can be applied in social work for conducting narrative interviews. Through an autoethnographic case study approach, social work researchers with limited technical expertise developed and tested an AI chatbot on WhatsApp. The chatbot adhered to a structured interview plan while also demonstrating empathetic engagement with participants. While the AI performed well in maintaining dialogue and showing empathy, challenges were identified in replicating human skills within an AI framework. The study concludes that AI tools can indeed be tailored by social workers to support narrative interviews, though further research is essential to fully address the nuanced needs of the field. Our key focus remains on democratizing AI and empowering domain experts, like social workers, to use these tools effectively. This research enriches the ongoing discussion by exploring the feasibility of AI in specialized fields and emphasizes the importance of interdisciplinary collaboration in developing robust assessment tools to enhance AI implementation.
Competing Interests: NIL Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 04 Oct 2024

Chitat CHAN, Social Work, Hong Kong Baptist University, Hong Kong, Hong Kong

04 Oct 2024

Author Response

Responding to Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

Literature Review
The literature review begins by emphasizing the potential of AI in narrative interviewing, but it would benefit from ... Continue reading Responding to Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

Literature Review
The literature review begins by emphasizing the potential of AI in narrative interviewing, but it would benefit from first defining and describing what narrative interviewing entails. The enthusiastic discussion of AI, including references to “influential bloggers,” feels more promotional than academic, especially since it turns out to be an advertisement for a for-profit company selling narrative interviewing tools. To strengthen the paper, I recommend moving the definitions of key terms like “narrative interviewing” and “conversation analysis” to the beginning of the literature review. Additionally, information from paragraph 5 and the section “Why do assistant’s interview skills matter” should be relocated to the start to provide important context early on.
*Response: The literature review part was entirely reorganized. The revised manuscript now includes a more structured literature review. It has been significantly reframed to focus on the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. These advancements enable non-technical users to develop and tailor AI tools for specific applications. As there is limited research on how social workers, especially those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews, the study investigates this aspect. The role of narrative interviews is instrumental, serving as a case to explore possibilities and provide feedback within a broader context. This sets a clear foundation before discussing the role of AI in narrative interviewing.

The language in the literature review occasionally seems exaggerated and inconsistent. For instance, the statement that AI is “becoming a crucial tool for researchers” feels overstated without supporting evidence. Furthermore, it appears contradictory to claim that AI improves analysis while also stating that academics have not “explored how AI could enable or limit narrative interviews.” Clarifying these points and providing supporting evidence will strengthen the argument.
*Response: Improved clarity and exaggeration removed. The revised manuscript moderates such language and focuses on informed claims. It highlights the exploratory nature of the study and emphasizes the use of rubrics as tools for qualitative assessment rather than making broad claims about AI’s capabilities.

There are also some statements that seem unnecessary and detract from the narrative flow, such as “This brief review of narrative inquirer capabilities is not comprehensive.” Removing such remarks can streamline the text and improve readability.
*Response: In the revised manuscript, we eliminated remarks such as “This brief review of narrative inquirer capabilities is not comprehensive” to streamline the text and enhance the overall clarity.

Methods Section
Introducing a “Methods” header before discussing competence metrics would improve the manuscript’s structure. The justification for the competence metrics needs to be clearer. Rather than stating that “The following metrics were chosen primarily for their simplicity and to facilitate testing and demonstration,” it would be helpful to explain the rationale behind selecting these specific metrics and whether they are based on existing frameworks or newly developed by the authors.
Scores: The authors identified six scores but did not talk about how those scores were established, the scales, or anything else about them. Specifically:
Competence 1: The explanation of how progression scores are assigned is vague. There is no clear description of the criteria used to determine the progression stage of each utterance. This lack of specificity makes it difficult to understand the validity and reliability of the scoring system.
The method’s reliance on previous studies (Au-yeung, 2023; Chan et al., 2020; Connor et al., 2009; Moran et al., 2005; Ramey et al., 2009) without providing a detailed explanation or adaptation process is problematic. It would be helpful to have a more thorough, but not necessarily long explanation of how these methodologies are applied to the current study.
Relatedly, the sequential analysis method from therapy research may not be directly applicable to narrative interviews conducted by AI. If this method was developed from narrative interviews, then say so. There are therapy approaches like CBT or behavioral therapies that have structured, goal-oriented sessions. ABFT is task oriented, but still provides the therapist structure. Narrative interviews, from your description, are more open-ended, possibly more like Rogerian or supportive therapy sessions. The adaptation of this method needs more rigorous justification.
I was also unclear how and why a higher score isn’t better, and also why you calculated average progression scores over time. How does this score help us to understand how good AI is doing?
Competence 2: Is there empirical validation or psychometric testing of the empathy scale? If so, please provide that information for the reader. I would like more information about why a Likert scale is used for empathy, especially since the previous competence argued that higher scores were not necessarily better. That does not seem to be the case with this scale.
Competence 3: I didn’t understand the use of the word “Domination” in the subheader.
You introduce multiple metrics (agency, coherence, complexity, and meaning-making) without clearly explaining how they are operationalized. Each one of these needs to be defined and the reader needs measurement criteria for each metric, especially since the previous two measures apparently used different scales. For example, the authors say that measuring “agency” cannot be done in isolation, but there is no description of how the scores or definitions of the other measures are used to understand or measure agency. Indeed, it seems that if the concepts in Competence 3 are interrelated, might there be a way to discuss the measures as a coherent package, rather than individual scales?
The description of the Case study has many strengths, including the innovative use of ChatGPT to create a chatbot that the researchers would otherwise have not been able to program, highlighting the value of deploying the chatbot on WhatsApp, clearly defining the seven stages, and the ingenuity of having the postgraduate have two conversations – one genuine and one combative. The major limitation of the description of the case study is that it seems to be disconnected from the measures. I know that you provide results in the results section, but it would be helpful to have notes about how each section connects to the measures.
*Response: We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative assessment framework that captures the nuances of narrative interviews, reflecting a shift from objective measurement to qualitative assessment aimed at enriching the exploration and discussion. The revised manuscript provides a detailed explanation of how rubrics are used primarily for assessing fidelity and empathy and introduces sequential analysis (and progression scores) to visualize how the conversations aligned with intended plans. The rubrics are the key assessment tools, while the sequential analysis serves more as an illustration. This revised approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores.

Data Analysis Method and Data Collection Section
I was unclear about the context in which the postgrad was having the WhatsApp conversations. Was this in a lab? Were the researchers observing? Was this in private?
I didn’t understand why the conversation was divided into four equal quarters. i.e. The choice of four quarters seems arbitrary. I know that it makes it easy to analyze by utterances rather than time, but it seems to miss the importance of space and silence in a narrative interview.
The process for manually assigning progression scores is not clearly defined. There is no description of the criteria used by the human research assistant to assign these scores, which raises concerns about the subjectivity and consistency of the ratings. Is there any research on using LLMs or ChatGPT specifically to assigning scores to narrative interviews? If so, it would be good to cite that here.
*Response: The WhatsApp conversation took place in private. We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. The progression scores were primarily instrumental in the visualizations. We used a proportional timeline approach, dividing the conversation into four equal quarters based on the number of utterances instead of actual time, which allows us to account for varying lengths of conversations. For manually assigning progression, the first author and third authors tagged conversations independently, then reviewed and discussed until all disagreements were resolved, which helped ensure reliable and accurate evaluation. The rubrics developed for this study are specifically tailored for narrative interviews and are more flexible than structured, goal-oriented therapy methods. The therapy orientation was de-emphasized. Narrative therapy references were cited because the questioning skills in narrative therapy are useful and help inspire narrative interviews (e.g., creating metaphors). This adaptation is detailed in the methodology section, aligning with the open-ended nature of AI-mediated narrative interviews.

Results
The agreement between the human research assistant and ChatGPT4 on most metrics in and of itself is encouraging for researchers interested in using LLMs.
In finding 2, what distinguishes a 3 from a 4? This is part of my ongoing concern about the paper’s lack of clarity with the measures.
The disagreement on coherence in Conversation 2 highlights the need for clear criteria. The authors wrote, “In the rating, the human research assistant and ChatGPT4 mostly agreed on all these metrics, except that there was a disagreement on coherence in Conversation 2, where the human research assistant gave a score of 2 and ChatGPT4 gave a score of 3. After reviewing this, the principal researcher sided with ChatGPT4's score.” The manuscript should detail the specific aspects that led to different ratings and the rationale for siding with ChatGPT4’s score. Especially in these early days of using LLMs for analysis, this kind of transparency will be instructive as well as increase trustworthiness of the analysis.
*Response: We have completely restructured the methodology. The manuscript now includes a Methods section that explains the autoethnographic approach and the use of rubrics to assess fidelity and empathy in the AI assistant’s performance. We omitted the section on AI tagging, as it no longer aligns with the revised methodology. The rationale for conducting two conversations, as explained in the updated manuscript, was to better assess the extent to which the chatbot maintains similar levels of adherence and fidelity across different contexts, even when the conversation veers off track.

Discussion
I thought it was interesting that there were challenges with long speeches in conversation analysis. Even though WhatsApp tends to have shorter exchanges, what are some potential solutions or alternative methods to address these challenges, especially since narrative interviews often involve extended dialogue. I’m thinking specifically about qualitative researchers who are interested in using this for more traditional open-ended narrative interviews, not on WhatsApp. Perhaps this is part of an expanded discussion on the scalability of this type of AI use.
I agree that there need to be more refined metrics (and some of my critiques of this paper are in line with that call).
How would you propose to improve interrater reliability? Again, are there computational linguists who have looked into human/LLM IRR?
After reading through the discussion a few times, I still didn’t have a good understanding of the differences between conversations 1 and 2. For example, in the results section you noted differences in agency, coherence and complexity. Why did the participant exhibit higher agency in Conversation 2? What factors contributed to the lower coherence and complexity in Conversation 2? If you discussed this and I missed it, my apologies.
*Response: We have thoroughly restructured the methodology. The revised manuscript adopts an autoethnographic approach and uses rubrics to assess fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative framework that captures the nuances of narrative interviews, shifting the focus from objective measurement to a qualitative assessment aimed at deepening exploration and discussion. As such, interrater reliability is no longer a focus in the revised version. The manuscript explains how rubrics are used to assess fidelity and empathy, while sequential analysis and progression scores visually represent how conversations aligned with the intended plans. The rubrics serve as the primary assessment tools, with sequential analysis as an illustration. This approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores. The revised methodology aims to explore whether an OpenAI Assistant can be effectively instructed by social workers with minimal coding experience, if the chatbot’s performance can be assessed using simple observation-based rubrics, and how these findings contribute to the broader discussion on democratizing AI in specialized fields like social work.

Final note:
I hope this feedback is helpful. I typically do not review manuscripts until I fully understand both the spirit and letter of the manuscript, but I am providing this feedback in response to the editor’s request for a timely review
*Response: We are very thankful for the insightful comments and appreciate the feedback provided. We also want to sincerely apologize, as we were aware that the first version of the manuscript was not fully mature and had several gaps. We recognize that it lacked clarity in several areas, and we are grateful for the opportunity to address those issues. In response, we have completely rewritten the article using the same dataset, but with a more refined focus and clearer articulation of the research aims. This study explores how AI chatbots, particularly those powered by GPT-4, can be applied in social work for conducting narrative interviews. Through an autoethnographic case study approach, social work researchers with limited technical expertise developed and tested an AI chatbot on WhatsApp. The chatbot adhered to a structured interview plan while also demonstrating empathetic engagement with participants. While the AI performed well in maintaining dialogue and showing empathy, challenges were identified in replicating human skills within an AI framework. The study concludes that AI tools can indeed be tailored by social workers to support narrative interviews, though further research is essential to fully address the nuanced needs of the field. Our key focus remains on democratizing AI and empowering domain experts, like social workers, to use these tools effectively. This research enriches the ongoing discussion by exploring the feasibility of AI in specialized fields and emphasizes the importance of interdisciplinary collaboration in developing robust assessment tools to enhance AI implementation.
Responding to Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

Literature Review
The literature review begins by emphasizing the potential of AI in narrative interviewing, but it would benefit from first defining and describing what narrative interviewing entails. The enthusiastic discussion of AI, including references to “influential bloggers,” feels more promotional than academic, especially since it turns out to be an advertisement for a for-profit company selling narrative interviewing tools. To strengthen the paper, I recommend moving the definitions of key terms like “narrative interviewing” and “conversation analysis” to the beginning of the literature review. Additionally, information from paragraph 5 and the section “Why do assistant’s interview skills matter” should be relocated to the start to provide important context early on.
*Response: The literature review part was entirely reorganized. The revised manuscript now includes a more structured literature review. It has been significantly reframed to focus on the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. These advancements enable non-technical users to develop and tailor AI tools for specific applications. As there is limited research on how social workers, especially those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews, the study investigates this aspect. The role of narrative interviews is instrumental, serving as a case to explore possibilities and provide feedback within a broader context. This sets a clear foundation before discussing the role of AI in narrative interviewing.

The language in the literature review occasionally seems exaggerated and inconsistent. For instance, the statement that AI is “becoming a crucial tool for researchers” feels overstated without supporting evidence. Furthermore, it appears contradictory to claim that AI improves analysis while also stating that academics have not “explored how AI could enable or limit narrative interviews.” Clarifying these points and providing supporting evidence will strengthen the argument.
*Response: Improved clarity and exaggeration removed. The revised manuscript moderates such language and focuses on informed claims. It highlights the exploratory nature of the study and emphasizes the use of rubrics as tools for qualitative assessment rather than making broad claims about AI’s capabilities.

There are also some statements that seem unnecessary and detract from the narrative flow, such as “This brief review of narrative inquirer capabilities is not comprehensive.” Removing such remarks can streamline the text and improve readability.
*Response: In the revised manuscript, we eliminated remarks such as “This brief review of narrative inquirer capabilities is not comprehensive” to streamline the text and enhance the overall clarity.

Methods Section
Introducing a “Methods” header before discussing competence metrics would improve the manuscript’s structure. The justification for the competence metrics needs to be clearer. Rather than stating that “The following metrics were chosen primarily for their simplicity and to facilitate testing and demonstration,” it would be helpful to explain the rationale behind selecting these specific metrics and whether they are based on existing frameworks or newly developed by the authors.
Scores: The authors identified six scores but did not talk about how those scores were established, the scales, or anything else about them. Specifically:
Competence 1: The explanation of how progression scores are assigned is vague. There is no clear description of the criteria used to determine the progression stage of each utterance. This lack of specificity makes it difficult to understand the validity and reliability of the scoring system.
The method’s reliance on previous studies (Au-yeung, 2023; Chan et al., 2020; Connor et al., 2009; Moran et al., 2005; Ramey et al., 2009) without providing a detailed explanation or adaptation process is problematic. It would be helpful to have a more thorough, but not necessarily long explanation of how these methodologies are applied to the current study.
Relatedly, the sequential analysis method from therapy research may not be directly applicable to narrative interviews conducted by AI. If this method was developed from narrative interviews, then say so. There are therapy approaches like CBT or behavioral therapies that have structured, goal-oriented sessions. ABFT is task oriented, but still provides the therapist structure. Narrative interviews, from your description, are more open-ended, possibly more like Rogerian or supportive therapy sessions. The adaptation of this method needs more rigorous justification.
I was also unclear how and why a higher score isn’t better, and also why you calculated average progression scores over time. How does this score help us to understand how good AI is doing?
Competence 2: Is there empirical validation or psychometric testing of the empathy scale? If so, please provide that information for the reader. I would like more information about why a Likert scale is used for empathy, especially since the previous competence argued that higher scores were not necessarily better. That does not seem to be the case with this scale.
Competence 3: I didn’t understand the use of the word “Domination” in the subheader.
You introduce multiple metrics (agency, coherence, complexity, and meaning-making) without clearly explaining how they are operationalized. Each one of these needs to be defined and the reader needs measurement criteria for each metric, especially since the previous two measures apparently used different scales. For example, the authors say that measuring “agency” cannot be done in isolation, but there is no description of how the scores or definitions of the other measures are used to understand or measure agency. Indeed, it seems that if the concepts in Competence 3 are interrelated, might there be a way to discuss the measures as a coherent package, rather than individual scales?
The description of the Case study has many strengths, including the innovative use of ChatGPT to create a chatbot that the researchers would otherwise have not been able to program, highlighting the value of deploying the chatbot on WhatsApp, clearly defining the seven stages, and the ingenuity of having the postgraduate have two conversations – one genuine and one combative. The major limitation of the description of the case study is that it seems to be disconnected from the measures. I know that you provide results in the results section, but it would be helpful to have notes about how each section connects to the measures.
*Response: We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative assessment framework that captures the nuances of narrative interviews, reflecting a shift from objective measurement to qualitative assessment aimed at enriching the exploration and discussion. The revised manuscript provides a detailed explanation of how rubrics are used primarily for assessing fidelity and empathy and introduces sequential analysis (and progression scores) to visualize how the conversations aligned with intended plans. The rubrics are the key assessment tools, while the sequential analysis serves more as an illustration. This revised approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores.

Data Analysis Method and Data Collection Section
I was unclear about the context in which the postgrad was having the WhatsApp conversations. Was this in a lab? Were the researchers observing? Was this in private?
I didn’t understand why the conversation was divided into four equal quarters. i.e. The choice of four quarters seems arbitrary. I know that it makes it easy to analyze by utterances rather than time, but it seems to miss the importance of space and silence in a narrative interview.
The process for manually assigning progression scores is not clearly defined. There is no description of the criteria used by the human research assistant to assign these scores, which raises concerns about the subjectivity and consistency of the ratings. Is there any research on using LLMs or ChatGPT specifically to assigning scores to narrative interviews? If so, it would be good to cite that here.
*Response: The WhatsApp conversation took place in private. We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. The progression scores were primarily instrumental in the visualizations. We used a proportional timeline approach, dividing the conversation into four equal quarters based on the number of utterances instead of actual time, which allows us to account for varying lengths of conversations. For manually assigning progression, the first author and third authors tagged conversations independently, then reviewed and discussed until all disagreements were resolved, which helped ensure reliable and accurate evaluation. The rubrics developed for this study are specifically tailored for narrative interviews and are more flexible than structured, goal-oriented therapy methods. The therapy orientation was de-emphasized. Narrative therapy references were cited because the questioning skills in narrative therapy are useful and help inspire narrative interviews (e.g., creating metaphors). This adaptation is detailed in the methodology section, aligning with the open-ended nature of AI-mediated narrative interviews.

Results
The agreement between the human research assistant and ChatGPT4 on most metrics in and of itself is encouraging for researchers interested in using LLMs.
In finding 2, what distinguishes a 3 from a 4? This is part of my ongoing concern about the paper’s lack of clarity with the measures.
The disagreement on coherence in Conversation 2 highlights the need for clear criteria. The authors wrote, “In the rating, the human research assistant and ChatGPT4 mostly agreed on all these metrics, except that there was a disagreement on coherence in Conversation 2, where the human research assistant gave a score of 2 and ChatGPT4 gave a score of 3. After reviewing this, the principal researcher sided with ChatGPT4's score.” The manuscript should detail the specific aspects that led to different ratings and the rationale for siding with ChatGPT4’s score. Especially in these early days of using LLMs for analysis, this kind of transparency will be instructive as well as increase trustworthiness of the analysis.
*Response: We have completely restructured the methodology. The manuscript now includes a Methods section that explains the autoethnographic approach and the use of rubrics to assess fidelity and empathy in the AI assistant’s performance. We omitted the section on AI tagging, as it no longer aligns with the revised methodology. The rationale for conducting two conversations, as explained in the updated manuscript, was to better assess the extent to which the chatbot maintains similar levels of adherence and fidelity across different contexts, even when the conversation veers off track.

Discussion
I thought it was interesting that there were challenges with long speeches in conversation analysis. Even though WhatsApp tends to have shorter exchanges, what are some potential solutions or alternative methods to address these challenges, especially since narrative interviews often involve extended dialogue. I’m thinking specifically about qualitative researchers who are interested in using this for more traditional open-ended narrative interviews, not on WhatsApp. Perhaps this is part of an expanded discussion on the scalability of this type of AI use.
I agree that there need to be more refined metrics (and some of my critiques of this paper are in line with that call).
How would you propose to improve interrater reliability? Again, are there computational linguists who have looked into human/LLM IRR?
After reading through the discussion a few times, I still didn’t have a good understanding of the differences between conversations 1 and 2. For example, in the results section you noted differences in agency, coherence and complexity. Why did the participant exhibit higher agency in Conversation 2? What factors contributed to the lower coherence and complexity in Conversation 2? If you discussed this and I missed it, my apologies.
*Response: We have thoroughly restructured the methodology. The revised manuscript adopts an autoethnographic approach and uses rubrics to assess fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative framework that captures the nuances of narrative interviews, shifting the focus from objective measurement to a qualitative assessment aimed at deepening exploration and discussion. As such, interrater reliability is no longer a focus in the revised version. The manuscript explains how rubrics are used to assess fidelity and empathy, while sequential analysis and progression scores visually represent how conversations aligned with the intended plans. The rubrics serve as the primary assessment tools, with sequential analysis as an illustration. This approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores. The revised methodology aims to explore whether an OpenAI Assistant can be effectively instructed by social workers with minimal coding experience, if the chatbot’s performance can be assessed using simple observation-based rubrics, and how these findings contribute to the broader discussion on democratizing AI in specialized fields like social work.

Final note:
I hope this feedback is helpful. I typically do not review manuscripts until I fully understand both the spirit and letter of the manuscript, but I am providing this feedback in response to the editor’s request for a timely review
*Response: We are very thankful for the insightful comments and appreciate the feedback provided. We also want to sincerely apologize, as we were aware that the first version of the manuscript was not fully mature and had several gaps. We recognize that it lacked clarity in several areas, and we are grateful for the opportunity to address those issues. In response, we have completely rewritten the article using the same dataset, but with a more refined focus and clearer articulation of the research aims. This study explores how AI chatbots, particularly those powered by GPT-4, can be applied in social work for conducting narrative interviews. Through an autoethnographic case study approach, social work researchers with limited technical expertise developed and tested an AI chatbot on WhatsApp. The chatbot adhered to a structured interview plan while also demonstrating empathetic engagement with participants. While the AI performed well in maintaining dialogue and showing empathy, challenges were identified in replicating human skills within an AI framework. The study concludes that AI tools can indeed be tailored by social workers to support narrative interviews, though further research is essential to fully address the nuanced needs of the field. Our key focus remains on democratizing AI and empowering domain experts, like social workers, to use these tools effectively. This research enriches the ongoing discussion by exploring the feasibility of AI in specialized fields and emphasizes the importance of interdisciplinary collaboration in developing robust assessment tools to enhance AI implementation.
Competing Interests: NIL Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 07 Jun 2024

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4
Version 2 (revision) 04 Oct 24		read	read	read
Version 1 07 Jun 24	read	read

Jonathan Singer, Loyola University Chicago, Chicago, USA
Neil Ballantyne, Open Polytechnic of New Zealand, Lower Hutt, New Zealand
Sunok Lee, Aalto University, Espoo, Finland
Anna De Fina, Georgetown University, Washington, DC, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

2 Views

02 Jan 2025 | for Version 2

Anna De Fina, Georgetown University, Washington, DC, USA

2 Views Cite this report Responses(0)

Approved With Reservations

Is the background of the case’s history and progression described in sufficient detail?

Yes
Is the work clearly and accurately presented and does it cite the current literature?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes
Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Sociolinguistics, Discourse Analysis

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

17 Views

14 Nov 2024 | for Version 2

Sunok Lee, Aalto University, Espoo, Finland

17 Views Cite this report Responses(0)

Approved With Reservations

In-depth Ethical and Critical Issues Analysis
While the study provides a promising exploration of AI’s utility in conducting narrative interviews, it lacks a thorough discussion of potential risks, especially concerning interactions with vulnerable users. If AI is to be applied widely, especially in fields involving sensitive interactions, the paper must address the risks of incorrect or insensitive AI responses. The absence of this discussion leaves out critical considerations that might arise if AI were to replace human interaction, such as the potential harm caused by insensitive responses to vulnerable participants. A more comprehensive examination of these ethical issues, including possible measures to mitigate such risks, is essential for providing a balanced view.
Interpretation of Progression Scores
Although progression scores were introduced to evaluate the interview’s structure, the authors’ interpretation could benefit from additional nuance. The current discussion seems overly definitive in concluding that the AI achieved meaningful progression in narrative structure. Given the limitations of a single participant case study, a more tentative interpretation would be appropriate. Furthermore, while progression scores offer a useful metric, the study could provide practical examples or case-specific explanations to clarify their real-world implications. This would give readers a clearer understanding of how these scores influence the AI’s interview performance and could enhance the study’s transparency.

Other Areas for Improvement

Clearer Focus on Research Questions and Study Scope
The study attempts to evaluate both the AI assistant's performance and the validity of the assessment metrics, which causes some confusion. It would be beneficial to clarify the primary research objective and distinguish between assessing the AI’s performance and the reliability of the evaluation metrics. Additionally, the study would benefit from acknowledging its limitations upfront, particularly as a preliminary or pilot study, to set appropriate expectations for the reader.
Balanced Perspective on AI's Potential in Narrative Interviews
The literature review highlights AI’s benefits in qualitative research but tends to overstate its potential, neglecting the ongoing debate around AI's limitations in qualitative settings. Acknowledging the contentious nature of AI in qualitative research would present a more balanced viewpoint. This would strengthen the paper by recognizing that while AI offers efficiencies, there are ethical and methodological concerns that still need resolution.

Is the background of the case’s history and progression described in sufficient detail?

Partly
Is the work clearly and accurately presented and does it cite the current literature?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes
Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Human-centered AI, Participatory Design, Human-computer Interaction.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

4 Views

17 Oct 2024 | for Version 2

Neil Ballantyne, Open Polytechnic of New Zealand, Lower Hutt, Wellington, New Zealand

4 Views Cite this report Responses(0)

Approved

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

social work, human service technology, artificial intelliegence, qualitative research.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

18 Views

15 Aug 2024 | for Version 1

Neil Ballantyne, Open Polytechnic of New Zealand, Lower Hutt, Wellington, New Zealand

18 Views Cite this report Responses(1)

Not Approved

Is the background of the case’s history and progression described in sufficient detail?

No
Is the work clearly and accurately presented and does it cite the current literature?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly
Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Partly

References

1. Chan C, Zhao Y: A Case Study on Measuring AI Assistant Competence in Narrative Interviews. F1000Research. 2024; 13. Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

social work, human service technology, artificial intelliegence, qualitative research.

Respond to this report

Responses (1)

Author Response

04 Oct 2024

Chitat CHAN, Social Work, Hong Kong Baptist University, Hong Kong, Hong Kong

//Main areas for improvement
Be clear about research questions and primary focus
A lot is going on in this study, and the description and rationale for key design choices are sometimes unclear. The problem begins with the abstract and the research questions, where there seems to be slippage between a focus on evaluating the performance of the AI assistant and evaluating the validity and inter-rater reliability of the measures constructed. The first research question asks:
How can the proposed metrics help us, as social researchers without a technical background, understand the quality of the AI-driven interviews in this study?
In my view, RQ1 would be better focussed on assessing the validity and reliability of the measures selected for evaluating the performance of the AI assistant. This needs to be established before an actual performance assessment of the AI assistant is made. Establishing inter-rater reliability of the measures would be a part of that task.
However, this raises the question of why the researchers decided to have only two raters, one human and the other a GenAI. The comparison is interesting but muddies the waters. This machine versus human rater aspect of the study was a surprise since it was not signalled in the abstract, introduction or research questions. A comparison between the ratings made by a GenAI and human raters would be far more interesting once we had established human inter-rater reliability, which could have been achieved with only a few more humans in the frame.//

*Response: The research questions and focus have been entirely reformulated. The revised manuscript now distinguishes its objectives. It focuses on the democratization of AI chatbots and the emerging trend of customizable AI, which has opened new opportunities across various fields, including social work. The study explores how social workers, particularly those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews. The role of narrative interviews is used as a case to explore possibilities and provide feedback within a broader context. The emphasis is on qualitative assessment rather than quantitative measurement, aligning with the exploratory nature of the study.

//Establish from the outset that this is a pilot study
The title and the abstract refer to the study as a case study, albeit one with a single participant, giving the reader the impression of a significant, in-depth exploration of a phenomenon. However, in the section on limitations, the authors describe the study as a pilot study and state that the primary goal “was to explore ideas and stimulate discussion rather than to conduct rigorous experiments”.
I recommend that the abstract and introduction be rewritten to reflect this stance. However, this does beg the question as to whether the paper might be better construed as a working paper for feedback and discussion than as a substantive academic article. The paper is very provocative and stimulating in terms of the issues it raises but inconclusive as an assessment of the utility of an AI assistant for narrative interviews.//

*Response: The pilot and exploratory nature of the study is emphasized. The revised abstract and introduction explicitly state that the study is exploratory and serves as a pilot to stimulate discussion about the use of AI in social work. The manuscript now clearly outlines its objectives, focusing on the democratization of AI chatbots and the emerging trend of customizable AI. The study is qualitative and concerned with the extent to which domain knowledge experts can participate in instructing and assessing AI chatbots.

//Other areas for improvement
A balanced approach to the literature and the claims made
The literature review is interesting and includes many relevant articles, but the articles selected and the authors’ comments are prone to over-accentuate the potential of a technology that is still emergent. The application of GenAI to qualitative research is deeply contested in the mainstream literature on qualitative methodology, making it hard to concur with the authors’ views that:
The integration of generative AI has revolutionized and democratized the process of qualitative data collection, analysis, and interpretation. This technological leap has boosted efficiency and expanded the horizons of narrative inquiry methods.
The authors can improve the literature review by recognising GenAI's emergent nature and the fact that many scholars in the mainstream qualitative research field have strong reservations about its applicability and ethical concerns about its use.//

*Response: The literature review has been revised to acknowledge the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. It addresses limitations alongside potential benefits. The discussion part addresses these limitations, noting that the study focused exclusively on OpenAI's product and that findings may not be generalizable to AI tools from other companies. While many principles and challenges identified here could apply broadly to other AI systems, variations in technology and implementation could lead to different outcomes. The introduction of AI tools into practice also brings considerations such as navigating risk management, data privacy, and the uneven development of language models across diverse sociocultural landscapes. While these challenges are important and have been extensively explored by others, the study focuses on exploring how social workers can expand their skill sets and rethink their expertise to engage actively with AI technologies.

//Other issues
There are several places in the text where the authors’ explanations of the methods used are a little too opaque. For example, the explanation about the progression scores was helpful but left me puzzled about how this was practically implemented. An example might help.
Also, in the findings, the authors highlight that the participant exhibited higher agency during conversation two, which is surely explained by the instruction given to the participant to try to throw the interviewer off track. If that isn’t the explanation, a fuller account as to why not would be helpful.//

*Response: The revised manuscript provides clearer explanations of the methods used. It explains that the study is autoethnographic, reflecting personal experiences and using rubrics for assessing fidelity and empathy in the AI assistant’s performance. Examples of conversation transcripts are included to illustrate these qualitative assessments, emphasizing the depth and context of interactions rather than mere numerical data. We also enhanced clarity in findings. It was noted that some findings, like the participant exhibiting higher agency, were not well-explained. In the revised discussion, the measure of agency has been removed and integrated into the discussion about empathy and a deeper exploration of meanings. This approach aligns with the study’s focus on qualitative assessment and provides a clearer understanding of the findings.

//Overall
I want to encourage the authors to continue their exploration of the use of GenAI in qualitative research, and I believe that this is a very useful working paper that raises many pertinent issues. Responding to the comments above might help to improve the structure and flow of the paper, but the research design issues cannot be remedied without conducting a completely new study. It is for that reason I decided not to recommend the paper for indexing.//

*Response: We are very grateful for the thoughtful and constructive comments provided. We deeply appreciate the feedback, especially regarding the areas for improvement and the suggestion to frame the study as a pilot. These insights were particularly helpful in guiding us toward a more focused and balanced approach. We also want to acknowledge that the first version of the manuscript had several gaps, and we sincerely apologize for any confusion this may have caused. We recognize that the clarity and direction of the paper needed refinement, and we are thankful for the opportunity to improve. In response, we have rewritten the article using the same dataset but with a more refined focus, especially clarifying the research aims and scope. The revised study explores how AI chatbots, particularly GPT-4, can be applied in social work for conducting narrative interviews. Using an autoethnographic case study approach, social work researchers developed and tested an AI chatbot on WhatsApp. The chatbot followed a structured interview plan and demonstrated empathy, though challenges in replicating human skills within AI were noted. The study concludes that social workers can tailor AI tools for narrative interviews, but further research is needed to address the field's complexities. Our focus remains on democratizing AI and empowering social workers to use these tools effectively while emphasizing the need for interdisciplinary collaboration in developing robust assessment tools.

View more View less

Competing Interests

NIL

Back to all reports

Reviewer Report

23 Views

26 Jul 2024 | for Version 1

Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

23 Views Cite this report Responses(1)

Approved With Reservations

Is the background of the case’s history and progression described in sufficient detail?

No
Is the work clearly and accurately presented and does it cite the current literature?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes
Is the case presented with sufficient detail to be useful for teaching or other practitioners?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Youth suicide prevention, qualitative research, technology and social work education/practice

Respond to this report

Responses (1)

Author Response

04 Oct 2024

Chitat CHAN, Social Work, Hong Kong Baptist University, Hong Kong, Hong Kong

Responding to Jonathan Singer, Loyola University Chicago, Chicago, Illinois, USA

Literature Review
The literature review begins by emphasizing the potential of AI in narrative interviewing, but it would benefit from first defining and describing what narrative interviewing entails. The enthusiastic discussion of AI, including references to “influential bloggers,” feels more promotional than academic, especially since it turns out to be an advertisement for a for-profit company selling narrative interviewing tools. To strengthen the paper, I recommend moving the definitions of key terms like “narrative interviewing” and “conversation analysis” to the beginning of the literature review. Additionally, information from paragraph 5 and the section “Why do assistant’s interview skills matter” should be relocated to the start to provide important context early on.
*Response: The literature review part was entirely reorganized. The revised manuscript now includes a more structured literature review. It has been significantly reframed to focus on the democratization of AI chatbots and the emerging trend of customizable AI, which opens new opportunities across various fields, including social work. These advancements enable non-technical users to develop and tailor AI tools for specific applications. As there is limited research on how social workers, especially those with minimal coding experience, can instruct and evaluate AI assistants to conduct narrative interviews, the study investigates this aspect. The role of narrative interviews is instrumental, serving as a case to explore possibilities and provide feedback within a broader context. This sets a clear foundation before discussing the role of AI in narrative interviewing.

The language in the literature review occasionally seems exaggerated and inconsistent. For instance, the statement that AI is “becoming a crucial tool for researchers” feels overstated without supporting evidence. Furthermore, it appears contradictory to claim that AI improves analysis while also stating that academics have not “explored how AI could enable or limit narrative interviews.” Clarifying these points and providing supporting evidence will strengthen the argument.
*Response: Improved clarity and exaggeration removed. The revised manuscript moderates such language and focuses on informed claims. It highlights the exploratory nature of the study and emphasizes the use of rubrics as tools for qualitative assessment rather than making broad claims about AI’s capabilities.

There are also some statements that seem unnecessary and detract from the narrative flow, such as “This brief review of narrative inquirer capabilities is not comprehensive.” Removing such remarks can streamline the text and improve readability.
*Response: In the revised manuscript, we eliminated remarks such as “This brief review of narrative inquirer capabilities is not comprehensive” to streamline the text and enhance the overall clarity.

Methods Section
Introducing a “Methods” header before discussing competence metrics would improve the manuscript’s structure. The justification for the competence metrics needs to be clearer. Rather than stating that “The following metrics were chosen primarily for their simplicity and to facilitate testing and demonstration,” it would be helpful to explain the rationale behind selecting these specific metrics and whether they are based on existing frameworks or newly developed by the authors.
Scores: The authors identified six scores but did not talk about how those scores were established, the scales, or anything else about them. Specifically:
Competence 1: The explanation of how progression scores are assigned is vague. There is no clear description of the criteria used to determine the progression stage of each utterance. This lack of specificity makes it difficult to understand the validity and reliability of the scoring system.
The method’s reliance on previous studies (Au-yeung, 2023; Chan et al., 2020; Connor et al., 2009; Moran et al., 2005; Ramey et al., 2009) without providing a detailed explanation or adaptation process is problematic. It would be helpful to have a more thorough, but not necessarily long explanation of how these methodologies are applied to the current study.
Relatedly, the sequential analysis method from therapy research may not be directly applicable to narrative interviews conducted by AI. If this method was developed from narrative interviews, then say so. There are therapy approaches like CBT or behavioral therapies that have structured, goal-oriented sessions. ABFT is task oriented, but still provides the therapist structure. Narrative interviews, from your description, are more open-ended, possibly more like Rogerian or supportive therapy sessions. The adaptation of this method needs more rigorous justification.
I was also unclear how and why a higher score isn’t better, and also why you calculated average progression scores over time. How does this score help us to understand how good AI is doing?
Competence 2: Is there empirical validation or psychometric testing of the empathy scale? If so, please provide that information for the reader. I would like more information about why a Likert scale is used for empathy, especially since the previous competence argued that higher scores were not necessarily better. That does not seem to be the case with this scale.
Competence 3: I didn’t understand the use of the word “Domination” in the subheader.
You introduce multiple metrics (agency, coherence, complexity, and meaning-making) without clearly explaining how they are operationalized. Each one of these needs to be defined and the reader needs measurement criteria for each metric, especially since the previous two measures apparently used different scales. For example, the authors say that measuring “agency” cannot be done in isolation, but there is no description of how the scores or definitions of the other measures are used to understand or measure agency. Indeed, it seems that if the concepts in Competence 3 are interrelated, might there be a way to discuss the measures as a coherent package, rather than individual scales?
The description of the Case study has many strengths, including the innovative use of ChatGPT to create a chatbot that the researchers would otherwise have not been able to program, highlighting the value of deploying the chatbot on WhatsApp, clearly defining the seven stages, and the ingenuity of having the postgraduate have two conversations – one genuine and one combative. The major limitation of the description of the case study is that it seems to be disconnected from the measures. I know that you provide results in the results section, but it would be helpful to have notes about how each section connects to the measures.
*Response: We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative assessment framework that captures the nuances of narrative interviews, reflecting a shift from objective measurement to qualitative assessment aimed at enriching the exploration and discussion. The revised manuscript provides a detailed explanation of how rubrics are used primarily for assessing fidelity and empathy and introduces sequential analysis (and progression scores) to visualize how the conversations aligned with intended plans. The rubrics are the key assessment tools, while the sequential analysis serves more as an illustration. This revised approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores.

Data Analysis Method and Data Collection Section
I was unclear about the context in which the postgrad was having the WhatsApp conversations. Was this in a lab? Were the researchers observing? Was this in private?
I didn’t understand why the conversation was divided into four equal quarters. i.e. The choice of four quarters seems arbitrary. I know that it makes it easy to analyze by utterances rather than time, but it seems to miss the importance of space and silence in a narrative interview.
The process for manually assigning progression scores is not clearly defined. There is no description of the criteria used by the human research assistant to assign these scores, which raises concerns about the subjectivity and consistency of the ratings. Is there any research on using LLMs or ChatGPT specifically to assigning scores to narrative interviews? If so, it would be good to cite that here.
*Response: The WhatsApp conversation took place in private. We have reframed the entire methodology. The manuscript now features a "Methods" section that explains the autoethnographic approach and the use of rubrics for assessing fidelity and empathy in the AI assistant’s performance. The progression scores were primarily instrumental in the visualizations. We used a proportional timeline approach, dividing the conversation into four equal quarters based on the number of utterances instead of actual time, which allows us to account for varying lengths of conversations. For manually assigning progression, the first author and third authors tagged conversations independently, then reviewed and discussed until all disagreements were resolved, which helped ensure reliable and accurate evaluation. The rubrics developed for this study are specifically tailored for narrative interviews and are more flexible than structured, goal-oriented therapy methods. The therapy orientation was de-emphasized. Narrative therapy references were cited because the questioning skills in narrative therapy are useful and help inspire narrative interviews (e.g., creating metaphors). This adaptation is detailed in the methodology section, aligning with the open-ended nature of AI-mediated narrative interviews.

Results
The agreement between the human research assistant and ChatGPT4 on most metrics in and of itself is encouraging for researchers interested in using LLMs.
In finding 2, what distinguishes a 3 from a 4? This is part of my ongoing concern about the paper’s lack of clarity with the measures.
The disagreement on coherence in Conversation 2 highlights the need for clear criteria. The authors wrote, “In the rating, the human research assistant and ChatGPT4 mostly agreed on all these metrics, except that there was a disagreement on coherence in Conversation 2, where the human research assistant gave a score of 2 and ChatGPT4 gave a score of 3. After reviewing this, the principal researcher sided with ChatGPT4's score.” The manuscript should detail the specific aspects that led to different ratings and the rationale for siding with ChatGPT4’s score. Especially in these early days of using LLMs for analysis, this kind of transparency will be instructive as well as increase trustworthiness of the analysis.
*Response: We have completely restructured the methodology. The manuscript now includes a Methods section that explains the autoethnographic approach and the use of rubrics to assess fidelity and empathy in the AI assistant’s performance. We omitted the section on AI tagging, as it no longer aligns with the revised methodology. The rationale for conducting two conversations, as explained in the updated manuscript, was to better assess the extent to which the chatbot maintains similar levels of adherence and fidelity across different contexts, even when the conversation veers off track.

Discussion
I thought it was interesting that there were challenges with long speeches in conversation analysis. Even though WhatsApp tends to have shorter exchanges, what are some potential solutions or alternative methods to address these challenges, especially since narrative interviews often involve extended dialogue. I’m thinking specifically about qualitative researchers who are interested in using this for more traditional open-ended narrative interviews, not on WhatsApp. Perhaps this is part of an expanded discussion on the scalability of this type of AI use.
I agree that there need to be more refined metrics (and some of my critiques of this paper are in line with that call).
How would you propose to improve interrater reliability? Again, are there computational linguists who have looked into human/LLM IRR?
After reading through the discussion a few times, I still didn’t have a good understanding of the differences between conversations 1 and 2. For example, in the results section you noted differences in agency, coherence and complexity. Why did the participant exhibit higher agency in Conversation 2? What factors contributed to the lower coherence and complexity in Conversation 2? If you discussed this and I missed it, my apologies.
*Response: We have thoroughly restructured the methodology. The revised manuscript adopts an autoethnographic approach and uses rubrics to assess fidelity and empathy in the AI assistant’s performance. These rubrics provide a qualitative framework that captures the nuances of narrative interviews, shifting the focus from objective measurement to a qualitative assessment aimed at deepening exploration and discussion. As such, interrater reliability is no longer a focus in the revised version. The manuscript explains how rubrics are used to assess fidelity and empathy, while sequential analysis and progression scores visually represent how conversations aligned with the intended plans. The rubrics serve as the primary assessment tools, with sequential analysis as an illustration. This approach emphasizes understanding the context and content of interactions rather than reducing them to numerical scores. The revised methodology aims to explore whether an OpenAI Assistant can be effectively instructed by social workers with minimal coding experience, if the chatbot’s performance can be assessed using simple observation-based rubrics, and how these findings contribute to the broader discussion on democratizing AI in specialized fields like social work.

Final note:
I hope this feedback is helpful. I typically do not review manuscripts until I fully understand both the spirit and letter of the manuscript, but I am providing this feedback in response to the editor’s request for a timely review
*Response: We are very thankful for the insightful comments and appreciate the feedback provided. We also want to sincerely apologize, as we were aware that the first version of the manuscript was not fully mature and had several gaps. We recognize that it lacked clarity in several areas, and we are grateful for the opportunity to address those issues. In response, we have completely rewritten the article using the same dataset, but with a more refined focus and clearer articulation of the research aims. This study explores how AI chatbots, particularly those powered by GPT-4, can be applied in social work for conducting narrative interviews. Through an autoethnographic case study approach, social work researchers with limited technical expertise developed and tested an AI chatbot on WhatsApp. The chatbot adhered to a structured interview plan while also demonstrating empathetic engagement with participants. While the AI performed well in maintaining dialogue and showing empathy, challenges were identified in replicating human skills within an AI framework. The study concludes that AI tools can indeed be tailored by social workers to support narrative interviews, though further research is essential to fully address the nuanced needs of the field. Our key focus remains on democratizing AI and empowering domain experts, like social workers, to use these tools effectively. This research enriches the ongoing discussion by exploring the feasibility of AI in specialized fields and emphasizes the importance of interdisciplinary collaboration in developing robust assessment tools to enhance AI implementation.

View more View less

Competing Interests

NIL

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Abkhezr P, McMahon M, Campbell M, et al.: Exploring the boundary between narrative research and narrative intervention Implications of participating in narrative inquiry for young people with refugee backgrounds. Narrative Inquiry: NI. 2020; 30: 316–342. Publisher Full Text

[2] Adamopoulou E, Moussiades L: Chatbots: History, technology, and applications. Machine Learning with Applications. 2020; 2: 100006. Publisher Full Text

[3] Ahmad RA, Purwaningrum DN: WhatsApp-based complementary feeding education and counselling for adolescent mothers: A one-group pre-post study. Journal of Community Empowerment for Health 2022; 5: 64–71.

[4] Au-yeung H: Visualising narrative practice with a 5-level scaffolding lens: An illustrative case study. China Journal of Social Work. 2023; 16: 18–29. Publisher Full Text

[5] Bendig E, Erb B, Schulze-Thuesing L, et al.: The next generation: chatbots in clinical psychology and psychotherapy to foster mental health – a scoping review. Verhaltenstherapie. 2019; 29: 266–280. Publisher Full Text

[6] Bogo M : Social work practice concepts, processes, and interviewing. (Library, Trans.). Columbia University Press; 2006.

[7] Bower M: Affordance analysis – matching learning tasks with learning technologies. Educational Media International. 2008; 45: 3–15. Publisher Full Text Reference Source Reference Source

[8] Brown T, Mann B, Ryder N, et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems. 2020; 33: 1877–1901. Reference Source

[9] Campolo A, Crawford K: Enchanted determinism: Power without responsibility in artificial intelligence. Engaging Science, Technology, and Society. 2020; 6: 1–19. Publisher Full Text

[10] Chan C: Narrative practice with youth: A heuristic case study on a youth-centre drama workshop. Journal of Social Work Practice. 2012; 26: 197–214. Publisher Full Text

[11] Chan C: Digital technologies and evolving narrative practice: an autoethnographic study. China Journal of Social Work. 2023; 16: 83–99. Publisher Full Text

[12] Chan C: Transcript of a Conversation Between a Customized AI and Human Users (Version 3) [transcripts]. DANS Data Station Social Sciences and Humanities; 2024. Publisher Full Text

[13] Chan C, Au-yeung H, Chiu W, et al.: Indicators for assessing intervention fidelity of narrative practice: A heuristic review of the concept of scaffolding in White (2007). Journal of Systemic Therapies. 2020; 39(3): 77–90. Publisher Full Text

[14] Chan C, Holosko M: The utilization of social media for youth outreach engagement: A case study. Qualitative Social Work. 2017; 16: 680–697. Publisher Full Text

[15] Chan C, Holosko MJ: A review of information and communication technology enhanced social work interventions. Research on Social Work Practice. 2016; 26: 88–100. Publisher Full Text

[16] Chan C, Li F: Developing a natural language-based AI-chatbot for social work training: an illustrative case study. China Journal of Social Work. 2023; 16: 121–136. Publisher Full Text

[17] Chan C, Ngai K-H, Wong C-K: Using photographs in narrative therapy to externalize the problem: A substance abuse case. Journal of Systemic Therapies. 2012; 31(2): 1–20. Publisher Full Text

[18] Chang H, Ngunjiri F, Hernandez K-AC: Collaborative autoethnography. Routledge; 2016. Publisher Full Text

[19] Chase SE: Narrative inquiry: Toward theoretical and methodological maturity. The Sage handbook of qualitative research. 2018; Vol. 5: pp. 546–560.

[20] Chun B: Doing autoethnography of social robots: Ethnographic reflexivity in HRI. Paladyn, Journal of Behavioral Robotics. 2019; 10: 228–236. Publisher Full Text

[21] Clandinin DJ: Handbook of Narrative Inquiry: Mapping a Methodology.2007. Publisher Full Text

[22] Concannon S, Tomalin M: Measuring perceived empathy in dialogue systems. AI & Soc. 2023. Publisher Full Text

[23] Cordero J, Barba-Guaman L, Guamán F: Use of chatbots for customer service in MSMEs. Applied Computing and Informatics. 2022. Publisher Full Text

[24] Denicolo P, Long T, Bradley-Cole K: Constructivist approaches and research methods: A practical guide to exploring personal meanings. Sage; 2016. Publisher Full Text

[25] Deriu J, Rodrigo A, Otegi A, et al.: Survey on evaluation methods for dialogue systems. Artificial Intelligence Review. 2021; 54: 755–810. PubMed Abstract | Publisher Full Text | Free Full Text

[26] DeVon C: You can now build your own version of ChatGPT—here’s what to know. CNBC. 2023, 2023-11-11. Reference Source

[27] Duvall J, Béres L: Innovations in narrative therapy: Connecting practice, training, and research. W.W. Norton; 2011.

[28] Ellis C, Bochner A: Autoethnography, personal narrative, reflexivity: Researcher as subject. In Denzin NK, Lincoln YS, editors. Handbook of Qualitative Research. 2nd Ed.Sage Publications; 2000; pp. 733–768.

[29] Finch SE, Choi JD: Towards unified dialogue system evaluation: A comprehensive analysis of current evaluation protocols. arXiv preprint arXiv:2006.06110. 2020.

[30] Følstad A, Araujo T, Law EL-C, et al.: Future directions for chatbot research: an interdisciplinary research agenda. Computing 2021; 103: 2915–2942. Publisher Full Text

[31] Fowler JL, O'Gorman JG, Lynch MP: Counselling Skills and Competencies Tool: Scale development and preliminary data. Psychotherapy and Counselling Journal of Australia 2021; 9. Publisher Full Text

[32] Gasaymeh A-MM: University students use of WhatsApp and their perceptions regarding its possible integration into their education. Global Journal of Computer Science and Technology. 2017; 17: 1–9.

[33] Gergen KJ: Social construction in context. London: Sage Publications; 2001.

[34] Gibson JJ: The theory of affordances. In Shaw R, Bransford J, editors. Perceiving, acting and knowing. Hillsdale, NJ: Eribaum; 1977; pp. 67–82.

[35] Greer S, Ramo D, Chang YJ, et al.: Use of the chatbot "vivibot" to deliver positive psychology skills and promote well-being among young people after cancer treatment: Randomized controlled feasibility trial. Jmir Mhealth and Uhealth 2019; 7, Article e15018. PubMed Abstract | Publisher Full Text | Free Full Text

[36] Gubrium JF, Holstein JA, Marvasti A, et al.: The SAGE Handbook of Interview Research: The Complexity of the Craft. Thousand Oaks: SAGE Publications; 2012.

[37] Hammond M: What is an affordance and can it help us understand the use of ICT in education? Education and Information Technologies. 2010;15: 205–217. Publisher Full Text

[38] Han X, Zhou M, Turner MJ, et al.: Designing effective interview chatbots: Automatic chatbot profiling and design suggestion generation for chatbot debugging. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 2021; pp. 1–15.

[39] He L, Basar E, Wiers RW, et al.: Can chatbots help to motivate smoking cessation? A study on the effectiveness of motivational interviewing on engagement and therapeutic alliance. Bmc Public Health. 2022; 22: Article 726. PubMed Abstract | Publisher Full Text | Free Full Text

[40] Josselson R: The ethical attitude in narrative research: Principles and practicalities.Clandinin DJ, editor. Handbook of Narrative Inquiry: Mapping a Methodology. SAGE; 2007; pp. 537–566. Publisher Full Text

[41] Kadushin A, Kadushin G: The social work interview. Columbia University Press; 1997.

[42] Katic MA, Miletic E, Candrlic S: Performance analysis of virtual assistants: A case study based on the Croatian language. INTED2024 Proceedings. 2024; 7305–7315.

[43] Lee M, Srivastava M, Hardy A, et al.: Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746. 2023.

[44] Leonardi PM: When flexible routines meet flexible technologies: Affordance, constraint, and the imbrication of human and material agencies. MIS Quarterly: Management Information Systems. 2011; 35: 147–167. Reference Source Publisher Full Text

[45] Lin M, Yang L, Hao C, et al.: Evaluation of WhatsApp-based counselling and body composition measurement to enrich knowledge about healthy diet and weight management among middle-aged female Hong Kong Chinese: A randomized controlled trial. International Congress on Obesity 2022.2022.

[46] Liu H, Peng H, Song X, et al.: Using AI chatbots to provide self-help depression interventions for university students: A randomized trial of effectiveness. Internet Interventions. 2022; 27, Article 100495. PubMed Abstract | Publisher Full Text | Free Full Text

[47] Luo B, Lau RY, Li C, et al.: A critical review of state-of-the-art chatbot designs and applications. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2022; 12: e1434.

[48] Mao J, Romero-Hall E, Reeves TC :Autoethnography as a research method for educational technology: a reflective discourse. Educational Technology Research and Development. 2023.Publisher Full Text

[49] Nath A, Sarkar R, Mitra S, et al.: Designing and implementing conversational intelligent chat-bot using natural language processing. International Journal of Scientific Research in Computer Science, Engineering, and Information Technology. 2021: 262–266. Publisher Full Text

[50] Navigli R, Conia S, Ross B: Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality. 2023; 15: 1–21. Publisher Full Text

[51] Norman DA: Affordance, conventions, and design. Interactions 1999; 6: 38–43. Publisher Full Text

[52] Omarov B, Narynov S, Zhumanov Z: Artificial intelligence-enabled chatbots in mental health: A systematic review. Computers, Materials & Continua. 2023; 74: 5105–5122. Publisher Full Text

[53] Powell L, Nour R, Sleibi R, et al.: Democratizing the development of chatbots to improve public health: Feasibility study of COVID-19 misinformation. JMIR Human Factors. 2023; 10: e43120. PubMed Abstract | Publisher Full Text | Free Full Text

[54] Ramey HL, Tarulli D, Frijters JC, et al.: A sequential analysis of externalizing in narrative therapy with children. Contemporary Family Therapy. 2009; 31: 262–279. Publisher Full Text

[55] Ramey HL, Young K, Tarulli D: Scaffolding and concept formation in narrative therapy: A qualitative research report. Journal of Systemic Therapies. 2010; 29(4): 74–91. Publisher Full Text

[56] Reamer FG: Artificial intelligence in social work: Emerging ethical issues. International Journal of Social Work Values and Ethics. 2023; 20: 52–71. Publisher Full Text

[57] Rice K, Schutte NS, Cosh SM, et al.: The utility and development of the competencies of professional psychology rating scales (COPPR). Frontiers in Education. 2022; 7. Publisher Full Text

[58] Riessman CK: Narrative methods for the human sciences. Sage Publications; 2008.

[59] Riessman CK, Quinney L: Narrative in social work: A critical review. Qual. Soc. Work. 2005; 4: 391–412. Publisher Full Text

[60] Robila M, Robila SA: Applications of artificial intelligence methodologies to behavioral and social siences. Journal of Child and Family Studies. 2020; 29; 2954–2966. Publisher Full Text

[61] Shah R, Lahoti S, Lavanya K: An intelligent chat-bot using natural language processing. International Journal of Engineering Research. 2017; 6: 281–286. Publisher Full Text

[62] Sharma V: Measure your virtual assistant performance.2020, 2020-07-19. Reference Source Reference Source

[63] Shotter J: Cultural politics of everyday life: social constructionism, rhetoric and knowing of the third kind. Toronto: University of Toronto Press; 1993.

[64] Stallbaumer C: Introducing Microsoft Copilot Studio and new features in Copilot for Microsoft 365.2023, 2023-11-15. Retrieved 2023-11-20. Reference Source

[65] Suárez-Lantarón B, Deocano-Ruíz Y, García-Perales N, et al.: The educational use of WhatsApp. Sustainability. 2022; 14: 10510. Publisher Full Text

[66] Swank JM, Lambie GW, Witta EL: An Exploratory Investigation of the Counseling Competencies Scale: A Measure of Counseling Skills, Dispositions, and Behaviors. Counselor education and supervision 2012; 51: 189–206. Publisher Full Text

[67] Tamayo PA, Herrero A, Martín J, et al.: Design of a chatbot as a distance learning assistant. Open Praxis. 2020; 12: 145–153. Publisher Full Text

[68] Tanana MJ, Soma CS, Srikumar V, et al.: Development and evaluation of ClientBot: Patient-like conversational agent to train basic counseling skills. Journal of Medical Internet Research. 2019; 21: e12529. PubMed Abstract | Publisher Full Text | Free Full Text

[69] UC Today: OpenAI announces no-code ChatGPT builder for enterprise.2023, 2023-11-07.Reference Source

[70] Van Brummelen JJR: Tools to create and democratize conversational artificial intelligence. Massachusetts Institute of Technology; 2019.

[71] Wang J, Shi E, Yu S, et al.: Prompt engineering for healthcare: Methodologies and applications. arXiv preprint arXiv:2304.14670. 2023.

[72] Wei J, Bosma M, Zhao VY, et al.: Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652. 2021.

[73] White M: Maps of narrative practice. W.W. Norton; 2007.

[74] Wu S, Zhao X, Yu T, et al.: Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning. arXiv preprint arXiv:2110.04725. 2021. Reference Source

[75] Xu B, Zhuang Z: Survey on psychotherapy chatbots. Concurrency and Computation: Practice and Experience. 2022; 34: e6170. Publisher Full Text

[76] Xu Z, Jiang J: Multi-dimensional Evaluation of Empathetic Dialog Responses. arXiv preprint arXiv:2402.11409. 2024.

A Case Study on Assessing AI Assistant Competence in Narrative Interviews

Abstract

Abstract*

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 1

Democratizing AI chatbots and domain knowledge expertise

The possibility

The challenge

Social workers using AI chatbots for narrative interviews

Narrative interviews in social work

Expected competencies from asynchronous text-based chatbots

The research questions

Method: An autoethnographic case study

The rationale

Our background

We instructed the open AI assistant to follow a specific conversation strategy

We tested it with our own stories

We analyzed the conversation transcripts

Table 1.

Table 2.

Assessing empathy

Table 3.

We compared observer assessments with interviewee experiences

Reflective observation 1: Assessing open AI assistant’s ability to follow operational instructions

Figure 1. A screenshot of OpenAI’s Assistant API, which supports a code-free environment for users to customize AI-driven conversations.

Reflective observation 2: Exploring the chatbot’s adherence to interview guides

Table 4.

Figure 2. Progressions of the chatbot and the participant in Conversation 1.

Figure 3. Progressions of the chatbot and the participant in Conversation 2.

Reflective observation 3: Understanding the chatbot’s empathy in conversations

Discussion

Challenges

Opportunities

Concluding remarks: Democratizing AI and the domain experts

Data availability

Underlying data

Acknowledgments

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated