Practices and Challenges of Using Think-Aloud Protocols in Industry: An International Survey

by Mingming Fan, Serina Shi, Khai N. Truong

Peer-reviewed Article

pp. 85-102

Abstract

Think-aloud protocols are one of the classic methods often taught in universities for training UX designers and researchers. Although previous research reported how these protocols were used in industry, the findings were typically based on the practices of a small number of professionals in specific geographic regions or on studies conducted years ago. As UX practices continuously evolve to address new challenges emerging in industry, it is important to understand the challenges faced by current UX practitioners around the world when using think-aloud protocols. Such an understanding is beneficial for UX professionals to reflect on and learn from the UX community’s practices. It is also invaluable for academic researchers and educators to understand the challenges faced by professionals when carrying out the protocols in a wide range of practical contexts and to better explore methods to address these challenges. We conducted an international survey study with UX professionals in various sized companies around the world. We found that think-aloud protocols are widely and almost equally used in controlled lab studies and remote usability testing; concurrent protocols are more popular than retrospective protocols. Most UX practitioners probe participants during test sessions, explicitly request them to verbalize particular types of content, and do not administer practice sessions. The findings also offer insights on practices and challenges in analyzing think-aloud sessions. In sum, UX practitioners often deal with the tension between validity and efficiency in their analysis and demand better fast-paced and reliable analysis methods than merely reviewing observation notes or session recordings.

Keywords

Think-aloud protocols, usability test, user experience, industry practices and challenges, international survey

Introduction

Think-aloud protocols, in which participants verbalize their thoughts when performing tasks, are used in usability testing to elicit insights into their thought processes that are hard to obtain from mere observation. Think-aloud protocols are often taught in UX courses to train professionals (Dumas & Redish, 1999; Nielsen, 1993; Preece, Rogers, & Sharp, 2015; Rubin & Chisnell, 2008) and are considered as the “gold standard” for usability evaluation (Hornbæk, 2010). Boren and Ramey probably were the first to note the discrepancies between the theory introduced by Ericsson and Simon (1984) and the practice of using think-aloud protocols in the UX field (Boren & Ramey, 2000). The discrepancies, however, were identified by their field observations and by reviewing usability guidebooks and literature. Therefore, there was a lack of empirical reports on how the protocols are used in industry.

Previous research has examined the practices of using think-aloud protocols in local geographic regions. For example, Nørgaard and Hornbæk studied a small number of UX practitioners’ practices in Danish enterprises and offered insights on how they conducted and analyzed think-aloud sessions (2006). Similarly, Shi reported practices of and particular challenges in using think-aloud protocols (2008). In contrast, McDonald, Edwards, and Zhao conducted an international survey study to understand how think-aloud protocols were used in a broader scale and distributed the survey to UX professional and academic listservs (2012). However, as the survey was conducted in 2011 and new UX testing software and tools have emerged over this period, the extent to which think-aloud protocols are currently being used in industry is unclear. Moreover, recent research has also urged the community to learn more about the current UX practices in industry (MacDonald & Atwood, 2013).

To better understand how think-aloud protocols are currently used in industry, we designed and conducted a survey study with UX practitioners who had different levels of experience and worked in different industries around the world. In this paper, we present and discuss the key findings and implications of the survey study to inform UX practitioners and researchers about the practices and challenges surrounding the use of think-aloud protocols in industry.

Methods

The goal of this study was to understand how think-aloud protocols are being used by UX professionals in different fields around the world. We chose survey over other methods (e.g., interview, focus groups) because it allowed us to gather data from a broad range of UX practitioners located in different geographic regions who work in different industrial fields.

Respondents

We contacted the organizers of local chapters of the User Experience Professional Association (UXPA), the largest organization of UX professionals around the world, to promote the survey study. We received support from the organizers of the UXPA’s local chapters in Asia, Europe, and North America, who helped us distribute the survey link to their listservs. We also promoted the survey link in UX professionals-related LinkedIn groups and other social media platforms. Thus, the members of these UXPA local chapters and the LinkedIn groups were our potential samples. We conducted the survey study for about three months—July-September in 2018. The inclusion criterion was that respondents must work in industry as a UX practitioner.

Survey Design

The survey was conducted as an online questionnaire using Google Form. The survey contained a list of multiple-choice (required) and short-answer (optional) questions to understand whether and how UX professionals are currently using think-aloud protocols in addition to their basic profile information (i.e., the organization and/or the usability testing team that they work in and their current positions). No personally identifiable information was collected.

We were inspired by the previous survey study conducted in 2010 by McDonald et al. (2012) but at the same time made important changes. The previous survey was distributed to UX practitioners working in both academia and the industry, which made it hard to isolate the use and practical impact of think-aloud protocols in industry. Instead, our survey study was focused on the practices around the use of think-aloud protocols in industry and thus was only distributed to UX practitioners in industry. We also collected the respondents’ years of experience as a UX professional, which allowed us to understand the effect of the years of experience on their usage patterns. Furthermore, as new tools and procedures for conducting usability test sessions have entered the market since 2010, such as the Agile-UX design (Jurca, Hellmann, & Maurer, 2014), we wanted to understand how the use of think-aloud protocols has evolved in light of the introduction of new practices.

Data Analysis

Answers to multiple-choice questions are quantitative data and were analyzed to identify the statistical trends in using think-aloud protocols. Answers to short-answer questions are qualitative data. Two researchers first independently analyzed the qualitative data using open coding and then discussed to resolve any conflicts. They then used affinity diagramming to identify common themes that emerged from the data.

Results

We received valid responses from 197 UX practitioners in industry around the world. Next, we reported the aggregated information about the respondents’ profile information and their practices of conducting and analyzing think-aloud usability tests.

Respondents’ Profile

Work role: We asked respondents about their current job titles and allowed them to report more than one title if applicable. The majority of the respondents reported their current job title as UX researcher (54%) or UX designer (36%). Others identified their job title as UX team lead (11%), UX manager (8%), or design strategist (6%).

Location: In terms of the geographic locations, the majority of the respondents worked in North America 63.5% (n = 125), followed by Asia 19.3% (n = 38) and Europe 14.7% (n = 29). Other respondents worked in Australia 1.5% (n = 3), Africa 0.5% (n = 1), and South America 0.5% (n = 1).

Companies or organizations: The respondents worked in various sized companies or organizations (see results in Table 1). Furthermore, 81 respondents also reported the actual companies that they worked in. These companies covered a wide range of industrial fields, including ads and marketing, banking, gaming, health care, IT and software, professional services, supply chain, telecommunication, and UX consulting.

Table 1. Number of Employees in the Companies/Organizations that Respondents Worked In

Self-employed	< 100	100–999	1,000–9,999	> = 10,000
6.1% (n = 12)	15.2% (n = 30)	21.3% (n = 42)	21.3% (n = 42)	36.1% (n = 71)

UX team size: We asked respondents about the size of the UX team that they worked in and found that they worked in different sized UX teams: 1 (n = 21), 2–5 (n = 55), 6–10 (n = 42), 11–15 (n = 22), 16–20 (n = 16), 21–30 (n = 16), 31–50 (n = 5), and >50 (n = 20).

Experience: We asked respondents about the number of years that they had worked in HCI/UX/usability testing fields (see results in Table 2). The distribution of the years of experience in industry covered all ranges among the respondents.

Table 2. Number of Years that Respondents Had Spent in HCI/UX/Usability Testing Fields

< 1 year	1–2 years	3–5 years	6–9 years	> = 10 years
12.7% (n = 25)	20.3% (n = 40)	22.8% (n = 45)	16.8% (n = 33)	27.4% (n = 54)

Methods for detecting usability problems: We asked respondents about their three most frequently used methods for detecting usability problems (see results in Figure 1). The most frequently used methods for detecting usability problems among the respondents were as follows: usability testing (n = 168, 86%), interview (n = 118, 60%), heuristic evaluation (n = 81, 41%), field studies/observation (n = 66, 34%), A/B testing (n = 53, 27%), cognitive walkthrough (n = 46, 23%), card sorting (n = 26, 13%), and focus groups (n = 25, 13%).

Figure 1. The frequently used methods for detecting usability problems among our respondents.

General Use of Think-Aloud Protocols

Where respondents learned think-aloud protocols: Among the 197 respondents, 91% (n = 179) reported that they had learned think-aloud protocols, and the remaining 9% (n = 18) reported that they were unfamiliar with think-aloud protocols. For the 179 respondents who had learned think-aloud protocols, 49% of them (n = 87) reported that they had learned the protocols in university/college, 36% (n = 65) at work, and 15% (n = 27) from UX online/offline bootcamps.

General use and non-use of think-aloud protocols: When conducting usability tests, 86% of all respondents (n = 169) reported that they used think-aloud protocols. In other words, 95% of the respondents who had learned think-aloud protocols (169 out of 179) used them. We carried out the following analysis based on the responses of these 169 respondents who used think-aloud protocols because the remaining survey questions were about how UX practitioners used think-aloud protocols.

We also asked those respondents who had learned think-aloud protocols but did not use them (n = 10) about their reasons for not using the protocols as an optional short-answer question and received seven responses. The reasons were as follows: conducting think-aloud sessions is not part of their role (n = 2), their study subjects may not verbalize their thoughts easily (e.g., children) or unbiasedly (e.g., internal users; n = 2), conducting think-aloud sessions takes too much time (n = 1), think-aloud protocols may distract their users (n = 1), and there are alternative methods (n = 1).

The frequency of using concurrent and retrospective think-aloud protocols: Concurrent think-aloud protocols, in which users verbalize their thoughts while working on tasks, and retrospective think-aloud protocols, in which users verbalize their thoughts only after they have completed the tasks (usually via watching their session recordings) are the two types of protocols. We asked respondents about their frequency of using concurrent and retrospective think-aloud protocols (see results in Figure 2). Specifically, 61% of them (n = 103) used the concurrent think-aloud protocols in almost every usability tests, and 91% of them (n = 154) used the concurrent think-aloud protocols in at least half of their usability tests. In contrast, only 21% of them (n = 36) used the retrospective think-aloud protocols in almost every usability tests, and the majority of them (61%, n = 104) almost never or only occasionally (i.e., roughly a quarter of the tests) used the retrospective think-aloud protocols.

Figure 2. The frequency of using concurrent-think-aloud protocols and retrospective think-aloud protocols among the respondents.

Motivation: We asked respondents about their motivation for using think-aloud protocols and found that 51% of the respondents (n = 86) used the think-aloud protocols to both inform the design (e.g., problem discovery) and to measure the performance (e.g., success rate); 48% of them (n = 81) only used the protocols to inform the design and only 1% of them (n = 2) only used the protocols to measure the performance.

Testing environments: We asked respondents about the test environments in which they used think-aloud protocols (see results in Figure 3). Specifically, 75% of the respondents (n = 127) used the protocols in controlled lab studies, 72% of them (n = 121) used the protocols in remote usability testing, and 48% of them (n = 81) used the protocols in field studies. The total does not sum up to 100% because respondents can use the think-aloud protocols in more than one test environment.

Figure 3. The testing environments in which UX practitioners use think-aloud protocols.

Conducting Think-Aloud Sessions

Types of tasks for think-aloud sessions: We asked respondents about the types of tasks that they ask their participants to work on during think-aloud sessions (see results in Figure 4). Specifically, 27% of them (n = 46) only ask their participants to work on tasks without instruction steps to follow (e.g., navigating a website), while 12% of them (n = 20) only ask their participants to work on tasks with instruction steps to follow (e.g., setting up a TV with its manual). In contrast, the majority of the respondents (61%, n = 103) used both two types of tasks during think-aloud sessions.

Figure 4. The types of tasks that UX practitioners ask their participants to work on during think-aloud sessions.

Practice sessions: Ericsson and Simon have suggested that practitioners should ask their participants to practice thinking aloud before conducting the actual think-aloud sessions (1984). We asked the respondents about the frequency of conducting a practice session before starting the actual think-aloud test sessions (see results in Figure 5). Specifically, the majority of the respondents (61%, n = 103) almost never do it, 7% (n = 12) only do it roughly a quarter of the time, 6% (n = 10) do it roughly half of the time, 2% (n = 4) do it roughly three-quarters of the time, and 24% (n = 40) do it almost all the time. The result shows that the majority of the UX practitioners seldom ask their participants to practice think-aloud before conducting the actual think-aloud sessions.

Figure 5. The frequency of conducting practice sessions before actual think-aloud sessions.

Instructions for requesting verbalizations: When using the classic think-aloud protocol (Ericsson & Simon, 1984), moderators are required to only ask their participants to say out loud everything that naturally comes into the mind. We asked respondents what else they explicitly ask their participants to verbalize during think-aloud sessions in addition to the thoughts that naturally comes into the mind (see results in Figure 6). Specifically, only 7% of the survey respondents (n = 12) reported that they do not ask their participants to verbalize anything beyond what naturally comes into their mind. In contrast, 80% (n = 136) mentioned that they also explicitly ask their participants to verbalize their feelings, 70% (n = 119) explicitly ask their participants to verbalize their feedback, 55% (n = 93) explicitly ask their participants to verbalize their actions on the interface, and 33% (n = 55) explicitly ask their participants to verbalize their design recommendations.

Figure 6. The content that respondents ask their participants to verbalize in addition to the thoughts that come naturally into the mind.

To better understand what types of content that respondents often request their participants to verbalize together, we counted the number of occurrences of different combinations of content that they ask their participants to verbalize in addition to the thoughts that come naturally into the mind (see results in Figure 7).

Figure 7. The percentages of different combinations of the content that respondents ask their participants to verbalize.

Prompting participants: When using the classic think-aloud protocol (Ericsson & Simon, 1984), moderators are required to keep the interaction with their participants to a minimal level and only remind them to keep talking if they fall into silence. We asked respondents whether they prompt their participants during think-aloud sessions and found that only 22% of the respondents (n = 37) keep the interaction minimal and do not prompt their participants with questions. In contrast, 78% of the respondents (n = 132) prompt their participants.

In addition, 91% of the respondents (n = 154) also reported how the frequency of prompting their participants had changed compared to when they just started their UX career (see results in Figure 8). Among these respondents, 44% (n = 67) felt that the frequency with which they prompt their participants remained roughly the same, 41% (n = 64) felt that the frequency for prompting their participants had only slightly changed, and 15% (n = 23) felt that the frequency had changed significantly.

Figure 8. How the frequency with which respondents prompted their participants during think-aloud sessions had changed compared to when they just started their UX career.

Correlation analysis: We examined whether there was any correlation between respondents’ profile info and their practices of using think-aloud protocols. Specifically, we performed the Spearman’s rank-order correlation test when both variables were ordinal data and the Chi-square test when there was categorical data (see results in Table 3). In sum, the tests did not find any significant correlation for most pairs except between the size of respondents’ companies and whether respondents request their participants to verbalize content beyond what comes into the mind, $x^2(4, N = 169)$ = 14.403, p = 0.006.

Table 3. Correlation Analysis Between Responders’ Profile Information and Their Practices of Conducting Think-Aloud Sessions

Respondents’ profile info	Frequency of conducting practice sessions (ordinal data)	Whether asking users to verbalize content beyond what comes into the mind (categorical data)	Whether prompting users during the study session (categorical data)
The size of their companies (ordinal data)	$r,$ (167) = -0.0294, p = 0.7043	$x^2(4, N = 169)$ = 14.403, p = 0.006*	$x^2(4, N = 169)$ = 1.3939, p = 0.8453
The UX experience (ordinal data)	$r,$ (167) = -0.0166, p = 0.8308	$x^2(4, N = 169)$ =2.6906, p = 0.6109	$x^2(4, N = 169)$ = 2.7057, p = 0.6082

* indicates significance

Analyzing Think-Aloud Sessions

Activities performed for analyzing sessions: We asked respondents about specific activities they did when analyzing think-aloud sessions. The activities were the following: review observation notes of the usability test, review the test session recording, review post-task interview data, review post-task questionnaire data, or transcribe and review the transcript of the session. These options were based on a prior survey (McDonald et al., 2012) and were updated via a pilot study (see results in Figure 9). Specifically, 89% of the respondents (n = 151) review observation notes, 77% of them (n = 130) review the session recordings (e.g., audio/video recordings), 70% of them (n = 118) review post-task interview data, 60% of them (n = 102) review the questionnaire/survey data, and 56% of them transcribe and review the transcripts (i.e., what participants said).

Figure 9. The activities that UX practitioners perform when analyzing think-aloud sessions.

Information for locating usability problems: We asked respondents about the types of information they thought would help locate usability problems (see results in Figure 10). Specifically, when reviewing think-aloud sessions to identify usability problems, 94% of them (n = 159) thought what participants were doing (e.g., user actions on the interface) is helpful, 86% of them (n = 145) thought what participants said during the sessions is helpful, and 76% of them (n = 128) also thought how participants said it (e.g., pauses, tone) is helpful.

Figure 10. The types of information that are helpful for UX practitioners to locate usability problems.

Information sought out from users’ verbalizations: We asked respondents about the information that they looked for when analyzing their participants’ verbalizations (i.e., utterances; see results in Figure 11). Specifically, 94% of them (n = 153) looked for expressions of feelings (e.g., excitement, frustration), 89% (n = 145) looked for their participants’ comments (e.g., feedback), 74% (n = 119) looked for their participants’ action descriptions, 70% (n = 116) looked for their participants’ explanations, and 30% (n = 49) looked for their participants’ design recommendations.

Figure 11. The types of information that UX practitioners seek in users’ verbalizations.

Delivering analysis results: We asked respondents what activities they performed when delivering analysis results. The following were the three activities: write an informal usability test report, write a formal usability test report, and have a data analysis discussion meeting. We did not provide definitions for these activities to make them open to interpretation. They could choose multiple options if applicable (see results in Figure 12). Specifically, when analyzing a think-aloud session, 69% of them (n = 116) wrote an informal usability test report, 58% (n = 98) wrote a formal usability test report, and 57% of them (n = 97) had a data analysis discussion meeting.

Figure 12. The ways in which UX practitioners deliver their analysis results.

Participation in the three types of data analysis: We asked the respondents who write formal and informal usability reports about how they did this. We gave them the following six options: Only myself, UX designers/researchers, UX team lead, Lead of non-UX teams (e.g., engineering, marketing), Other non-UX team members (e.g., engineers), and C-level executives (e.g., CEO). In addition, we also asked respondents who would attend data analysis discussion meetings with the same set of options except “Only myself.” They could choose multiple options if applicable (see results in Figure 13). More than half of the respondents (56%, n = 95) wrote informal usability testing reports alone and nearly half of the respondents (42%, n = 71) also wrote formal usability testing reports alone. In addition, UX team members were the primary authors of informal/formal reports with occasional help from outside of the UX team. In contrast, non-UX team members were more involved in data analysis discussion meetings.

Figure 13. Participation in three types of data analysis activities: writing an informal usability test report, writing formal usability test report, and having a data analysis discussion meeting.

Challenges of Using Think-Aloud Protocols

We asked respondents what their biggest inefficiencies or difficulties had been in conducting and analyzing think-aloud sessions as an optional short-answer question. We present the key findings from the responses in the following paragraphs.

Challenges for conducting sessions: Our qualitative analysis reveals three main challenges that respondents encountered when conducting think-aloud sessions. First, getting their participants to think aloud is a challenge. Participants’ personality and their ability to verbalize thoughts and the complexity and duration of the tasks are factors that influence the amount of content that they verbalize. For example, some people tend to be able to verbalize more readily than others, which can create an unbalanced representation of potential users. For some products, the target population may not be able to verbalize properly, for example, children. Participants may also feel less comfortable verbalizing their thoughts when the task is complex. Furthermore, it may also be fatiguing for users to verbalize their thoughts if the task takes too long to complete.

Another challenge facing respondents is to create a comfortable and neutral environment that encourages participants to honestly verbalize their thought processes. This is challenging because participants might want to say nice things or may be reluctant to offer criticism during the test sessions, which could preclude UX practitioners from identifying usability bugs.

Finally, being patient and knowing when to interrupt participants is challenging. It is valuable to observe and understand how participants deal with the tasks themselves and recover from errors. Interrupting the process with prompts too early could change their way of interacting with the test interface. Moreover, because part of the goal of usability evaluations is to gather data on what is difficult/impossible for users, it is often necessary to observe users struggle a bit during the evaluation to understand their “pain points.” That being said, it is also bad to let participants get stuck for too long as they can be too frustrated, which could, in turn, affect the rest of the test session and consequently the amount of feedback that can be gained from the test session.

Challenges for analyzing sessions: While previous research reported general practices in analyzing usability evaluation (Følstad, Law, & Hornbæk, 2012), our survey study found specific challenges that respondents faced when analyzing think-aloud test sessions. This survey study showed that respondents reviewed think-aloud session notes (89%) more often than the session recordings (77%; see Figure 9). Respondents felt that reviewing think-aloud session video recordings was arduous because recorded think-aloud sessions often contain so much data that transcribing and coding them takes a significant amount of time. Consequently, instead of transcribing sessions and reviewing transcripts, respondents often rely on “their memory of participants’ sentiments and actions” or the notes.

Despite the convenience of observation notes, respondents realized that it is “easy to make judgments that might be off if they don’t refer back to actual transcripts or recordings” and thus considered reviewing think-aloud session recordings a necessary part of their analysis process. First, it is necessary to match the observation notes with the corresponding segments in the session recordings to understand the context of the notes. Second, it is necessary to review the session recordings to capture points that might have been missed by observation notes because notetakers can only write down the points that seem to be important from their perspective, and any individual perspective can be incomplete or biased. Indeed, previous research also suggested that while some of the usability problems may be captured by notes, much of the insight is often lost and needs to be reconstructed by conducting video data analysis later (Kjeldskov, Skov, & Stage, 2004).

This survey study further identifies two challenges associated with reviewing think-aloud sessions. One challenge is to compare users’ verbalization data with other streams of data to triangulate the issues that users encountered. One such comparison is to pair the user’s actions on the interface with what they are saying (i.e., utterances) during the session. In scenarios where multiple streams of data are acquired, respondents had to correlate the verbalizations with other sensor data. Recent research has shown that considering verbalizations with other sensor data, such eye-tracking data (Elbabour, Alhadreti, & Mayhew, 2017; Elling, Lentz, & de Jong, 2012), EEG data (Grimes, Tan, Hudson, Shenoy, & Rao, 2008), or functional near-infrared spectroscopy (fNIRS; Lukanov, Maior, & Wilson, 2016), can potentially increase the reliability and validity of the findings. Another challenge is to match the observation notes with the context in which the notes are taken. It is not always possible to notate the exact timestamps when notes are taken. Consequently, matching notes (e.g., observations about users’ facial expressions) with the audio stream often require evaluators to watch the entire recording. Another example of this challenge comes from the emerging VR and AR applications. To make sense of users’ verbalizations when they interact with a VR or AR application, evaluators need to correlate the verbalizations with the visual content that participants observed during the sessions.

Reviewing think-aloud sessions is time-consuming. Our respondents reported that they often had limited time to complete the analysis and faced the tension between achieving high reliability and validity in their analysis and completing their analysis efficiently. To cope with the tension, respondents reported using strategies such as developing better note-taking skills or having a team of UX professionals observe a think-aloud test session. Respondents also proposed to discuss the session afterward with their peers.

In addition to reviewing sessions, respondents also pointed out that it can be valuable to keep track of the examples of different types of usability problems that they had observed over time and develop a taxonomy to describe the patterns in the data that commonly occur when users encountered usability problems. Such patterns, examples, and the taxonomy could act as templates that potentially help them quickly identify common issues that users encounter and the solutions that they had accumulated in a new test context.

Discussion

Our study respondents worked in different geographic locations, in different industrial fields, and different sized UX teams. They also played different roles and possessed different levels of experience as UX professionals. Thus, the survey responses have uncovered a wide range of UX practitioners’ practices surrounding the conduct and analysis of think-aloud sessions. Next, we discuss the implications of the survey responses.