[:en]Enhancing Card Sorting Dendrograms through the Holistic Analysis of Distance Methods and Linkage Criteria[:]

Abstract

Keywords

Introduction

Method

Data Characterization

Distance Analysis

Clustering Analysis

Peer-reviewed Article

pp. 73-90

[:en]

Card sorting has been widely used in information architecture to analyze and improve web content and navigation. This is an intuitive and cost-effective technique also useful in user research and evaluation. However, while the implementation of sorting tasks comprises a constructive and easy-to-accomplish process, the quantitative analysis of resulting card-sorting data can be a challenge for non-skilled evaluators. Several tools exist to support sorting tasks and data analysis, but still some users utilize custom spreadsheets or statistical packages in order to enhance analysis and obtain more expressive and comparable results.

One of the most utilized diagrams for analyzing card-sorting results is the dendrogram, also known as a tree diagram, which is commonly based on an agglomerative clustering representation depicting groupings of related cards. However, several issues have to be considered by evaluators in order to produce meaningful dendrograms for decision-making. In fact, the distance method and the linkage criterion greatly influence the final dendrogram obtained.

In this paper, an analysis on distance methods and linkage criteria for obtaining suitable dendrograms is proposed. The main aim is to guide evaluators and usability engineers to produce appropriate dendrograms based on available card-sorting data. In this sense, the provided clues can be widely applied to any card-sorting domain and sample size, improving card-sorting analysis by comparing different solutions through goodness indicators.

Analyses applied to a publicly available dataset indicate that the results are highly dependent of the data type, so the right selection of both distance method and linkage criterion is essential for obtaining a suitable dendrogram and correctly interpreting the results.

Information architecture, card sorting, quantitative analysis, agglomerative clustering, distance method, linkage criterion, dendrogram

Card sorting is a widespread technique that has been used in different domains that range from social sciences to software development (Hudson, 2014). In fact, it has been successfully applied to analyze and improve the information architecture of a website (Rosenfeld & Morville, 2002) through a user-centered approach (Cayola & Macías, 2018; Rojas & Macías, 2019). In general, card sorting can be seen as an intuitive and cost-effective technique very useful in user research and evaluation (Albert & Tullis, 2013; Sauro & Lewis, 2016). Card sorting can be utilized in the early phases of the software development process for eliciting a user’s mental model (Paea & Baird, 2018), but also in advanced stages in order to evaluate and compare different design solutions (Paul, 2008, 2014) in user-centered development processes (Macías, 2012). Nowadays, card sorting has become popular in human-computer interaction fields (Macías et al., 2009), such as user experience (Veral & Macías, 2019) and design thinking. In addition, it has been successfully utilized in user research (Spencer, 2009).

In an effective card-sorting implementation, sorting tasks have to be arranged and then accomplished by recruited participants, often under the supervision of an evaluator or software engineer. The main aim is to analyze how participants classify each card into different categories. This provides important clues about a user’s criteria, exploring the categories of information that best fit a concrete domain for a specific design (Macías & Castells, 2002), content layout, or navigation through the information involved.

Once the sorting tasks have been completed, a second important step is the analysis of the resulting data. While qualitative analysis, principally based on a user’s behavior, can be promptly and easier achieved, quantitative analysis implies the statistical study of the resulting card-sorting data. This becomes a much more complex task, where certain knowledge about the statistical methods to apply is essential for the right interpretation of the results. One of the most utilized representation for the quantitative analysis of card-sorting data is the dendrogram, which comprises a taxonomical hierarchy (Borges & Macías, 2010) of the related cards. The agglomerative dendrogram, obtained through a hierarchical agglomerative clustering, is the most common representation. It comprises an unsupervised bottom-up algorithm (Hastie et al., 2017) intended to produce the best clustering possible for the variables selected (e.g., cards). However, a first step toward the generation of the dendrogram is the selection of a suitable distance method and the linkage criterion. A wrong selection may produce inaccurate or misleading results (Pawliczek & Dzwinel, 2013), which may lead the evaluator to undertake incorrect decisions (Saraçli et al., 2013) through a wrong or confusing dendrogram. Although different software tools exist today (Chaparro et al., 2018), they mostly provide specific data representations and run standard algorithms with customized parameters and settings, which intrinsically limits the range of the outcome obtained, thus reducing the expressivity of the quantitative analysis. In fact, most card-sorting analyses are still manually addressed, often utilizing custom spreadsheets that can obtain only basic information about raw data.

In this paper, an analysis of the principal parameters to produce suitable dendrograms for the quantitative analysis of card-sorting data is presented. Specifically, a study on different distance methods and linkage criteria for agglomerative clustering is detailed. To carry out this task, different distance methods are studied and compared together with the analysis of different linkage criteria for the agglomerative clustering. The main aim is to provide knowledge for producing suitable dendrograms, thus improving the quantitative analysis of card-sorting data. Besides, goodness indicators are also provided in order to compare different dendrograms. Different clues are provided in order to guide evaluators according to the type of the data obtained from sorting tasks.

This paper is structured as follows. The next section presents a method for obtaining suitable dendrograms from card-sorting data. This way, the analysis of the raw data and the suitable selection of the distance method and linkage criterion are described. Also, goodness indicators are proposed for validation purposes. Then, the results section applies the proposed method to an existing card-sorting dataset that is publicly available. Finally, conclusions and some useful tips for usability practitioners are provided.

The following sections provide information on the data characterization, distance analysis, and clustering analysis determined by this study.

In order to produce effective dendrograms for the quantitative analysis of card-sorting results, the first step is to study the characteristics of the data obtained from sorting tasks. This implies to analyze the type of the data and the variables involved. In most card-sorting analyses, cards are considered as variables or stimuli (p), whereas categories are considered as observations (n). However, and depending on the analysis to achieve, categories can be utilized as variables and cards as observations.

In most cases, card-sorting data are represented using working matrices. Figure 1 depicts three different examples of card sorting, where cards (p = 6) are classified into different categories (n = 5). Each representation in Figure 1 (a, b, and c) has its own working matrix represented in Tables 1, 2, and 3, respectively. This way, Figure 1.a depicts an example of an open card sorting where three users have classified each card into a different category, and this can be represented by the working matrix in Table 1. Similarly, Figure 1.b depicts an example of card sorting where a user has attempted to classify cards into nested categories, instead of into independent ones. This is represented by the working matrix in Table 2. Also, Figure 1.c depicts an example of card sorting where three users have attempted to classify the cards into categories, and this has been normalized and represented by the working matrix in Table 3.

As shown, working matrices indicate straight relationships among cards and categories, where each cell contains numbers that can be of two different types:

Binary: They usually come from open card sorts and when categories are not normalized (i.e., it has not been reduced and regrouped to a smaller number). This is a common representation where only 1 and 0 values are utilized, indicating (see Table 1) that a card has been classified into a concrete category (1) or not (0). It is worth mentioning that binary data may produce symmetric (see Table 1) or asymmetric (see Table 2) binary variables, denoting that each card can be classified into a single category or into a nested hierarchy of categories, respectively. Symmetric binary variables have the same weight and preference (invariant characteristics). This way, given a card, the lack of membership in a category is the opposite of membership in the rest of categories. By contrast, asymmetric binary variables can codify more complex memberships, having variables of different weight. As shown in Figure 1.b and Table 2, a card (Card4) classified into a subcategory belongs to both the subcategory (Category1.1) and the parent category (Category1). This is an important issue, as asymmetric binary variables require noninvariant similarity measures and they greatly affect the kind of analysis to achieve at a later stage.
Interval-scaled: They come from closed card sorts or open ones where categories have been normalized. In this case (Table 3), aggregated binary values, consisting in positive integers, are utilized. This implies a card-category relationship denoting the number of times that participants classified each card into a given category.

Figure 1. Different examples of card sorting (p = 6; n = 5) representing (a) an open unnormalized card sorting, (b) a nested-category card sorting, and (c) a normalized-category card sorting.

Table 1. Example of Symmetric Binary Variables (p = 6; n = 5)

Table 2. Example of Asymmetric Binary Variables (p = 6; n = 5)

Table 3. Example of Interval-Scaled Variables (p = 6; n = 5)

A second step toward producing the dendrogram is the selection of the suitable distance method according to the type of the card-sorting data. As the dendrogram is based on a hierarchical clustering approach, it requires a distance or dissimilarity matrix in order to calculate pairwise distances among variables. There are several methods to obtain a distance matrix (Drost, 2018), and this mostly depend on the type of the data. In the case of card-sorting data, there are different distance metrics to apply:

Lq metrics (also known as Minkowski family): The general formula of the Minkowski distance (d) among two variables (v₁ and v₂) of length n can be defined as the following:
$d(v_1, v_2) = ^q \sqrt{ \sum_{i=1}^{n}(v_1_i - v_1_i)^q }$

This family includes the most used distance metrics such as Euclidean (q = 2) or Manhattan (q = 1), where the latter is also known as city-block, taxicab, or the snake metric. These metrics are based on minimizing the sums of dissimilarities. Euclidean distance, representing the straight-line distance among two points, is probably one of the most utilized, and it can be used with interval-scaled data but also with symmetric binary data (Kaufman & Rousseeuw, 2009), providing suitable results in both cases.

- L₁ metrics: They are generally based on minimizing the average sums of dissimilarities, and include metrics such as Sorensen, Soergel, and Gower, to cite a few. One the most used is the Gower distance metric, which is more suitable for symmetric binary variables. It is based on calculating the average of partial dissimilarities among variables. This way, the general formula of the Gower distance (d) among two variables (v₁ and v₂) of length n and range R can be defined as the following:
  $d(v_1, v_2) = \frac{1}{n} \sum_{i=1}^{n}\frac{|v_1_i - v_2_i|}{R_i}$
Inner Product metrics: They are based on the inner product theory of linear algebra, and they can be applied to asymmetric binary variables. This family include metrics such as Harmonic Mean, Cosine, Dice, Jaccard, and so on. The most utilized metric is the Jaccard distance, based on the Jaccard Coefficient (i.e., S-coefficient), which can be calculated as the ratio of the size of the intersection among two sets and the size of their union. More specifically, the Jaccard distance (d) among two variables (v₁ and v₂) of length n can be defined as the following:
$d(v_1, v_2) = \frac {\sum_{i=1}^n (v_1_i - v_2_i)^2}{\sum_{i=1}^n {v^2}_1_i + \sum_{i=1}^n {v^2}_2_i - \sum_{i=1}^n v_1_i * v_2_i}$

More specific metrics can be customized when asymmetric weighted variables are considered. This might imply to assign specific weights or create customized contingency tables to calculate the Jaccard distance.

A simple example to understand the application of the principal metrics is provided. Let us consider the following six variables representing different categories in which users have attempted to classify cards. Specifically, the following representation is a working matrix where categories (six in this case) are represented in rows and cards (a total of 40) in columns. This is similar to the example described in Figure 1.a and Table 1, where card-sorting data indicate the relationship between each card and the categories where it has been classified. This is done by using symmetric binary values, where 1 indicates that a card was classified into a specific category and 0 the opposite.

Categories	Cards
fruits:	0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
vegetables:	1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
produce:	1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0
quick_energy:	0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0
dinners:	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
main_dishes:	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0

Considering such categories, it is possible to calculate their distances or dissimilarities based on appropriate metrics for symmetric binary variables. This way, Lq metrics such as Euclidean, Manhattan, and Minkowski, and L₁ metrics such as Sorensen, Gower, and Soergel have been applied (see Table 4).

Table 4. Examples of Distance Calculations Based on the Previously Presented Categories

Distance among variables	Euclidean	Manhattan	Minkowski	Sorensen	Gower	Soergel
d(fruits, vegetables)	3,46	12,00	1,64	1,00	0,30	1,00
d(produce, quick_energy)	4,79	23,00	1,87	0,92	0,57	0,95
d(dinners, main_dishes)	0,00	0,00	0,00	0,00	0,00	0,00

As shown in Table 4, and taking into account the results of most metrics (Euclidean, Manhattan, Minkowski, and Gower), it can be concluded that the categories produce and quick_energy are strongly dissimilar (very distant) from each another, whereas fruits and vegetables can be considered as dissimilar, and dinners and main_dishes are strongly similar (closer). Note that similarity and dissimilarity in this context is related to the number of cards classified in each category, which implies that, for instance, dinners and main_dishes can be considered as similar categories as they classify the same cards (i.e., cards 16, 27, and 35, representing hamburger, pizza, and spaghetti, respectively, in this example).

In the case of multivariate methods, such as the hierarchical agglomerative clustering, metrics based on the minimization of sums or averages of dissimilarities (L_q and L₁families including metrics such as Euclidean, Manhattan, Minkowski, and Gower) may result more robust than others based on the sums of squares (L₂ family including metrics such as Squared Euclidean, Pearson, Neyman, and so on), being also more consistent for larger sample sizes and less sensitive to outliers (Kaufman & Rousseeuw, 2009).

Category1

Category2

Category3

Category4

Category5

Once the distance matrix has been calculated, the next step is to select the right linkage criterion for producing the dendrogram. In general, the hierarchical clustering can be implemented through a bottom-up (agglomerative) or top-down (divisive) approach. The agglomerative approach is the most commonly used, and it initially considers each variable as a single cluster. This way, the method attempts to successively merge pairs of clusters until all the clusters have been merged into a big one containing all the variables. The linkage criterion defines the way the distance among two clusters is calculated to decide the fusion of both. There exist different approaches to consider. The following are the most utilized:

Centroid: Known as UPGMC (Unweighted Pair Group Method using Centroids), it is based on the between-cluster dissimilarity. This way, the dissimilarity between two clusters is calculated as the distance between their geometric centroids.
Median: Known as WPGMC (Weighted Pair Group Method using Centroids), it is based on the cluster median. This way, the dissimilarity between two clusters is calculated as the median distance between the variables of one cluster and the variables of the other cluster.
Mcquitty: Also known as WPGMA (Weighted Pair Group Method using Arithmetic Averages) or the simple-average method, it is based on the average distance. This way, the dissimilarity between two clusters is calculated as the average distance between the variables of one cluster and the variables of the other cluster.
Complete-linkage: Also known as the furthest neighbor method, it is based on the minimum inter-similarity. This way, the dissimilarity between two clusters is calculated as the distance between the most distant two variables in each cluster.
Single-linkage: Also known as the nearest neighbor method, it is based on the maximum inter-similarity. This way, the dissimilarity between two clusters is calculated as the distance between the closest two variables in each cluster.
Ward: Also known as MISSQ (Minimization of the Increase of Sum of Squares), it is based on the minimum variance, trying to minimize the within-cluster sum of squares. In this sense, the dissimilarity between two clusters is calculated as the sum of squared deviations from the variables to the centroids.

All the linkages can be utilized and no specific restrictions have to be initially considered. However, analyzing the different linkage criteria according to the way they operate, Single and Complete linkages reduce the fitness to a single similarity between a pair of variables, which usually hinder to reflect the distribution of the variables in a cluster. This may lead to produce undesirable clusters. In addition, Single-linkage produces chaining effects when variables are close together, identifying long chain-like clusters. Complete-linkage overcomes this problem, but it is sensitive to outliers. On the other hand, Centroid and Median linkages may produce inversions, which occurs when two clusters being merged appear closer to each other than pairs of clusters merged earlier.

In general, Ward is the most common and utilized approach, and it can be used with both interval-scaled and symmetric binary variables. It minimizes the loss of information between the individual variables and the cluster centroid level. Previous studies have proved that, in a broad context, Ward behaves better than other linkage methods (Saraçli et al., 2013).

As for asymmetric binary data, Jaccard distance, together with the previously commented clustering linkages can be utilized. However, it is more recommendable to utilize monothetic approaches such as DIVCLUS-T (Chavent et al., 2007) or MONA (Kaufman & Rousseeuw, 2009).

Once the dendrogram has been produced, a final step consists in analyzing the goodness of the solution obtained. To this end, the Cophenetic Correlation Coefficient (CCC) can be used to analyze the goodness of a dendrogram (Saraçli et al., 2013). CCC can be defined as follows:

$CCC = \frac{\sum_{i<j}(x(i,j) - \bar{x})(t(i,j) - \bar{t})} {\sqrt{[\sum_{i<j}(x(i,j)-\bar{x})^2][\sum_{i<j}(t(i,j)-\bar{t})^2]}}$

Where x(i,j) represents the ordinary Euclidean distance between variables i and j, t(i,j) represents the dendrogrammatic distance (i.e., the height of the node at which these two variables are first joined together) between the model points T_i and T_j, $\bar{x}$ represents the average value of x(i,j) and $\bar{t}$ represents the average value of t(i,j).

In brief, CCC describes the linear correlation between the dissimilarities of each pair of variables and their corresponding cophenetic distances. The cophenetic distances represent an intergroup dissimilarity measure of two variables that were merged in the same cluster. This method has been used in different domains to analyze whether a dendrogram represents an appropriate solution or not. This way, a high correlation (closer to 1) among the original distances and the cophenetic ones indicates a high goodness for a given dendrogram.

There exist other approaches intended to compare two different dendrograms in order to analyze which solution is the best suited. On the one hand, the Robinson-Foulds metric, also known as symmetric difference metric (Pattengale et al., 2007), can be used to calculate the topological distances between two dendrograms. This metric represents the number of branches in a given dendrogram with a combination of variables that exist in it but not in any subtree of the compared dendrogram plus the same computation on the other way round. This way, two dendrograms can be considered similar if they are isomorphic and their isomorphism preserves the labeling. On the other hand, the Baker-Hubert Gamma index (Desgraupes, 2013) can be used to obtain the correlation among two dendrograms. This index provides a concordance measure that indicates that two dendrograms are concordant if the values classify in the same order in both dendrograms. As well as CCC, Gamma index provides values between -1 and 1. Values closer to 0 indicate that two dendrograms are not statistically similar.