Journal of Biomedical Informatics 63 (2016) 66–73
Contents lists available at ScienceDirect
Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin
Using concept hierarchies to improve calculation of patient similarity Dominic Girardi a, Sandra Wartner a, Gerhard Halmerbauer b, Margit Ehrenmüller b, Hilda Kosorus c, Stephan Dreiseitl d,⇑ a
RISC Software GmbH, Research Unit Medical Informatics, Johannes Kepler University Linz/Hagenberg, Austria Department of Process Management in Health Care, University of Applied Sciences Upper Austria, Steyr, Austria c Institute for Application Oriented Knowledge Processing, Johannes Kepler University Linz, Austria d Department of Software Engineering, University of Applied Sciences Upper Austria, Hagenberg, Austria b
a r t i c l e
i n f o
Article history: Received 4 March 2016 Revised 31 May 2016 Accepted 26 July 2016 Available online 28 July 2016 Keywords: Distance measure using concept hierarchy ICD-10 taxonomy Patient similarity calculation
a b s t r a c t Objective: We introduce a new distance measure that is better suited than traditional methods at detecting similarities in patient records by referring to a concept hierarchy. Materials and methods: The new distance measure improves on distance measures for categorical values by taking the path distance between concepts in a hierarchy into account. We evaluate and compare the new measure on a data set of 836 patients. Results: The new measure shows marked improvements over the standard measures, both qualitatively and quantitatively. Using the new measure for clustering patient data reveals structure that is otherwise not visible. Statistical comparisons of distances within patient groups with similar diagnoses shows that the new measure is significantly better at detecting these similarities than the standard measures. Conclusion: The new distance measure is an improvement over the current standard whenever a hierarchical arrangement of categorical values is available. Ó 2016 Elsevier Inc. All rights reserved.
1. Introduction The ubiquitous availability of data processing systems has led to an ever-increasing amount of data in healthcare environments [1,2]. While some of this data is available only in unstructured formats and thus difficult to process automatically—recent developments in image understanding [3] and natural language processing [4] notwithstanding—several areas of biomedicine have seen great strides towards standardized formats for data representation. Examples of such formats include DICOM for imaging data, Gene Ontology for molecular biology terms, SNOMED CT for clinical terminology, and ICD-10 for the classification of diseases. While standardized data formats were mainly developed to facilitate information exchange between computers, structured data representation also provides benefits to the human understanding of data. These benefits may be minor—for example, providing descriptive summary statistics of the data—or major, when visualizing complex gene activation pathways in cells. The ability to graphically represent data may in fact be the biggest advantage of structured data formats, because it allows the human pattern recognition apparatus to derive meaning from the data ⇑ Corresponding author. E-mail address:
[email protected] (S. Dreiseitl). http://dx.doi.org/10.1016/j.jbi.2016.07.021 1532-0464/Ó 2016 Elsevier Inc. All rights reserved.
[5,6]. In many instances, this requires a similarity measure to be defined on the data, so that similarities in the original data space can be mapped (in a meaningful manner) to a 2D representation on screen. One context in which similarity information can be obtained from data is when the data concepts are arranged in a hierarchical manner [7,8]. An example of such a concept hierarchy is the International Classification of Diseases catalog ICD-10, maintained by the World Health Organization [9]. It contains over 12,000 disease classifications organized in three levels, with 22 level 1 elements. Consider how a concept hierarchy can help to detect similarities between two patients A and B. Patient A suffers from influenza due to identified avian influenza virus (ICD-10 code J09.0) and a fracture of fibula alone (ICD-10 code S82.4). Patient B suffers from pneumonia (ICD-10 code J18.9) and a fracture of lateral malleolus (ICD-10 code S82.6). Regarding these diagnoses only as nominal (categorical) values, patient A shows no similarity to patient B, because their diseases and disease codes are all different. From a realworld point of view, however, it is obvious that the diagnoses of both patients are quite similar, because influenza is similar to pneumonia, and a broken fibula is similar to a broken malleolus. What is needed to accurately reflect this similarity is a distance measure that takes into account the hierarchical structure of the
67
D. Girardi et al. / Journal of Biomedical Informatics 63 (2016) 66–73
concept hierarchy. Furthermore, this measure must also calculate the distance between sets of concepts. The work presented here is motivated by the problem of finding and visualizing similarities in categorical clinical patient data, where every patient is represented as a set of ICD-10 codes. Two patients are considered as similar if they show similar or overlapping sets of diagnoses. The main hypothesis is that patients with similar diagnoses (meaning the same diagnoses on a high level of the hierarchy) form clearer clusters when the hierarchical structure is incorporated into the distance measure than when it is not. We will provide two ways to show that this hypothesis holds. First, we provide a graphical representation of clusterings and show that patients with similar diagnoses form clearly visible clusters with the new hierarchical distance measure, while they do not when the hierarchical structure is ignored. Second, we calculate the inter-record distances of patients with similar diagnoses (same diagnoses on ICD-10 level 2) and compare the results of the new hierarchical distance measure with a standard distance measures. The improved, more realistic distance calculation contributes to a number of applications. The data which is presented in this paper is taken from a clinical benchmarking program. The improved distance calculation allows a more accurate visualization of the benchmarking data and subsequently more reliable and understandable benchmarks. Generally, distance-based data visualization methods (e.g. dimensionality reduction or non-linear mappings) are important tools for exploratory data analysis. A more accurate distance measure leads to higher expressiveness of the resulting data visualizations. Moreover, the calculation of distance or similarity between two data sets is at the core of any case-based reasoning approach [10], with many applications in biomedicine – one example being decision support systems. Since such systems depend so heavily on reasoning by similarity (using similar old cases to reason about new ones), improvements in the assessment of case similarities directly lead to improvements in the capabilities of such systems.
Intuitively, the similarity of different concepts in an ontology is measured by computing their edge distance within the ontology. This means that the closer two concepts are in the ontology, the more similar we consider them to be [15]:
simðc1 ; c2 Þ ¼ minimum number of edges separating c1 and c2 ; where c1 and c2 are the node representation of the two concepts in the ontology. Wu and Palmer [16] redefined the edge-based similarity measure taking into account the depth of the nodes in the hierarchical graph:
simðc1 ; c2 Þ ¼
2N3 ; N1 þ N2 þ 2N3
ð1Þ
where N 1 and N 2 are the number of nodes from c1 and c2 , respectively, to c3 , the least common superconcept (LCS) of c1 and c2 , and N 3 is the number of nodes on the path from c3 to the root node. Li et al. [17] defined the similarity between two concepts as:
(
simðc1 ; c2 Þ ¼
bN
bN
3 eaðN1 þN2 Þ eebN33 e þebN3
if c1 – c2 ;
1
otherwise:
ð2Þ
where the parameters a and b scale the contribution of the two values N 1 þ N2 and N3 . On a benchmark data set, they obtained the optimal parameters settings as a ¼ 0:2 and b ¼ 0:6. The information content-based approach for computing the semantic similarity between concepts was introduced by Resnik [18]. It assumes that the frequency with which one term appears with another within a given ontology represents the similarity of the two terms. Resnik [19] showed that by associating probabilities with concepts in the taxonomy, it is possible to capture the same idea of edge-based similarity, but avoid the unreliability of uniform edges. Resnik [18] defines the similarity of two concepts as
simðc1 ; c2 Þ ¼ max logðpðc3 ÞÞ; c3 2Sðc1 ;c2 Þ
2. Related work The notions of patient distance and patient similarity have been widely investigated in the biomedical literature. In this work, we focus on the notion of semantic similarity, where the semantics of concepts are arranged in a hierarchical manner. Two problems arise in this context: How to measure the distance between individual concepts, and how to measure the distance between sets of concepts. 2.1. Semantic similarity between concepts One can distinguish two ways of using an ontology or taxonomy to determine the semantic similarity between concepts: the edgebased approach and the information content-based approach [11]. Our approach, however, deals only with hierarchical structure in the data, and ignores frequency distributions in the data set. The similarity, dissimilarity or distance between two concepts in a hierarchy is calculated by considering the hierarchy tree as an acyclic graph and applying graph distance measures to it. Boutsinas and Papastergiou [12] present an algorithm that calculates the distance between two concepts in an hierarchy by the hierarchy level of their nearest common ancestor node. The lower the nearest common ancestor node is located in the tree (with the tree root considered as the top), the more similar are the two concepts. A similar approach can be observed in the work of Hammer et al. [13], who extend the concept of a Self-Organizing Map (SOM) [14]. Their generalized SOM treats any enumeration as a hierarchy; flat lists are hierarchies with only one level.
where Sðc1 ; c2 Þ is the sets of all superconcepts of c1 and c2 , and pðc3 Þ is the relative frequency of concept c3 . Compared to the edge-counting method, the similarity measure introduced by Resnik [19] is conceptually quite simple. However, it is not sensitive to the problem of varying link distances. In addition, by combining an ontological structure with empirical probability estimates, it provides a way of adapting a static knowledge structure to multiple contexts. This similarity measure was further improved by Lin [20], when he introduced the information-theoretic definition of similarity. Based on this notion, he defined the semantic similarity in a taxonomy as
simðc1 ; c2 Þ ¼
2 logðpðc3 ÞÞ ; logðpðc1 ÞÞ þ logðpðc2 ÞÞ
where c3 is the LCS of c1 and c2 . Here, one can notice the similarities with the measure in Eq. (1). 2.2. Semantic similarity between sets of concepts Defining a semantic similarity measure between sets of concepts was the next step in computing semantic similarity mainly for information retrieval purposes. In Bouquet et al. [21], the ontological distance between sets of concepts X and Y is computed by summing up the distances between every pair ðc1 ; c2 Þ, where c1 2 X and c2 2 Y. Haase et al. [22] used the edge-based similarity measure between concepts defined in Eq. (2) to introduce the similarity between sets of concepts as
68
D. Girardi et al. / Journal of Biomedical Informatics 63 (2016) 66–73
SimðX; YÞ ¼
1 X max simðc1 ; c2 Þ; jXj c 2X c2 2Y 1
which computes an average of distances between c1 2 X and the most similar concept in Y. In the following, we will use the associated notion of concept distance
dC ðX; YÞ ¼ 1 SimðX; YÞ:
ð3Þ
2.3. Jaccard distance In the absence of a hierarchical structure for linking concepts, the standard method to calculate the similarity of two sets X and Y is the Jaccard similarity
JðX; YÞ ¼
jX \ Yj ; jX [ Yj
which is also known as Tanimoto similarity when the sets are represented as bit vectors [23,24]. The converse concept of Jaccard distance is readily obtained from the similarity measure as 1 JðX; YÞ, or equivalently as
dJ ðX; YÞ ¼
jX Yj ; jX [ Yj
ð4Þ
where X Y denotes the symmetric difference ðX n YÞ [ ðY n XÞ of all elements that are in one, but not both of the sets. This perspective allows us to focus directly on the distance of sets, which makes it easier to extend the concept to sets with hierarchical structure. 3. Method When master data catalogs of information systems are organized hierarchically and referenced by transaction data, these concept hierarchies need to be considered when calculating the similarity of two transaction data sets. In the following, we consider a hierarchy of concepts represented as a tree. The distance between two nodes (concepts) x and y in the tree is
dðx; yÞ ¼
pmin ðx; yÞ ; lðxÞ þ lðyÞ
ð5Þ
where pmin ðx; yÞ denotes the minimum number of edges between nodes x and y, and lðxÞ the depth (level) of a node in the hierarchy. Note that the denominator lðxÞ þ lðyÞ is the longest possible path between any two nodes at these depths in the tree. This means that if the LCS of two nodes x and y is the root node, then pmin ðx; yÞ ¼ lðxÞ þ lðyÞ, and consequently dðx; yÞ ¼ 1—the same as with the Jaccard distance between two singleton sets with unequal elements. Also in analogy to the Jaccard distance, dðx; xÞ ¼ 0 due to pmin ðx; xÞ ¼ 0. Note that our proposed hierarchical distance measure can be extended in a straightforward manner to take concept hierarchies with edge weights into account, simply by substituting sums of edge weights for the number of edges in Eq. (5). Building on this notion of distance between nodes in a tree, we define the hierarchical distance dH between sets X and Y of nodes as
dH ðX; YÞ ¼
1 jX [ Yj
! X 1 X X 1 X dðx; yÞ þ dðy; xÞ : jYj y2Y jXj x2X x2XnY y2YnX
We next provide an example that illustrates how the introduction of a hierarchical structure allows us to calculate a more finegrained distance between sets than the more crude Jaccard distance. For this, consider two sets
S1 ¼ fa; c; ng and S2 ¼ fc; i; og: By Eq. (4), their Jaccard distance is jS1 S2 j=jS1 [ S2 j ¼ 4=5. Viewed graphically as below, one can observe that every mismatch between elements contributes an additive term of 1 (to be then scaled by 1=5) to the Jaccard distance:
S1 S2 Contribution
Note the similarity of this expression with the formula for the Jaccard distance in Eq. (4): both feature a denominator of 1=jX [ Yj, and both measure how unequal the sets are in the numerator—albeit in a much more complicated manner for the hierarchical distance introduced here. Furthermore, observe that summing over x 2 X n Y and y 2 Y n X means that we do not consider elements in the intersection of the two sets. Every element not in the intersection of the two sets contributes the average tree node distance to all the elements in the other set to the overall distance calculation.
c
i
U
U U 0
U 1
1
n
o
U U 1
1
We now embed these two sets in the larger hierarchy shown in Fig. 1. The hierarchy contains six levels ranging from 0 to 5, with only the root node on level 0. The basic idea of the hierarchical distance measure is to reduce the mismatch penalty points of the Jaccard distance by accounting for the distance of the elements in the tree according to Eq. (5). Although all the set elements are placed at the leaves of the tree in Fig. 1, the algorithm works the same for inner nodes. The summation symbols in Eq. (6) indicate that the distance calculation iterates over all pairs of distinct elements in the two sets. When the element occurs in both sets, the distance contribution stays at 0. For two distinct elements, the term dðx; yÞ potentially reduces the mismatch penalty—by how much, is determined by the relative positions of the elements in the tree. Applying the formula in Eq. (6) to this example results in the expression
1 1 ðdða; cÞ þ dða; iÞ þ dða; oÞÞ þ ðdðn; cÞ þ dðn; iÞ þ dðn; oÞÞ 3 3 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} x¼a
x¼n
1 1 þ ðdði; aÞ þ dði; cÞ þ dði; nÞÞ þ ðdðo; aÞ þ dðo; cÞ þ dðo; nÞÞ; 3 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} 3 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} y¼o
y¼i
still to be scaled by 1=jS1 [ S2 j ¼ 1=5. The individual contributions to this sum are given by Eq. (5); note that since dðx; yÞ ¼ dðy; xÞ, we omit the redundant entries:
2 1 8 ¼ ; dða; iÞ ¼ ¼ 1; 5þ5 5 5þ3 7 5 ¼ 1; dðn; iÞ ¼ ¼ 1; dðn; cÞ ¼ 2þ5 2þ3 8 7 ¼ 1; dðo; cÞ ¼ ¼ 1: dði; cÞ ¼ 5þ3 2þ5
7 ¼ 1; 2þ5 2 1 dðn; oÞ ¼ ¼ ; 2þ2 2 dða; oÞ ¼
dða; cÞ ¼
The following table illustrates how the mismatch penalties in the Jaccard distance are reduced by the hierarchical distance measure. For every element x not in the intersection of the two sets, the table P 1 gives the hierarchical contribution jYj y2Y dðx; yÞ in Eq. (6): a
ð6Þ
a
U S1 S2 Jaccard 1 contribution 11 Hierarchical 15 contribution
b c
d e f g h i
U U 0
U 1
0
1
j k l m n
o
p
U 1
U 1
5 6
5 6
As an example, the entry 11=15 for the hierarchical contribution of element a is computed as 1=3 ð1=5 þ 1 þ 1Þ. Summing up, we obtain
D. Girardi et al. / Journal of Biomedical Informatics 63 (2016) 66–73
69
Fig. 1. An example concept hierarchy. The elements of set S1 ¼ fa; c; ng are circled, and the elements of set S2 ¼ fc; i; og are underlined.
dH ðS1 ; S2 Þ ¼
1 11 5 5 17 þ0þ1þ þ ¼ ¼ 0:68: 5 15 6 6 25
Note that this corresponds to a reduction of 15% from the Jaccard distance dJ ðS1 ; S2 Þ ¼ 0:8; the closer the nodes in the sets are in the tree, the larger this reduction will be. 4. Results We evaluated the viability of this approach to distance calculation on a data set of 800 records, and compared the results with those of two other distance measures: the Jaccard distance dJ of Eq. (4) as the traditional, non-hierarchical choice for distance measure between sets, and the Haase-Li concept distance dC of Eq. (3) as a different hierarchical distance measure. Each of the 800 patients with data records analyzed here suffers from exactly one of four primary diagnoses (at ICD-10 level 2), and a varying number of secondary diagnoses. Three of the four primary diagnoses are diagnoses of the ICD-10 top level category Diseases of the digestive system (K00-K93) and one is of the top level category Endocrine, nutritional and metabolic diseases (E00-E90). The results of a graphical comparison between the three distance measures is shown in Section 4.2; a quantitative comparison is given in Section 4.3. 4.1. Data set Test data was gathered at eight small to medium-sized Austrian hospitals. All patients with inguinal or femoral hernia repair, appendectomy, strumectomy, or thyroidectomy performed in the surgical wards of the participating hospitals were considered for entry into the study. The eligible sample included all consecutive discharges between October 1, 2010 and March 31, 2011 as well as between October 1, 2011 and October 31, 2012 at the participating hospitals. Patients with prior operative procedures performed before hernia repair were excluded from the analysis. Patients who underwent any other surgical procedure after hernia repair were excluded only if the following procedure was not related to complication after hernia repair. ICD-10 codes of the participating patients were gathered from hospital data bases along with an extensive panel of other information about the patients. In this data set, each patient suffers from exactly one primary diagnosis—which is either hernia, or a disease of the appendix, gall bladder, or thyroid gland—and a varying number of secondary diagnoses. The diagnoses are stored as ICD-10 codes at leaf level
of the ICD-10 tree. So, a patient with the stored diagnosis of K35.2 Acute appendicitis with generalized peritonitis and a patient with K35.8 Acute appendicitis, other and unspecified are both considered appendix patients. The hierarchical nature of the four primary diagnoses in the data set is shown in Fig. 2. Taking this particular hierarchy into account, we expect four lower-level clusters in our data, and two higher-level clusters. The data set contains data from 836 patients, with a total of 1487 ICD-10 encoded diagnoses. The primary diagnoses of the 836 patients are distributed as follows: 291 hernia diagnoses, 239 gall-bladder diagnoses, 91 thyroid diagnoses, and 215 appendix diagnoses.
4.2. Visual comparison of clusterings The data set was visualized using the nonlinear Sammon mapping algorithm [25], which creates a 2D cloud visualization of the relative positions of all patients. Sammon mapping is a widely used method in exploratory data analysis; it is a framework for minimizing a normalized version of the sum-of-squares error between the original and projected distances between all pairs of data points. The algorithm computes the 2D projections of the original data points in such a way that this error is minimized. The Sammon’s mapping algorithm requires a distance matrix of the data set to display. These distances were calculated between the patients’ sets of diagnoses, once using the Jaccard metric dJ , once using the Haase-Li concept hierarchies dC , and once using the hierarchical distance function dH . Fig. 3 shows three clusterings of the same data set, where each visualization was generated using a different distance metric. The cloud arrangement in part (a) shows a high fragmentation of patients with the same primary diagnosis. The reason for this clearly visible dispersion is the underlying Jaccard metric. It considers two sets of concepts as equal if they contain exactly the same diagnoses. It does not take into account the similarity information between concepts, which is provided by the hierarchy of the ICD-10 catalog. This leads to the effect that the visualization shows, for example, four dark blue clusters of appendix patients (one cluster for each appendix sub-type), which are scattered across the visualization, although they all contain appendix patients and should be located close to one another. The same can be observed for all other main diagnoses. Parts (b) and (c) of Fig. 3 display a clear separation between cases with different primary diagnoses. The patients with the same
70
D. Girardi et al. / Journal of Biomedical Informatics 63 (2016) 66–73
Fig. 2. The location of the four primary diagnosis groups in the hierarchy of the ICD-10 catalog.
primary diagnosis (regardless of their diagnosis sub-type) form clearly recognizable and clearly separable clusters. The expected picture of four clusters can be recognized. In part (c), the clusters show a dense core area and a wide dispersion of cases belonging to the same category. The dispersion is not as clear in the clustering shown in part (b). Analysis of the underlying data showed that the dense cores of the clusters in part (c) represent patients without secondary diagnoses; the more widely dispersed cases of the same category represent patients with one or more secondary diagnoses, which causes them to drift away from the cluster cores and from each other. Furthermore, the clustering in part (c) shows that the blue (appendix), the cyan (gall bladder), and the green (hernia) clusters are located close to one another, while the red (thyroid) cluster is located further away. This cannot be seen in the clustering in part (b). This is remarkable, since the first three clusters are clusters of patients with diseases of the digestive system, and the latter represents patients with endocrine diseases. The clustering in part (c) thus more closely mimics the hierarchical concept structure given by the ICD-10 catalog.
4.3. Quantitative comparison of clusterings Using the data set described above, we analyzed how the choice of distance measure influences the distances between the 836 cases, when grouped into the four primary (level 2) diagnoses. The distance matrices DJ ; DC and DH , corresponding to the three choices of distance measure dJ ; dC , and dH , respectively, are shown in Table 1. The numbers in the matrices, such as 0.6523 as the first (A-A) entry in DJ , denote the average distance between two diseases in the same category (e.g., diseases of the appendix). For these calculations, we considered only the primary diagnoses (those in one of the four groups), and no additional diagnoses the patients might have. Thus, out of all possible pairs of the 215 appendicitis patients, 65.23% had different diagnoses. Of course, all appendicitis patients had a different primary diagnosis than all other patients; this results in the three 1.0000 entries that comprise the remainder of the first row in DJ . The numbers in DC and DH , the other two matrices in Table 1, are smaller than or equal to the corresponding numbers in DJ . This can be seen more easily for DH , the matrix corresponding to the new hierarchical distance measure dH : two different diagnoses with the same primary diagnosis (e.g., diseases of the appendix) have distance 2=ð3 þ 3Þ ¼ 1=3. Because there are also patients with exactly the same diagnosis—and thus, distance 0—the average distance among all patients with the same primary diagnosis is always less than 1=3. This can be observed along the diagonal of DH . Two diagnoses with different primary diagnoses in the same branch of Fig. 2 have distance 4=ð3 þ 3Þ ¼ 0:6667; those with dif-
ferent primary diagnoses in different branches have distance 6=ð3 þ 3Þ ¼ 1:0000. These values make up the remainder of the values in DH . A statistical comparison of the values in DH with the corresponding values in DJ reveals that all the values in DH (except the 1.0000 entries) are significantly smaller than those in DJ (Wilcoxon signed-rank test, p < 0:0001). Comparing the entries in DH with the other hierarchical distances in DC , we can observe the same thing: All non-1.000 entries in DH are again significantly smaller than the corresponding entries in DC (p < 0:0001). A graphical display of the effect that the choice of distance measure has on the distances between patient records is given in Fig. 4. One can observe that the two hierarchical distance measures dC and dH result in values that are always smaller than, or at most equal to, those produced by the Jaccard distance measure dJ . The figure on the left shows that incorporating hierarchical information results in a better stratification of distances. When we consider more than one diagnosis of the patients—as shown on the right of Fig. 4—distances between them are no longer restricted to a few discrete values. We can indeed observe that incorporating all the diagnoses of a patient results in more finegrained distance information; this effect is more pronounced when using the hierarchical distance measures. 5. Discussion In this paper, we introduced a new hierarchical distance measure, and compared it to two existing methods (Jaccard metric and Haase-Li concept distance) when calculating the similarity between two sets of concepts for which a hierarchical arrangement is available. Our new hierarchical distance measure allows the grouping of cases in a manner that is not possible with distance measures available in the literature so far: We extend the notion of the well-established Jaccard distance by taking concept hierarchies into account in such a manner that in the absence of a concept hierarchy, our new distance measure reverts to the Jaccard distance. In contrast to other hierarchical distance measures, such as the one proposed by Li et al. [17] and Haase et al. [22], our new proposed measure does not take nonlinearities into account, and can thus be argued to be a more direct mapping of hierarchical information to distances than others. 5.1. Graphical representation Given a data set of about 836 cases, each with one of four primary diagnoses, we calculated three distance matrices, each corresponding to a different distance measure. We then displayed the clusters given by these distance measures through a nonlinear mapping algorithm that reduces the data dimensionality to two.
D. Girardi et al. / Journal of Biomedical Informatics 63 (2016) 66–73
71
Fig. 3. Three cloud visualizations of the same data set using the Jaccard metric (a), the Haase-Li distance (b), and the proposed hierarchical distance measure (c). The patients are colored by the category of their primary diagnosis. Appendix patients are colored dark blue, thyroid patients are red, patients with hernia diagnoses are green, and gall bladder patients are cyan. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
72
D. Girardi et al. / Journal of Biomedical Informatics 63 (2016) 66–73
Table 1 Average distances between patients with diseases in the four level 2 groups hernia (H), and diseases of the appendix (A), gall bladder (G) and thyroid (T). The three distance matrices use the Jaccard distance dJ (left), the Haase-Li concept distance dC (middle), and the new hierarchical distance dH (right).
DJ A H G T
A
H
G
T
0.6523
1.0000 0.4095
1.0000 1.0000 0.7206
1.0000 1.0000 1.0000 0.4895
A 0.2878
H 0.7587 0.1807
G 0.7587 0.7587 0.3179
T 1.0000 1.0000 1.0000 0.2160
A 0.2174
H 0.6667 0.1365
G 0.6667 0.6667 0.2402
T 1.0000 1.0000 1.0000 0.1632
DC A H G T DH A H G T
Visualization based on Jaccard distance (Fig. 3, part (a)) produces a number of small, dense clusters, each corresponding to patients with a single primary diagnosis, i.e., exactly the same ICD-10 code. One can detect patients having widely varying secondary diagnoses in a straightforward manner, as they are drifting off these clusters. Contrary to expectations, one cannot make out four main clusters. Instead, the visualization shows several small clusters, with clusters of different categories sometimes being closer than those of the same category. By contrast, the Haase-Li and the new hierarchical distance measures take into account a concept hierarchy; hence, the four expected main clusters are easily visible in Fig. 3, parts (b) and (c). Beyond that, only the new hierarchical distance measure was able to depict patients with one single diagnosis in dense clusters and patients with side-diagnoses in less dense areas around these clusters. Additionally, it can be observed that there is almost no blending of different case categories. However, there is a notable difference in the way that patients with a varying number of diagnoses are displayed. The visualization based on the Haase-Li distance measure tends to move patients with multiple diagnoses closer to patients with a single diagnosis; in contrast, the visualization based on the new hierarchical distance measure forms a clearly arranged boundary between them. This transition zone arises from more dense cores of the clusters and moving patients with multiple diagnoses outwards; this makes it much easier to distinguish between cases with single and multiple diagnoses. This is highly relevant for the application area of case-based reasoning for clinical decision sup-
port systems. Patients with side diagnoses are clearly separated from patients without side diagnoses; cases from one group therefore do not influence decisions of the other group. Furthermore, our hierarchical distance measure reveals another structural feature of the data set that is not easily visible when using the Haase-Li distance measure. As visualized in Fig. 2, three of the four diagnosis groups (namely appendix, hernia, and gall bladder diseases) belong to the same top-level category Diseases of the digestive system, while the fourth diagnosis group Disorders of the thyroid gland belongs to the top-level category Endocrine, nutritional and metabolic diseases. Consequently, patients with diagnoses of the digestive system are more similar to one another than to patients who suffer from endocrine diseases, and vice versa. Only in Fig. 3, part (c) does this high-level clustering become visible. The digestive patients (green, turquoise, and blue) form a closer group, and the endocrine patients (colored in red) are located remotely on the left side. This high-level clustering is not visible at all in parts (a) and (b) of Fig. 3. Summarizing, the clusterings of our example data set, shown in Fig. 3, illustrate that the Jaccard distance measure is all but useless in identifying the underlying structure of the data. On the other hand, the Haase-Li distance measure leads to an oversimplification in the sense that clusters appear much more dense and clearly delineated than they should be, given the nature of the data with a large number of cases sharing little but their primary diagnoses. 5.2. Quantitative issues In Fig. 4, the left plot corresponds to distances between patients when only their primary diagnoses are considered. In the plot, the dotted line represents the output of Jaccard distance. For two patients with the same primary diagnosis, the plot shows a distance of 0.0. If their primary diagnoses differ, the distance increases to 1.0, corresponding to the maximum distances between cases. In contrast to the Jaccard distance, the Haase-Li distance takes into account hierarchical concepts, which can especially be observed in the less coarse-grained gradations (dashed line) in the plot on the right. An enhancement of both methods described above is provided by the new hierarchical distance measure (solid line). It can be observed that the distances calculated by this measure in the right plot are much more fine-grained than those of the other two methods. 5.3. Runtime behavior Consider the case of calculating the similarity of two sets S1 and S2 , with n and m items respectively. When using the Jaccard distance measure, this can be accomplished in Oðm þ nÞ steps. Both the Haase-Li and the new hierarchical distance measure, however, require substantially more computational effort: the distance in
1
1
0.667
0.667
0.333
0.333
0
0
Fig. 4. Distances between diagnoses on the y-axis, for Jaccard distance dJ (dotted line), the Haase-Li concept distance dC (dashed line), and the new hierarchical distance dH (solid line). The x-axis gives all pairs of diagnoses in the data set; the pairs are sorted by distance between the entries in the pair. The plot on the left shows distances when only the primary diagnosis of a patient is considered; the right shows distances when all diagnoses of a patient are considered.
D. Girardi et al. / Journal of Biomedical Informatics 63 (2016) 66–73
the concept hierarchy has to be calculated for every pair of items in S1 and S2 . This distance calculation requires OðdÞ steps, where d is the depth of the concept hierarchy. Consequently, a hierarchical similarity calculation of S1 and S2 requires Oðm n dÞ steps. 5.4. Applicability to other areas This new distance measure is not only relevant in a biomedical environment. It could, for example, be used in the field of market basket research to improve predictions and customer clusterings. Available products are often organized and standardized in hierarchies, such as the Global Product Classification (GPC), which is maintained by the GS1 consortium. Comparable to patients with multiple diagnoses, customers that purchased multiple products can be clustered into more meaningful groups by considering product hierarchies. Thus, two different variants of milk are different products that can only be identified as similar (and furthermore, as a member of the ‘‘dairy” category) by referring to a concept hierarchy. Predictions of purchasing behavior and product recommendations can be made more accurate by taking this hierarchical notion of similarity into account. Conflict of interest We declare that we do not have any conflict of interest. References [1] L.B. Madsen, Data-Driven Healthcare: How Analytics and BI are Transforming the Industry, Wiley, 2014. [2] W. Raghupathi, V. Raghupathi, Big data analytics in healthcare: promise and potential, Health Inform. Sci. Syst. 2 (2014) 3. [3] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intel. 35 (8) (2013) 1798–1828. [4] R. Pivovarov, N. Elhadad, Automated methods for the summarization of electronic health records, J. Am. Med. Inform. Assoc. 22 (5) (2015) 938–947. [5] S.K. Card, J.D. MacKinlay, B. Shneiderman (Eds.), Readings in Information Visualisation. Using Vision to Think, Morgan Kaufman, 1999. [6] D.A. Keim, F. Mansmann, J. Schneidewind, J. Thomas, H. Ziegler, Visual analytics: Scope and challenges, in: Visual Data Mining, Lecture Notes in Computer Science, vol. 4404, Springer, Berlin, Heidelberg, 2008, pp. 76–90.
73
[7] W. Lee, N. Shah, K. Sundlass, M. Musen, Comparison of ontology-based semantic-similarity measures, in: Proceedings of the AIMA Annual Symposium, 2008, pp. 384–388. [8] T. Mabotuwana, M.C. Lee, E.V. Cohen-Solal, An ontology-based similarity measure for biomedical data—application to radiology reports, J. Biomed. Inform. 46 (5) (2013) 857–868. [9] WHO, ICD-10, 2015.
. [10] Janet Kolodner, Case-Based Reasoning, Elsevier Science, Burlington, 2014. ISBN 1483294498. [11] S. Boriah, V. Chandola, V. Kumar, Similarity measures for categorical data: a comparative evaluation, in: Proceedings of the 8th SIAM International Conference on Data Mining, 2008, pp. 243–254. [12] B. Boutsinas, T. Papastergiou, On clustering tree structured data with categorical nature, Pattern Recog. 41 (12) (2008) 3613–3623. [13] B. Hammer, A. Micheli, A. Sperduti, M. Strickert, A general framework for unsupervised processing of structured data, Neurocomputing 57 (2004) 3–35. [14] T. Kohonen, Self-Organizing Maps, third ed., Springer, 2001. [15] R. Rada, H. Mili, E. Bicknell, M. Blettner, Development and application of a metric on semantic nets, IEEE Trans. Syst. Man Cyb. 19 (1) (1989) 17–30. [16] Z. Wu, M. Palmer, Verb semantics and lexical selection, in: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994, pp. 133–138. [17] Y. Li, Z.A. Bandar, D. McLean, An approach to measuring semantic similarity between words using multiple information sources, IEEE Trans. Knowl. Data Eng. 15 (4) (2003) 871–882. [18] P. Resnik, Using information content to evaluate semantic similarity in a taxonomy, in: Proceedings of the 1995 International Joint Conference on AI, 1995, pp. 448–453. [19] P. Resnik, Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language, J. Artif. Intel. Res. 11 (1999) 95–130. [20] D. Lin, An information-theoretic definition of similarity, in: Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 296–304. [21] P. Bouquet, G. Kuper, M. Scoz, S. Zanobini, Asking and answering semantic queries, in: Proceedings of ISWC-04 Workshop on Meaning Coordination and Negotiation Workshop, in conjunction with the 3rd International Semantic Web Conference, 2004, pp. 25–36. [22] P. Haase, R. Siebes, F. Van Harmelen, Peer selection in peer-to-peer networks with semantic topologies, in: International Conference on Semantics of a Networked World: Semantics for Grid Databases, 2004, pp. 108–125. [23] Paul Jaccard, The distribution of the flora in the alpine zone, New Phytol. 11 (1912) 37–50. [24] David J. Rogers, Taffee T. Tanimoto, A computer program for classifying plants, Science 132 (3434) (1960) 1115–1118. [25] J.W. Sammon, A nonlinear mapping for data structure analysis, IEEE Trans. Comput. 18 (1969) 401–409.