Complex overlapping concepts: An effective auditing methodology for families of similarly structured BioPortal ontologies

Complex overlapping concepts: An effective auditing methodology for families of similarly structured BioPortal ontologies

Journal of Biomedical Informatics 83 (2018) 135–149 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: ww...

1MB Sizes 0 Downloads 21 Views

Journal of Biomedical Informatics 83 (2018) 135–149

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

Complex overlapping concepts: An effective auditing methodology for families of similarly structured BioPortal ontologies

T



Ling Zhenga, , Yan Chenb, Gai Elhananc, Yehoshua Perla, James Gellera, Christopher Ochsd a

Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States CIS Department, Borough of Manhattan Community College, CUNY, NY 10007, United States c Applied Innovation Center, Desert Research Institute, Reno, NV 89512, United States d Nokia Bell Labs, Murray Hill, NJ 07974, United States b

A R T I C LE I N FO

A B S T R A C T

Keywords: National Cancer Institute thesaurus SNOMED CT Ontology auditing Ontology quality assurance Abstraction network Family-based ontology quality assurance

In previous research, we have demonstrated for a number of ontologies that structurally complex concepts (for different definitions of “complex”) in an ontology are more likely to exhibit errors than other concepts. Thus, such complex concepts often become fertile ground for quality assurance (QA) in ontologies. They should be audited first. One example of complex concepts is given by “overlapping concepts” (to be defined below.) Historically, a different auditing methodology had to be developed for every single ontology. For better scalability and efficiency, it is desirable to identify family-wide QA methodologies. Each such methodology would be applicable to a whole family of similar ontologies. In past research, we had divided the 685 ontologies of BioPortal into families of structurally similar ontologies. We showed for four ontologies of the same large family in BioPortal that “overlapping concepts” are indeed statistically significantly more likely to exhibit errors. In order to make an authoritative statement concerning the success of “overlapping concepts” as a methodology for a whole family of similar ontologies (or of large subhierarchies of ontologies), it is necessary to show that “overlapping concepts” have a higher likelihood of errors for six out of six ontologies of the family. In this paper, we are demonstrating for two more ontologies that “overlapping concepts” can successfully predict groups of concepts with a higher error rate than concepts from a control group. The fifth ontology is the Neoplasm subhierarchy of the National Cancer Institute thesaurus (NCIt). The sixth ontology is the Infectious Disease subhierarchy of SNOMED CT. We demonstrate quality assurance results for both of them. Furthermore, in this paper we observe two novel, important, and useful phenomena during quality assurance of “overlapping concepts.” First, an erroneous “overlapping concept” can help with discovering other erroneous “non-overlapping concepts” in its vicinity. Secondly, correcting erroneous “overlapping concepts” may turn them into “non-overlapping concepts.” We demonstrate that this may reduce the complexity of parts of the ontology, which in turn makes the ontology more comprehensible, simplifying maintenance and use of the ontology.

1. Introduction In recent years, ontologies have been playing an important role in the biomedical field to support the rapid increase of data processing in healthcare and basic research [1]. Biomedical ontologies have been used for data annotation, information integration, knowledge discovery and other applications [2–6]. Errors in biomedical ontologies impede their usefulness. Hence, quality assurance (QA) of biomedical ontologies must be an essential part of their life cycle [7]. Some practically useful biomedical ontologies are large in terms of their numbers of concepts and are complex due to the many links between concepts. For example, the National Cancer Institute thesaurus (NCIt) [8] contains



Corresponding author. E-mail address: [email protected] (L. Zheng).

https://doi.org/10.1016/j.jbi.2018.05.015 Received 10 October 2017; Received in revised form 25 May 2018; Accepted 26 May 2018 Available online 28 May 2018 1532-0464/ © 2018 Elsevier Inc. All rights reserved.

more than 100,000 concepts and over 400,000 links connecting them. Biomedical ontologies need to be updated at regular intervals, in order to stay in synch with frequent extensions and changes of expert biomedical knowledge. Due to these inherent characteristics of biomedical ontologies, QA is a challenging and resource-intensive task. Without the help of automatic or semi-automatic techniques and tools, it is impossible to maintain large ontologies of a high quality. For example, Jiang and Chute [9] utilized a formal concept analysis-based model to audit the semantic completeness of SNOMED CT. We have introduced abstraction networks, which are automaticallyderived compact summaries of the content and structure of ontologies, to support ontology summarization and quality assurance [10]. One

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

The Neoplasm subhierarchy contains 8166 concepts and has similar structural features as “the four ontologies,” thus, it belongs to the same family and qualifies as “the fifth ontology.” Therefore, we are evaluating the QA technique based on overlapping concepts on the Neoplasm subhierarchy. For this purpose, we first derived the partial-area taxonomy [17] for the NCIt Neoplasm subhierarchy in this study. Then we derived from it the disjoint partial-area taxonomy [23]. Based on the latter we conducted a QA study, with the help of two domain experts, on a random sample of its “overlapping concepts” and a random control sample. The goal was to determine whether the error rate of overlapping neoplasm concepts is statistically significantly higher than that of a random control sample of non-overlapping neoplasm concepts. For the sixth ontology, we conducted a QA study on the Infectious Disease subhierarchy of SNOMED CT. During recent years, this subhierarchy has undergone an intensive remodeling by editors of SNOMED International [24]. In our study, we considered all concepts that were changed in this process as erroneous, since their modeling was changed for good reasons. A similar idea was previously used by other researchers [25,26]. In this way, no additional domain expert was necessary to review any samples of the sixth ontology. The study considered all the overlapping concepts in the subhierarchy and an equal number of randomly chosen non-overlapping concepts. Statistical significance was obtained for both the Neoplasm subhierarchy and the Infectious Disease subhierarchy. This extends our previous results to six out of six ontologies from the same family, as required. Furthermore, we observed two new useful phenomena for overlapping concepts. The first is that some erroneous overlapping concepts facilitate the discovery of erroneous non-overlapping concepts outside of the study and control samples. The second new phenomenon is that the correction of some erroneous overlapping concepts turns them into non-overlapping concepts. Since overlapping concepts are “complex” (in a sense defined below), this implies that such corrections reduce the overall complexity of the ontology, which makes it easier for ontology curators to maintain it, and for users and application developers to utilize it.

advantage of the abstraction network-based QA methodology is that it can support auditors to identify both semantic errors and structural errors while automatic methods can only detect structural errors, e.g., redundant parents or relationship targets. By errors we mean concepts’ modeling errors, including omission errors (e.g., a concept is missing an is-a relationship to another concept, which should be the concept’s parent) and commission errors (e.g., a concept has an erroneous is-a relationship to the wrong parent or an erroneous semantic relationship pointing to an incorrect target concept). Examples will be given below. We note that both NCIt and SNOMED CT were developed using description logic support. Some existing errors could be related to either inaccurate logical definitions or lack of sufficient conditions (i.e., primitive concepts). Wei and Bodenreider [11] showed that abstraction networks can support finding of errors that are not exposed by automatic classifiers [12]. The derivation of an abstraction network from an ontology takes into account the structural features of the ontology. Ontologies of different structures require different derivation algorithms. A special kind of abstraction network, the partial-area taxonomy, had been developed for auditing the SNOMED CT Specimen subhierarchy [13]. In previous work, we introduced a family-based QA framework that supports QA techniques applicable to whole families of ontologies with similar structure [14]. The more than 680 ontologies hosted at the National Center for Biomedical Ontologies (NCBO) BioPortal [15] in Stanford can be mapped into many families [14]. The methodology of auditing “overlapping concepts” [16] was shown to find a statistically significantly higher ratio of errors than for a control sample, for four ontologies on BioPortal. These four ontologies belong to the BioPortal family of ontologies with (a) object properties used only in restrictions and (b) with multiple parents allowed. “Overlapping concepts” are concepts that belong to multiple partialareas in the partial-area taxonomy of the ontology. (These technical terms will be clarified in the Background Section.) The four ontologies for which the studies were conducted [16–19] are (1) the Specimen subhierarchy of SNOMED CT, (2) the Bleeding subhierarchy in the Clinical Finding subhierarchy of SNOMED CT [20], (3) the Uber Anatomy Ontology (Uberon) [21], and (4) the Gene subhierarchy of NCIt [22]. In the balance of this paper we will refer to (1)–(4) as “the four ontologies.” Below we will briefly review the derivation method for partial-area taxonomies. It was possible to derive partial-area taxonomies for “the four ontologies,” because the concepts in these ontologies (or subhierarchies of ontologies) have “outgoing” semantic relationships, the first structural feature of the above-mentioned family. The reason for the higher error rate is that overlapping concepts are complex, because they inherit information from two or more sources. This is the second structural feature of this BioPortal family of ontologies (see Background for details). Statistically, in order to correctly draw the conclusion that a QA technique is likely to work for at least half of the ontologies in a family, it is necessary to show that this technique succeeds for six out of six similar ontologies, or for eight out of nine similar ontologies, as shown by Ochs et al. [14]. The QA technique that relies on the hypothesis that the error rate of overlapping concepts is higher than that of non-overlapping concepts had been successfully applied to “the four ontologies.” To establish the required “six out of six,” we still need to demonstrate its success for two more ontologies of the same family, which is achieved in this paper. The National Cancer Institute thesaurus (NCIt) is a large complex reference ontology dedicated to the treatment of and research on cancer. It consists of 19 subhierarchies, the largest of which is Disease, Disorder or Finding. The Neoplasm subhierarchy is, in turn, a subhierarchy of Disease, Disorder or Finding. Due to this cancer focus, the Neoplasm subhierarchy appears to receive special attention by the NCIt editorial team, as manifested, for example, by the higher relationship density of this subhierarchy, compared to the density for the whole NCIt.

2. Background 2.1. Ontologies and the National Cancer Institute thesaurus (NCIt) The National Cancer Institute thesaurus (NCIt) [27] is a cancer-focused reference ontology developed and published by the National Cancer Institute (NCI) with the goal to facilitate interoperability and data sharing among various information systems in the NCI. It is released at the beginning of each month for free public access in OWL and flat file formats and has been used in an increasing number of information systems outside the NCI, both nationally and internationally [28]. Having referred to NCIt as an ontology, as opposed to, for example, a terminology, it is necessary to state that the exact definition and differentiation between ontology, terminology, and controlled vocabulary have been the subject of much academic debate. Schulz and Jansen [6] write that in prior cited work “seven different definitions were proposed” for the meaning of “ontology.” They also “identify the following trends:” “The best starting point to get an up-to-date overview of biomedical ontologies (and terminologies) is BioPortal…” The definition of ontology on the Protégé website [29] is “An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them. An ontology is a formal explicit description of concepts in a domain of discourse (classes (sometimes called concepts)), properties of each concept describing various features and attributes of the concept (slots (sometimes called roles or properties)), and restrictions on slots (facets (sometimes called role restrictions)).” We are adopting an extensional (in the philosophical sense) stance that any collection of terms listed by BioPortal 136

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al. Disease, Disorder or Finding subhierarchy

is either an ontology, or sufficiently similar to an ontology, to be the subject of this investigation. In order to avoid cluttering up the text of this paper with the repeated use of phrases such as “ontology and ontology-like controlled terminology” we will take the inclusive view of ontologies and use only this designation. NCIt covers medical terms in different domains that are important for cancer research, including clinical care, basic research, public information and administrative activities. The content of NCIt is modeled based on a version of description logic [22]. A “concept” is the basic unit in NCIt, as in many other ontologies. Pairs of concepts may be connected by is-a links or roles. Together the concepts and is-a links form a connected directed graph (in the mathematical sense) without cycles that may be referred to as heterarchy, DAG (Directed Acyclic Graph), or, for simplicity in this paper, as hierarchy. Any part of a hierarchy that is itself structurally a DAG will be referred to as subhierarchy in this paper. Thus, the complete NCIt is a hierarchy, and any part of it is a subhierarchy. A subhierarchy can have its own subhierarchies. The inferred February 2015 release of NCIt in OWL format, utilized for this research, has 108,376 active concepts organized into 19 major disjoint is-a subhierarchies (direct children of NCIt’s root), e.g. Disease, Disorder or Finding; Molecular Abnormality; Abnormal Cell; Chemotherapy Regimen or Agent Combination and Biological Process. Concepts in each subhierarchy are connected by is-a relationships to their parent concept (s), i.e., a concept may have multiple parents. Fig. 1 shows the 19 major subhierarchies, with some concepts listed for the two subhierarchies Biological Process and Disease, Disorder or Finding. Unfortunately, different ontologies use slightly different meta-terminology to name basic components. As we worked first with NCIt, we will adopt its usage and use the name “role” to describe a semantic connection between two concepts. Thus, Disease Has Associated Anatomic Site is considered a role in NCIt (Fig. 2). It connects a disease concept with the anatomical location where this disease is manifested. Alternative closely related meta-terms for connections between pairs of concepts are lateral relationship, semantic relationship, and restriction. The name semantic relationship is specifically used to express that this connection of two concepts has different semantics from the is-a connection of two concepts. This semantics is expressed explicitly by the role name. The is-a connection, in turn, is closely related but philosophically not necessarily identical to the meta-terms subclass-of, a-kind-of, etc. Domain and range are basic components of description logics. We will use the term source to describe the concept at the starting point of a role (Fig. 2) and the term target to describe the end point (“the concept at the arrow head”). We will use term domain to describe the starting subhierarchy of a role and the term range for the subhierarchy of its end point. For example, for the above-mentioned role Disease Has Associated Anatomic Site the domain is the Disease, Disorder or Finding

Anatomic Structure, System, or Substance subhierarchy

Disease

Breast Neoplasm

Has Ass

ociated

is-a

Benign Breast Neoplasm

Dis

Asso ease Has

ciated

Anatom

ic Site

ic Sit Anatom

e

Breast

Fig. 2. Illustration of the is-a hierarchical relationship, the role Disease Has Associated Anatomic Site (the solid arrow) and the effect of inheritance of this role (the dashed arrow) to the concept Benign Breast Neoplasm.

subhierarchy and the range is the Anatomic Structure, System, or Substance subhierarchy. Roles are inherited along the is-a hierarchy. For example, in Fig. 2, the source concept Breast Neoplasm in the Disease, Disorder or Finding subhierarchy has the role Disease Has Associated Anatomic Site pointing (as a solid arrow) to the target concept Breast, which is in the Anatomic Structure, System, or Substance subhierarchy. Since the concept Benign Breast Neoplasm “is-a” Breast Neoplasm, it inherits the role Disease Has Associated Anatomic Site (shown as a dashed arrow) with the target concept Breast from Breast Neoplasm. The inherited role is shown explicitly in Fig. 2 for the purpose of explication; in practice, it would not be displayed. Below we will make one more important distinction, namely between role and role type. Use cases determine to a large extent the modeling priorities of the NCIt. Hence, not every subhierarchy is modeled with roles. Concepts in eight of the 19 major subhierarchies, such as the Organism subhierarchy and the Biochemical Pathway subhierarchy, only serve as the targets of roles and do not function as sources of roles. NCIt has provided terminology support for a broad range of users from NCI’s various internal divisions to domestic and international organizations in the cancer research and biomedical communities outside of NCI. For example, the anatomy concepts in NCIt serve as a resource for the Mouse–Human Anatomy Project (MHAP), facilitating the mapping of anatomical terms used for mouse and human models [30] by the NCI Division of Cancer Biology. The eMERGE (Electronic Medical Records and Genomics) Network study of Pathak et al. [31] mapped phenotype data dictionaries from five different sites to standardized biomedical ontologies including NCIt. The Pharmacogenomics Research Network (PGRN) applied controlled terminologies such as NCIt as sources for pharmacogenomics data semantic annotation [32]. The Tissue Microarray Database (TMAD), an important public resource for tissue microarray experiment data, uses NCIt to annotate and query tissue data [33]. The neXtProt knowledgebase of human proteins,

Fig. 1. An excerpt of the 19 major subhierarchies in NCIt. 137

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

Disease Has Abnormal Cell

Neoplasm

Neoplasm by Morphology

Neoplasm by Special Category

Papillary Neoplasm

Papillary Epithelial Neoplasm

Cystic Neoplasm Disease Has Abnormal Cell, Disease Has Finding

Papillary Cystic Neoplasm

Epithelial Neoplasm

Cystadenoma Papillary Cystadenoma

Glandular Cell Neoplasm

Serous Neoplasm Serous Cystadenoma

(a)

Disease Excludes Abnormal Cell, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin

Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin

Disease Has Abnormal Cell Neoplasm (5)

Disease Has Abnormal Cell 5 concepts Disease Has Abnormal Cell, Disease Has Finding Disease Has Abnormal Cell, Disease Has Finding 1 concept

Cystic Neoplasm (1) Disease Excludes Abnormal Cell, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin

Disease Excludes Abnormal Cell, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin 2 concepts

Papillary Epithelial Neoplasm (1)

Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin 5 concepts

(b)

Glandular Cell Neoplasm (1)

Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin

Papillary Cystic Neoplasm (2)

Cystadenoma (3)

Serous Neoplasm (2)

(c)

Fig. 3. The derivation of area taxonomy and partial-area taxonomy from an ontology: (a) An excerpt of 13 Neoplasm concepts in the Disease, Disorder or Finding subhierarchy of NCIt. Concepts represented by boxes with rounded corners are connected by is-a relationships shown as thin upward arrows. Roles are written in bold inside dashed, colored boxes that indicate sets of concepts with the same role types. (b) The area taxonomy for the ontology excerpt in (a). Area nodes are color coded by the number of role types, i.e., area nodes with the same number of role types have the same color. Hierarchical child-of links are displayed as bold upward arrows. (c) The partial-area taxonomy for the ontology excerpt in (a). The partial-area nodes are represented as white boxes within area nodes. Hierarchical child-of links are again displayed as bold upward arrows. Numbers in () indicate how many concepts are summarized.

concepts (5901) than non-neoplasm concepts (4620) in the Disease, Disorder or Finding subhierarchy [27]. Due to the needs expressed by users for the inclusion of non-neoplasm concepts, there are now more non-neoplasm concepts in this subhierarchy, but most of the non-neoplasm concepts are primitive (in the technical sense of description logic). The Disease, Disorder or Finding subhierarchy is the largest subhierarchy in NCIt, with 25,360 concepts in the February 2015 release. As NCIt is focused on cancer, the Neoplasm subhierarchy, containing 8166 concepts, is an important component of the Disease, Disorder or Finding subhierarchy and is modeled with more detail, compared to the other concepts in the latter subhierarchy. We investigated two quantitative measures, the average number of parents and the average number of roles per concept, to numerically demonstrate the detailed modeling of the Neoplasm subhierarchy. The average number of parents for concepts in the Neoplasm subhierarchy is 1.73, while it is 1.10 for the remaining concepts in the Disease, Disorder or Finding subhierarchy. The average number of roles of the neoplasm concepts is 23.02. However, the corresponding average is 0.55 for nonneoplasm concepts, because only 2858 non-neoplasm concepts (16.6%

developed at the Swiss Institute of Bioinformatics (SIB), includes a mapping from its data source to NCIt to support interoperability with standard vocabularies [34]. To explore, ensure and improve the quality of NCIt, results of various QA studies applying different QA techniques to NCIt have been reported. NCIt has an internal QA process that is operational during its whole life cycle [28]. In one study, de Coronado et al. [35] used the UMLS Semantic Network to validate NCIt’s structure. A random sample of NCIt axioms was reviewed by two domain experts and they found that about half of the sample was incorrect [36]. A method based on semantic web technologies was presented to audit hierarchical and associative relations (i.e., roles) in NCIt, with the result that consistency of the associative relations is better than that of the hierarchical relations [37]. Zhu et al. [38], in a first journal special issue on auditing of terminologies [39], performed a review of auditing methods applied to controlled biomedical terminologies including NCIt. 2.1.1. Neoplasm subhierarchy of NCIt At the early development stages of NCIt, there were more neoplasm 138

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

node also contains an annotation indicating how many concepts are summarized by it. Area nodes are connected by child-of links that are derived from the underlying is-a links. We omit the technical details of this step, which was developed in previous publications (e.g., Wang et al. [13]). Area nodes and child-of links together form the area taxonomy. One issue with the area taxonomy is that it sometimes “over-summarizes,” because it is based solely on structure, as defined above. To illustrate what we mean by “over-summarize,” consider the lowest (red) dashed box in Fig. 3(a). The three concepts in that box Papillary Cystic Neoplasm, Cystadenoma, and Serous Neoplasm have the same set of six role types (i.e., the same structure). However, each has a different meaning as reflected by their names, i.e., each is a different kind of neoplasm. Hence, the red area in Fig. 3(b) derived from this box contains concepts with substantially different meanings. We would like to refine an area into smaller units such that a unit contains a group of concepts with similar meaning (semantics). For example, each of the three concepts Cystadenoma, Papillary Cystadenoma, and Serous Cystadenoma is a kind of cystadenoma. We will say that these three concepts have the semantics of Cystadenoma. Fig. 3(c) shows the partial-area taxonomy, which defines a middle ground between the original ontology and the area taxonomy. To explain this part of the process, refer back to Fig. 3(a) where some of the concepts are marked by bold frames. Those concepts are called roots, because they have no parents within their areas. In Fig. 3(c), every root is transformed into a white partial-area node. One partial-area node summarizes one root and all concepts in its area “under it” (in the graph-theoretical sense of a DAG). Below in Fig. 4(a), dotted boxes indicate the concepts in partial-areas. A dotted box of one color contains all concepts of one partial-area. Concepts of different partial-areas are distinguished by using different colors of dots.

= 2858/17,194) have roles. These 2858 non-neoplasm concepts have an average number of 3.33 roles. Based on NCI’s modeling emphasis on neoplasm concepts, we focused our research on the Neoplasm subhierarchy. 2.2. SNOMED CT SNOMED CT (SNOMED Clinical Terms) [40] is the most comprehensive clinical terminology. It is used in more than fifty countries, with multiple language versions. It is maintained and distributed by an international non-profit organization named SNOMED International, which is the trading name of the International Health Terminology Standards Development Organization (IHTSDO) [41]. Each year, IHTSDO publishes two new releases of the SNOMED CT International Edition, in January and in July respectively. SNOMED CT covers a wide range of clinical specialties, disciplines and requirements so that it enables consistent representation of clinical content in electronic health records [42] and facilitates the semantic interoperability of health records. Concepts are SNOMED CT’s basic components to represent healthcare data [43]. SNOMED CT’s concepts are organized into 19 major subhierarchies (e.g., Clinical Finding and Specimen) through is-a relationships. A concept may have multiple parents in a subhierarchy, i.e., a concept may have multiple is-a relationships pointing to other concepts in the same subhierarchy. The lateral relationships (called “attribute relationships” by SNOMED CT) provide formal definitions for concepts. For example, Finding site for the Clinical Finding subhierarchy specifies the body site affected by a condition. There are 341,105 active concepts connected by more than 511,000 is-a hierarchical relationships and 550,308 lateral relationships in SNOMED CT’s January 2018 release. The Infectious Disease subhierarchy contains 6681 concepts in that release. Examples of infectious disease concepts include Tuberculosis, Viral hepatitis type B, and Human immunodeficiency virus infection. The Infectious Disease subhierarchy is a part of the large Clinical Finding subhierarchy of SNOMED CT, which contains 111,081 concepts.

2.4. Disjoint partial-area taxonomy We have demonstrated, in a long-range research program, that abstraction networks (especially the partial-area taxonomy) are powerful tools to support quality assurance of ontologies. Abstraction networks are useful in quality assurance, because they make it possible to characterize certain subsets of concepts that have been found to have a relatively high likelihood of errors. Two such characterizations are “complex concepts” and “uncommonly modeled concepts,” as defined by Halper et al. [10]. In this research, we will use an abstraction network called disjoint partial-area taxonomy [23], which is a further refinement of a partialarea taxonomy, and will now be explained. As a starting point, we observe that the red (bottom) area in Fig. 3(a) contains five concepts, while the sum of the numbers of concepts in the corresponding three partial-areas in Fig. 3(c) is 7 (= 2 + 3 + 2). That is the case, because both Papillary Cystadenoma and Serous Cystadenoma (Fig. 3(a)) have two parents each that are roots of the red area. Because partial-areas were defined to consist of roots and concepts “under them,” both concepts are simultaneously summarized by two partial-areas. Definition: Overlapping concept. We call a concept that is summarized by two or more partial-areas an overlapping concept. Therefore, the two concepts Papillary Cystadenoma and Serous Cystadenoma in Fig. 3(a) are overlapping concepts. One way to recognize an overlapping concept is to note that there are paths of is-a links from it to two or more roots of the area. We note that overlapping concepts cause an ambiguity in the summarization of the ontology due to their belonging to multiple partial-areas. In order to eliminate this phenomenon of summarization ambiguity, we derive the disjoint partialarea taxonomy from the partial-area taxonomy [23]. The basic idea is to extract the overlapping concepts from their multiple original partial-areas and place them into their own dedicated partial-area. As a result, both the original partial-areas and the newly extracted partial-area become disjoint; that means that no two partial-areas share any concepts.

2.3. Partial-area taxonomy Fig. 3 demonstrates the derivation of an area taxonomy and a partial-area taxonomy from an excerpt of an ontology. Wherever terms are defined in the text below, they are typeset in bold. Fig. 3(a) shows an excerpt of 13 Neoplasm concepts in the Disease, Disorder or Finding subhierarchy of NCIt. Concepts are represented by boxes with rounded corners. Concepts are connected by is-a relationships shown as thin upward arrows. Concepts also have roles that may be shown as arrows, as previously demonstrated in Fig. 2. However, as a first step of summarization, these arrows are not shown in Fig. 3(a). Consequently, the targets of roles are also not shown. Furthermore, in preparation of summarization, the roles are not listed separately for each concept. The dashed, colored boxes in Fig. 3(a) are not from the original ontology. Rather, we have overlaid these boxes to group together concepts that have exactly the same role types, listed inside those boxes. Thus, there are five concepts with the role type Disease Has Abnormal Cell in Fig. 3(a), e.g., Papillary Neoplasm. There is an important distinction between roles and role types. A concept may have one single role type, e.g., Disease Has Finding, but it may have two or more roles of this role type pointing to two or more targets, respectively. In Fig. 3(a) the concept Cystic Neoplasm has the two role types Disease Has Abnormal Cell and Disease Has Finding. A set of concepts with exactly the same structure, namely the same set of role types, is called an area. In other words, a dashed box delineates an area. The dashed, colored boxes form the basis for deriving the area taxonomy in Fig. 3(b). Every dashed, colored box becomes one area node, shown with a solid color fill. The text in an area node indicates the set of role types of all concepts in the corresponding dashed box. Thus, an area node summarizes the concepts of an area. The area 139

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

(a)

Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin Serous Neoplasm

Cystadenoma

Papillary Cystic Neoplasm

Serous Cystadenoma

Borderline Cystadenoma

Papillary Cystadenoma

Borderline Serous Cystadenoma

Papillary Serous Cystadenoma

Borderline Papillary Cystadenoma

Clear Cell Papillary Cystadenoma

Borderline Papillary Serous Cystadenoma

(c)

(b)

Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin

Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, Disease Has Normal Tissue Origin Serous Neoplasm

Cystadenoma

Papillary Cystic Neoplasm

Serous Cystadenoma

Borderline Cystadenoma

Papillary Cystadenoma

Borderline Serous Cystadenoma

Papillary Serous Cystadenoma

Serous Neoplasm (1)

Serous Cystadenoma (1)

Cystadenoma (2)

Borderline Serous Cystadenoma (1)

Borderline Papillary Cystadenoma (1)

Papillary Cystic Neoplasm (1)

Papillary Cystadenoma (2)

Borderline Papillary Clear Cell Papillary Cystadenoma Cystadenoma Papillary Serous Cystadenoma (1)

Borderline Papillary Serous Cystadenoma

Borderline Papillary Serous Cystadenoma (1)

Fig. 4. The derivation of a disjoint partial-area taxonomy: (a) An excerpt of 11 neoplasm concepts from the area with the role types Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, and Disease Has Normal Tissue Origin. These 11 concepts are distributed over three partial-areas enclosed by three differently colored dotted boxes. Some concepts appear in several dotted boxes. (b) The area roots and disjoint roots are colored. Area roots have a single color and disjoint roots have multiple striped colors according to the colors of their multiple ancestor area roots. (c) The disjoint partial-area taxonomy for the excerpt in (a). Disjoint partial-area nodes are color coded according to the colors of their roots. Disjoint partial-area nodes with the same number of colors are placed at the same level. The bold arrows represent child-of links between disjoint partial-area nodes. Details of their derivation were previously explicated [13]. There may be child-of relationships between disjoint partial-area nodes at the same level.

Definition: Non-root, non-overlapping concept. Such a concept is neither a root concept nor an overlapping concept. Such concepts are descendants of only one area root. In Fig. 4(a) Borderline Cystadenoma is a non-root, non-overlapping concept. A non-overlapping concept may thus be either an area root or a nonroot non-overlapping concept. In Fig. 4(b), the root concepts are colored according to the (multiple) color(s) of the dotted boxes containing them in Fig. 4(a). Both kinds of non-root concepts appear in white, to distinguish between root concepts and non-root concepts. Thus, Serous Cystadenoma that appears in the blue and green dotted boxes in Fig. 4(a) appears itself as blue/ green in Fig. 4(b). In Fig. 4(b), the disjoint roots appear “striped” in different colors. Two disjoint roots may be striped in the same way. For example, Serous Cystadenoma and Borderline Serous Cystadenoma are both disjoint roots that have the same stripe pattern. We note that the meaning of fill color in Fig. 4(b) is also different from the meaning in Fig. 3. Now the area itself remains white and every root concept in the area is assigned a color (combination) in Fig. 4(b). For further clarification, we will elaborate on Fig. 4(b). The three concepts in the first row (=Level 1) in different solid colors (one blue, one green, one red) are the area roots. The concept Borderline Cystadenoma (white fill, at Level 2) and the three area root concepts are the non-overlapping concepts. The concept Serous Cystadenoma (Level 2) is a child of two area root concepts. Previously, in the partial-area taxonomy, this concept would have been counted twice, namely once in the partial-area defined by each of the two area roots. This created an unwanted ambiguity. However, in the disjoint partial-area taxonomy, Serous Cystadenoma is

Fig. 4 illustrates the derivation of the disjoint partial-area taxonomy for an excerpt of 11 concepts in the area with the six role types Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, and Disease Has Normal Tissue Origin. Fig. 4(a) shows the complicated inner structure of the area. The three roots, e.g., Serous Neoplasm, are again marked by bold frames. The dotted blue box contains the five concepts of the partial-area Serous Neoplasm (5). The partial-area Papillary Cystic Neoplasm (6) is enclosed by a red dotted box. Two concepts are clearly within the green, blue and red dotted boxes: Papillary Serous Cystadenoma and its child Borderline Papillary Serous Cystadenoma. Four kinds of concepts need to be defined to understand the disjoint partial-area taxonomy. Definition: Area root. An area root has no parents in its own area. In Fig. 3 we referred to “area roots” simply as roots. However, now a finer distinction has become necessary. In Fig. 4(a), Cystadenoma is an area root. Definition: Disjoint root. A disjoint root is a special kind of overlapping concept. A disjoint root has at least two parents. Furthermore, at least two parents must be different area roots or descendants of different sets of area roots. In Fig. 4(a), Serous Cystadenoma and Papillary Cystadenoma are two examples of disjoint roots. Whenever we refer to either an area root or a disjoint root, we will use the term root concept. Definition: Non-root overlapping concept. An overlapping concept that is not a root concept. Such a concept is a descendant of only one disjoint root. In Fig. 4(a) Clear Cell Papillary Cystadenoma is a non-root overlapping concept. 140

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

complex than the area roots. It was indeed demonstrated for overlapping concepts that their error rate is statistically significantly higher than for non-overlapping concepts for the following ontologies: SNOMED CT, for the Specimen subhierarchy [16] and for the Bleeding subhierarchy [17], for the Uberon ontology [18], and for the Gene subhierarchy of NCIt [19] (“the four ontologies”). In this paper, we demonstrate three impacts of the complexity of overlapping concepts. We start by formulating the following hypothesis. Hypothesis 1. Overlapping concepts are likely to have statistically significantly more errors than non-overlapping concepts. The motivation for Hypothesis 1 is that the modeling of overlapping concepts is more challenging due to the complexity of overlapping concepts, and thus there are more errors expected than for non-overlapping concepts. Hypothesis 1 was previously confirmed for the abovementioned four ontologies. It will be demonstrated for the fifth and sixth ontology below. A natural question arises, for how many ontologies from a family of similar ontologies such a hypothesis needs to be confirmed to guarantee that this hypothesis is true for at least half of the ontologies of this family [14]. We derived from the binomial distribution that it is sufficient to show the correctness of Hypothesis 1 for six out of six or for eight out of nine ontologies of the family [14]. To achieve six out of six, we demonstrate Hypothesis 1 for two more ontologies, the Neoplasm subhierarchy of NCIt and the Infectious Disease subhierarchy of SNOMED CT. 2. We explore with an extensive example the power of an overlapping concept that is erroneous to serve as trigger to discover many other similar errors in the vicinity of this concept. The complex semantics derived from the semantics of such a concept’s multiple ancestors triggers the discovery of errors in other related concepts that are not overlapping. Hence, auditing overlapping concepts may help with finding more errors in other concepts that do not carry the complex semantics of overlapping concepts. 3. We illustrate, with an example, the possible simplification of an ontology achieved by transforming overlapping concepts into nonoverlapping concepts, by applying error corrections to the former. Hence, discovering overlapping concepts not only helps to detect and correct errors, but also further simplifies the ontology by reducing its complexity, making it more maintainable and easier to use.

promoted to become a disjoint root of its own partial-area. This has the following reason. Serous Cystadenoma inherits multiple semantics from its parent concepts and therefore deserves to be in its own disjoint partial-area, since it has different semantics than any of its parents. The same extraction process is repeated for other concepts that have two or more ancestors that are root concepts. Note that these root concepts may be either area roots or disjoint roots. Thus, Borderline Papillary Cystadenoma is a descendant of one root Cystadenoma through its left parent. Through its right parent it is a descendant of two area roots and also a descendant of the disjoint root Papillary Cystadenoma. As it is a descendant of two area roots, it is shown in their two colors. As hinted above, the concept Clear Cell Papillary Cystadenoma (Level 3) is not a root, however, it is an overlapping concept, as it inherits from Cystadenoma and from Papillary Cystic Neoplasm, through its parent Papillary Cystadenoma. Because it is a non-root, its fill “color” is white. It is necessary to perform the above extraction operations recursively at every level of the partial-area taxonomy, because a concept could be a child of multiple disjoint roots. Papillary Serous Cystadenoma (at Level 3) is a child of two concepts that have now become disjoint roots. Thus, it would have to appear in both disjoint partial-areas of these two disjoint roots. To avoid this, the same method of extraction of a joint concept is applied again one level down, and Papillary Serous Cystadenoma is itself promoted to a disjoint root. The recursive extraction process is complex and is beyond the scope of this paper. For details, see work of Wang et al. [23]. Fig. 4(b) still represents concepts, just as Fig. 4(a). To arrive at the final disjoint partial-area taxonomy in Fig. 4(c), three more steps have to be taken. The disjoint partial-area taxonomy consists of nodes connected by child-of links. Each node represents a root concept in Fig. 4(b) with its non-root descendants. Thus, all non-root concepts have to be deleted, because they are summarized by the nodes representing their ancestor root concepts. After all, the whole purpose of deriving a disjoint partial-area taxonomy is to achieve a degree of unambiguous summarization. Any is-a link pointing to a deleted concept is redirected to the node that will represent (i.e., summarize) the deleted concept. Thus, the two uncolored concepts of Fig. 4(b) are eliminated in Fig. 4(c). A disjoint partial-area is named after its root concept followed by the number of its concepts in (). This extensive Background Section was necessary to make the quality assurance method comprehensible and clarify Table 3 in the Results Section. With the disjoint partial-area taxonomy in place, we can embark on the quality assurance studies of the Neoplasm subhierarchy and the Infectious Disease subhierarchy.

3.1. QA methodology applicable to a member of a family of ontologies with similar structure

3. Methods Ochs et al. [14] classified 373 biomedical ontologies from the NCBO BioPortal into 81 families according to their structural features and presented a family-based ontology quality assurance framework (BioPortal has significantly grown since this study). This framework reduces the manual QA efforts necessary for ontologies stored on BioPortal, because it is possible to derive abstraction networks for all ontologies in the same family with one single algorithm, followed by applying a single QA methodology to all members of the family. Ochs et al. [14], in their analysis of different families of ontologies, determined that seven subhierarchies of SNOMED CT, Uberon, and eleven subhierarchies (those with roles) of NCIt belong to the same family (of 76 ontologies) with object properties used only in restrictions and multiple inheritance allowed. (The term “object properties used only in restrictions” is taken from the framework of Ochs et al. [14], who were referring to ontologies hosted in BioPortal and represented in OWL. However, the native NCIt term for “object property” is “role.” In NCIt, all roles are used only in restrictions. Details were provided in previous work [14].) In our previous studies, we have demonstrated that overlapping concepts in the partial-area taxonomies for “the four ontologies” (enumerated in the Introduction) are more likely to exhibit errors than non-overlapping concepts. The Neoplasm subhierarchy of

In the long-range research of the SABOC team [44], a repeated theme in QA of ontologies has been that “complex” concepts tend to have a significantly higher error rate than “simple” concepts. There are various interpretations of “complex concept” for different methodologies and different ontologies. A likely explanation is that the human activity of modeling complex concepts is more challenging and thus there is more room for errors in the modeling of a complex concept. Overlapping concepts are complex, because they derive semantics from two or more concepts, which are roots of partial-areas in the partialarea taxonomy. It is doubtful that when those source concepts were originally modeled, there was a plan to have (possibly many) other concepts inherit from several of such roots at the same time. For example, the concept Papillary Serous Cystadenoma in Fig. 4(a) inherits semantics from three area roots Serous Neoplasm, Cystadenoma, and Papillary Cystic Neoplasm. The complex semantics of this concept is manifested by including in its name words from all the area roots’ names that it is “under.” Both of its two parents Serous Cystadenoma and Papillary Cystadenoma are themselves overlapping concepts. Hence, from the viewpoint of hierarchy complexity, Papillary Serous Cystadenoma is more complex than its two parents, which in turn are more 141

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

inferred view of the subhierarchy was generated. In this study, we concentrate on all the inferred changes made to the Infectious Disease subhierarchy between the January 2015 release and the July 2015 release. During this period, 4308 concepts were changed. Any time a concept changes during such a remodeling process it is apparent that the concept was previously erroneous. A similar idea was extensively used by Ceusters et al. [25] and by Zhang et al. [26]. In evaluating Hypothesis 1 for the SNOMED CT Infectious Disease subhierarchy, we considered only “severe” and “moderate” errors, ignoring “non-critical” errors, just as was done for the Neoplasm subhierarchy. Since there is no domain expert involved in determining what is considered a severe or moderate error, the judgment of what makes an error “severe” or “moderate” has to be arrived at indirectly. Previous feedback of ontology curators has indicated that commission errors are considered more severe than omission errors, because commission errors indicate that some part of the modeling of a concept is outright wrong. Omissions are sometimes done on purpose by ontology curators, because there is no use case for the omitted information. Such errors are generally considered non-critical. For this study, we generated a sample containing all the overlapping Infectious Disease concepts and a random control sample consisting of an equal number of non-overlapping concepts from the Infectious Disease subhierarchy. To assure a fair comparison, the control concepts were randomly taken from the same areas as the overlapping concepts. Since concepts in small partial-areas are prone to have more errors, the control population excluded such concepts as a confounding factor. We calculated the two-tailed p-value of Fisher’s exact test [48] to evaluate the statistical significance of the different error rates for overlapping Infectious Disease concepts and for non-overlapping Infectious Disease concepts.

NCIt has the same structural features (object properties used only in restrictions and multiple inheritance) as “the four ontologies.” Thus, in Section 3.1.1 we apply the same methodology to it. In Section 3.1.2 we apply a different methodology to derive results for the sixth ontology, the Infectious Disease subhierarchy of SNOMED CT, where overlapping concepts also exhibit more errors than control concepts. 3.1.1. Methodology for the Neoplasm subhierarchy The disjoint partial-area taxonomy derived from the Neoplasm subhierarchy provides the theoretical framework to easily distinguish between overlapping concepts and non-overlapping concepts. To do this in practice requires software support. Ochs et al. [45] described the Ontology Abstraction Framework (OAF) software tool for deriving various abstraction networks and the overlapping concepts. In this study, we consider overlapping concepts as complex and non-overlapping concepts as simple; the latter serve as a source of control group concepts. We investigate Hypothesis 1 for the Neoplasm subhierarchy of the Disease, Disorder or Finding subhierarchy of NCIt. Hypothesis 1 is of practical importance. If Hypothesis 1 is confirmed with statistical significance, then the disjoint partial-area taxonomy can be viewed as a fully automatic screening test that identifies concepts with a higher error yield than other neoplasm concepts. The error yield is defined by the ratio of the number of discovered errors to the number of reviewed concepts. Thus, it is justified to invest QA resources, such as the time of domain experts, into a careful review of overlapping concepts. We conducted a randomized controlled trial on a sample of neoplasm concepts to evaluate Hypothesis 1. The Neoplasm disjoint partialarea taxonomy was generated by our OAF software tool [45]. It contains exactly 225 overlapping concepts, which we used as the study concepts. We randomly picked a sample of 350 non-overlapping concepts from the same areas that the study concepts came from, as a control group. Since concepts in small partial-areas are prone to have more errors, as shown in previous studies [7,46,47], the control population excluded such concepts as a confounding factor. The study concepts and control group concepts were combined into a list. The order of the concepts in the list was randomized and the resulting list was presented to two domain experts for review. The two domain experts (GE) and (YC) were trained in medicine and have extensive ontology QA experience. The QA study consisted of three steps. First, the two experts reviewed the 575 concepts independently. Each of the reviewers generated a report of errors with reasons, error severities (moderate or severe) and suggested corrections. Non-critical errors were not reported. In the second step, we created a combined list of errors reported by the two experts in the first step and presented the combined list to the same two reviewers. They had to express agreement or disagreement with each error in the list. We did not include the information of who had marked a concept as erroneous in the combined list, although we cannot exclude that reviewers recognized concepts that they had previously reported as erroneous. In the third step, we eliminated all concepts that were considered erroneous by only one reviewer in the second step. Concepts on the list were then divided according to whether they came from the study group (overlapping concepts) or from the control group (non-overlapping concepts) and the numbers of errors were counted. We calculated the two-tailed p-value of Fisher’s exact test [48] to evaluate the statistical significance of the different error rates for overlapping concepts and for non-overlapping concepts.

3.2. Erroneous overlapping concepts trigger discovery of related nonoverlapping erroneous concepts Granulosa cell tumors of the ovary usually have a favorable outcome, but should be treated as malignant neoplasms [50]. They may be hormonally active and secrete sex steroids, such as estrogen, that may be related to their clinical presentation. Two forms, adult and juvenile, are recognized. The NCIt concept Ovarian Granulosa Cell Tumor was amongst our test sample of overlapping concepts. The concept has three children: Adult Type Ovarian Granulosa Cell Tumor, Juvenile Type Ovarian Granulosa Cell Tumor, as well as Malignant Ovarian Granulosa Cell Tumor. All four concepts have the property Neoplastic Status: Malignant. However, Ovarian Granulosa Cell Tumor and its three children, unlike many other malignant process concepts in NCIt, lack any indication as to its malignancy status through a role target, and only Malignant Ovarian Granulosa Cell Tumor has an is-a relationship to Malignant Granulosa Cell Tumor. In contrast, Ovarian Lymphoma has an is-a relationship to Malignant Ovarian Neoplasm, as well as the Disease Has Abnormal Cell role with the target Malignant Cell. We propose that a similar modeling should be applied to Ovarian Granulosa Cell Tumor and, as a result, be inherited to its children. Moreover, as indicated above, ovarian granulosa cell tumors often secrete estrogen and may result in clinical manifestations such as abnormal uterine bleeding, endometrial hyperplasia or even uterine cancer as a result of prolonged exposure to tumor-derived estrogen. We propose that such potential symptoms should be added as targets Metrorrhagia, Postmenopausal Hemorrhage and Endometrial Hyperplasia for the role Disease May Have Finding, and the target Uterine Neoplasm for the role Disease May Have Associated Disease. Additionally, when considering the child concept Malignant Ovarian Granulosa Cell Tumor, it becomes clear that the name implies “An aggressive granulosa cell tumor that arises from the ovary and metastasizes to other anatomic sites,” as indicated by the concept definition, rather than a histological characterization of a neoplasm. This may be a cause for ambiguity, as the concept does not have any more granular

3.1.2. Methodology for the SNOMED CT Infectious Disease subhierarchy During the year 2015, editors of SNOMED International conducted a project of remodeling the Infectious Disease subhierarchy of SNOMED CT. Details of this work were published by Ochs et al. [24]. Due to scheduling difficulties, the project was not completed. In the process they remodeled the stated concepts, and by using a classifier [49] the 142

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

Hence, after correction by adding the missing role in NCIt, these two concepts (in fact three concepts, including the other child Anterior Pituitary Gland Neoplasm due to inheritance) in the newly derived corresponding disjoint partial-area taxonomy appear in the bottom area with four role types in Fig. 5(b). The name of the added role in NCIt is again italicized. Furthermore, the two partial-areas in Fig. 5(a) Pituitary Gland Neoplasm (3) and Recurrent Anterior Pituitary Gland Neoplasm (1) are merged into a new partial-area Pituitary Gland Neoplasm (4) in Fig. 5(b). The three concepts of the partial-area Pituitary Gland Neoplasm (3) in Fig. 5(a) are not overlapping concepts anymore in the new area in Fig. 5(b). Specifically, Pituitary Gland Neoplasm became an area root in the new area. That is, after the correction these three concepts are not “complex” anymore, because they are not overlapping concepts since they are in a separate area with one root. Fig. 5 demonstrates that the corrections of erroneous overlapping concepts may transform overlapping concepts into non-overlapping concepts. Thus, the complexity of the disjoint partial-area taxonomy is reduced. For example, in Fig. 5(b) this is expressed by the elimination of one disjoint partial-area (Pituitary Gland Neoplasm) in the disjoint partial-area taxonomy, leading to a simpler summary. Hence, correcting erroneous overlapping concepts may reduce the complexity of the ontology. The simplification in Fig. 5 is expressed by eliminating the “striped” node of Fig. 5(a) when generating Fig. 5(b). This reduces the total number of boxes and makes it unnecessary to color any of the partial-area nodes.

descendants to indicate whether it is a malignant behavior of the adulttype or the juvenile-type ovarian granulosa cell tumor. Indeed Malignant Ovarian Granulosa Cell Tumor has three parents, two of which have “malignant” in their name: Malignant Granulosa Cell Tumor and Malignant Ovarian Sex Cord-Stromal Tumor. Based on their NCIt definitions, the first refers to an aggressive behavior by the tumor whereas the latter is more histological, further contributing to possible ambiguity. Therefore, we suggest that Malignant Ovarian Granulosa Cell Tumor may be a redundant concept that might be considered for elimination. The above example demonstrates how an intensive review process, initiated by an algorithmic suggestion of a concept, can result in numerous modeling improvements for many concepts, either through inherited changes to descendant concepts or through observations regarding related neighboring concepts. Such errors were not counted when computing the statistical significance in our study, although they contribute to the error yield. 3.3. Correcting erroneous overlapping concepts may reduce the complexity of an ontology Fig. 5 shows an interesting error case, in which the corrections of three erroneous overlapping concepts transform them into non-overlapping concepts in another area, by adding a new role Disease Has Primary Anatomic Site suggested by our domain experts. Fig. 5(a) shows an excerpt of the disjoint partial-area taxonomy consisting of three disjoint partial-areas for the area with the three role types Disease Excludes Primary Anatomic Site, Disease Has Abnormal Cell, Disease Has Associated Anatomic Site and the area with an additional role Disease Has Primary Anatomic Site (italic and underline). Notably, there is a child-of link between the two partial-areas Recurrent Anterior Pituitary Gland Neoplasm (1) and Pituitary Gland Neoplasm (3) because the concept Recurrent Anterior Pituitary Gland Neoplasm is a child concept of Anterior Pituitary Gland Neoplasm in the partial-area Pituitary Gland Neoplasm (3). Fig. 5 follows the graphical convention for disjoint partial-area taxonomies of Fig. 4(c). The two concepts Pituitary Gland Neoplasm and its child Posterior Pituitary Gland Neoplasm were two overlapping concepts in the audited sample (Fig. 5(a)). Our auditors reported that both concepts missed the role Disease Has Primary Anatomic Site with the target Pituitary Gland.

4. Results 4.1. Results for the Neoplasm subhierarchy of NCIt We first derived the partial-area taxonomy for the Neoplasm subhierarchy of the February 2015 release of NCIt. (Reminder: This is the precursor of the disjoint partial-area taxonomy.) The 8166 neoplasm concepts are summarized by 920 areas and 4824 partial-areas in this partial-area taxonomy. The partial-area taxonomy for the complete Disease, Disorder or Finding subhierarchy of 25,360 concepts contains 986 areas and 5080 partial-areas. Comparing the numbers of areas and partial-areas for the Neoplasm subhierarchy versus the whole Disease, Disorder or Finding subhierarchy, 95% (4824/5080) of the Disease, Disorder or Finding partial-area

Disease Excludes Primary Anatomic Site, Disease Has Abnormal Cell, Disease Has Associated Anatomic Site Brain Neoplasm (17)

Disease Excludes Primary Anatomic Site, Disease Has Abnormal Cell, Disease Has Associated Anatomic Site

Skull Base Neoplasm (2)

Brain Neoplasm (17)

Skull Base Neoplasm (2)

Pituitary Gland Neoplasm (3) Same as above + Disease Has Primary Anatomic Site

Same as above + Disease Has Primary Anatomic Site

Pituitary Gland Neoplasm (4)

Recurrent Anterior Pituitary Gland Neoplasm (1)

(b)

(a) Fig. 5. Simplification of the complexity of the disjoint partial-area taxonomy due to correction of overlapping concepts: (a) Excerpt from disjoint partial-area taxonomy before correction of three erroneous overlapping concepts in the partial-area Pituitary Gland Neoplasm (3) with the error “missing the role Disease Has Primary Anatomic Site”; (b) after correction by adding the missing role (italic and underline) to the three erroneous overlapping concepts. The two partial-areas in Fig. 5(a) Pituitary Gland Neoplasm (3) and Recurrent Anterior Pituitary Gland Neoplasm (1) are merged together to become a new partial-area Pituitary Gland Neoplasm (4), because Recurrent Anterior Pituitary Gland Neoplasm (1) is child-of Pituitary Gland Neoplasm (3). All three partial-areas are not colored, since they do not contain overlapping concepts. 143

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

taxonomy summarize all the neoplasm concepts, which account for only 32% (8166/25,360) of the complete Disease, Disorder or Finding subhierarchy. The remaining 68% of the subhierarchy are covered by only 5% of the partial-areas. In order to perform a direct quantitative comparison, we define the abstraction ratio of a partial-area taxonomy as the average number of concepts summarized per partial-area. The abstraction ratio for the Neoplasm subhierarchy is 1.69 (=8166/4824) and the standard derivation is 6.49, while the abstraction ratio is 4.99 (=25,360/5080) and the standard derivation is 201.55 for the whole Disease, Disorder or Finding subhierarchy. A lower number is indicative of more structural and semantic diversity, which is the result of detailed modeling efforts. The structural diversity is due to the large average number (23) of roles per neoplasm concept, since every combination of roles defines a different area. Thus, the structural diversity is reflected in the large number of areas. The semantic diversity is borne out by the many partial-areas. The partial-area taxonomy for the complete Disease, Disorder or Finding subhierarchy has 396 overlapping concepts. Among those, 225 overlapping concepts are in the Neoplasm partial-area taxonomy, and they appear in 45 areas. To how many partial-areas do overlapping concepts belong? Most overlapping concepts are summarized by two partial-areas each. Only six overlapping concepts appear in three partial-areas simultaneously. There are six areas with more than 10 overlapping neoplasm concepts in the partial-area taxonomy. We are now switching to the partial-area taxonomy of the Neoplasm subhierarchy. The largest area contains 137 partial-areas, 463 concepts, and 27 overlapping concepts. These overlapping concepts are distributed over 18 partial-areas. The second-largest area contains 100 partial-areas, 321 concepts and 25 overlapping concepts. These overlapping concepts are distributed over 24 partial-areas. These two areas contain the two largest sets of overlapping concepts among all areas in the Neoplasm subhierarchy. Fig. 6 shows the disjoint partial-area taxonomy for the area with the six role types Disease Excludes Abnormal Cell, Disease Excludes Finding, Disease Has Abnormal Cell, Disease Has Finding, Disease Has Normal Cell Origin, and Disease Has Normal Tissue Origin that summarizes 98 concepts in 26 partial-areas. Of these 98, 20 concepts are overlapping concepts. The overlapping concepts appear in nine partial-areas. An excerpt of this disjoint partial-area taxonomy was also shown in Fig. 4(c), where the meaning of the color-coding was explained. In Fig. 6, Level 2 had to be distributed over two rows, as there are 15 disjoint partial-areas at this level that do not fit into one row. For the QA study, we included all 225 overlapping neoplasm concepts as the study sample. As control sample, we randomly selected 350 non-overlapping neoplasm concepts from the same areas that the study concepts were taken from. The two domain expert reviewers (GE) and (YC) agreed that 71 concepts (12.3% = 71/575) had errors of a “moderate” or “severe” error type. Among the 71 erroneous concepts,

Level 1

Adenocarcinoma (28)

Mucinous Cystadenoma (1)

Papilloma (7)

Papillary Cystadenoma (2)

Level 2 Mucinous Adenocarcinoma (1)

Level 3

Oncocytic Neoplasm (3)

Cystadenoma (2)

Borderline Papillary Cystadenoma (1)

Mucinous Cystadenocarcinoma (1)

Papillary Mucinous Cystadenoma (1)

Table 1 The distribution of overlapping concepts and erroneous overlapping concepts. # of areas

# of areas with errors

# of erroneous concepts

1 2 3 4 5 6 7 10 12 20 25 27

15 5 6 3 4 3 2 1 3 1 1 1

5 1 1 1 1 1 1 1 3 0 0 1

5 2 2 1 5 5 2 1 12 0 0 1

Total

45

16

36

36 concepts (16% = 36/225) were overlapping concepts in 16 areas, with 48 errors (1.33 = 48/36 errors per erroneous overlapping concept). In contrast, we observed 35 (10% = 35/350) non-overlapping concepts with 39 errors (1.11 = 39/35 errors per erroneous non-overlapping concept). Table 1 shows the area distribution of overlapping concepts and erroneous overlapping concepts. Table 2 is the contingency table for the p-value calculation between erroneous overlapping concepts and erroneous non-overlapping concepts. We calculated the two-tailed p-value of Fisher’s exact test [48] to evaluate the statistical significance of the study. The p-value is 0.0377 (p < 0.05), which means the study result has statistical significance. In other words, the overlapping concepts are significantly likely to exhibit more errors than non-overlapping concepts. Thus, Hypothesis 1 was supported by the results. Of the 225 overlapping concepts, 195 came from disjoint partialareas containing only one concept. The remaining 30 overlapping concepts came from disjoint partial-areas with at most four concepts. Altogether, only 18 overlapping concepts are non-roots. (See the Background Section and Fig. 4(b) for the definition and examples of disjoint roots). Table 3 shows the error rate comparison between disjoint roots and non-root overlapping concepts. Out of the 36 erroneous overlapping concepts, two concepts (11.1% = 2/18) are non-root concepts and the other 34 concepts (16.4% = 34/207) are disjoint root concepts. The difference is statistically not significant (p = 0.7794). We will now present example errors separately for overlapping concepts and non-overlapping concepts. Table 4 illustrates five examples of errors found in overlapping concepts with suggested corrections and reasons. There are two main error types for the overlapping concepts. There are 14 concepts with missing roles and 23 concepts with incorrect roles. The concept Pancreatic VIPoma has a missing role error and an incorrect role error at the same time.

Acinar Cell Neoplasm (2)

Borderline Serous Cystadenoma (1)

Mucinous Adenocarcinoma, Endocervical Type (1)

# of overlapping concepts in an area

Serous Cystadenoma (1)

Oxyphilic Adenocarcinoma (1)

Borderline Papillary Mucinous Cystadenoma (1)

Mucinous Neoplasm (1)

Benign Squamous Cell Neoplasm (1)

Squamous Papilloma (1)

Serous Cystadenocarcinoma (1)

Papillary Serous Cystadenoma (1)

Papillary Cystic Neoplasm (1)

Inverted Squamous Cell Papilloma (1)

Serous Adenocarcinoma (1)

Serous Neoplasm (1)

Acinar Cell Carcinoma (1) Micropapillary Serous Carcinoma (1)

Borderline Papillary Serous Cystadenoma (1)

Fig. 6. The disjoint partial-area taxonomy of the area with the six role types mentioned above. To reduce the density of the figure, the child-of links for the disjoint partial-areas at the second row of Level 2 are not shown. The graphical notation for disjoint partial-area taxonomies was explained for Fig. 4(c). 144

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

Table 6 is the contingency table for the p-value calculation distinguishing between erroneous overlapping and erroneous non-overlapping concepts. We count erroneous concepts that have commission errors, such as wrong parent, wrong role type, or wrong role target. A sample of commission errors of different kinds appears in Table 7. We calculated the two-tailed p-value of Fisher’s exact test [48] to evaluate the statistical significance of the study. The p-value is 0.0067 (p < 0.05), which means the study result has statistical significance. Thus, Hypothesis 1 was supported by the results. To summarize, the results support Hypothesis 1 for the Neoplasm subhierarchy of NCIt and for the Infectious Disease subhierarchy of SNOMED CT. Thus, Hypothesis 1 was supported for two more ontologies beyond “the four” previously investigated ontologies, fulfilling the six out of six requirement for this family.

Table 2 The 2x2 contingency table for erroneous overlapping neoplasm concepts and erroneous non-overlapping neoplasm concepts in NCIt (with a two-tailed pvalue = 0.0377 < 0.05 by Fisher’s exact test).

Overlapping concepts Non-overlapping concepts

# Erroneous concepts

# Concepts w/o errors

% Errors

36 35

189 315

16 10

Table 3 The 2×2 contingency table for the error rate comparison between disjoint root concepts and non-root overlapping concepts (with a two-tailed p-value of 0.7449 by Fisher’s exact test).

Disjoint root concepts Non-root overlapping concepts

# Erroneous concepts

# Concepts w/o errors

% Errors

34 2

173 16

16.4 11.1

5. Discussion There are never enough (human) resources available for a complete quality assurance audit of any major ontology. Thus, we have been developing methods for identifying subsets of concepts that are most likely to be erroneous. However, developing a separate QA method for each of the hundreds of biomedical ontologies in BioPortal is daunting. As noted above, the characterization “overlapping concepts in a partial-area taxonomy” was previously successful for “the four ontologies” (of the Introduction) in finding increased error rates. In order to establish that the family-based methodology works for six out of six sample ontologies for the same family of similar ontologies in BioPortal, we successfully applied, in this paper, the methodology to the Neoplasm subhierarchy of NCIt and the Infectious Disease subhierarchy of SNOMED CT. In other words, the main point of this paper is not showing that our previous method works for two additional ontologies. The point is that we are showing that family-wide ontology quality assurance is possible. Among the six ontologies, there are two from NCIt and three from SNOMED CT. However, we stress the differences between them. The NCIt Gene subhierarchy is different from the Neoplasm subhierarchy, since all the genes are modeled as leaves or as parents of leaves in cases where they have alleles. In contrast, diseases can appear anywhere in the Neoplasm subhierarchy. Regarding SNOMED CT, Specimen is a small subhierarchy, while Clinical Finding is the largest subhierarchy of SNOMED CT. It is two magnitudes larger than Specimen. Because it is so large, we reviewed two subhierarchies of it, the small Bleeding subhierarchy and the medium-sized Infectious Disease subhierarchy, to assess the validity of Hypothesis 1 for ontologies of different sizes. The implication of confirming the efficacy of the above uniform QA methodology for six ontologies is that for at least half of the other ontologies in the substantial BioPortal family studied in this paper the error rate for overlapping concepts will be significantly higher than the error rate for non-overlapping concepts [14]. Hence, by concentrating QA efforts on overlapping concepts in the ontologies of that family, a higher QA yield is expected in terms of the number of concepts

We observed 21 non-overlapping concepts with missing role errors and 12 non-overlapping concepts with incorrect role errors. Our auditors also found other kinds of errors for non-overlapping concepts. These include incorrect parents, missing parents and incorrect neoplastic status. This is illustrated in Table 5. “Neoplastic status” is a data property for neoplasm concepts in NCIt [51], with possible values “Benign,” “Malignant,” “Precancerous,” “Uncertain Malignant Potential,” and “Undetermined.” It defines a neoplastic growth as non-cancerous, cancerous, or of uncertain cancerous potential. A discussion of data properties is beyond the scope of this paper. The concept Basophilic Adenocarcinoma has both a missing parent error and a missing role error. 4.2. Results for the Infectious Disease subhierarchy of SNOMED CT The SNOMED CT Infectious Disease subhierarchy contained 6099 concepts in January 2015. The partial-area taxonomy for it contains 80 areas and 1305 partial-areas. It contains 196 overlapping concepts distributed over eight areas. The area with the most overlapping concepts has three role types Associated morphology, Finding site, and Pathological process with 665 concepts among which there are 83 overlapping concepts. The overlapping concepts were found by the Ontology Abstraction Framework (OAF) software tool [45]. The concepts that underwent a change between two releases were found by the SNOMED CT Visual Semantic Delta tool [52]. Both tools were developed previously as part of this research program. The concepts with commission errors were obtained from the sample of 196 overlapping concepts and from the control group of 196 randomly chosen non-overlapping concepts. Table 4 Five examples of errors in overlapping concepts identified in the QA study. Concept

Error type

Correction

Reason

Childhood Central Nervous System Mature Teratoma Occult Adenosquamous Lung Carcinoma

Incorrect role

Remove the role Disease Has Abnormal Cell with the target Malignant Cell Remove the role Disease Excludes Finding with the target No Evidence of Radiologic Finding or change the role to Disease Has Finding with the same target Add the role Disease Has Normal Cell Origin with a more refined target Granulosa Cell Add the role Disease May Have Associated Disease with the target Multiple Endocrine Neoplasia Type 1 Add the role Disease Is Stage with the target AJCC v7 Stage

Mature Teratoma is a benign neoplasm

Incorrect role

Testicular Granulosa Cell Tumor

Missing role

Pancreatic VIPoma

Missing role

Stage IVA Oral Cavity Cancer

Missing role

145

According to the definition “The primary tumor is undetectable radiographically or during bronchoscopy” According to the definition “It is characterized by the presence of granulosa-like cells” This concept has the role Disease Mapped To Gene with the target MEN1 Gene According to the definition, it is an AJCC 7th stage concept

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

Table 5 Three other error types identified in non-overlapping concepts of the QA study. Concept

Error type

Correction

Reason

Basophilic Adenocarcinoma

Missing parent

Papillary Hidradenoma

Incorrect parent

According to the definition “A malignant epithelial neoplasm of the anterior pituitary gland” Hidradenoma is more relevant

Gallbladder Goblet Cell Carcinoid

Incorrect neoplastic status

Add is-a link directed to Anterior Pituitary Gland Neoplasm Replace the parent Benign Sweat Gland Neoplasm with Hidradenoma Change the value “Undetermined” to “Malignant”

Concept drift is another well-known issue. The periodic inclusion of new medical knowledge will require repeated reevaluation of even the best-designed ontologies. Thus, the recall value (R) of the errors found depends on the modeling quality and the prior QA regimen of the ontology itself. Our methodology guides the domain experts to review concepts that tend to have a relatively higher concentration of errors versus a control sample. However, the actual recall value depends on the quality of the ontology itself. For example, the error rate of overlapping concepts in the Specimen subhierarchy, the Bleeding subhierarchy, and the Infectious Disease subhierarchy of Clinical Finding in SNOMED CT are about 60%, 40%, and 39%, respectively [16,17]. Those rates are much higher than the 16% error rate found in the current study for NCIt. Furthermore, in the current study of Neoplasm concepts, we looked only for errors with a “moderate” or “high” severity. This was based on the feedback from the NCIt manager. Obviously, limiting reviewers’ attention only to such errors affects the error rate for this study. In both cases that we investigated in this study, the error rate (16%) of the overlapping concepts is 60% higher than the error rate (10%) for the control sample. Our two domain experts reported spending on average 500 min on reviewing 100 neoplasm concepts. To find 16 erroneous concepts in 100 overlapping neoplasm concepts would require 500 min. Using the same time for reviewing 100 non-overlapping neoplasm concepts would find on the average only ten erroneous concepts. To find 16 erroneous concepts one would need 60% more time or 800 min, i.e., five more hours, compared to the suggested methodology. We reiterate that for the study regarding SNOMED CT Infectious Disease, we did not use reviewers, but the post-mortem results of the remodeling performed by SNOMED CT editors. Thus, we have no data concerning auditing effort. However, independent of the low error rate in some ontologies, we showed in this paper in Section 3.2 a new phenomenon, namely that an erroneous overlapping concept can trigger the discovery of additional erroneous non-overlapping concepts in its vicinity. To provide more anecdotal evidence for this phenomenon, we will elaborate on another example triggering the finding of about 20 erroneous, non-overlapping concepts in the vicinity of the erroneous overlapping concept AIDSRelated Gastric Kaposi Sarcoma in NCIt. We present a description of the review process of this overlapping concept by the domain experts. This review illustrates the depth and breadth of the quality assurance practiced in this study. Such review can help discover similar errors in related non-overlapping concepts. In this way, the reviewers can find more errors than given by the number of overlapping concepts. The domain expert tried to determine the root cause of the errors in the above overlapping concept AIDS-Related Gastric Kaposi Sarcoma, similar to what was done by Rector et al. [53]. He found that the origin was the grandparent Kaposi sarcoma (KS) (see Fig. 7). Kaposi sarcoma is a tumor associated with an oncogenic herpesvirus, the Kaposi sarcomaassociated herpesvirus (KSHV), also known as human herpesvirus 8 (HHV8) [54]. Despite its name, KS is in general not considered a true sarcoma, which is a tumor arising from mesenchymal tissue. The histogenesis of KS is not clear, with some researchers attributing it to lymphatic endothelial cells while others do not consider KS as a pure vascular tumor [55]. It is of note that in NCIt Sarcoma is a “far” ancestor

Table 6 The 2×2 contingency table for erroneous overlapping Infectious Disease concepts versus erroneous non-overlapping Infectious Disease concepts in SNOMED CT (with a two-tailed p-value = 0.0067 < 0.05 by Fisher’s exact test).

Overlapping concepts Non-overlapping concepts

# Erroneous concepts

# Concepts w/o errors

% Errors

76 50

120 146

38.8 25.5

Table 7 Different kinds of commission errors for overlapping versus non-overlapping concepts. Overlapping concepts

Non-overlapping concepts

Wrong parent Wrong parent Wrong role type Wrong role type Wrong target

Tuberculous enteritis Oculoglandular tularemia Tuberculous peritonitis

Tuberculous ascites Mumps nephritis Anal candidiasis

Bullous staphylococcal impetigo

Bacterial peritonitis

Beta lactam resistant bacterial infection

Wrong target

Superficial foreign body of anus without major open wound but with infection

Infection by Diplodinium Infection by Theileria parva

According to the definition “An invasive mixed adenoneuroendocrine carcinoma of the gallbladder”

identified as erroneous for a given number of reviewed concepts, exercising the best possible use of scarce human resources. Thus, when embarking on quality assurance for members of this family under resource constraints, overlapping concepts should be audited first. At the very least, all overlapping concepts should be audited for every member of this family of ontologies. Besides higher yield, another advantage of the family-based QA approach [14] is that it is supported by the OAF software tool [45] that finds the overlapping concepts for each ontology of the family, rather than having to develop algorithms separately for each member of the family. Hence, this methodology is semi-automatic, because the overlapping concepts are found automatically by the OAF software and the manual review is only performed for those concepts. Finding a method that prioritizes among the overlapping concepts would be beneficial for ontologies with many such concepts. Even though no statistical significance was established, the observed error rates in Table 3 suggest that whenever QA resources are insufficient to audit all overlapping concepts, the auditors should concentrate on the disjoint roots. We note that, the percentage of errors among the overlapping concepts can vary from one ontology to another and among different subhierarchies of the same ontology. After all, errors in ontologies are human-made, and some ontologies are better curated than others. Furthermore, large ontologies, like SNOMED CT with many large subhierarchies, were developed over several/many years by different teams, and these teams potentially used different modeling styles. In contrast, Neoplasm concepts in NCIt were modeled by one person, who worked as a curator for many years (personal communication). As a result, the error rate for NCIt Neoplasm concepts is expected to be low. 146

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

Sarcoma

Disease Has Associated Anatomic Site

Disease Has Primary Anatomic Site

Malignant Blood Vessel Neoplasm

Vascular System

Virus-Related Sarcoma

Cardiovascular System

Blood Vessel

Kaposi Sarcoma

Dermis

Vascular System

Skin Kaposi Sarcoma

Lung Kaposi Sarcoma

Skin Anal Kaposi Sarcoma

Anus

Cardiac Kaposi Sarcoma

Thorax

AIDS-Related Gastric Kaposi Sarcoma

Heart

Lung

Digestive System

Legend concept is-a relationship is-a chain

Skin Respiratory System

Gastric Kaposi Sarcoma

Heart Stomach

Integumentary System

Anus

Lung

Skin

Connective and Soft Tissue

Stomach Disease Has Primary Anatomic Site role Disease Has Associated Anatomic Site role

Fig. 7. The modeling of Kaposi Sarcoma-related concepts in NCIt.

This example, similar to the technique of Rector et al. [53], shows that discovering the root cause of a problem in an overlapping concept, highlighted by our methodology, can result in improved modeling quality of numerous descendants as well as ancestors. Note that these extra errors are not counted in assessing the statistical significance for the sample, although such a phenomenon increases the QA yield of the overlapping concept-based QA methodology. Table 2 supported Hypothesis 1 about a statistically significantly higher error rate for overlapping concepts, promising a higher QA yield for them. We also explored the minor hypothesis that among the overlapping concepts the disjoint roots would have more errors than non-root overlapping concepts. The motivation for such an exploration is that the complexity is increased in a disjoint root concept relative to the complexity of each parent due to inheriting information from multiple parents. In contrast, a non-root overlapping concept inherits only the complexity of its unique disjoint root concept. As noted above, in this and previous studies, we have found support for the fact that a concept of higher complexity is more likely to have errors. On the other hand, no statistically significant support could be established for the distinction between disjoint root concepts and nonroot overlapping concepts. In Table 3, although the error rate of disjoint root concepts is higher than that of non-root overlapping concepts, there is no statistical significance with a two-tailed p-value of 0.7449 of Fisher’s exact test. This is due to the small number of non-root overlapping concepts. In Section 3.3, we showed an example that demonstrated the further strengthening of the connection among overlapping concepts, complexity of concepts, and error discovery. Previously we showed that complex overlapping concepts are likely to have more errors, and in Section 3.2 we added that complex concepts can trigger the discovery of additional errors. The example in Section 3.3 showed that the correction of erroneous overlapping concepts could lead to a reduction in the complexity of an ontology, making it more comprehensible for editors and users. It was possible to “read off” this simplification from a disjoint partial-area taxonomy diagram.

of KS, rather than a direct parent, and the implied is-a relationship may be the result of an oversight. The NCIt concept AIDS-Related Gastric Kaposi Sarcoma and all other associated KS concepts have the two roles, Disease Has Associated Anatomic Site and Disease Has Primary Anatomic Site (among others), both of which have the targets Vascular System, as well as Cardiovascular System for the former role and Blood Vessel for the latter role, as shown in Fig. 7. NCIt defines the Disease Has Associated Anatomic Site role as “A role used to relate a disease to the general site, structure or system where the specific pathological process is located.” With the focus on the location of the tumor, despite the possible endothelial cell origin of KS, the various types of KS are associated with the skin, mucous membranes, and various visceral organs such as the stomach. Thus, we believe, associating the Vascular System and Cardiovascular System as anatomic sites for Gastric Kaposi Sarcoma is not appropriate and, at best, they should play a role as targets for potential roles dealing with the histogenesis of the disease. Likewise, NCIt defines the Disease Has Primary Anatomic Site role as “A role used to relate a disease to the anatomical site where the originating pathological process is located.” Thus, and in light of the lack of clarity around the histogenesis of KS, Vascular System and Blood Vessel may be questionable targets for the role. As indicated before, the target values of these roles are inherited from Kaposi Sarcoma, and cannot be blocked. For Kaposi Sarcoma the Disease Has Associated Anatomic Site role has targets Cardiovascular System, Connective and Soft Tissue, and Vascular System. The more specific role Disease Has Primary Anatomic Site has the targets Blood Vessel and Vascular System only. Malignant Blood Vessel Neoplasm is one of the three parents of Kaposi Sarcoma and the contributor for the abovementioned inherited values. Although highly vascular, as discussed earlier, the is-a association with Malignant Blood Vessel Neoplasm may not be warranted. Removal of this is-a association, as shown in Fig. 8 will remove the questionable role target values from Kaposi Sarcoma and will prevent their inheritance down the hierarchy to more than 20 descendants (not all are shown in Fig. 7), such as Prostate Kaposi Sarcoma, Appendix Kaposi Sarcoma and others. Fig. 8 shows the corrected modeling.

147

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

Disease Has Associated Anatomic Site

Sarcoma Virus-Related Sarcoma

Dermis

Integumentary System

Kaposi Sarcoma

Disease Has Primary Anatomic Site

Connective and Soft Tissue

Skin Skin Kaposi Sarcoma

Skin

Lung Lung Kaposi Sarcoma

Lung Anus

Anal Kaposi Sarcoma

Heart

Cardiac Kaposi Sarcoma

Respiratory System Thorax Heart Anus

Gastric Kaposi Sarcoma

Digestive System

Stomach AIDS-Related Gastric Kaposi Sarcoma

Stomach

Legend Skin

concept is-a relationship is-a chain

Disease Has Primary Anatomic Site role Disease Has Associated Anatomic Site role

Fig. 8. The remodeling of Kaposi Sarcoma-related concepts in order to correct errors: remove the concepts Malignant Blood Vessel Neoplasm as parent of Kaposi Sarcoma. Also remove Blood Vessel, Vascular System and Cardiovascular System as role target concepts, and the roles Disease Has Primary Anatomic Site and Disease Has Associated Anatomic Site pointing to these removed concepts.

Competing interests

6. Conclusions

The authors declare that they have no competing interests.

In this paper, we derived the partial-area taxonomy and the disjoint partial-area taxonomy for the Neoplasm subhierarchy of the Disease, Disorder or Finding subhierarchy of NCIt. A sample of 575 neoplasm concepts consisting of overlapping concepts and non-overlapping concepts was selected from the Neoplasm disjoint partial-area taxonomy. In a three-step manual QA study of the sample, we found that overlapping concepts have a statistically significantly higher error rate than nonoverlapping concepts (16% vs. 10%). The two most common error types of neoplasm concepts are missing role errors and incorrect role errors. The study supports the hypothesis that overlapping concepts tend to have more severe or moderate errors than control concepts, due to their complexity, the modeling of which is more challenging. We also conducted a study on the Infectious Disease subhierarchy of SNOMED CT with a sample of 196 overlapping concepts and 196 randomly chosen non-overlapping concepts from the Infectious Disease disjoint partial-area taxonomy, in which the overlapping concepts are again demonstrated to have statistically significantly more commission errors than non-overlapping concepts (38.8% vs. 25.5%). Hence, the results in this paper suggest that the methodology of reviewing overlapping concepts is an effective QA methodology for ontologies of one family of the BioPortal ontologies, as this methodology has been demonstrated successfully for six out of six ontologies in the chosen BioPortal family. This means that the overlapping concept methodology can be applied to the whole BioPortal family of 76 similar ontologies and is likely to be successful for at least half of the members of this family. This study also introduces two new useful features of overlapping concepts. First, an erroneous overlapping concept can help QA domain experts with finding additional erroneous non-overlapping concepts in the vicinity of this concept. Secondly, correcting overlapping concepts may lead to a simplification of the complexity of the ontology, measured by the number of overlapping concepts.

Acknowledgement Research reported in this publication was partially supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA190779. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. References [1] D.L. Rubin, N.H. Shah, N.F. Noy, Biomedical ontologies: a functional perspective, Brief Bioinform. 9 (1) (2008) 75–90. [2] O. Bodenreider, Biomedical ontologies in action: role in knowledge management, data integration and decision support, Yearb Med. Inform. 67–79 (2008). [3] B. Smith, R.H. Scheuermann, Ontologies for clinical and translational research: introduction, J. Biomed. Inform. 44 (1) (2011) 3–7. [4] R. Hoehndorf, P.N. Schofield, G.V. Gkoutos, The role of ontologies in biological and biomedical research: a functional perspective, Brief Bioinform. 16 (6) (2015) 1069–1080. [5] F. Shen, Y. Lee, Knowledge discovery from biomedical ontologies in cross domains, PLoS One 11 (8) (2016) e0160005. [6] S. Schulz, L. Jansen, Formal ontologies in biomedical knowledge representation, Yearb Med. Inform. 8 (2013) 132–146. [7] H. Min, Y. Perl, Y. Chen, M. Halper, J. Geller, Y. Wang, Auditing as part of the terminology design life cycle, J. Am. Med. Inform. Assoc. 13 (6) (2006) 676–690. [8] NCI Thesaurus. Available from: https://nciterms.nci.nih.gov/ncitbrowser/. [9] G. Jiang, C.G. Chute, Auditing the semantic completeness of SNOMED CT using formal concept analysis, J. Am. Med. Inform. Assoc. 16 (1) (2009) 89–102. [10] M. Halper, H. Gu, Y. Perl, C. Ochs, Abstraction networks for terminologies: supporting management of “big knowledge”, Artif. Intell. Med. 64 (1) (2015) 1–16. [11] D. Wei, O. Bodenreider, Using the abstraction network in complement to description logics for quality assurance in biomedical terminologies – a case study in SNOMED CT, Stud. Health Technol. Inform. 160 (Pt 2) (2010) 1070–1074. [12] R. Shearer, B. Motik, I. Horrocks, HermiT: A Highly-Efficient OWL Reasoner. Proceedings of the Fifth OWLED Workshop on OWL: Experiences and Directions, 2008. [13] Y. Wang, M. Halper, H. Min, Y. Perl, Y. Chen, K.A. Spackman, Structural

148

Journal of Biomedical Informatics 83 (2018) 135–149

L. Zheng et al.

[33] N.H. Shah, D.L. Rubin, I. Espinosa, K. Montgomery, M.A. Musen, Annotation and query of tissue microarray data using the NCI Thesaurus, BMC Bioinform. 8 (2007) 296. [34] P. Gaudet, P.A. Michel, M. Zahn-Zabal, I. Cusin, P.D. Duek, O. Evalet, et al., The neXtProt knowledgebase on human proteins: current status, Nucl. Acids Res. 43(Database issue) (2015) D764–D770. [35] S. de Coronado, M.S. Tuttle, H.R. Solbrig, Using the UMLS semantic network to validate NCI Thesaurus structure and analyze its alignment with the OBO relations ontology, AMIA Annu. Symp. Proc. (2007) 165–170. [36] S. Schulz, D. Schober, I. Tudose, H. Stenzhorn, The pitfalls of thesaurus ontologization – the case of the NCI thesaurus, AMIA Annu. Symp. Proc. 2010 (2010) 727–731. [37] F. Mougin, O. Bodenreider, Auditing the NCI thesaurus with semantic web technologies, AMIA Annu. Symp. Proc. 500–4 (2008) 500–504. [38] X. Zhu, J.W. Fan, D.M. Baorto, C. Weng, J.J. Cimino, A review of auditing methods applied to the content of controlled biomedical terminologies, J. Biomed. Inform. 42 (3) (2009) 413–425. [39] J. Geller, Y. Perl, M. Halper, R. Cornet, Special issue on auditing of terminologies, J. Biomed. Inform. 42 (3) (2009) 407–411. [40] SNOMED CT Starter Guide. Available from: https://confluence.ihtsdotools.org/ display/DOCSTART. [41] SNOMED CT. Available from: https://www.nlm.nih.gov/healthit/snomedct/index. html. [42] J. Millar, The need for a global language – SNOMED CT introduction, Stud. Health Technol. Inform. 225 (2016) 683–685. [43] SNOMED CT Basics. Available from: https://confluence.ihtsdotools.org/display/ DOCSTART/4.+SNOMED+CT+Basics. [44] SABOC. Available from: https://saboc.njit.edu/. [45] C. Ochs, J. Geller, Y. Perl, M.A. Musen, A unified software framework for deriving, visualizing, and exploring abstraction networks for ontologies, J. Biomed. Inform. 62 (2016) 90–105. [46] M. Halper, Y. Wang, H. Min, Y. Chen, G. Hripcsak, Y. Perl, et al., Analysis of error concentrations in SNOMED, AMIA Annu. Symp. Proc. (2007) 314–318. [47] C. Ochs, Y. Perl, J. Geller, M. Halper, H. Gu, Y. Chen, et al., Scalability of abstraction-network-based quality assurance to large SNOMED hierarchies, AMIA Annu. Symp. Proc. 2013 (2013) 1071–1080. [48] P.I. Good, Permutation, Parametric, and Bootstrap Tests of Hypotheses: A Practical Guide to Resampling, third ed., Springer, New York, NY, 2005. [49] M.J. Lawley, C. Bousquet, Fast classification in Protégé: Snorocket as an OWL 2 EL reasoner, in: Proceedings of the 6th Australasian Ontology Workshop (IAOA'10): Conferences in Research and Practice in Information Technology, 2010, pp. 45–49. [50] S.T. Schumer, S.A. Cannistra, Granulosa cell tumor of the ovary, J. Clin. Oncol. 21 (6) (2003) 1180–1189. [51] NCI Thesaurus property definitions. Available from: https://evs.nci.nih.gov/ftp1/ ThesaurusSemantics/Properties.pdf. [52] C. Ochs, J.T. Case, Y. Perl, Analyzing structural changes in SNOMED CT's Bacterial infectious diseases using a visual semantic delta, J. Biomed. Inform. 67 (2017) 101–116. [53] A.L. Rector, S. Brandt, T. Schneider, Getting the foot out of the pelvis: modeling problems affecting use of SNOMED CT hierarchies in practical applications, J. Am. Med. Inform. Assoc. 18 (4) (2011) 432–440. [54] S. Gramolelli, T.F. Schulz, The role of Kaposi sarcoma-associated herpesvirus in the pathogenesis of Kaposi sarcoma, J. Pathol. 235 (2) (2015) 368–380. [55] S. Gurzu, D. Ciortea, T. Munteanu, I. Kezdi-Zaharia, I. Jung, Mesenchymal-to-endothelial transition in Kaposi sarcoma: a histogenetic hypothesis based on a case series and literature review, PLoS One 8 (8) (2013) e71530.

methodologies for auditing SNOMED, J. Biomed. Inform. 40 (5) (2007) 561–581. [14] C. Ochs, Z. He, L. Zheng, J. Geller, Y. Perl, G. Hripcsak, et al., Utilizing a structural meta-ontology for family-based quality assurance of the BioPortal ontologies, J. Biomed. Inform. 61 (2016) 63–76. [15] P.L. Whetzel, N.F. Noy, N.H. Shah, P.R. Alexander, C. Nyulas, T. Tudorache, et al., BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucl. Acids Res. 39(Web Server issue) (2011) W541–5. [16] Y. Wang, M. Halper, D. Wei, H. Gu, Y. Perl, J. Xu, et al., Auditing complex concepts of SNOMED using a refined hierarchical abstraction network, J. Biomed. Inform. 45 (1) (2012) 1–14. [17] C. Ochs, J. Geller, Y. Perl, Y. Chen, J. Xu, H. Min, et al., Scalable quality assurance for large SNOMED CT hierarchies using subject-based subtaxonomies, J. Am. Med. Inform. Assoc. 22 (3) (2015) 507–518. [18] G. Elhanan, C. Ochs, J.L.V. Mejino Jr., H. Liu, C.J. Mungall, Y. Perl, From SNOMED CT to Uberon: transferability of evaluation methodology between similarly structured ontologies, Artif. Intell. Med. (2017). [19] L. Zheng, H. Min, Y. Perl, J. Geller, Discovering Additional Complex NCIt Gene Concepts with High Error Rate, in: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2017. pp. 653–7. [20] M.Q. Stearns, C. Price, K.A. Spackman, A.Y. Wang, SNOMED clinical terms: overview of the development process and project status, AMIA Annu. Symp. Proc. 662–6 (2001). [21] C.J. Mungall, C. Torniai, G.V. Gkoutos, S.E. Lewis, M.A. Haendel, Uberon, an integrative multi-species anatomy ontology, Genome Biol. 13 (1) (2012) R5. [22] N. Sioutos, S. de Coronado, M.W. Haber, F.W. Hartel, W.L. Shaiu, L.W. Wright, NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information, J. Biomed. Inform. 40 (1) (2007) 30–43. [23] Y. Wang, M. Halper, D. Wei, Y. Perl, J. Geller, Abstraction of complex concepts with a refined partial-area taxonomy of SNOMED, J. Biomed. Inform. 45 (1) (2012) 15–29. [24] C. Ochs, J.T. Case, Y. Perl, Tracking the remodeling of SNOMED CT's bacterial infectious diseases, AMIA Annu. Symp. Proc. 2016 (2016) 974–983. [25] W. Ceusters, J.P. Bona, Analyzing SNOMED CT's historical data: pitfalls and possibilities, AMIA Annu. Symp. Proc. 2016 (2016) 361–370. [26] G.-Q. Zhang, Y. Huang, L. Cui, Can SNOMED CT changes be used as a surrogate standard for evaluating the performance of its auditing methods? AMIA Annu. Symp. Proc. 1886–95 (2017). [27] S. de Coronado, M.W. Haber, N. Sioutos, M.S. Tuttle, L.W. Wright, NCI Thesaurus: using science-based terminology to integrate cancer research results, Stud. Health Technol. Inform. 107 (Pt 1) (2004) 33–37. [28] S. de Coronado, L.W. Wright, G. Fragoso, M.W. Haber, E.A. Hahn-Dantona, F.W. Hartel, et al., The NCI thesaurus quality assurance life cycle, J. Biomed. Inform. 42 (3) (2009) 530–539. [29] N.F. Noy, D.L. McGuinness, Ontology Development 101: A Guide to creating your first ontology. Available from: https://protege.stanford.edu/publications/ontology_ development/ontology101.pdf. [30] T.F. Hayamizu, S. de Coronado, G. Fragoso, N. Sioutos, J.A. Kadin, M. Ringwald, The mouse-human anatomy ontology mapping project, Database (Oxford). 2012 (2012) bar066. [31] J. Pathak, J. Wang, S. Kashyap, M. Basford, R. Li, D.R. Masys, et al., Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the eMERGE Network experience, J. Am. Med. Inform. Assoc. 18 (4) (2011) 376–386. [32] Q. Zhu, R.R. Freimuth, Z. Lian, S. Bauer, J. Pathak, C. Tao, et al., Harmonization and semantic annotation of data dictionaries from the Pharmacogenomics Research Network: a case study, J. Biomed. Inform. 46 (2) (2013) 286–293.

149