yOWL: An ontology-driven knowledge base for yeast biologists

yOWL: An ontology-driven knowledge base for yeast biologists

Journal of Biomedical Informatics 41 (2008) 779–789 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: ww...

265KB Sizes 1 Downloads 36 Views

Journal of Biomedical Informatics 41 (2008) 779–789

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

yOWL: An ontology-driven knowledge base for yeast biologists Natalia Villanueva-Rosales a, Michel Dumontier a,b,* a b

School of Computer Science, Carleton University, Ottawa, Ont., Canada K1S 5B6 Department of Biology, Carleton University, Ottawa, Ont., Canada K1S 5B6

a r t i c l e

i n f o

Article history: Received 1 September 2007 Available online 11 May 2008 Keywords: Semantic web Data integration Semantic query answering Knowledge management OWL Ontology Biological data Yeast Saccharomyces cerevisiae Mashup

a b s t r a c t Knowledge management is an ongoing challenge for the biological community such that large, diverse and continuously growing information requires more sophisticated methods to store, integrate and query their knowledge. The semantic web initiative provides a new knowledge engineering framework to represent, share and discover information. In this paper, we describe our efforts towards the development of an ontology-based knowledge base, including aspects from ontology design and population using ‘‘semantic” data mashup, to automated reasoning and semantic query answering. Based on yeast data obtained from the Saccharomyces Genome Database and UniProt, we discuss the challenges encountered during the building of the knowledge base and how they were overcome. Ó 2008 Elsevier Inc. All rights reserved.

1. Introduction Online biological information is available via web pages, stored in ftp sites or relational databases, and textually described in publications. However, web search engines are unable to answer questions about this massive knowledge base other than identifying resources that contain some subset of the specified attributes. The main reason for this limitation is that the representation of biological information on the web is not machine understandable, in the sense that computers cannot interpret words, sentences or diagrams so as to correctly reason about the objects and the relations between them that are implicitly stated in those documents [1]. The primary goal of the semantic web is to add semantics to the current Web, by designing ontologies which explicitly describe and relate objects using formal, logic-based representations that a machine can understand and process [2,3]. This ongoing effort is expected to facilitate data representation, integration and question answering, of critical importance in the life sciences. Ontologies already play an important role in managing medical terminology [4–6], and more recently in the discovery and execution of grid [7] and semantic web services [8]. The Open Biomedical Ontologies (OBO) is a shared portal of biological/medical ontologies that includes the popular Gene Ontology (GO) [9]. By provid-

* Corresponding author. Address: Department of Biology, Carleton University, 1125 Colonel By Drive, Ottawa, Ont., Canada K1S 5B6. Fax: +1 613 520 3539. E-mail address: [email protected] (M. Dumontier). 1532-0464/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2008.05.001

ing a standardized vocabulary, OBO controlled vocabularies and taxonomies are used in the annotation of biological information, which helps make information more accessible for computer interpretation. Through the OBO Foundry effort, OBO ontologies are being redesigned and mapped to the Basic Formal Ontology (BFO) [10], a domain independent ontology that provides distinction between objects and processes and can be linked using basic relations [11]. Together, they should provide a powerful platform to describe and annotate domain specific knowledge, and open the possibility of making queries at various levels of ontological granularity and potentially from diverse domains. This in turn, will enable knowledge discovery by tapping into expert knowledge. For instance, a biochemical ontology might help one discover enzymes as the set of proteins that catalyze reactions, even though these have not been asserted as enzymes. While the first generation of OBO ontologies cannot be used in this way because they do not contain explicit logical descriptions to define class membership in terms of their properties, work towards a second generation of OBO ontologies using formal, logic-based knowledge representation languages will provide enhanced functionality toward this goal [12–14]. OWL, the Web Ontology Language [15], is a knowledge representation language for designing ontologies on the Semantic Web. One variant, OWL-DL, is based on description logics (DL), a subset of First Order Logic that allows description of complex concepts from simpler ones with an emphasis on decidability of reasoning tasks [16] (i.e. the results will be returned in a finite

780

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789

amount of time). Reasoning tasks like checking ontology consistency, computing inferences, and realization (classifying real world objects into their most specific category) can be executed by a computer program called a reasoner (e.g. Pellet [17], FACT++ [18] and Racer Pro [19]) over DL ontologies [20]. The design of OWLDL bio-ontologies favorable to reasoning may be achieved by the application of semantic web best practices [21], relation formalisms [11], normalization [22], design patterns and workflows [23]. Sophisticated biomedical ontologies such as the Foundational Model of Anatomy are being converted to OWL and this has proven useful in simplifying the ontology and identifying inconsistencies [24,25]. The FungalWeb project involved the design of an ontology to reason about enzymes of importance to the yeast biotechnology industry [26]. The BioPax OWL ontology [27] enables data sharing via an ontology for pathways, interactions and molecular participants, for data exchange between pathway data providers such as HumanCyc [28] and Reactome [29]. A considerable challenge in bioinformatics is the resolution of multiple identifiers for the same entities. The proliferation of identifiers stems from (1) direct user submissions to a general database (i.e. NCBI), (2) the construction of ‘‘boutique” databases (i.e. SGD) and (3) value added annotations fuels a need for each provider to issue new identifiers so as to keep track of their contributions [30–32]. However, keeping track of these identifiers is such a problem that it becomes necessary to create databases of database identifiers [33,34]. In fact, identifiers have such a pervasive influence in the life sciences that people talk about identifiers instead of the entities they are meant to identify. While LinkHub [35] offers a first step at navigating this confusing network of identifiers, YeastHub [36] provides an RDF-based data warehouse which lets one add data and create queries between resources. While flexible, the lack of ontology requires users to indicate which user-contributed resources are equivalent, a difficult task, and this precludes the automatic knowledge discovery. In this paper, we create a semantic knowledge by designing and populating expressive ontologies primarily using the information from the Saccharomyces Genomes Database [37]. We illustrate semantic data mashup with OBO ontologies and the UniProt RDF knowledge base and provide solutions for the resolution of multiple resource identifiers. We describe the formulation of semantic queries for question answering using two query languages that require reasoning services. Finally, we explain how this semantic knowledge base can be extended by others such that they can build their own custom knowledge base to track developments in recent literature or add their own experimental results. 2. Methods Fig. 1 shows the three main methodologies used to create and extract information from the yOWL knowledge base: ontology design, ontology population and semantic query answering. 2.1. Ontology design All ontologies were created using the OWL editor Protégé Ontology Editor (v 4.53 alpha).1 Proposals to design ontologies from the theoretical and practical point of view can be found in [38–40] and they include brain storming, scope definition, motivating scenarios, competency questions, creation of bag of concepts, creating concept hierarchy, properties, domain and range, amongst others. Authors of these papers agree in the fact that there is no a correct methodology to design an ontology, but they also agree in seeing this process iterative

and application oriented. In the ontology design process described in this paper their proposed criteria is reused when possible and mapped to the OWL language. The yOWL ontology was designed following semantic web best practices [11,21–23] and implemented using the OWL-DL. Briefly, we follow the methodology described by [38], but apply a three layer ontology design with increasing expressivity [41]. The ontology design process can be summarized in the following steps: 1. Define the scope and requirements (use cases) of the ontology. 2. Determine essential concepts, reusing existing ontologies when possible. 3. Construct a primitive hierarchy of concepts. 4. Map to an upper level ontology. 5. Assign relations between concepts and attributes. 6. Design a second layer of complex definitions that imposes additional logical restrictions and establishes necessary and/or necessary and sufficient conditions. 7. Repeat steps 2–6 until requirements are satisfied from semantic query answering over populated ontologies. 2.1.1. Step 1. Define the scope and requirements of the ontology The scope of the yOWL project was to model the entities and their relations found in the Saccharomyces Genome Database (SGD) database2 such that one could effectively search this knowledge base using expected relations rather than with a collection of keywords. One example would be to identify a gene involved in a genetic interaction experiment and whose gene product is involved in transcription (Query 8). SGD is a free and widely used resource for yeast genomic and proteomic information, from which structural and functional chromosome features (telomeres, genes, etc), database cross references, molecular function, cellular component, biological process, interactions, pathways, phenotypes and literature references as listed in Table 1. 2.1.2. Step 2. Determine essential concepts, reusing existing ontologies when possible An initial set of concepts was manually collected to represent the type and attributes of the data located in the data files. Concept definitions were obtained from the SGD glossary, WordNet and Wikipedia and were manually added to the class using the ‘‘rdfs:comment” annotation property for human consumption. To illustrate the conceptual process, we will illustrate the methodology for the ‘‘interactions” file. This file describes results from physical or genetic interaction experiments and each row contains columns for the pair of interacting genes/proteins, the type of interaction experiment, the impact on viability and a reference to a publication in which it was reported. To represent this information in the ontology, the following concepts were created: (a) the class of InteractionExperiment and more specific classes from an enumeration of types of interaction experiments (e.g. SyntheticLethalityExperiment, AffinityCapture-MS, etc). Values captured in the ‘‘viability” column were captured as viability states (viable, nonviable) or growth states (slow growth, etc). References to journal articles identified by PubMed identifier are captured by a Publication class. Data found in the ‘‘go_slim_mapping” file corresponds to information defined by the Gene Ontology. However, no OWL document was available for this subset of GO. Thus, we constructed an OWL ontology3 with concepts that are equivalent (owl:EquivalentClass)

2 1

http://www.co-ode.org/downloads/protege-x/.

3

http://www.yeastgenome.org/. http://ontology.dumontierlab.com/goslim.

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789

Ontology Design (Terminology Extraction, Integration and Extension)

Data Mashup text file

RDB

RDF

OBO

asdfasdf

wrapper

781

BFO

...

KB

wrapper

wrapper

ABox

OWL

TBox

user contributions

Query Answering

Reasoner

GUI

yOWL Knowledge Base Fig. 1. yOWL knowledge base.

Table 1 yOWL source data Data type

File

Chromosome features DB cross references Function, localization, process Interactions Complex Pathways Phenotypes Literature

SGD_features.taba dbxref.taba go_slim_mapping.tabb

a b

Interactions.tabb go_protein_complex_slim.tabb biochemical_pathways.tab2 phenotypes.tabb gene_literature.tabb

Number of records 16,781 68,313 21,176 148,777 3105 946 15,505 142,777

ftp://genome-ftp.stanford.edu/pub/yeast/chromosomal_feature/. ftp://genome-ftp.stanford.edu/pub/yeast/literature_curation/.

to those in the complete OWL4 GO ontology, but with English class names like ‘‘Nucleus” rather than alphanumeric class names favored by OBO ontologies (i.e. GO_0005634). 2.1.3. Step 3. Construct a primitive hierarchy of concepts A hierarchy of concepts for the yOWL primitive ontology5 was developed by iteratively categorizing the set of necessary classes. For instance, a SyntheticLethalityExperiment is a type of SyntheticGeneticExperiment which is a type of GeneticInteractionExperiment which is a type of InteractionExperiment which is a type of Experiment. The result is the design of an taxonomic branch that is homogenous and increasingly specialized in that each child term can be easily differentiated from its parent. In line with general normalization techniques, all ontological terms are asserted to have but a single parent. 2.1.4. Step 4. Map to an upper level ontology Upper level ontologies promise increased semantic coherency by helping to identify the basic type of domain entities and impos-

4 5

http://www.co-ode.org/downloads/protege-x/. http://ontology.dumontier.com/yowl-primitive.

ing restrictions on the relationships that these entities may hold. An upper level ontology should be carefully selected according to the purpose and scope of the ontology designed. The Basic Formal Ontology (BFO) provides a simplified framework to distinguish from qualities, objects and processes. We mapped classes defined in the yOWL ontology to the BFO ontology. For example, the validation status of open reading frame is a type of quality, a Publication is a type of object and an Experiment is a type of Process. In doing so, we anticipate that our classes (and the restrictions placed on them) will be compatible with second generation OBO ontologies that also adopt the BFO framework. 2.1.5. Step 5. Assign relations between concepts and attributes An essential part of the modeling process is to establish which relations exist between the set of concepts. To do this, we need two things: (1) a set of basic relations to draw from and (2) generate a kind of topics map. We have designed the Basic Relation Ontology (BRO)6 to hierarchically organize basic object relations [10,11] for use in OWL ontologies. A root relation isRelatedTo provides a generic relationship between any entity and also provides object-process, object-quality, mereological, spatial and temporal relations. NULO,7 an integrated upper level ontology, maps the domain and ranges of BRO relations with BFO concepts so as to semantically constrain the assignment of relations and provide enhanced reasoning and inference capabilities. Taken together, NULO consists of 36 classes, 50 object properties and 2 datatype properties, 17 annotation properties. Similarly to the use of upper level ontologies, the use of NULO in several ontologies imposes extra restrictions that generating further inferences may facilitate semantic integration in an abstract level. New object properties were added to describe the more specific relations required (but not restricted to) in this domain. The first pair hasReference/isReferencedIn provides a relation between a publication and the entity it references. The second pair hasSource/isSourceOf, links an entity with its source of origin, and provides a means to assign data provenance. A pair of quality relations hasSta-

6 7

http://ontology.dumontierlab.com/bro. http://ontology,dumontierlab.com/nulo.

782

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789 rdfs:subClassOf

bro:isPartOf

yowl:Complex

yowl: ORF Status Of

bro:isPartOf

yowl: Molecule

rdfs :subClassOf

bfo:Entity yowl:hasSource

yowl:DB Source

yowl: hasStatus

goslim: Chromosome

yowl:Chromosome Strand

bro

yowl:hasPerturbation

yowl:Interaction Experiment

yowl: Perturbation

bro : isParticipantIn

bro:hasFunction

bro:isParticipantIn

yowl:hasPhenotype

goslim:Molecular _function

yowl: Pathway

yowl:Phenotype

bro:isParticipantIn rdfs: subClassOf

yowl:Publication

yOWL

bro:isLocatedIn

:is

yowl:DNA Region

yowl:hasReference

rt Pa

goslim:Cellular_ component

goslim:Biological _process

rdfs:subClassOf

bfo:Process rdfs: subClassOf

Namespaces xsd: http://www.w3.org/2001/XMLSchema rdfs : http://www.w3.org/2000/01/rdf-schema owl: http://www.w3.org/2002/07/owl dc: http:/purl.org/dc/elements/1.1// bfo : http://www.ifomis.org/bfo/1.0.1 bro: http://ontology.dumontierlab.com /bro goslim : http:/ontology.dumontierlab.com /goslim -2.0 yowl:http://ontology.dumontierlab.com /yowl-primitive

Fig. 2. Selected classes and relations from yOWL ontology.

tus/isStatusOf describes the status (verified, dubious, uncharacterized) of an open reading frame. Finally, two pairs of object relations hasOutcome/isOutcomeOf and hasPerturbation/isPerturbationOf describe the relationship between an entity and an outcome (such as a phenotype) and a perturbation (i.e. deletion of a gene), respectively. Ideally, these properties would be associated with the participants of some experiment, but given the form of the data, they are associated directly with experiments. Several datatype properties were also introduced to relate information that is intrinsic to the specific entity. For instance, the date version properties (sequence version and coordinate version), in which multiple values should be maintained. Most chromosomal features are associated with a start and end coordinate which delineate a continuous region located on chromosomal strand. Another property provides the association of a citation with a publication hasCitation. A datatype property was established to relate a gene with the biochemical reaction that its gene product catalyzes hasECNumber, although it is expected that this relation will later be converted to one between a protein product and a biochemical reaction. The entities and their relations are then defined in a topics map. A part of this map is illustrated in Fig. 2. This topics map also servers as an initial data model in which to instantiate the ontology. 2.1.6. Step 6. Design a second layer of complex definitions that imposes additional logical restrictions and establishes necessary and/or necessary and sufficient conditions The primitive hierarchy of concepts was augmented and refined with logical restrictions to reflect knowledge about genome structure and function. To ensure the proper classification of data and knowledge discovery, necessary and necessary and sufficient conditions were added to the ontology. Necessary conditions were added where obvious (i.e. a chromosome strand is a single stranded DNA molecule that is part of a chromosome). Necessary and sufficient conditions were added for single property varying entities (i.e. a dosage rescue experiment is a dosage interaction experiment that has a viable outcome) or in the design of a value partition (i.e. the quality of viability is defined by a state of viable or non-viable). The yOWL complex ontology8 is comprised of 257 classes, 60 object properties, 12 data properties and 22 annotation properties. These include the BFO and GOSLIM ontologies.

2.2. Ontology Population and Mashup In this work, the ontology population involved: (1) assigning names, (2) asserting class membership, and (3) determining proper relations between entities. In the assignment of unique names, we constructed Universal Resource Identifiers (URI) from an assigned base namespace (Table 2) plus the identifier. In the assignment of types, the yOWL ontology sufficiently covers the genomic and proteomic data obtained from SGD tab files. However, given that there was no relational database schema available, the specific relations between the data had to be manually determined based on the concept and relations extraction results of steps 2 and 4. We designed PHP-based parsers to automatically populate (instantiate classes and relations) the yOWL ontology from SGD tab files, obtaining an OWL document in RDF/XML format. A necessary first step was to normalize the content such that multi-valued column entries delineated by ‘‘/” or ‘‘|” delimiters were separated and treated as distinct entities. Another problem was in that these files lacked references to the primary key, the SGD identifier (SGDID; a numeric identifier prefixed with an ‘S’—chromosomal features file). Instead some contained references to gene names or ORF names, and we found additional references to resources outside of the SGD. In particular, information about genes and their open reading frames could be obtained from UniProt records in RDF9 RDF documents were fetched for each entity for which information was available. Local identifiers for blank nodes in these RDF documents were replaced with globally identifiers by appending the name of the entity, thereby avoiding the incorrect integration of triples. GO URIs were also mapped to the GOSLIM ontology. The curated RDF document10 can be seamlessly imported and integrated with the yOWL populated ontology,11 providing a mashup with additional information such as type (UniProt schema), biological source, and GO terms, among others. Using an Intel Pentium 4 computer with 3GB RAM, we were able to load the entire data-instantiated ontology using Protégé 4 alpha (build 53) and RacerPorter 1.9.1, but unable to execute realization for query answering. Thus, due to limited resources available at the time of writing this paper, the query examples described in this paper were selected from a representative subset of the complete yOWL data. This subset was obtained by performing a search for genes with links across the ontology and contains approximately 2% of the whole dataset (5770 instances). While 9 10

8

http://ontology.dumontierlab.com/yowl-complex

11

http://www.yeastgenome.org/ http://ontology.dumontierlab.com/yow/yowl-jbi-unitprot.owl http://ontology.dumontierlab.com/yowl-jbi.owl

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789 Table 2 Resource namespaces Source

URI

yOWL ontology GOslim ontology GO SGDa Genbanka PubMeda EBIa DIPa CGDa CandidaDBa IUBMBa EUROSCARFa BioGrida MetaCyca GermOnlinea UniProt UniParc

http://ontology.dumontierlab.com/yowlprimitive http://ontology.dumontierlab.com/goslim http://purl.org/obo/owl/GO urn:lsid:yeastgenome.org: urn:lsid:ncbi.nlm.nih.gov:genbank: urn:lsid:ncbi.nlm.nih.gov:pubmed urn:lsid:ebi.ac.uk: urn:lsid:dip.doe-mbi.ucla.edu: urn:lsid:candidagenome.org: urn:lsid:candidadb: urn:lsid:iubmb:reaction: urn:lsid:euroscarf: urn:lsid:thebiogrid.org: urn:lsid:metacyc.org: urn:lsid:germonline.org: http://purl.uniprot.org/uniprot/ http://purl.uniprot.org/uniprot/

a

LSID assigned in the absence of known authority.

smaller, this approach still enabled a demonstrated of proof of principle on how OWL-DL ontologies can be used to enable semantic query answering over data, rather than undertake performance testing on available tools. 2.3. Question answering The design of reasoning capable applications can facilitate information retrieval as well as aid in the discovery of new knowledge about a subject of interest. We demonstrate, by means of examples, how a scientist could retrieve information from yOWL such that they can query at various levels of ontological granularity, query across data having multiple synonymous identifiers, and make semantically constrained queries. We focus on two main categories of queries: class queries and graph pattern based queries. Class queries are constrained to the concepts, properties and individuals contained in the ontology. While class queries have been around for a long time, the Manchester OWL syntax [42] directly maps to the OWL knowledge representation language and can be easily formulated using the Protégé 4 DL Query plug-in and the Pellet/Fact++ reasoners that are embedded in this application. Graph pattern queries are more powerful than class queries in that they allow multiple variables to be specified and restricted, and hence the query returns a sub-graph of the underlying knowledge base. For example, we might ask to retrieve genes/gene products, pathway and source where the gene is involved in a protein modification pathway. These type of queries can formulated using nRQL query language, a lisp based query language [19] supported in Racer Pro 1.9.1. Intuitively, these queries are based on graph patterns composed by nodes and edges. Each node represents a class, an individual or a variable to be bound to an individual. Edges represent relations through properties or restrictions in the ontology. For the answering of nRQL queries we used Racer Pro (via the Racer Porter interface) with the instance indexing feature enabled for better performance and unique name assumption. 3. Results 3.1. Heterogeneous biological data integration 3.1.1. Resource integration and provenance SGD assigns a unique identifier (SGDID) for every chromosomal feature it provides. All other identifiers including gene names, gene aliases, ORF names, and all database cross-references were assigned one of the namespaces in Table 2 and made an instance of owl:-

783

Thing. Identifiers that point to the same resource were made equivalent by asserting the owl:sameAs between them. This allows statements made using any one of the identifiers to be resolved and greatly simplify the data integration process. For instance, UniProt data was seamlessly integrated to SGD data, given that there existed a cross-reference list between the two. When a resolvable URI for the entities described was not available, provenance was maintained by assigning the LSID identifiers to imported data, as well as linking resources to their databases with a hasSource object property. This then enables querying data from a specific source either by filtering namespaces or by querying object properties. While using an object property is less elegant than RDF statement reification, the latter is not possible in a OWL-DL ontology. 3.1.2. Instantiation of the Gene Ontology By adopting the BFO in this work, we subscribe to the idea that molecular functions, cellular components and biological processes really do exist and as such each gene/protein instantiates its own GO instance. For simplicity of reasoning about GO slim terms from SGD data file, we designed an OWL-DL version of the yeast GO slim ontology12 comprises of 79 classes spanning the three hierarchies (molecular function, cellular component and biological process). The use of this GO slim ontology substantially reduces the complexity of reasoning over the full GO ontology (with over 19,000 terms), but it is still possible to import the full ontology to reason with as each class is equivalent to a GO class. We created an instance of the correspondent GO term for each GO annotated gene/protein, which opens the door to making future statements about those particular functions, processes and components. 3.2. Question answering 3.2.1. Types of queries posed to yOWL Table 3 lists example queries (class and graph pattern based) and the OWL features they use in order to demonstrate the basic functionality and advantages of OWL ontology-driven queries. 3.3. Query results 3.3.1. Query 1. Find all individuals that have a molecular function Class Expression: hasFunction some Molecular_function This query returns the set of individuals that have some (at least one) known molecular function. While molecular functions assigned to genes/proteins are generally more specific, these are resolve to this parent term defined in the GO slim ontology. Among the results, we find that the Elongin A, F-box protein encoded by the open reading frame identified by the SGD identifier S000005174 has been associated with an instance of transcription regulator activity. Additional functional annotation can be obtained from the UniProt RDF knowledge base. We imported yeast specific UniProt RDF records into our yOWL knowledge base. In doing so, we discover new information beyond the original SGD data, such that YDR363w (ESC2) and YGL127c (SOH1) exhibit DNA processing function, but these are described in free text entries. Since UniProt assigns its own GO URIs, it was necessary to create a mapping13 from those URIs to the PURL URIs used by GO. By additionally importing the GOslim ontology, we can query the enhanced knowledge base with classifiedWith some Molecular_function to retrieve genes/proteins that have GO functional annotation. Among the results, we discover that the pre-mRNA-splicing factor CWC21 (Q03375|S000002890) is annotated with protein binding function.

12 13

http://ontology.dumontierlab.com/goslim. http://ontology.dumontierlab.com/yowl-jbi-go-uniprot2obo.owl.

784

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789

Table 3 Example queries posed to yOWL Type

Query

Query feature

Class queries

(1) Find all individuals that have a molecular function

Existential restriction, ontology integration Conjunctive query, ontology integration Conjunctive query, defined classes Transitive relations Cardinality restrictions Negation, defined classes

(2) Find all uncharacterized open reading frames that have a known molecular function (3) Find pathway participants that are also physical interaction participants (4) Find all open reading frames on chromosome 5 (5) Find all the interaction experiments that are referred in at least four publications (6) Find all DNA regions that are not physically mapped Graph pattern based queries

(7) Find genes that play a role in transcription and are participants in some genetic interaction experiment. Return both the genes and their associated publications (8) Give the set of identifiers and their database sources for genes involved in a protein modification pathway (9) Find all information related to Gene NSA3

3.3.2. Query 2. Find all uncharacterized open reading frames that have a known molecular function Class Expression: OpenReadingFrame that (hasFunction some Molecular_function) and (hasStatus some (ORFStatus and not (Verified or Dubious))) This conjunctive query illustrates how (i) multiple restrictions (OpenReadingFrame, hasFunction, hasStatus) on an individual must be satisfied and (ii) requires background knowledge of status types (Verified and Dubious are known and different ORF status types from other types—i.e. uncharacterized). The result to this query contains among others, a putative F-box protein encoded by the ORF with the SGD identifier S000005255 and whose molecular function is protein binding. Since uncharacterized open reading frames are those that likely encode a protein, but for which there are no specific experimental data demonstrating that a gene product is produced in Saccharomyces cerevisiae, such queries open new avenues for experimental investigation and validation. 3.3.3. Query 3. Find pathway participants that are also physical interaction participants Class Expression: PathwayParticipant and PhysicalInteraction Participant This query illustrates the use of conjunctive queries with defined classes for knowledge discovery. A defined class relies on necessary and sufficient conditions to logically describe its membership requirements. An OWL-DL reasoner (e.g. Pellet, Racer, FACT++) will discover which individuals satisfy the class restrictions and will classify such individuals as instances of that defined class. The yOWL ontology contains the class PathwayParticipant for which the necessary and sufficient conditions for membership are: (i) be an instance of an independent continuant and (ii) is a participant in some pathway. The class InteractionParticipant is defined to be (i) an instance of an independent continuant and (ii) participant in some physical interaction experiment. The full query can be posed in terms of primitive (not defined) classes: Find all continuants that participate in a pathway and participate in a physical interaction experiment. Both queries return the same set of individuals as an answer, which includes among others: the BET4 gene (YJL031C; SGDID S000003568) whose protein product has a known role in a protein modification pathway and has been shown to interact in five physical interaction experiments (affinityCapture-MS, FRET, reconstituted complex, dosage rescue and copurification). Thus, knowledge that spans different curated information can be easily queried. 3.3.4. Query 4. Find all open reading frames (ORF) on chromosome 5 Class Expression: OpenReadingFrame that isPartOf value chromosome5

Conjunctive queries, variable binding DB Cross references, variable binding Property/role hierarchy, variable binding

An important aspect of biological modeling is the use of transitive parthood relations. Within the BFO framework, both objects and processes can have parts, although they are restricted to the same type (i.e. an object can have another object as its part). In the yOWL ontology, open reading frames are parts of chromosome strands which themselves are part of a chromosome. The result of this query includes the YEL059C-A ORF (SGDID S000002954) which is asserted to be part of the Crick strand that is part of chromosome 5. Given that the part of relation is transitive, the reasoner can infer that this ORF is also part of the chromosome 5. 3.3.5. Query 5. Find all the interaction experiments that are referred in at least 4 publications Class Expression: InteractionExperiment that (hasReference min 4 Publication) This query imposes cardinality restrictions over the property has Reference. The result contains the AffinityCapture-MS interaction experiment with SGD identifier interaction_173, which is referred in the publications identified by PMID 1805837, 12374754, 16429126 and 16554755. In Racer, queries including cardinality restrictions cannot contain transitive relations [43]. When the cardinality restrictions involve operators like ‘‘at most” or ‘‘exactly”, it is necessary to state that we have all the knowledge required to answer the question (i.e. close the world at query time), and the semantics of the query would change from at most four references to at most four known references. Such queries typically require the use of the ‘‘unique name assumption” in order to distinguish between individuals that have not been explicitly asserted to be different. This configuration option is offered by both, Racer and Pellet reasoners. Otherwise, axioms containing the OWL differentFrom property axioms would need to be added to the ontology. 3.3.6. Query 6. Find all DNA regions that are not physically mapped Class Expression: DNARegion and not (hasStartCoordinate some int) and not (hasEndCoordinate some int) All physically mapped DNA regions have a known start coordinate and end coordinate along the chromosome. The result of this query contains, among others, negative regulator gene in general amino acid biosynthetic pathway (SGDID S000029174), which is an instance of ‘‘NotPhysicallyMappedFeature”. In t‘he yOWL ontology, ‘‘NotPhysicallyMappedFeature” is equivalent to the query, and can be answered either under normal open world assumption or with negation as failure, as implemented by Racer Pro. The following queries are considered under the category of graph pattern based query, and therefore, the result will contain

785

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789

the set of values (bindings) for each variable (node) in the pattern graph queried that satisfy the conditions described in such a graph. 3.3.7. Query 7. Find genes that play a role in transcription and are participants in some genetic interaction experiment. Return both the genes and their associated publications The graph pattern for this query is illustrated in Fig. 3. The result of this query is a set of tuples, each one containing the values for publications and the genes they reference that satisfy both conditions: (i) play a role in transcription and (ii) are participant in a genetic interaction experiment. This result contains among others the tuple: publication (PMID:16431986) with the gene (S000005705) whose gene product is a part of the APT subcomplex of cleavage and polyadenylation factor. 3.3.8. Query 8. Give the set of identifiers and their database sources for genes involved in a protein modification pathway The graph pattern for this query is illustrated in Fig. 4. This query illustrates the use of OWL sameAs property to assert that cross references are identifiers to the same object, thereby facilitating the aggregation of information from multiple resources. This query returns, among others, the individual with the identifiers and sources (id from source): S000003568, YJL031C and BET4 from SGD, PWY30-11 from MetaCyc, YJL031C (also found as YJL031c) from EUROSCARF, 4994 from DIP, orf10.1039 from CGD, CA1034 from CandidaDB, CAA89323.1 and AAA21386.1 from GenBank/ EMBL/DDBJ, 33728 from BioGRID, UPI000034F5CE and Q00618 from EBI, NP_012503.2 and 853421 from NCBI. Thus, even when sources assign multiple identifiers, or there exists multiple spellings, yOWL integrates this knowledge and facilitates query answering. 3.3.9. Query 9. Find all information related to Gene NSA3 The graph pattern for this query is illustrated in Fig. 5. As mentioned in the Ontology design and ontology population section, yOWL contains a property hierarchy, whose top property is isRelatedTo. Therefore, every relation (asserted or inferred) between NSA3 and any other individual, will imply that NSA3 is related to that individual. This query retrieves all the individuals NSA3 is related to at the most general level of granularity: (sources) EBI, CGD, NCBI, GenBank/EMBL/DDBJ, BioGRID, DIP and CandidaDB, (ORFStatus) verified, chromosome8_Watson, proteasome_complex, (GO identifiers) GO_S000001094_Ribosome_biogenesis_and_assembly, GO_S000001094_Nucleolus, GO_S000001094_Protein_catabolic_ process and GO_S000001094_Protein_binding. It is also related to a large set of interactions, and a set of publications including PMID:16922378. This exploratory query can later be refined searching for more specific types of relations between NSA3 and other individuals (e.g. find all the molecular functions related with NSA3 or the location of NSA3).

yowl:hasReference

yowl:Gene bro:isParticipantIn

goslim: Transcription

yowl: Publication bro:isParticipantIn yowl:Genetic Interaction Experiment

Fig. 3. Graph pattern for Query 7. Bound variables (dotted circle), unbound variables (unfilled circle), individuals (square).

owl:sameAs

yowl:Gene

bfo:Entity yowl:hasSource

bro:isParticipantIn

sgd:protein_ modifications_ pathway

yowl:DB Source

Fig. 4. Graph pattern for Query 8. Bound variables (dotted circle), unbound variables (unfilled circle), individuals (square).

bro:isRelatedTo

sgd:NSA3

owl:Thing

Fig. 5. Graph pattern for Query 9. Bound variables (dotted circle), unbound variables (unfilled circle), individuals (square).

3.3.10. Query 10. Find genes/proteins with transferase activity that are part of a complex, have rescued a non-viable phenotype by overexpression and have a known role in some pathway. Retrieve the genes, sources, pathways, experiments, chromosome and complex that satisfy these requirements The graph pattern for this query is illustrated in Fig. 6. This query represents a more sophisticated query that spans a greater portion of the yOWL knowledge base. It requires the integration of all the information including ontology-based inferences (e.g. be part of a chromosome) and ontology integration (e.g. the transferase activity GO term). The answer to this query is the set of tuples containing a gene, source, pathway, experiment, chromosome and complex that together with the relations satisfy the restrictions defined in the query. An answer to this query is the gene BET4, with source SGD that is a participant in the pathway_1 and also in the experiment_30. This experiment has as outcome the phenotype_30 that has the quality of being non-viable. BET4 also participates in the interaction_871 obtained from a Dosage Rescue Genetic Interaction Experiment. This gene is annotated with the S000003568_GO_Transferase_activity, an instance of the GO transferase activity molecular function and is also part of the chromosome10 and the Rab-protein_geranylgeranyltransferase_complex, which is an instance of complex. Each unbound variable of the query (unfilled circle) is bounded to an individual that satisfies the restrictions imposed in the query. 4. Discussion In previous sections we presented the prototype of an ontology driven knowledge management system in which the semantics of RDF and OWL are used to integrate and query over heterogeneous biological knowledge. We will now discuss in greater detail some features, lessons learned and remaining challenges. 4.1. Ontology design and ontology population While the approach used to design the ontology (generate bag of concepts, place in hierarchy, enhance with additional domain knowledge, establish relations, add necessary conditions) follows general practices, there are two distinguishing aspects in this work. The first is that we used the conceptual framework provided by the BFO upper level ontology to guide the design of

786

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789

goslim:Transferase_activity

yowl:Complex bro: isPartOf

bro:hasFunction bro:isParticipantIn

bro:isParticipantIn

yowl:Dosage Rescue

yowl:Gene

bro:isPartOf

goslim : Chromosome

yowl:Whole GeneDeletion

yowl:Pathway yowl: hasSource

bro:isParticipantIn

yowl: Experiment

yowl: hasPerturbation

yowl: hasOutcome

yowl:DB Source

yowl: Nonviable

Fig. 6. Graph pattern for Query 10. Bound variables (dotted circle), unbound variables (unfilled circle), individuals (square).

classes and the use of the BRO basic relationship ontology to establish the basic relationships between the classes. The use of an upper level ontology will also permit semantic mapping of classes with other ontologies by establishing, in the very least, whether the class is a type of quality, object or process. A second point is that our methodology was specifically applied to remodel information that is presumably stored in a relational database without having access to the schema, which required a high human intervention. However, our manual modeling process forms the basis by which classes and relations may be automatically extracted from a database model (e.g. database schema, data files, etc). An important aspect of this process is to clearly identify when a relation is of the type is a or that of parthood, both of which are widely used in the biology domain in taxonomies and mereologies. Following an automatic extraction would be the more challenging aspect of identifying existing ontologies with compatible semantics. In this paper, we trivially generated a GO slim ontology that references classes defined in a much larger GO document, thereby specializing an existing ontology for practical considerations, and supporting the idea that a mapping to existing ontologies could be done after the automatic generation of classes in ones own namespace. The task of ontology population is labor intensive and requires intervention by a technical and domain expert. However, the development of RDF data converters for a variety of heterogeneous data sources, as being done with the Bio2RDF project,14 is a clear first step towards the population of OWL ontologies. In this work, we integrated SGD and UniProt RDF data into our OWL ontology so as to execute queries across these information sources.

hanced vision of biological data management. The use of OWL ontologies and easy to use interfaces facilitates the management of a semantic knowledge base to keep track of information and execute novel queries. For instance, let us add the recent discovery of 11 genes involved in meiotic DNA processing based on genetic interaction experiments [44]. We create a new ontology using Protege4, we import the yOWL ontology, add functional annotation to each of the genes using the GO terminology, and publish the document at some web accessible location.15 Thus, scientists are now empowered to create their own knowledge bases on literature, experiments, etc, and can easily share them, thus creating a new network of distributed, machine understandable, scientific knowledge bases. 4.3. Gene–protein resolution Despite SGD’s recent thrust to improve its annotation of proteins [45], this resource does not clearly differentiate between genes and their gene products (i.e. proteins). This is enormously problematic in reasoning about processes that involve proteins and not genes, as these should be considered disjoint. In particular, experiments that deal exclusively with genes (microarray), or proteins (two hybrid), and making strong statements via necessary conditions will lead to inconsistencies. While it will ultimately become necessary to differentiate between genes and proteins, this will require active curation or integration of knowledge which links these for yeast. However, since yeast biologists routinely use gene/protein names interchangeably, they might expect to see all relevant information when asking about either one, thus making the distinction unnecessary and inconvenient.

4.2. Knowledge management 4.4. Semantic knowledge integration in the biological sciences Although a large amount of information about yeast genes and proteins is available at SGD, new experimental information is being generated everyday in laboratories across the world. Clearly, facile methods to publish information and rapid access to new knowledge would be incredible assets in scientific knowledge discovery. Semantic web languages such as RDF and OWL facilitate publication of machine understandable information (i.e. RSS) and allow automated integration and querying from distributed web sources, and hence these are suitable in moving forward towards this en-

14

http://bio2rdf.org/

Given the broad scope of this ontology, we recognize that there may be subsets that overlap with other ontologies, particularly community-driven ontologies that are part of the Open Biomedical Ontologies (OBO). Two major issues that precluded the use of the first generation of OBO ontologies were: (i) OBO ontologies did not adopt OWL semantics, (ii) few of them adhered to logical subsumption. Efforts via the OBO Foundry aim to redesign OBO ontologies by adopting the formal ontology of the BFO and by providing

15

http://ontology.dumontierlab.com/yowl-jbi-example.owl

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789

clearer guidelines for ontology design. Recent work has also established formal semantics to transform OBO ontologies into OWL [12–14]. Thus, we anticipate that we will be able to map our concepts to OBO ontologies in the near future. A controversial aspect of our work relies on the use of a self-declared base URI for all third part identifiers of sources without resolvable URLs, in order to facilitate future data integration. This problem is solely rooted in the fact that currently there is no global directory for base URIs of public biological data providers. Since OWL inherits the semantics of RDF, instances may be assigned to different namespaces. While we can state that the resource identified by UPI0000052DF0 is an instance of http://ontology.dumontierlab.com/yowl-1.0#OpenReadingFrame, we would expect that the proper namespace of that individual belongs to the original data provider, SGD in this case. Some data providers, such as UniProt will provide RDF or OWL documents with these identifiers (e.g. http://beta.uniprot.org/uniparc/UPI0000036C3F.rdf). Unfortunately, many data providers, including the SGD, do not make their information available in RDF, and do not provide a base URI from which their resources may be identified. Thus, it falls on a third party to assign a unique URI. One way to do this is assign an HTTP URI (URL) in the data provider’s namespace i.e. http://yeastgenome.org/UPI0000052DF0. But this choice is wholly arbitrary, and it falls on the data provider to resolve that URI, even if there is no resolver that exists today or in the future. Without a proper URL resolver, semantic web clients would query the host server and waste network bandwidth in trying every single identifier and never learning more about the resource. Another possibility is to assign a URI in the third party’s namespace, i.e. http://dumontierlab.com/PI0000052DF0, and potentially offer a service to resolve such a URI. Aside from whether the third party actually wants to serve the data, a more fundamental problem with this approach is that different URIs might be assigned for each third party data provider for the very same resource, thereby effectively dismantling the potential of RDF semantics to trivially integrate what should be an identical resource. Another possibility (the one we adopted) is the use of the Life Science Identifier (LSID), a location-independent encoding of resource (URN) [46]. The advantage of this approach is that the URL resolution of the entity is done via another protocol, therefore allowing changes in URL end-points. Unfortunately, there are two issues with the LSID approach: data providers must (i) register with an LSID authority directory and (ii) implement a resolver that will convert the URN into a URL internet resource. Problematically, many data providers have not subscribed to the LSID resolution mechanism and therefore there may never be resolution for these entities. Compounding this problem, the LSID authority directory was not available at the time of this study, and we were forced to assign LSIDs based on the data providers root DNS. Should the data providers register with the LSID authority some future date with a different LSID, we can add another sameAs statement to our knowledgebase to enable data integration. Alternatively, a case might be made for new OWL semantics to make namespaces equivalent. In any case, yOWL is ready to integrate data containing LSID identifiers or data providers URLs when available. Realizing the vision of data integration requires that statements made in different sources about a single resource be considered equivalent. As outlined in the introduction, bioinformatics databases are particularly keen on maintaining their own identifiers to maintain provenance about value added contributions. This approach results in a number of equivalent identifiers for the same resources. Using the OWL property sameAs, database cross references are made equivalent to the SGD resource. Thus, a user may query the knowledge base using any of the equivalent identifiers and return the union of statements about that resource. Some reasoners, such as Racer, provide the means to query only asserted

787

knowledge, thereby retrieving knowledge of some subset of data providers. Such behavior is particularly well suited for users wishing to filter the knowledge base depending on the data or data provider they trust. 4.5. Knowledge discovery The ability to define classes in OWL (e.g. PathwayParticipant and PhysicalInteractionParticipant) given a set of (logically described) necessary and sufficient conditions, allows a reasoner to infer the individuals that belong to that defined class. This is a significant difference between RDF and OWL, where the former has no such vocabulary to express this concept. A reasoner can also determine that a user’s query corresponds to a class already defined in an ontology. Thus, individuals that belong to defined classes will be classified by the reasoner in the realization process and will not require on the fly query evaluation, which will play a role in improving query performance over greater amount of data. 4.6. Granular semantic search The yOWL ontology supports semantic query formulation across various levels of granularity in both class and property hierarchies. In the case of class hierarchies, a scientist may generally ask about DNA regions or query specialized DNA regions (e.g. Open Reading Frame). Most of the available biological data models contain information in a specific granularity level, the population of classes using their definitions, as explained in previous paragraph, will allow a reasoner to infer the membership of individuals to classes that otherwise would be gaps precluding the integration of data at these levels of granularity (e.g. Query 3 that includes a PathwayParticipant, a class that does not contain any asserted individual). In the case of property hierarchies that are rooted on a non-transitive relation, one can ask whether there is any relation between two or more objects with multiple unknown concepts between them. This provides a general data mining approach to discover relations between two or more resources. 4.7. Transitive relations OWL ontologies provide the ability to define transitive relations (e.g. if an Open Reading Frame is part of a Chromosome Strand, and the Chromosome Strand is part of the Chromosome, then the Open Reading Frame is part of the Chromosome). These relations are very useful in knowledge discovery. Transitive relations in relational databases are not straightforward as they require recursive SQL queries that extend relational algebra. This is hard to maintain given the information needed a priori (e.g. database schema, datatypes) that may limit the scope of the application, making it domain dependent. Also, the user will need to have a previous training on SQL queries, which is uncommon among biological scientists. The use of transitive and composed properties among other is a clear advantage of the use of ontologies over relational databases. 4.8. Closed world reasoning and unique name assumption Life sciences terminology often requires cardinality restrictions over properties (e.g. a carbon atom has exactly 6 protons) and negation (e.g. individuals that are DNA Regions but are not physically mapped). Life sciences ontologies are populated from databases where different names represent different entities. For these reasons, the ability of using negation as failure (implemented by RacerPro and its query language nRQL) and the ability to apply Unique Name Assumption (RacerPro and Pellet) played an important role in the query answering, especially for queries for knowl-

788

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789

edge discovery. These features are also important in the population of defined classes containing cardinality restrictions containing the ‘‘at least”, ‘‘at most” and ‘‘exactly” operators. Given that new hypothesis are typically generated using the available evidence, a scientist would like to pose queries with the assumption that all the knowledge is available at certain point, and therefore ‘‘closed-world” reasoning is necessary and applicable in the biology domain. 4.9. Query answering interfaces The construction of semantically correct queries is facilitated by user-friendly interfaces. Protégé 4 offers to the users the ability to construct class queries using English phrase-like phrases (the Manchester OWL Syntax [42]). The Protégé 4.0 DL query plug-in aids in the construction of the query by dynamically suggesting the phrase grammar and available entities, relations and individuals. This kind of interface helped in the construction of sophisticated queries with no training required. Unfortunately, class queries do not return the individuals that bound the graph pattern based queries, which is essential when users want to identify multiple data that they are interested in. While RacerPro returns the set of individuals that related satisfy the graph pattern based query, and is generally quite powerful, we found it difficult to construct the nRQL queries using the RacerPorter interface. We are aware that some efforts have been made to implement more intuitive interfaces to nRQL [26] including a graphical query language [47]. We expect this trend to continue and will facilitate the use of semantic web applications by the scientific community. 4.10. Scalable data management A major challenge remains with the efficient storage and retrieval of ontological data. Current applications necessarily store all data in memory to execute reasoning tasks. It will be difficult, if not impossible to store large databases with all their inferences in memory without sophisticated hardware. In our study, due to limited resources, we necessarily restricted both the ontology size (GO slim instead of the full GO) and amount of data to reason about (2% subset of about 320,000 instances). Databases such as Instance Store [48] provide a first step towards reasoning databases, but to our knowledge this is currently limited to role-free queries. More sophisticated solutions are clearly required, like the efforts towards ontology modularization described in [49,50]. Our contribution towards this goal is the generation of decidable ontologies with large amounts of data (individuals) as use cases to represent the needs for scientist for the testing and development of new and data management (including reasoning) approaches. 5. Conclusions In this work, we described a first approach to logically describe, integrate and query yeast biological data from the SGD database using the OWL-DL ontology language. Several features make this work unique. First, we designed a domain specific ontology that incorporates concepts drawn from raw data and expert knowledge. We maintained upper level semantics by differentiating between objects and processes, using basic relations and instantiating functions, components and processes for each gene/protein based on the Gene Ontology. We described the methodology used for semantically augmenting biological databases in absence of domain specific ontologies and discuss its scope and characteristics. We also made use of OWL-DL semantics to integrate identical resources from different data providers and overcome the problem of heterogeneous identifiers integration. We illustrated the process of adding additional knowledge from UniProt (automatically) and

a relevant publication (manually) in a new ontology for modularity purposes. Finally, we illustrated the use of diverse queries at various levels of ontological granularity using DL reasoners, highlighting the difference of using open and closed world semantics and their importance in the biology domain. This work marks a beginning for using the semantic web framework in yeast knowledge discovery. However, significant challenges remain in realizing the potential of the semantic web, such as the automated creation and population of ontologies (via ontology lifting), the efficient storage and reasoning with large amounts of ontological data for reasoning and the development of intuitive interfaces among others. Acknowledgments This paper is an extension of the paper [51] published in WWW2007/HCLSDI workshop proceedings. We thank the anonymous reviewers for providing constructive feedback that has improved the quality of this paper. Funding for N.V.R. and M.D. was with CONACYT scholarship #150581 and NSERC, respectively. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.jbi.2008.05.001. References [1] Robu I, Robu V, Thirion B. An introduction to the Semantic Web for health sciences librarians. J Med Libr Assoc 2006;94(2):198–205. [2] Berners-Lee T, Hall W, Hendler J, Shadbolt N, Weitzner DJ. Computer science. Creating a science of the Web. Science 2006;313(5788):769–71. [3] Berners-Lee , Hendler J. Publishing on the Semantic Web. Nature 2001;410(6832):1023–4. [4] Spackman KA. Normal forms for description logic expressions of clinical concepts in SNOMED RT. In: Proc AMIA Symp, 2001. p. 627–31. [5] Kashyap V. The UMLS semantic network and the semantic web. In: AMIA Annu Symp Proc, 2003. p. 351–5. [6] Bodenreider IO, Mitchell CJA, McCray JAT. Biomedical ontologies. Pac Symp Biocomput 20022005:76–8. [7] Foster I, Kesselman C, Nick J, Tuecke S. The physiology of the grid: an open grid services architecture for distributed systems integration. Globus Project, 2002. [8] Martin D, Burstein M, Hobbs J, Lassila O, McDermott D, McIlraith S, et al. OWLS: semantic markup for web services, 2004. Available from: http:// www.w3.org/Submission/OWL-S/. [9] The Gene Ontology (GO) project in 2006. Nucleic Acids Res 2006;1(34):D322-6 [Database issue]. [10] Grenon P, Smith B, Goldberg L. Biodynamic ontology: applying BFO in the biomedical domain. Stud Health Technol Inform 2004;102:20–38. [11] Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, et al. Relations in biomedical ontologies. Genome Biol 2005;6(5):R46. [12] Golbreich C, Horrocks I. The OBO to OWL mapping, GO to OWL 1.1! OWLED, 2007. [13] Wroe CJ, Stevens R, Goble CA, Ashburner M. A methodology to migrate the gene ontology to a description logic environment using DAML+OIL. Pac Symp Biocomput 2003:624–35. [14] Moreira D, Musen M. OBO to OWL: a protege OWL tab to read/save OBO ontologies. Bioinformatics 2007;23(14):1868–70. [15] W3C. OWL Web Ontology Language Guide, 2004. Available from: http:// www.w3.org/TR/owl-guide/. [16] Horrocks I. Applications of description logics: state of the art and research challenges. In: International conference on computational science, Kassel, Germany; 2005: p. 78–90. [17] Sirin E, Parsia B, Grau BC, Kalyanpur A, Katz Y. Pellet: a practical owl-dl reasoner. In: Third international Semantic Web conference, 2004. [18] Tsarkov D, Horrocks I. FaCT++ description logic reasoner: system description. In: Joint conf on automated reasoning, 2006. p. 292–7. [19] Haarslev V, Möller R, Wessel M. Querying the Semantic Web withRacer+nRQL. In: KI-04 workshop on applications on description logics. Ulm, 2004. [20] Wolstencroft K, Lord P, Tabernero L, Brass A, Stevens R. Protein classification using ontology classification. Bioinformatics 2006;22(14):e530–538e538. [21] Group. Semantic Web Best Practices and Deployment Working Group: Semantic Web Best Practices, 2004. Available from: http://www.w3.org/ 2001/sw/BestPractices/. [22] Rector AL. Modularisation of domain ontologies implemented in description logics and related formalisms including OWL. Knowledge Capture. Sanibel Island, FL, USA: ACM Press; 2003. [23] Aranguren ME. Ontology design patterns for the formalisation of biological ontologies. Manchester: University of Manchester, 2005.

N. Villanueva-Rosales, M. Dumontier / Journal of Biomedical Informatics 41 (2008) 779–789 [24] Heja G, Varga P, Pallinger P, Surjan G. Restructuring the foundational model of anatomy. Stud Health Technol Inform 2006;124:755–60. [25] Zhang , Bodenreider O, Golbreich C. Experience in reasoning with the foundational model of anatomy in OWL DL. Pac Symp Biocomput 2006:200–11. [26] Christopher JOB, Arash S-N, Xiao S, Volker H, Greg B. Semantic Web infrastructure for fungal enzyme biotechnologists. Web Semant 2006;4(3):168–80. [27] Luciano JS. PAX of mind for pathway researchers. Drug Discov Today 2005;10(13):937–42. [28] Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol 2005;6(1):R2. [29] Joshi-Tope, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005Jan 1;33(Database issue):D428-32.;1(33):D428–32 [Database issue]. [30] T.Kulikova, Akhtar R, Aldebert P, Althorpe N, Andersson M, Baldwin A, et al. EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res 2007Jan;35(Database issue):D16-20.;35:D16–20 [Database issue]. [31] Sugawara , Abe T, Gojobori T, Tateno Y. DDBJ working on evaluation and classification of bacterial genes in INSDC. Nucleic Acids Res 2007Jan;35(Database issue):D13-5.;35:D13–5 [Database issue]. [32] Wheeler, Barrett T, Benson DA, Bryant SH, Canese K, ChetverninV, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2007Jan;35(Database issue):D5-12.;35:D5–D12 [Database issue]. [33] Babnigg G, Giometti CS. A database of unique protein sequence identifiers for proteome studies. Proteomics 2006;6(16):4514–22. [34] The Universal Protein Resource (UniProt). Nucleic Acids Res 2007;35:D193-7 [Database issue]. [35] Smith AK, Cheung KH, Yip KY, Schultz M, Gerstein MK. LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics. BMC Bioinform 2007;8(Suppl. 3):S5. [36] Cheung KH, Yip KY, Smith A, Deknikker R, Masiar A, Gerstein M. YeastHub: a Semantic Web use case for integrating data in the life sciences domain. Bioinformatics 2005;21(Suppl. 1):i85–96i96. [37] Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, et al. Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res 2002;30(1):69–72.

789

[38] Noy NF, McGuinness DL. Ontology development 101: a guide to creating your first ontology; 2001. Report No.: Stanford Knowledge Systems Laboratory Technical Report KSL-01-05, Stanford Medical Informatics Technical Report SMI-2001-0880. [39] Gruninger M, Fox MS. Methodology for the design and evaluation of ontologies. In: IJCAI, Montreal, 1995. [40] Uschold M. Towards a methodology for building ontologies. In: Workshop on basic ontological issues in knowledge sharing, 1995. [41] Dumontier M, Villanueva-Rosales N. Three-layer OWL ontology design. In: Second international workshop on modular ontologies (WOMO07), Whistler, Canada, 2007. [42] M, Drummond N, Goodwin J, Rector A, Stevens R, Wang HH. The Manchester OWL syntax. In: OWLED, Athens, Georgia, 2006.. [43] Horrocks I, Sattler U, Tobies S. Reasoning with individuals for the description logic shiq. In: MacAllester D, editor. 17th international conference on automated deduction (CADE-17). Germany: Springer Verlag, 2000. p. 482–96. [44] Jordan PW, Klein F, Leach DR. Novel roles for selected genes in meiotic DNA processing. PLoS Genet 2007;3(12):e222. [45] Nash, S.WengB.Hitz, Balakrishnan TR, Christie SKR, Costanzo TMC, et al. Expanded protein information at SGD: new pages and proteome browser. Nucleic Acids Res 20042007;35:59D468–70D471. [Database issue]. [46] Clark T, Martin S, Liefeld T. Globally distributed object identification for biological knowledge bases. Brief Bioinform ;5(1):59–70. [47] Fadhil A, Haarslev V. GLOO: a graphical query language for OWL ontologies. OWLED, 2006. [48] Horrocks I, Li L, Turi D, Bechhofer S. The instance store: DL reasoning with large numbers of individuals. In: Haarslev V, Möller R, editors. Description Logics, Whistler, British Columbia, Canada: CEUR-WS.org, 2004. [49] Grau BC, Horrocks I, Kazakov Y, Sattler U. Just the right amount: extracting modules from ontologies. In: Banff AB, editor. World WideWeb, Canada: ACM Press, 2007. p. 717–27. [50] Grau BC, Halaschek-Wiener C, Kazakov Y. History matters: incremental ontology reasoning using modules. In: International Semantic Web conference, Busan, South Korea: Springer, 2007. [51] Villanueva-Rosales N, Osbahr K, Dumontier M. Towards a semantic knowledge base for yeast biologists. In: First international workshop on health care and life sciences data integration for the Semantic Web (HCLS 2007), Banff, Alberta; 2007.