The state of the art in text mining and natural language processing for pharmacogenomics

The state of the art in text mining and natural language processing for pharmacogenomics

Journal of Biomedical Informatics 45 (2012) 825–826 Contents lists available at SciVerse ScienceDirect Journal of Biomedical Informatics journal hom...

133KB Sizes 0 Downloads 49 Views

Journal of Biomedical Informatics 45 (2012) 825–826

Contents lists available at SciVerse ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

Guest Editorial

The state of the art in text mining and natural language processing for pharmacogenomics

Pharmacogenomics researchers need high quality information about the genetic modulators of drug responses. Although PharmGKB (the Pharmacogenomics Knowledge Base, http://www.pharmgkb.org/) [1] and other public resources provide such knowledge, most of it remains within the free text of scientific articles and biomedical reports. It is crucial to provide Natural Language Processing (NLP) tools to extract and encode published facts to support the design of new experiments and the discovery of new knowledge in pharmacogenomics [2]. Most pharmacogenomics knowledge occurs in the form of ternary relationships that occur between (i) a drug treatment, (ii) a genetic variant, and (iii) a drug response. Information extraction in pharmacogenomics has primarily focused on the binary relationships that comprise parts of this triad. This special section of JBI presents recent progress in NLP in support of pharmacogenomics. Some of the work was presented at a symposium dedicated to this topic at the annual Pacific Symposium on Biocomputing in 2010 (http://www.psb.stanford.edu/), but other articles were contributed in response to an open call for papers for this journal. The last decade has witnessed great progress in NLP applied to the general field of molecular biology. NLP technologies can create gene and protein networks that are useful for systems biology. The field has been spurred by international challenges or ‘‘shared tasks’’ that provide comparative evaluations, under controlled circumstances, of different approaches [3,4]. Organizers of these competitions create annotated corpora, reference vocabularies, and evaluation metrics that guide participants as they test, train and tune their approaches. These resources have improved for each competition, and have allowed advances to be quantified. Similar evaluations may be useful for pharmacogenomics in order to catalyze and document similar progress. At the same time, researchers in translational bioinformatics [5] have used NLP with marked success. For example, Campillos et al. extracted drug side effect information from drug labels (also known as package inserts) to compute similarities of side effects and to infer whether two drugs share a target [6]. The authors made several in silico predictions and validated 11 novel drug-target relationships in vitro. Percha et al. used a random forest classifier, trained with drug-gene interactions extracted from the literature, to study drug-drug interactions [7]. On the basis of textual features extracted from articles, the classifier succeeded in predicting novel drug-drug interactions. The eight articles in this special section highlight recent successes in applying NLP to key tasks within pharmacogenomics,

1532-0464/$ - see front matter Ó 2012 Published by Elsevier Inc. http://dx.doi.org/10.1016/j.jbi.2012.08.001

with the goal of providing reliable tools to support translational bioinformatics. The first five articles describe approaches to extracting relevant pharmacogenomics relationships from the literature. The last three articles describe the evaluation and the construction of new resources for future research in pharmacogenomics NLP, especially the creation of new annotated corpora. The article by R. Xu and Q. Wang demonstrates the extraction of drug-gene relationships from text [8]. Their extraction method recognizes exact mentions of drugs and genes occurring simultaneously in sentences. Importantly, they demonstrate the feasibility of performing this extraction on all MEDLINE abstracts, processing 100 million sentences (grouped in 20 million abstracts) using a cloud-based named entity recognizer. The authors demonstrate that the precision and F1 measures obtained by their approach increase impressively (from 0.11 to 0.345 for the precision and from 0.201 to 0.402 for the F1 measure) when the extraction is done only from sentences where one gene-drug relationship is present. The article by Rance et al. differs from the first in that the information extraction is focused on the relationship between drugs and gene variants instead of genes more generally [9]. Features of genetic variants such as their position in the sequence are identified using regular expressions and subsequently compared to the entries within dbSNP in order to map the variants to standard identifiers. The authors use the drug-variant relationships manually curated in PharmGKB to evaluate the recall of their approach (0.33 on a corpus of 104 MEDLINE articles) and to compare it to competing approaches. The next article, by Hakenberg et al., employs an exhaustive information extraction approach to a corpus of 179,935 articles with the goal of automatically populating a database of twelve different relationship types among nine different entity types (genes, drugs, diseases, adverse effects, mutations, refSNP, alleles, population and frequency) [10]. The method uses distinct adapted tools for the recognition of each entity type (e.g., BANNER for genes [11], MutationFinder for mutations [12], etc.) and combines seven methods for relation extraction, from simple co-occurrence to syntactic parse trees. The resulting resource contains 233,964 relationships extracted with a precision of 0.48–0.84 and a recall of 0.73–1, depending on the type of relationship. The full set of resulting relationships between entities is presented in a new database, called SNPshot. The next two articles explore the feasibility of using manually curated relationships within PharmGKB as a gold standard for

826

Guest Editorial / Journal of Biomedical Informatics 45 (2012) 825–826

evaluating information extraction and as a training set for predicting new relations. Rinaldi et al. adapted their OntoGene tool for the extraction of protein–protein interactions to the extraction of pharmacogenomics relations. The authors evaluate the performance of their tool using metrics from the BioCreative shared task, and using PharmGKB gene-drug relationships as a gold standard [13]. The results are important because they provide an available benchmark for extracting pharmacogenomics relationships. Pakhomov et al. take advantage of the PharmGKB labels of positive and negative relationships between drugs and genes to create a machine learning system to predict drug-gene relationships [14]. The authors extract textual features (words or groups of two words) from articles where relationships have been manually curated, and subsequently train a Support Vector Machine (SVM) to distinguish text that mentions positive associations from text that does not. When a new abstract that mentions a drug and a gene is provided to the SVM, it classifies the pair as likely being related or unrelated. The paper identifies several candidate gene targets not mentioned in PharmGKB and concludes that the existing PharmGKB corpus can be used to predict novel pharmacogenomic relationships. The first of the three articles dealing with the evaluation and construction of new resources, by Li and Lu, searches for pharmacogenomics knowledge outside of MEDLINE by focusing on the 93,661 clinical trial records included in ClinicalTrials.gov [15]. The lexicons of PharmGKB are used to recognize records containing a gene and a drug. Manual evaluation of the approach shows a precision of 0.74 and demonstrates that ClinicalTrials.gov is a rich source of known and novel pharmacogenomic relationships. The last two papers of the special section describe the development of two corpora manually annotated with drug-disease-gene relationships and drug-adverse effect relationships. The corpus developed by Van Mulligen et al. is the EU-ADR Corpus [16]. It is made of 300 MEDLINE abstracts with annotated relationships among drugs, disorders, and targets. The entities were initially annotated with a named entity recognition system, and then corrected and completed by three annotators. Relationships were proposed to annotators when two entities co-occur in a sentence; annotators evaluate the relationship, associate it with a type (target-disease, target-drug or target-disease) and a level of certainty (positive, negative or speculative). Only 1037 of 2436 annotated relationships are ultimately approved by a group of three curators. The second corpus, described by Gurulingappa et al., named ADE, is much larger (2972 documents). It focuses specifically on drug-adverse effect and drug-dosage relationships [17]. Three individuals annotate MEDLINE case reports manually at the sentence level. In the end the resource contains 6821 drug-adverse event relationships. The authors use the resulting annotations to train classifiers that distinguish sentences mentioning drug-adverse event relationships. They achieve a precision in the range of 0.75 and 0.91. Both of these corpora, ADR-EU and ADE, will be of great help in the development and evaluation of future information extraction systems. Even without the current exponential increase in the rate of scientific publications, manual curation without machine assistance is demonstrably not capable of populating pharmacogenomics databases [19]. It is notable that when the call for papers for the first (in 2010) Pacific Symposium on Biocomputing workshop on the extraction of gene-drug-disease relationship from text [18] was published, there were no annotated corpora and no properly evaluated gold standards for pharmacogenomic NLP. It is hoped that the material presented in this special section will further catalyze the development of improved and efficient NLP systems and enable the emergence of a new set of NLP challenges relevant to the critical field of personalized medicine and pharmacogenomics.

References [1] McDonagh EM, Whirl-Carrillo M, Garten Y, Altman RB, Klein TE. From pharmacogenomic knowledge acquisition to clinical applications: the PharmGKB as a clinical pharmacogenomic biomarker resource. Biomar Med 2011;5(6):795–806. [2] Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics 2010;11(10):1467–89. [3] Arighi CN, Lu Z, Krallinger M, Cohen KB, Wilbur WJ, Valencia A, Hirschman L, Wu CH. Overview of the BioCreative III Workshop. BMC Bioinformatics; 2011 [12 Suppl 8:S1]. [4] Kim JD, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J. Overview of BioNLP shared task 2011. In: Proceedings of BioNLP shared task 2011 workshop; 2011. p. 1–6. [5] Butte AJ. Translational bioinformatics: coming of age. J Am Med Inform Assoc 2008;15(6):709–14. [6] Campillos M, Kuhn M, Gavin A-C, Jensen LJ, Bork P. Drug target identification using side-effect similarity. Science 2008;321(5886):263–6. [7] Percha B, Garten Y, Altman RB. Discovery and explanation of drug-drug interactions via text mining. In: Proceeding of the pacific symposium on biocomputing; 2012. p. 410–21. [8] Xu R, Wang QQ. A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text. J Biomed Inform 2012;45(5):827–34. [9] Rance B, Doughty E, Demner-Fushman D, Kann MG, Bodenreider O. A mutation-centric approach to identifying pharmacogenomic relations in text. J Biomed Inform 2012;45(5):835–41. [10] Hakenberg J, Voronov D, Nguyen VH, Liang S, Anwar S, Lumpkin B, et al. A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions. J Biomed Inform 2012;45(5):842–50. [11] Leaman R, Gonzalez G. BANNER: an executable survey of advances in biomedical named entity recognition. In: Proceeding of the Pacific symposium on biocomputing; 2008. p. 652–63. [12] Caporaso JG, Baumgartner Jr WA, Randolph DA, Cohen KB, Hunter L. MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics 2007;23(14):1862–5. [13] Rinaldi F, Schneider G, Clematide S. Relation mining experiments in the pharmacogenomics domain. J Biomed Inform 2012;45(5):851–61. [14] Pakhomov S, Mcinnes BT, Lamba J, Liu Y, Melton GB, Ghodke Y, et al. Using PharmGKB to Train text mining approaches for identifying potential gene targets for pharmacogenomic studies. J Biomed Inform 2012;45(5):862–9. [15] Li J, Lu Z. Systematic identification of pharmacogenomics information from clinical trials. J Biomed Inform 2012;45(5):870–8. [16] van Mulligen EM, Fourrier-Reglat A, Gurwitz D, Molokhia M, Nieto A, Trifiro G, et al. The EU-ADR corpus: Annotated drugs, diseases, targets, and their relationships. J Biomed Inform 2012;45(5):879–84. [17] Gurulingappa H, Rajput AM, Roberts A, Fluck J, Hofmann-Apitius M, Toldo L. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J Biomed Inform 2012;45(5):885–92. [18] Coulet A, Shah NH, Hunter L, Baral C, Altman RB. Extraction of genotypephenotype-drug relationships from text: from entity recognition to bioinformatics application. In: Proceedings of the Pacific symposium on biocomputing; 2010. p. 485–7. [19] Baumgartner Jr WA, Cohen KB, Fox L, Acquaah-Mensah G, Hunter LE. Manual annotation is not sufficient for curating genomic databases. Bioinformatics 2007;23:i41–8.

Commentary: Adrien Coulet LORIA – Inria Nancy-Grand Est, University of Lorraine, 54506 Vandœuvre-lès-Nancy, France E-mail address: [email protected] Guest Editors K. Bretonnel Cohen Computational Bioscience Program, University of Colorado School of Medicine, Mail Stop 8303, 12801 E. 17th Ave., RC1S-L18 6104, Aurora, CO 80045, USA E-mail addresses: [email protected] Russ B. Altman Department of Genetics, Department of Bioengineering, Stanford University, 318 Campus Drive, S172, Stanford, CA 94305, USA E-mail addresses: [email protected]