Beijing International Convention 17th IFAC Symposium on SystemCenter Identification 17th IFAC Symposium on System Identification 17th IFAC Symposium on System Identification October 19-21, 2015. Convention Beijing, China Beijing International Center 17th IFAC Symposium on System Identification Available online at www.sciencedirect.com Beijing International International Convention Convention Center Center Beijing OctoberInternational 19-21, 2015. Convention Beijing, China Beijing Center October 19-21, 2015. Beijing, China October 19-21, 2015. Beijing, China October 19-21, 2015. Beijing, China
ScienceDirect
IFAC-PapersOnLine 48-28 (2015) Predicting drug-target interaction based on012–016 sequence and structure information Predicting drug-target interaction based on sequence and structure information Predicting interaction based on and information Predicting drug-target drug-target interaction based on sequence sequence and structure structure information Wei Lan*, Jianxin Wang*, Min Li*, Ruiqing Zheng*, Fang-Xiang Wu**, Yi Pan***
Wei Lan*, Jianxin Wang*, Min Li*, Ruiqing Zheng*, Fang-Xiang Wu**, Yi Pan*** Wei Zheng*, Wei Lan*, Lan*, Jianxin Jianxin Wang*, Wang*, Min Min Li*, Li*, Ruiqing Ruiqing Zheng*, Fang-Xiang Fang-Xiang Wu**, Wu**, Yi Yi Pan*** Pan*** Wei Lan*, Jianxin Wang*, Min Li*, Ruiqing Zheng*, Fang-Xiang Wu**, Yi Pan*** * School of Information Science and Engineering, Central South University, Changsha, 410083, China (e-mail: weilan, * School of Information Science and Engineering, South University, Changsha, 410083, China (e-mail: jxwang, limin,Central
[email protected]). * School of Information Science and Engineering, Central South University, Changsha, 410083, China (e-mail: weilan, weilan, * School Information Science Engineering, Central South Changsha, 410083, China weilan, School of of Science and and and Engineering, Central South University, University, Changsha, 410083, China (e-mail: (e-mail:Saskatoon, weilan, jxwang, limin,
[email protected]). ** *Division of Information Biomedical Engineering Department of Mechanical Engineering, University of Saskatchewan, jxwang, limin,
[email protected]). jxwang, limin,
[email protected]). jxwang, limin,
[email protected]). ** Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SKS7N5A9, Canada (e-mail:
[email protected]) ** Division of Biomedical Engineering and Department of Engineering, University of Saskatchewan, Saskatoon, ** of Engineering and Department of Mechanical Mechanical Engineering, University of Saskatoon, ** Division Division of Biomedical Biomedical Engineering andGeorgia Department Mechanical Engineering, University of Saskatchewan, Saskatchewan, Saskatoon, SKS7N5A9, Canada (e-mail:
[email protected]) *** Department of Computer Science, Stateof University, Atlanta, GA30303, USA, (e-mail:
[email protected])} SKS7N5A9, Canada (e-mail:
[email protected]) SKS7N5A9, Canada Canada (e-mail: (e-mail:
[email protected])
[email protected]) SKS7N5A9, *** Department of Computer Science, Georgia State University, Atlanta, GA30303, USA, (e-mail:
[email protected])} *** Department of Computer Science, Georgia State University, Atlanta, GA30303, USA, (e-mail:
[email protected])} *** *** Department Department of of Computer Computer Science, Science, Georgia Georgia State State University, University, Atlanta, Atlanta, GA30303, GA30303, USA, USA, (e-mail: (e-mail:
[email protected])}
[email protected])} Abstract: It is well known that discovering a new drug is a cumbersome, time-consuming and expensive Abstract: It is is well well known known that discovering discovering a new new drug drug is aa cumbersome, cumbersome, time-consuming and and expensive process. Computational approaches for identifying interactions betweentime-consuming drug compounds target Abstract: It that is and expensive Abstract: It well that aaa new is time-consuming and expensive Abstract: It is is become well known known that discovering discovering new drug drug is aa cumbersome, cumbersome, time-consuming and and expensive process. Computational approaches for identifying interactions between drug compounds target proteins have important in drug discovery which is helpful to reduce these obstacles. The process. Computational approaches for identifying interactions between drug compounds and target process. Computational approaches for identifying interactions between drug compounds and process. Computational approaches for identifying interactions between drug compounds and target target proteins have become important in drug discovery which is helpful to reduce these obstacles. The difficulties of drug-target interaction identification include the lack of known drug-target associations and proteins have become important in drug discovery which is helpful to reduce these obstacles. The proteins have become important in drug which is helpful to these obstacles. The proteins have become important in identification drug discovery discovery which islack helpful to reduce reduce thesecalled obstacles. The difficulties of drug-target interaction include the of known drug-target associations and no experimentally verified negative examples. In this study, we present a method, PUDT, to difficulties of drug-target interaction identification include lack of known drug-target associations and difficulties of interaction identification include the the of known associations and difficulties of drug-target drug-target interaction identification the lack lack ofpresent known adrug-target drug-target associations no experimentally verified negative examples. Ininclude this study, we method, called PUDT,and to predict drug-target interactions. Instead of treating unknown interactions asmethod, negative examples, we no experimentally verified negative examples. In this study, we present a called PUDT, to no experimentally verified negative examples. In study, we present aa method, called PUDT, to no experimentally verified negative examples. In this this we present called to predict drug-target interactions. Instead ofexamples. treating unknown interactions asmethod, negative examples, we consider unknown interactions as unlabeled Thestudy, unlabeled examples are divided intoPUDT, two parts: predict drug-target interactions. Instead of treating unknown interactions as negative examples, we predict drug-target interactions. Instead of treating unknown interactions as negative examples, we predict drug-target interactions. Instead ofexamples. treating unknown interactions asare negative examples, wea consider unknown interactions as unlabeled The unlabeled examples divided into two parts: reliable negative examples and likely negative examples based on protein structure similarity. Then, consider unknown interactions as examples. unlabeled examples are divided into two parts: consider unknown interactions as unlabeled unlabeled examples. The The unlabeled examples are into parts: consider unknown interactions unlabeled Thebased unlabeled examples are divided divided into two two parts: reliable negative examples andaslikely likely negative examples on protein protein structure similarity. Then, weightednegative support examples vector machine is used to examples. buildexamples a classifier to predict drug-target interactions based ona reliable and negative based on structure similarity. reliable negative negative examples examples and and likely likely negative negative examples examples based based on on protein protein structure structure similarity. similarity. Then, Then, aaa reliable Then, weighted support vector machine is used to build a classifier to predict drug-target interactions based on protein sequence and drug structure information. Four data sets (enzymes, ion channels, GPCRs and weighted support vector machine is used to a classifier to drug-target interactions based on weighted support machine is to build build classifier to predict predict drug-target interactions based on weighted support vector vector machine is used used build aa Four classifier predict drug-target interactions basedand on protein sequence and drug structure information. data sets (enzymes, ion channels, GPCRs nuclear receptors) are used tostructure evaluate thetoperformance ofdata the to proposed method PUDT. The experimental protein sequence and drug information. Four sets (enzymes, ion channels, GPCRs and protein sequence and drug structure information. Four data sets (enzymes, ion channels, GPCRs and protein sequence and drug structure information. Four data sets (enzymes, ion channels, GPCRs and nuclear receptors) are are used tomethod evaluate the performance performance of the proposed proposed methodapproaches. PUDT. The The experimental experimental results demonstrate that ourto PUDT outperforms of recent state-of-the-art nuclear receptors) used evaluate the the method PUDT. nuclear receptors) used to evaluate the of the method PUDT. nuclear receptors) are are used tomethod evaluate the performance performance of the proposed proposed methodapproaches. PUDT. The The experimental experimental results demonstrate that our PUDT outperforms recent state-of-the-art results that our method PUDT outperforms recent state-of-the-art approaches. results demonstrate that method PUDT outperforms recent state-of-the-art approaches. © 2015,demonstrate IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: Drug, Target protein, Sequence, Structure, Positive-unlabeled learning. results demonstrate that our our method PUDT outperforms recent state-of-the-art approaches. Keywords: Drug, Target protein, Sequence, Structure, Positive-unlabeled learning. Keywords: Drug, Target Target protein, Sequence, Sequence, Structure, Positive-unlabeled learning. Keywords: Positive-unlabeled Keywords: Drug, Drug, Target protein, protein, Sequence, Structure, Structure, Positive-unlabeled learning. learning. These approaches have achieved great successes in drug 1. INTRODUCTION These approaches have great successes in target However, methods still These interaction approaches prediction. have achieved achieved great these successes in drug drug 1. INTRODUCTION These approaches have achieved great successes in drug These approaches have achieved great successes in drug 1. INTRODUCTION target interaction prediction. However, these methods still 1. INTRODUCTION have some limitations. For example, the ligand-based In the field of pharmacology, drug discovery is a cost- and target interaction prediction. However, these methods still 1. INTRODUCTION target interaction prediction. However, these methods still target interaction prediction. However, these methods still have some limitations. For example, the ligand-based methods rely on the number of known ligands, the dockingIn the field of pharmacology, drug discovery is a costand time-consuming process. According to the Food and Drug have some limitations. For example, the ligand-based have some limitations. For example, the ligand-based In the field of pharmacology, drug discovery is a costand In the field of pharmacology, drug discovery is a costand have some limitations. For example, the ligand-based methods rely on the number of known ligands, the dockingbased methods the information of protein structure, and In the field of pharmacology, drug is a molecular cost-Drug and methods time-consuming process. to and Administration’s statisticalAccording data, thediscovery costthe ofFood new onneed the number of known ligands, the dockingmethods rely rely the number of ligands, the time-consuming process. According to the Food and Drug time-consuming process. According to the Food and Drug rely on on the number of known known ligands, the dockingdockingbased methods need the information of protein structure, and time-consuming process. According to the Food and Drug methods the literature text mining based methods are unable to find Administration’s statistical data, the cost of new molecular entity discovery is approximately $1.8 billion and it takes based methods need the information of protein structure, and methods need the information of protein structure, and Administration’s Administration’s statistical statistical data, data, the the cost cost of of new new molecular molecular based based methods need the information of protein structure, and the literature text mining based methods are unable to find Administration’s statistical data, the cost of new molecular unknown and interesting interactions. entity discovery is approximately $1.8 billion and it takes average 13 years (Hopkins, 2012). Therefore, it is an the literature text mining based methods are unable to find literature text mining based methods are unable to find entity discovery discovery is is approximately approximately $1.8 $1.8 billion billion and and it it takes takes the entity the literature text mining based methods are unable to find unknown and interesting interactions. entity discovery is approximately $1.8 billion and it takes average years (Hopkins, 2012). Therefore, it important to reduce expenses in unknown and interesting unknown and interesting interactions. average 13 13 issue years how (Hopkins, 2012). these Therefore, it is is an an average years (Hopkins, 2012). Therefore, it In the past more interactions. and more statistical methods have unknown andyears, interesting interactions. average 13 13 issue years (Hopkins, 2012). methods Therefore, it is is an an important how to reduce these expenses in pharmacology. The computational provide an important issue how to reduce these expenses in In the past years, more and more methods have important issue how to reduce these expenses in been proposed to predicting drugstatistical target interactions by important issueThe tothisreduce these expenses in In pharmacology. computational provide an In the the past past years, years, more more and and more more statistical statistical methods methods have have effective strategy to how address issue. methods In the past years, more and more statistical methods have pharmacology. The computational methods provide an been proposed to predicting drug target interactions pharmacology. The computational methods provide an integrating multiple sources of biological knowledge such as been proposed proposed to to predicting predicting drug drug target target interactions interactions by by pharmacology. The computational methods provide an been effective strategy to address this issue. by been proposed to predicting drug target interactions by effective strategy to address this issue. integrating multiple sources of biological knowledge such as effective strategy to address this issue. drug chemical structures, target protein sequence, gene With the strategy development of high-throughput techniques, a great integrating multiple sources of biological knowledge such as effective to address this issue. integrating multiple sources of biological knowledge such as integrating multiple sources of biological knowledge such as drug chemical structures, target protein sequence, gene expression and known drug-target interactions. Chen et al. With the development of high-throughput techniques, a great deal of drug-target interaction data has been generated drug chemical structures, target protein sequence, gene drug chemical structures, target protein sequence, gene With the development of high-throughput techniques, a great With the development of high-throughput techniques, a great drug chemical structures, target protein sequence, gene expression and known drug-target interactions. Chen et al. (2012) present aknown network-based random walk with restart With the development of high-throughput techniques, a great deal of drug-target interaction data has been generated (Moffat et. 2014). Several databases have been established to expression and drug-target interactions. Chen et al. expression and drug-target interactions. Chen et deal of drug-target interaction data has been generated deal of drug-target interaction data has been generated expression and aknown known drug-target interactions. Chen et al. al. (2012) network-based random walk with restart deal of drug-target interaction data has been generated method,present called NRWRH, to predict relationship between (Moffat et. 2014). Several databases have been established to store interaction information and provide relevant retrieval (2012) present a network-based random walk with restart (2012) present aa network-based random walk with restart (Moffat et. 2014). Several databases have been established to (Moffat et. 2014). Several databases have been established to (2012) present network-based random walk with restart method, called NRWRH, to predict relationship between drug and target by integrating drug-drug chemical structure (Moffat et. 2014). Several databases have been established to store interaction information and provide relevant retrieval servers. For example, the comprehensive database DrugBank method, called NRWRH, to predict relationship between method, called NRWRH, to relationship between store interaction information and provide relevant retrieval store interaction information and provide relevant retrieval method, called NRWRH, to predict predict relationship between drug and target by integrating drug-drug chemical structure store interaction information and information providedatabase relevant retrieval similarity network, protein-protein sequence similarity servers. For example, the comprehensive DrugBank (Law et al. 2014) provides the of drugs and drug and target by integrating drug-drug chemical structure drug and target by integrating drug-drug chemical structure servers. For example, the comprehensive database DrugBank servers. For example, the comprehensive database DrugBank drug and target by integrating drug-drug chemical structure similarity network, protein-protein sequence similarity servers. For example, the comprehensive database DrugBank network and known drug-target interaction network into a (Law et al. 2014) provides the information of drugs and targets with their interaction information. ChEMBL (Gaulton similarity similarity network, network, protein-protein protein-protein sequence sequence similarity similarity (Law et al. 2014) provides the information of drugs and (Law et al. 2014) provides the information of drugs and similarity network, protein-protein sequence similarity network and known drug-target interaction network (Law et al. 2014) provides the information of drugs and heterogeneous network. Machine learning methods have beenaa targets with their interaction information. ChEMBL (Gaulton et al. 2012) is a web resourceinformation. of bioactiveChEMBL molecules, which network network and and known known drug-target drug-target interaction interaction network network into into into aa targets with their interaction (Gaulton targets with their interaction information. ChEMBL (Gaulton network and known drug-target interaction network into heterogeneous network. Machine learning methods have been employed to indentify associations between drugs and targets. targets with their interaction information. ChEMBL (Gaulton et al. 2012) is a web resource of bioactive molecules, which contains 10,579 targets and 1,637,862 compound records and heterogeneous network. Machine learning methods have been heterogeneous network. Machine learning methods have been et al. 2012) is a web resource of bioactive molecules, which et is web of molecules, which network. Machine learning have been employed to indentify associations between drugs and targets. et al. al. 2012) 2012) is aa targets web resource resource of bioactive bioactive molecules, which The assumption of these approaches is methods that similar drugs contains 10,579 and 1,637,862 compound records 2,843,338 bioactivity evidences. Supertarget (Hecker etand al. heterogeneous employed to indentify associations between drugs and targets. employed to indentify associations between drugs and targets. contains 10,579 targets and 1,637,862 compound records and contains 10,579 targets and 1,637,862 compound records and employed to indentify associations between drugs and targets. The assumption of these approaches is that similar drugs contains targetsdrug-target and 1,637,862 compound records similar pattern of interaction withistarget in drug-target 2,843,338 bioactivity evidences. Supertarget (Hecker etand al. show 2012) is10,579 an online database which includes The assumption of these approaches that similar drugs The of approaches is that similar drugs 2,843,338 2,843,338 bioactivity bioactivity evidences. evidences. Supertarget Supertarget (Hecker (Hecker et et al. al. show The assumption assumption of these these approaches istarget that in similar similar pattern of interaction with drug-target 2,843,338 bioactivity evidences. Supertarget (Hecker et al. interaction network. Cheng et al. (2012) propose threedrugs infer 2012) is an online drug-target database which includes approximately 6000 target proteins, 196,000 drugs and show similar pattern of interaction with target in drug-target show similar pattern of interaction with target in drug-target 2012) is an online drug-target database which includes 2012) is an online drug-target database which includes show similar patterndrug-based of interaction with target in drug-target interaction network. Cheng et al. (2012) propose three infer methods including similarity inference (DBSI), 2012) isdrug-target an online drug-target database whichdrugs includes approximately 6000 target proteins, 196,000 and 330,000 associations. interaction network. Cheng et al. (2012) propose three infer interaction network. Cheng et al. (2012) propose three infer approximately 6000 target proteins, 196,000 drugs and approximately 6000 target proteins, 196,000 drugs and interaction network. Cheng et al. (2012) propose three infer methods including drug-based similarity inference (DBSI), approximately 6000 target proteins, 196,000 drugs and target-based similarity inference (TBSI) and network-based 330,000 drug-target associations. methods including drug-based similarity inference (DBSI), methods including drug-based similarity inference (DBSI), 330,000 drug-target associations. 330,000 drug-target associations. methods including drug-based similarity inference (DBSI), Recently,drug-target some computational target-based similarity inference (TBSI) and network-based 330,000 associations.methods have been proposed inference (NBI) to predict drug-target interactions. Similarity target-based target-based similarity similarity inference inference (TBSI) (TBSI) and and network-based network-based target-based similarity inference (TBSI) and Recently, computational methods have been proposed to predict some drug-target interactions from available interaction inference to predict drug-target interactions. Similarity work has (NBI) been accomplished by Alaimo et al.network-based (2013). They Recently, some computational methods have been proposed inference (NBI) to predict drug-target interactions. Similarity Recently, some computational methods have been proposed inference (NBI) to predict drug-target interactions. Similarity Recently, some computational methods have been proposed inference (NBI) to predict drug-target interactions. Similarity to predict drug-target interactions from available interaction data. The drug-target traditional computational methods for identifying work has been accomplished by Alaimo et al. (2013). They present a DT-hybrid approach which extends a networkto predict interactions from available interaction work has been accomplished by Alaimo et al. (2013). They to predict drug-target interactions from available interaction has been accomplished by Alaimo et (2013). They to predict drug-target interactions from available interaction work has been accomplished by domain-based Alaimoextends et al. al. knowledge (2013). They data. The traditional computational methods for identifying drug-target interactions can be classified into three categories: work present a DT-hybrid approach which a networkbased inference method by using to data. The traditional computational methods for identifying present a DT-hybrid approach which extends a networkdata. The traditional computational methods for identifying present a DT-hybrid approach which extends a networkdata. The traditional computational methods foretidentifying present a DT-hybrid approach which extends a networkdrug-target interactions can be classified into three categories: ligand-based methods (Keiser et al. 2007, Pé rot al. 2013), based inference method by using domain-based knowledge to detect drug-target interactions. In addition, Chen and Zhang drug-target interactions can be classified into three categories: based inference method by using domain-based knowledge to drug-target interactions can be classified into three categories: based inference method by using domain-based knowledge to drug-target interactions can be et classified into three categories: based inference method by using domain-based knowledge to ligand-based methods (Keiser al. 2007, Pé rot et al. 2013), docking-based methods (Cheng et al. 2007, Combs et al. detect drug-target interactions. In addition, Chen and Zhang (2013) present a NetCBP method by maximizing the rank ligand-based methods (Keiser et al. 2007, Pé rot et al. 2013), detect drug-target interactions. In addition, Chen and Zhang ligand-based methods methods (Keiser (Keiser et et al. al. 2007, 2007, Pé Pérot rot et et al. al. 2013), 2013), detect drug-target interactions. In addition, Chen and Zhang ligand-based detect drug-target interactions. In addition, Chen and Zhang docking-based methods (Cheng et al. 2007, Combs et al. 2013) and literature text mining methods (Zhu et al. 2005). (2013) present a NetCBP method by maximizing the rank coherence with respect to known knowledge to identify docking-based methods methods (Cheng (Cheng et et al. al. 2007, Combs Combs et et al. al. (2013) present aa NetCBP method by maximizing the rank docking-based present NetCBP by maximizing the rank docking-based methods (Cheng et al. 2007, 2007, Combs et al. (2013) (2013) present a respect NetCBPtomethod method byknowledge maximizing the rank 2013) and literature text mining methods (Zhu et al. 2005). coherence with known to identify 2013) and literature text mining methods (Zhu et al. 2005). coherence with respect to known knowledge to identify 2013) and literature text mining methods (Zhu et al. 2005). coherence with with respect respect to to known known knowledge knowledge to to identify identify 2013) and literature text mining methods (Zhu et al. 2005). coherence
Copyright © IFAC 2015 12 2405-8963 © 2015, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Copyright © IFAC 2015 12 Copyright IFAC 2015 12 Peer review© of International Federation of Automatic Copyright ©under IFAC responsibility 2015 12 Control. Copyright © IFAC 2015 12 10.1016/j.ifacol.2015.12.092
2015 IFAC SYSID October 19-21, 2015. Beijing, China
Wei Lan et al. / IFAC-PapersOnLine 48-28 (2015) 012–016
associations between drugs and targets. Bleakley and Yamanishi (2009) employ a bipartite local model to predict relationships between drugs and targets. Further work has been completed by Mei et al. (2013), they integrates neighbour information into the bipartite local model for drugtarget interactions identification. The Gaussian interaction profile kernel and weighted nearest neighbour are integrated for drug-target interactions prediction (Van Laarhoven 2013). In addition, the Bayesian matrix factorization and binary classification (Gönen 2012) and probabilistic matrix factorization (Cobanoglu et al. 2013) are proposed to detect drug-target interactions.
13
Drug chemical structure information is retrieved from the DRUG AND COMPOUND Sections in the KEGG LIGAND. The chemical structure similarity between compounds is calculated by using SIMCOMP, which gives a score based on the size of common substructures with graph alignment (Hattori et al. 2003). The chemical structure similarity has been widely applied in drug-target interaction prediction (Cobanoglu et al. 2013). The sequence similarity between targets is calculated by normalized Smith-Waterman algorithm (Smith et al. 1981) based on the information of amino acid sequence of target protein extracted from KEGG GENE database. Given two target proteins Ai and Aj, the sequence similarity of two target proteins is calculated as:
Although these approaches have achieved good performance, there are some limitations and difficulties for drug-target interactions prediction. Firstly, most of methods adopt sequence information to measure the similarity of two proteins. More studies demonstrate that the structure information is more conservation than sequence information (Volkamer et al. 2014). Therefore, the structure information of target protein may be better suited for drug-target interactions identification. Secondly, there are no experimentally verified negative examples. These methods treat the non-interaction data as negative samples, which is unreasonable as those non-interaction data may include undiscovered drug-target interactions. Thirdly, the known drug-target interaction data is rare.
Sim _ seq( Ai , A j )
where
SW ( Ai , A j )
SW ( Ai , A j )
(1)
SW ( Ai , Ai ) SW ( A j , A j )
is the
score of Smith-Waterman
algorithm. The 3D structures of target proteins are obtained from PDB database (Rose et al. 2013). There are over 98000 3D structures in PDB database. For some target proteins without known 3D structure in PDB, the SWISS-MODEL (Biasini et al. 2014), which is a popular method to generate reliable three-dimensional protein structure models based on homology modelling, is employed to predict their 3D structures. The structure similarity between targets is calculated by utilizing TM-alignment (Zhang et al. 2005).
In this paper, we propose a method, called PUDT, to predict drug-target interactions based on positive-unlabeled learning. The sequence and structure information are utilized to measure similarity between two targets. In addition, we treat unknown drug target interaction as unlabeled set U instead of negative sample N. The random walk with restart is used to divide unlabeled data U into two sets, named reliable negative set RN and likely negative set LN, based on protein structure information. The weighted support vector machines is used to build a multi-level classifier to predict drug target interaction based on the positive set, reliable negative set and likely negative set. Four datasets (including Enzymes, Ion Channels, GPCRs and Nuclear Receptors) are applied to test the effectiveness of the proposed method. The experimental results demonstrate our method outperforms state-of-the-art approaches.
Table 1. The information of four datasets Dataset Enzyme Ion Channel GPCR Nuclear Receptor
Drug 445 210
Target 664 204
Nd/Nt 0.67 1.03
Interactions 2926 1467
Sparsity 0.0099 0.0344
223 54
95 26
2.35 2.08
635 90
0.0299 0.0641
2.2 Positive-Unlabeled learning for drug-target prediction (PUDT)
2. MATERIALS AND METHDOS
Our method for drug-target prediction is based on assumption that similar drugs often target on similar target proteins. Traditional methods consider the unknown drug-target interactions as negative examples. However, it may be result in bias as unknown drug-target interactions may contain undiscovered drug-target interactions. Instead of treating unlabeled examples as negative examples, the random walk with restart is employed to partition the unlabeled examples into reliable negative examples and likely negative examples based on protein structure similarity.
2.1 Data Preparation In this paper, we use four drug-target interaction networks involving Enzymes, Ion Channels, GPCRs and Nuclear Receptors which are first analyzed by Yamanishi et al. (2010). These datasets can be downloaded from http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/. Table 1 shows the detailed information of the four datasets. The drugtarget interaction data are collected from the KEGG BRITE (Kanehisa et al. 2006), BRENDA (Schomburg et al. 2013), SuperTarget (Hecker et al. 2012) and DrugBank (Law et al. 2014).
Pn 1 (1 a)Wij Pn aP0
Wij D1Wij
13
(2) (3)
2015 IFAC SYSID 14 October 19-21, 2015. Beijing, China
Wei Lan et al. / IFAC-PapersOnLine 48-28 (2015) 012–016
Table 2. The average AUC of five methods Yamnishi(2010) KBMF2K NetCBP PUDT 0.821 0.832 0.825 0.872 0.692 0.799 0.803 0.807 0.811 0.857 0.823 0.878 0.814 0.824 0.839 0.843 AUC=0.5, it indicates random selection performance. Table 2 presents the average AUC of DBSI, Yamanishi et al. (2010), j KBMF2K, NetCBP and PUDT. It can be observed that the Dii Win (4) performance of PUDT is better than other four methods in n 1 four datasets. where W denotes protein structure similarity matrix and a denotes probability parameter of returning back to initial In order to test the influence of parameter of random walk nodes at every iteration process. The initial vector P0 denotes with restart, we change the parameter a from 0.1 to 0.9. Fig 1 the prior probability vector of the positive examples with the shows the effect of different parameters on the prediction sum of probabilities equal to 1. The steady state is performance. From Figure 1, it can be found that the best determined by performing the iteration until the probability performance of parameter a is 0.8 for enzyme, ion channel difference between Pn 1 and Pn (measured by L1 norm) is and nuclear receptor and 0.9 for GPCRs dataset. Data Enzymes Ion Channels GPCRs Nuclear receptor
DBSI 0.806 0.803 0.803 0.759
less than 106 . 1
According to the posterior probability, the unlabeled examples are classified into two groups: LN (like negatives) and RN (reliable negatives):
0.9 0.8 0.7
LN Pn (Ti ) ave _ socre
lable(Ti )
0.6
AUC
(5)
RN Pn (Ti ) ave _ socre
0.4
The Weighted Support Vector Machine is used to build a classifier to train multiple-level examples (Vapink 1998). The objective function is defined as: 1 min : || w ||2 c' 2
iP
i c'
i c''
iRN
0.5
0.3 Enzyme Ion channel GPCRs Nuclear receptor
0.2 0.1
i
(6)
0
iLN
0
0.1
0.2
0.3
0.4
0.5 a
0.6
0.7
0.8
0.9
1
subject to: Fig. 1. The affect of different parameters in prediction performance.
yi ( wT xi b) 1 i (7)
where i denotes a slack variable for allowing the misclassification of some training examples, c , c and c denotes the penalty factors for misclassification of P, RN, LN, respectively. We set the penalty factor c c because we are confident with RN than LN.
To illustrate the ability of potential drug-target interaction prediction of PUDT, we conduct experiments on four benchmark datasets. In these experiments, all known drugtarget interactions are set as training data. We verify these predicted interactions manually by looking up the latest versions of KEGG DRUG, ChEMBL, DrugBank databases. Most of predicted drug-target interactions are verified as annotated by at least one database. For example, 57% of predicted drug-target interactions (16 out of 26) are contained within top 1 scoring drug in nuclear receptors. It is demonstrated that our method have ability in predicting potential drug-target interactions. Table 3 shows the verified drug-target interactions which are predicted by PUDT in four datasets. For example, drug nicotinamide adenine dinucleotide (NADH) is a kind of important coenzymes composed of ribosylnicotinamide 5’-diphosphate combined to adenosine 5’-phosphate with pyrophosphate linkage. It is found widely in mammals and plays important roles in metabolisms such as redox reactions. In medical application, it is widely used in many diseases like tuberculosis, Alzheimer's and Parkinson disease. According to the experimental result, the top target is has: 124 (Alcohol
3. EXPERIMENTS AND RESULTS In order to demonstrate the performance of our method, we compare our method with four state-of-art methods: DBSI (Cheng et al. 2012), Yamanishi et al. (2010), KBMF2K (Gönen 2012), NetCBP (Chen et al. 2013), with five-fold cross validation in four datasets: enzymes, ion channels, GPCRs and nuclear receptors. In each experiment, the dataset is randomly divided into five subsets. Each subset is used as the test set and the remaining four datasets as training sets. The AUC (area under the receiver operating curve) is used to evaluate the prediction performance. The higher the AUC value, the better the prediction performance is. When 14
2015 IFAC SYSID October 19-21, 2015. Beijing, China
Wei Lan et al. / IFAC-PapersOnLine 48-28 (2015) 012–016
dehydrogenase 1A). It is proved that Alcohol dehydrogenase has strong preference for NADH and can reduce NAD to NADH in nucleus and cytosol (Bieganowski et al. 2006).
Nuclear Receptor
Table 3. The new verification drug-target interactions which is predicted by PUDT in four datasets Dataset Enzyme
Ion Channel
GPCR
Nuclear Receptor
Drug ID D00002 D00002 D00691 D00537 D00448 D00438
Target ID hsa:124 hsa:131 hsa:5150 hsa:759 hsa:5742 has:779
D00456 D00533 D00553 D03365
has:9177 has:6331 has:6328 has:1137
D00049 D00079 D00106
hsa:8843 hsa:5731 hsa:5739
D00715 D01358 D00348
hsa:1129 hsa:150 hsa5915
D00348
hsa5916
D00094 D00129 D00554
hsa8914 hsa190 hsa2011
Source DrugBank DrugBank DrugBank ChEMBL KEGG KEGG, DrugBank KEGG KEGG KEGG ChEMBL, DrugBank DrugBank DrugBank KEGG, DrugBank KEGG KEGG KEGG, ChEMBL KEGG, ChEMBL KEGG ChEMBL KEGG
Ion Channel
GPCR
Target ID Has:1548 Has:1178 Has: 1046 Has:3778 Has: 1134 Has:6323 Has:1813
Has:1815 hsa367 Has:2140
2 3 1
D00930 D00557
Has:5916 Has: 2100
2 3
The systematic understanding of the associations between chemical compound and target protein is conducive to new drug design and discovery. Due to the limitation of traditional experimental methods such as time-consuming and expensive in new drug discovery, it is common for biological scientists to predict for drug-target interaction prediction by computational methods. Many computational approaches have been developed to predict drug-target interactions. However, there are some limitations existing in these methods: 1) some methods treat unlabeled examples as negative examples. However some of them may be undiscovered positive examples. 2) Most methods use sequence information of target protein for drug-target interactions. It is well known that the structure of target proteins is more conservative than sequence and more effective than sequence in drug function. In the paper, we have proposed a new method, named PUDT, based on positive-unlabeled examples learning for predicting drug-target interactions. Comparing with previous approaches, our method combines structure information and sequence information of target protein to predict drug-target interactions. The structure information is employed to divide unlabeled examples into two categories: reliable negative examples (RN) and likely negative examples (LN). Different penalty factors are employed to predict drug-target interactions. Four benchmark datasets (including enzymes, ion channels, GPCRs and nuclear receptors) are used to evaluate the performance of our method . The results show that our method is superior to state-of-art approaches in predicted performance. By means of checking available databases and literatures, it demonstrates that our method is able to discover potential drug-target interactions. ACKNOWLEDGEMENTS This work is supported in part by the National Natural Science Foundation of China under Grant No. 61232001, No. 61428209 and No. 61420106009; the Program for New Century Excellent Talents in University (NCET-12-0547).
Table 4. The new verification drug-target interactions which is predicted by PUDT in four datasets Drug ID D00383 D00330 D03218 D00658 D02173 D01287 D02340
D00480 D00443 D00554
4. CONCLUSIONS
In order to demonstrate comprehensive forecasting ability of our method in unknown drug-target interactions, we select top three predicted interactions on the four benchmark datasets. Table 4 lists top three predicted drug-target interactions for four benchmark datasets. We check the KEGG, DrugBank, ChEMBL, databases for verifying these predicted drug-target interactions. 3 out of 12 predicted drugtarget interactions are found to be annotated in these databases. It shows that our method is useful in practical applications. For those predicted drug-target interactions without annotations, it may not be found at present.
Dataset Enzyme
15
REFERENCES Alaimo S., Pulvirenti A., Giugno R., et al. (2013). Drugtarget interaction prediction through domain-tuned network-based inference. Bioinformatics, 29(16), 20042008. Biasini M., Bienert S., Waterhouse A., et al. (2014). SWISSMODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Research, 42 (1), 252-258.
Rank 1 2 3 1 2 3 1 15
2015 IFAC SYSID 16 October 19-21, 2015. Beijing, China
Wei Lan et al. / IFAC-PapersOnLine 48-28 (2015) 012–016
Bieganowski P., Seidle H., Wojcik M.,et al. (2006). Synthetic lethal and biochemical analyses of NAD and NADH kinases in Saccharomyces cerevisiae establish separation of cellular functions. J Biol Chem, 281(32), 22439-22445. Bleakley K,. and Yamanishi Y. (2009). Supervised prediction of drug-target interactions using bipartite local models. Bioinformatics, 25(18), 2397-403. Chen H., Zhang Z. (2013). A semi-supervised method for drug-target interaction prediction with consistency in networks. PLoS One, 8(5), e62975. Chen X., Liu M., Yan G., (2012). Drug-target interaction prediction by random walk on the heterogeneous network. Mol Biosyst, 8(7), 1970-1978. Cheng A., Coleman R., Smyth K., et al. (2007). Structurebased maximal affinity model predicts small-molecule druggability. Nat Biotechnol, 25(1), 71-75. Cheng F., Liu C., Jiang J., et al. (2012). Prediction of drugtarget interactions and drug repositioning via networkbased inference. PLoS Comput Biol, 8(5), e1002503. Cobanoglu M., Liu C., Hu F., et al. (2013). Predicting drugtarget interactions using probabilistic matrix factorization. J Chem Inf Model, 53(12), 3399-409. Combs S., Deluca S., et al. (2013). Small-molecule ligand docking into comparative models with Rosetta. Nat Protoc, 8(7), 1277-1298. Gaulton A., Bellis L., Bento A., et al. (2012). ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res, 40, 1100-1107. Gönen M. (2012). Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics, 28(18), 2304-2310. Hattori M., Okuno Y., Goto S., et al. (2003). Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J Am Chem Soc, 125(39), 1185311865. Hecker N., Ahmed J., von Eichborn J., et al. (2012). SuperTarget goes quantitative: update on drug-target interactions. Nucleic Acids Res, 40, 1113-1117. Hopkins, A.L. (2012). Drug discovery: Predicting promiscuity. Nature, 462(7270), 167-8. Kanehisa M., Goto S., Hattori M., et al. (2006). From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res, 34, 354-357. Keiser M., Roth B., Armbruster B., et al. (2007). Relating protein pharmacology by ligand chemistry. Nat Biotechnol, 25(2), 197-206. Law V., Knox C., Djoumbou Y., et al. (2014). DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res, 42, D1091-1097. Mei J., Kwoh C., Yang P., et al. (2013). Drug-target interaction prediction by learning from local information and neighbours. Bioinformatics, 29(2), 238-45. Moffat J.G., Rudolph J.R., and Bailey D. (2014). Phenotypic screening in cancer drug discovery - past, present and future. Nat Rev Drug Discov, 13(8), 588-602,. Pérot S., Regad L., Reynès C., et al. (2013). Insights into an original pocket-ligand pair classification: a promising tool for ligand profile prediction. PLoS One, 8(6), e63730.
Rose P., Bi C., Bluhm W., et al. (2013). The RCSB Protein Data Bank: new resources for research and education. Nucleic Acids Res, 41, 475-482. Schomburg I., Chang A., Placzek S., et al. (2013). BRENDA in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA. Nucleic Acids Res, 41, 764-772. Smith T., and Waterman M. (1981). Identification of common molecular subsequences. J Mol Biol, 147(1), 195-197. Van Laarhoven T., Marchiori E. (2013). Predicting DrugTarget Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile. PLoS One, 8(6), e66952. Vapink V. (1998). Statistical Learning Theory, Wiley, New York. Volkamer A., and Rarey M. (2014). Exploiting structural information for drug-target assessment. Future Med Chem, 6(3), 319-331. Yamanishi Y., Kotera M., Kanehisa M., et al. (2010). Drugtarget interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics, 26(12), 246-54. Zhang Y., Skolnick J. (2005). TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Research, 33, 2302-2309. Zhu S., Okuno Y., Tsujimoto G., et al. (2005). A probabilistic model for mining implicit 'chemical compound-gene' relations from literature. Bioinformatics, 21, 245-251.
16