SFM: A novel sequence-based fusion method for disease genes identification and prioritization

SFM: A novel sequence-based fusion method for disease genes identification and prioritization

Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Q1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36...

1MB Sizes 0 Downloads 7 Views

Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Q1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

SFM: A novel sequence-based fusion method for disease genes identification and prioritization Abdulaziz Yousef, Nasrollah Moghadam Charkari n Faculty of Electrical & Computer Engineering,Tarbiat Modares University, Tehran, Iran

H I G H L I G H T S

    

A novel fusion- sequence-based method for disease gene identification has presented. The sequences of the proteins are used as a prior-knowledge to present genes. Likely non-disease gene (negative data) are selected using distance metrics. The results of the four SVM-predictors are fused by Decision Tree to make the final decision. It is found that the protein sequence are useful in determining the similarity of the disease genes.

art ic l e i nf o

a b s t r a c t

Article history: Received 13 February 2015 Received in revised form 27 June 2015 Accepted 13 July 2015

The identification of disease genes from human genome is of great importance to improve diagnosis and treatment of disease. Several machine learning methods have been introduced to identify disease genes. However, these methods mostly differ in the prior knowledge used to construct the feature vector for each instance (gene), the ways of selecting negative data (non-disease genes) where there is no investigational approach to find them and the classification methods used to make the final decision. In this work, a novel Sequence-based fusion method (SFM) is proposed to identify disease genes. In this regard, unlike existing methods, instead of using a noisy and incomplete prior-knowledge, the amino acid sequence of the proteins which is universal data has been carried out to present the genes (proteins) into four different feature vectors. To select more likely negative data from candidate genes, the intersection set of four negative sets which are generated using distance approach is considered. Then, Decision Tree (C4.5) has been applied as a fusion method to combine the results of four independent state-of the-art predictors based on support vector machine (SVM) algorithm, and to make the final decision. The experimental results of the proposed method have been evaluated by some standard measures. The results indicate the precision, recall and F-measure of 82.6%, 85.6% and 84, respectively. These results confirm the efficiency and validity of the proposed method. & 2015 Published by Elsevier Ltd.

Keywords: Classification Disease gene Protein Physicochemical properties of amino acid Fusion method

1. Background Identifying causative genes for human diseases is an important step to facilitate our understanding of disease mechanism, diagnosis, and improving therapeutic practice. Genome-wide linkage and association studies are often used to identify genome regions which potentially contain hundreds of candidate genes possibly related with genetic diseases (Glazier et al., 2002). Identifying disease associated genes from the vast number of candidates using experimental methods is an expensive and time consuming task. Hence, the need of computational

n

Corresponding author. Tel.: þ 98 21 82883301; fax: þ 98 21 82884325. E-mail address: [email protected] (N. Moghadam Charkari).

approaches has been emerged. These approaches are based on the guilt-by-proximity principle. More precisely, unknown genes with similar features to the confirmed disease genes could be a candidate disease gene with high probability. The similarity between unknown and known disease genes is based on a variant genomic data which has been generated using high-throughput technologies. Two types of computational approaches have been proposed in this area. The First approach attempts to prioritize candidate disease genes and render a fewer prioritized list for further investigations. Some of these methods prioritized candidate disease genes using functional similarity data, such as gene ontology (GO) (Freudenberg and Propping, 2002), and gene expression profiles (Ala et al., 2008). Some methods employed protein–protein interactions (PPIs) data for disease gene prioritization (Kohler et al., 2008; Navlakha and Kingsford, 2010; Yang

http://dx.doi.org/10.1016/j.jtbi.2015.07.010 0022-5193/& 2015 Published by Elsevier Ltd.

Please cite this article as: Yousef, A., Moghadam Charkari, N., SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.010i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

A. Yousef, N. Moghadam Charkari / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

et al., 2011; Zhang et al., 2011). Here, the motivation is that the proteins which lead to similar phenotypes have a higher chance of being connected in the network (Xu and Li, 2006). Finally, other methods attempt to integrate multiple data sources to prioritize candidate disease genes (Aerts et al., 2006). The main difference between the integration methods is the type of data sources used in combination stage. (Aerts et al., 2006; De Bie et al., 2007) integrate nine data sources, e.g. sequence data, gene annotation data, DNA sequence, etc. while, (Li and Patra, 2010) combined PPIs data and GO data to prioritize candidate genes. The fundamental flaw in the above methods is how to find an appropriate threshold that would be used to separate disease genes from non-diseases genes. Therefore, second type of approaches were developed. This type of methods, in addition to the prioritization of the candidate genes, attempt to provide a proper threshold to decide whether a specific gene is disease related or not. In this regard, the identification disease gene problem is defined as two class classification problem. Xu and Li (2006) employed a k-nearest-neighbor classifier based on PPI topological properties of genes, such as the percentage of disease genes in Proteins' neighborhood, protein degree, etc. Smalter et al. (2007) proposed PPI topological properties to build feature vector, and SVM classifier to predict disease genes. De Bie et al. (2007) introduced a novel kernel method by finding a hyperplane which separates the disease genes from the unknown genes with the largest possible margin. They assumed that gene is more likely be a disease gene if it lies farther of this hyperplane. Radivojac et al. (2008) proposed a combined method by constructing three SVM classifiers based on three types of feature vectors. To predict disease genes, the predictor is built by combining the prediction results of three individual classifiers. The above mentioned approaches used disease genes as positive set and unknown genes as negative set. Since the negative set is generally noisy and may comprise some disease genes, it leads to confusion in classification process and reduces the accuracy.

Mordelet and Vert (2011) predicted disease gene using a positive-unlabeled method named ProDiGe. Accordingly, some negative instances were selected from unknown genes randomly. Then, by training multi classifier based on SVM and combining their prediction results, the final prediction was established. However, as the random negative set is obtained from unknown genes, it still suffers from noisy data. In this regard, the performance of the final classifier will be decreased. Yang et al. (2012) (PUDI), selected some samples from unknown genes by applying the Euclidean distance between ‘positive representative vector’ and each of the unknown genes as negative instances. They also defined three other sets namely, likely negative, likely positive, and weakly negative set, based on their likelihoods to be positive or negative class. Finally, they applied multi-level weighted SVM classifier for disease gene prediction and prioritization. However, since the generated feature vector used in this method depends on the data sources which contains errors and missing values, i.e. topological properties of PPI network, gene ontology, and protein domain, the classifier built based on this high dimensional feature vector (more than 4000 features) suffers from precisely predicting new disease gene. It is clear that all the above mentioned methods might not be able to implement, because they rely on prior-knowledge that may be expensive to acquire or do not have information about some training and testing genes. For example, the current methods based on functional annotation are limited, since only a small fraction of genes in the genome is currently annotated. Therefore, using a universal prior-knowledge is essential to solve this problem. On the other hands, the amino acids are simple descriptors of protein sequence features (Shepherd et al., 2003), which are available for all proteins and have frequently been used for predicting protein structural and functional classes (Cai et al., 2003; Cai et al., 2004; Han et al., 2004), protein– protein interactions (Lo et al., 2005; Yousef and Moghadam Charkari, 2013; Yu et al., 2010), subcellular locations (Chou and Cai, 2004; Fukasawa et al., 2014), etc.

Fig. 1. The schema of the SFM proposed method. As it is shown, SFM includes four layers which are representation layer, negative selection layer, SVM-based predictor layer, and fusion layer.

Please cite this article as: Yousef, A., Moghadam Charkari, N., SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.010i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

A. Yousef, N. Moghadam Charkari / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

3

2008), and Moran auto-correlation (MA) (Xia et al., 2010). These methods account for the neighboring effect between amino acids with a certain number of amino acids apart in the sequence using their specific physicochemical property and make it possible to discover patterns that run through entire sequences. The reason of using these representation methods is to avoid losing important information hidden in the protein sequences. All of these representation methods are based on physicochemical properties of amino acid. Instead of using six or seven physicochemical properties which had been used by the above mentioned representation methods, we have employed twelve physicochemical properties as a descriptor to provide more information about amino acid sequence. These properties include entropy of formation (EOF) (Chothia, 1992), partition coefficient (PC) (Quinlan, 1996), polarity (POL) (Grantham, 1974), amino acid composition (AAC) (Grantham, 1974), residue accessible surface area in tripeptide (RAS) (Chothia, 1976), transfer free energy (TFE) (Janin, 1979), CC in regression analysis (CC) (Prabhakaran and Ponnuswamy, 1982), hydrophilicity (HY-PHIL) (Hopp and Woods, 1981), polarizability (POL2) (Charton and Charton, 1982), hydrophobicity (HY-PHOB) (Sweet and Eisenberg, 1983), solvation free energy (SFE) (Eisenberg and McLachlan, 1986), and graph shape index (GSI) (Fauchere, et al., 1988), respectively. Min–Max normalization method is used to normalize these physicochemical properties. Table 1 shows the normalized physicochemical properties. It is important to mention that the dimensions of the proposed feature vectors are less than the dimensions of feature vectors which have been presented in the identification disease gene researchers recently.

In this article, we present a novel sequence-based fusion method (FSM) which comprises of four layers for disease gene identification and prioritization. In the first layer, four sequencetranslated methods, which are based on physicochemical properties of amino acid, are employed to construct four different feature vectors for each gene products (proteins). Since there is no information available about negative data (non-disease genes), we have selected some reliable negative data from unknown genes by using a PUDI method (Yang, et al., 2012) with a cosine similarity instead of Euclidean distance in the second layer. Then, each feature vector is trained using SVM algorithm separately, in the third layer. Decision Tree (C4.5) has been conducted in the forth layer to fuse the prediction results of SVM classifiers and to make the final decision. We have compared the proposed method with Smalter's method, Xu's method, ProDiGe method, and PUDI method. The experimental results show that our method overpassed the current state-of-the art methods, by achieving 82.6%, 85.6% and 84.1 for precision, recall and F-measure, respectively.

2. Method In this section, we describe our method for identifying and prioritizing disease genes. The proposed method consists of four steps: (1) Translate corresponding gene products (proteins) into four numerical feature vectors using four types of protein sequence translator; (2) selecting negative data from unknown genes; (3) modeling each feature vector using SVM algorithm; (4) Decision Tree algorithm is used to make the final decision by fusing the predicting results of the base SVM classifiers. The schema of the proposed method is depicted in Fig. 1.

2.2. Negative data generation 2.1. Protein sequence translation After generating the feature vectors for all genes using representation methods, it is required to select a negative protein set from the unknown proteins to build a dataset with both positive and reliable negative instances. In this regard, we proposed a six steps algorithm. First, four negative sets NP AC ; NP GA ; NP MA ; NP NA as empty set are defined for each of the feature vectors. Second, representing each protein P i (disease and unknown proteins) into GA MA NA four vectors; V AC pi ; V pi ; V pi ; V pi ; using AC, GA, MA, and NA representation methods. Third, compute the positive mean vector (Pm) of all positive proteins for each of the represented vectors.

One of the most important challenges in identifying disease gene problem using machine learning algorithm is to extract feature vectors for disease and unknown genes. In this work, we use corresponding gene products (Proteins) to characterize genes. In this regard, four types of representation methods have been employed to extract the important information of protein in which fully encoded, including Normalized Moreau–Broto autocorrelation (NA) (Feng and Zhang, 2000), Geary auto correlation (GA) (Sokal and Thomson, 2006), auto covariance (AC) (Guo et al.,

Table 1 The normalized value of twelve physicochemical properties for each of amino acid (A–Y).

A C D E F G H I K L M N P Q R S T V W Y

HY-PHOB

HY-PHIL

POL

POL2

SFE

GSI

TFE

AAC

CC

RAS

PC

EOF

0.281 0.458 0 0.027 1 0.198 0.207 0.792 0.198 0.783 0.721 0.12 0.253 0.123 0.222 0.235 0.318 0.687 0.56 0.922

0.453 0.375 1 1 0.14 0.531 0.453 0.25 1 0.25 0.328 0.562 0.531 0.562 1 0.578 0.468 0.296 0 0.171

0.395 0.074 1 0.913 0.037 0.506 0.679 0.037 0.79 0 0.098 0.827 0.382 0.691 0.691 0.53 0.456 0.123 0.061 0.16

0.112 0.312 0.256 0.369 0.709 0 0.562 0.454 0.535 0.454 0.54 0.327 0.32 0.44 0.711 0.151 0.264 0.342 1 0.728

0.589 0.527 0.191 0.285 0.936 0.446 0.582 0.851 0.325 0.851 0.957 0.319 0.702 0.4 0 0.448 0.557 0.765 1 0.787

0.305 0.422 0.381 0.372 0.701 0 0.713 1 0.451 0.618 0.56 0.381 0.637 0.372 0.558 0.312 0.723 0.875 0.766 0.701

0.777 1 0.444 0.407 0.851 0.777 0.629 0.925 0 0.851 0.814 0.481 0.555 0.407 0.148 0.629 0.592 0.888 0.777 0.518

0 1 0.501 0.334 0 0.269 0.21 0 0.12 0 0 0.483 0.141 0.323 0.236 0.516 0.258 0 0.047 0.072

0.942 0 0.82 0.902 0.697 0.904 0.735 0.668 0.32 0.617 0.144 0.502 0.748 0.586 0.726 0.953 1 0.591 0.82 0.515

0.222 0.333 0.416 0.638 0.75 0 0.666 0.555 0.694 0.527 0.611 0.472 0.388 0.583 0.833 0.222 0.361 0.444 1 0.861

0.033 0.033 0.021 0.042 0.372 0.014 0.021 0.13 0 0.162 0.115 0.028 0.053 0.046 0.001 0.005 0.021 0.09 1 0.208

0.124 0.431 0.314 0.447 0.36 0 0.537 0.494 0.809 0.489 0.35 0.375 0.244 0.504 1 0.216 0.365 0.373 0.511 0.475

Please cite this article as: Yousef, A., Moghadam Charkari, N., SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.010i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

A. Yousef, N. Moghadam Charkari / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Forth, calculate the similarity, sim ðjÞ, between each unknown protein (P j є UP) and the mean vectors (Pm). Fifth, for each of the feature vectors, select the r negative proteins from UP set by choosing the r farthest proteins from the Pms. Cosine similarity measure has been used as a distance measurement to compute the distance between P j and Pm. Since the number of unknown proteins is much more than disease proteins, determining the appropriate number (r) of selected negative proteins has a direct effect on the prediction model construction. In this article, we have attempted to generate the prediction model on balanced dataset. Finally, the proteins determined by the intersection of selected negative proteins sets, will be chosen as reliable negative data (RN_Set). Fig. 2 shows the algorithm steps for selecting reliable negative data set. 2.3. SVM learning Support Vector Machine (SVM) (Cortes and Vapnik, 1995) is a popular and promising method for data classification in many application areas. There are two hyper-parameters for gaussian kernel in SVM: C and γ. Where C is defined as penalty parameter and γ as a gaussian parameter.The proper selection of the parameters values directly affect on SVM performance. Therefore, how to find an appropriate hyper-parameter values using a grid search

technique in the range of C ¼ 2  2…, 24 and γ ¼2  4 …, 24, is an important issue. 2.4. C4.5 fusion layer Since using the same classifier (SVM) to classify the different feature vectors of the same instances, produces some uncertainties and makes some individual errors, a reasonable fusion of these classifiers are more likely reduce the overall prediction inaccuracies and provides better prediction result (Jolliffe, 2005). Fusion of classifier has been widely studied in different applications such as datamining from noisy data steam, fraud detection, image analysis; etc. the objective of classifier fusion is to obtain better classification accuracy by combining the results of classifiers. In this work, C4.5 algorithm is used as a fusion method in the fourth layer. C4.5 is a rule post-pruning algorithm which builds Decision Tree by recursively partitioning the input attribute vector. Consequently, a rule is achieved by traveling tree from the root node to each leaf node. It is obvious that the SVM-based classifiers in third layer of the proposed method build the model for the same dataset using different feature vectors. Thus, using C4.5 algorithm to fuse the outputs of SVM-based predictors leads to concurrent use of optional feature descriptors and classification procedures. It is a powerful solution to solve tough classification problems (such as disease gene identification) which include dataset with noisy data.

3. Results and discussions In this section, we evaluate the performance of the proposed method. We have investigated the method on both the balanced and imbalanced datasets. At first, we initialize our experiments by investigating the effect of fusing Based-classifier (SVMs) using C4.5 on balanced dataset. Then, to confirm the validity of the method, the performance of the proposed method is evaluated using imbalanced dataset. finally, we have also compared our method and the state-of-the-art techniques on general disease genes identification and prioritization. 3.1. Experimental data In our experiments, we have employed employed the data used by (Yang et al., 2012). This data has 5405 known disease genes spanning 2751 disease phenotypes, where all the genes have been extracted by combining GENECARD (Safran et al., 2010) and OMIM (McKusick, 2007) disease gene data. On other hand, 16 k genes from Ensembl (Flicek et al., 2011) have been selected as the unknown gene set. The protein sequence of each gene has been quarried from Uniprot (Li et al., 2006). Then, each sequence are translated to four feature vectors using AC, GA, MA, and NA algorithms. 3.2. Evaluation metrics The classification performance of the proposed disease gene identification method was validated using two different structures of dataset: balanced and imbalanced dataset. There are different metrics to measure the performance of our classification system. The Recall, Precision, and F-measure which are defined as follows: Recall ¼ Fig. 2. Pseudo-Code of the selection reliable negative data algorithm from the unknown gene set.

TP TP þ FN

Precision ¼

TP TP þFP

ð1Þ ð2Þ

Please cite this article as: Yousef, A., Moghadam Charkari, N., SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.010i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

A. Yousef, N. Moghadam Charkari / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

F  measure ¼

2nRecallnPrecision Recall þ Precision

ð3Þ

where FP is false positive (non-disease gene ‘negative data’ which has been identified as disease gene), TP is true positive (disease genes which have been properly identified), and FN is false negative (disease gene which have been identified as nondisease gene ‘negative data’). The reason of using these metrics is that the performance of positive class (disease genes) is considered, and both precision and recall are defined with respect to the positive data, while F-measure is the harmonic mean of recall and precision.

5

3.4. Prediction using imbalanced dataset As the number of unknown genes is generally much more than disease genes, we have investigate the performance of the proposed method using imbalanced dataset. In this regard, different positive genes-to-negative genes ratios dataset have been conducted. In this experiment, we could observe how the fusion stage reduces the negative effects of the noise data which proportionally rises with increasing the negative data ratio. Fig. 3 illustrates how the overall performance reduces with increasing the ratio of negative data. However, the decrease rate in the performance of fusion method is lower than that of the SVM-based classifier.

3.5. Comparison with other works 3.3. Performance of the proposed method To evaluate the robustness of the proposed method and to minimize the overfitting of classification model, 5 fold crossvalidations has been employed using balanced dataset with 10,000 instances( i.e. 5000 positive and 5000 negative instances). Table 2 shows the comparison between the prediction performances of the proposed fusion predictor and the SVM-based classifier. It can be found that the performance of the fusion predictor is better than the performance of each SVM-based predictor. Since the classification of different feature vectors of the same data using same classifier produces some uncertainties, fusion the results of classifiers will reduce the overall classification errors.

Table 2 Comparison between SVM-based predictor and SFM method. Methods

Precision (%)

Recall (%)

F-measure (%)

AC-SVM GA-SVM NA-SVM MA-SVM SFM

81.6 74 77.2 72.7 82.6

74.2 88.6 83.2 88 85.6

77.7 80.6 80 79.6 84

We have compared our approach with four state-of-art methods, including, Smalter's method (Smalter et al., 2007), Xu's method (Xu and Li, 2006), ProDiGe method (Mordelet and Vert, 2011), and PUDI method (Yang et al., 2012). The main differences between the proposed method and the mentioned methods are in two issues. First, using different prior-knowledge to extract feature vector. The SFM generates feature vector by using universal priorknowledge. While the other methods employ a prior-knowledge which might be incomplete and noisy. For example, in the PUDI method, some features are not existed for some genes, while for the others genes some other features are not defined. While the protein in this work is defined by its sequence, we can confirm that our prior-knowledge is universal and available for all proteins. Second, the selection of negative genes set from unknown genes set. As mentioned above, both Xu's and Smalter's methods used unknown gene set as negative set, while ProDiGe method investigated multi negative gene subsets from unknown gene set randomly. The PUDI determined the negative set using the Euclidean distance between each gene feature vector and the positive representative vector. They selected the genes which have the larger distances as reliable negative genes. This strategy depends on the feature vector and the distance methods. While the feature vector proposed by PUDI suffers from incompleteness, it is expected that the negative data suffers of noisy data. In addition, using Euclidean distance might not be a best choice for

Fig. 3. Comparison between F-measure reduction of the SFM algorithm and SVM-based predictors on different Positive to Negative ratio datasets.

Please cite this article as: Yousef, A., Moghadam Charkari, N., SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.010i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

A. Yousef, N. Moghadam Charkari / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

selecting negative genes. In this paper, more reliable strategy has been used for selecting negative genes, by applying other distance metrics and intersecting the results which extracted by four different feature vectors. The effectiveness of the representation feature vectors used in this paper is clearly found through the prediction results. For a fair comparison with other approaches for disease-gene prediction, the previous methods have been applied using the same data group and with the same cross-validation scheme. While the difference is in selecting negative set. Table 3 shows the comparison between SFM and state-of-art methods. The prediction results achieved by a SFM method indicates that SFM predictor results exceed other state-of-art methods. It shows that fusing multiple based-predictor is more accurate and potent than single classifier. Also, it is clear that the results achieved by the other methods have been increased when they used the proposed sequence-based feature vector. For example PUDI is able to achieve 81.1% F-measure which is 4.6% better than the result reported in PUDI paper. To give a fair comparison with ProDiGe, Smalter, and Xu methods which randomly choose a fraction of unknown data as negative ones, we tested our method by selecting some unknown genes randomly and used them as negative data. The results are shown in Table 4. We can find that the prediction performances of the proposed method is better than the others methods. To explain how the performance of the proposed method becomes better than the other methods, it is important to mention the role of fusion method used in this work. The fusion of multiclassifier allows concurrent use of arbitrary feature descriptors and classification procedures, it is a powerful solution to solve difficult decision making problems (disease gene identification problem) which involves data sets with noisy data. 3.6. Comparison between distance metrics Euclidean distance has been used to generate negative set in PUDI (Yang et al., 2012). Since using Euclidean distance means that the positive data has identity covariance. Obviously, if positive data has not identity covariance, Euclidean distance will not be helpful. Therefore, the distance matric has a direct effect on the validity of the extracted negative set. So that we have attempted to select more effective distance metrics to construct the negative set. Table 5 indicates that cosine distance has a positive effect comparing with other three distance metrics for this type of datasets. Where cosine distance is a measure of similarity between two vectors, and it is most commonly used in high-dimensional positive spaces, such as text mining (Singhal, 2001).

Q4 68

Methods

Precision (%)

Recall (%)

F-measure (%)

PUDI Yang et al. (2012) ProDiGe Mordelet and Vert (2011) Smalter's method Smalter et al. (2007) Xu's method (1) Xu and Li (2006) Xu's method (5) Xu and Li (2006) SFM

78.2 72.4 67.9

84.3 79.8 61.5

81.1 75.9 64.5

68.4 66.8 82.6

54.2 56.3 85.6

60.4 61.1 84

Table 4 Comparison between disease gene identification methods using negative data randomly selected from unknown data. Method

Precision (%)

Recall (%)

F-measure (%)

ProDiGe Mordelet and Vert (2011) Smalter's method Smalter et al. (2007) Xu's method Xu and Li (2006) SFM

65.2 66.2

83.9 58.7

73.3 62.2

67.4 77.9

56.8 81.4

61.6 79.6

Table 5 Comparison between distances methods used in selecting negative genes process. Methods

Precision (%)

Recall (%)

F-measure (%)

cosine Euclidean Mahalanobis Correlation

82.6 79.6 78.8 60.1

85.6 86.6 85.03 46.3

84 83 81.8 52.3

Table 6 Cancer, diabetes, and Anemia disease gene classification. Disease class

Method

Precision (%)

Recall (%)

F-measure (%)

Prostate cancer

PUDI Yang et al. (2012) ProDiGe Mordelet and Vert (2011) Smalter's method Smalter et al. (2007) SFM PUDI Yang et al. (2012) ProDiGe Mordelet and Vert (2011) Smalter's method Smalter et al. (2007) SFM PUDI Yang et al. (2012) ProDiGe Mordelet and Vert (2011) Smalter's method Smalter et al. (2007) SFM

71.6 67.2

79.3 73.8

76.2 70.3

58.4

69.3

63.3

76.9 75.3 72.6

79.8 80.7 60

78.3 77.9 66

63.5

73.2

68

77.3 79.4 69.6

79.9 71.2 60.2

78.5 75 64.5

63.3

70.6

66.7

74.8

78.4

76.5

Diabetes

3.7. Identification specific disease class Prediction of specific disease genes is more valuable for pharmaceutical industry. In this work, three specific disease classes of ‘Prostate cancer’, ‘diabetes’, and ‘Anemia’ have been applied to show the capability of the proposed method in predicting the specific disease. A total of 52 Prostate cancer disease genes and 83 diabetes disease genes and 72 Anemia disease genes are collected form OMIM database (these genes treated as positive instances). Then, we made three balanced dataset (i.e. cancer dataset, diabetes dataset, and Anemia dataset) by selecting negative instances from unknown genes for each of the disease class. 3fold cross validation has been performed and the average values have been reported as the final result. Table 6 indicates that the proposed method performs 2.1%, 0.6%, 1.5% better than the best results of other method for cancer, diabetes, and Anemia, respectively.

67

Table 3 Comparison between SFM with other methods.

Anemia

3.8. Candidate gene prioritization In this section we explain our prioritization strategy of the predicted positive genes. Accordingly, we illustrate the strategy of the evaluation of the ranked list for positive predicted instances on the basis of their probability of belonging to the disease gene class (positive class).

Please cite this article as: Yousef, A., Moghadam Charkari, N., SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.010i

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

A. Yousef, N. Moghadam Charkari / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Table 7 Comparison between SFM method and other methods in terms of specific disease class identification. Method

PUDI Yang et al. (2012) ProDiGe Mordelet and Vert (2011) Smalter's method Smalter et al. (2007) SFM

AUC (%) Cancer

Metabolic

Neurological

83.3 76.7 79.1 86.6

90.0 72.2 78.2 90.3

85.4 64.6 73.9 88.2

Area under the ROC curve (AUC) is the ranking metric used in this experiment. The ROC curve is a graph of the fraction of true positive rate (TPR) vs. to that of the false positive rate at various threshold settings. TPR ¼

TP TP þ FN

ð4Þ

FPR ¼

FP FP þTN

ð5Þ

AUC is a measure to show that how the more reliable positive instances are ranked above the negative ones. If the growing rate of TPR in versus of FPR becomes faster as the threshold descends, then we expect the more reliable positive instances rank on the top of the ranked list. Table 7 denotes the performance of the different methods to detect disease genes for less general disease classes (Cancer, Metabolic, and Neurological). It demonstrates that the capability of the SFM method to proper ranking the predicted disease genes.

4. Conclusions Many machine learning (ML) algorithms have been applied to identify disease genes. Most of these methods regarded this problem as binary-class classification problem. The main challenges are summarized in: (1) Selecting the more complete prior-knowledge about genes to generate the feature vectors. (2) Selecting negative data from unknown genes to build and evaluate the classification model. (3) Selecting the proper classification method. In this article, we proposed a fusion method (SFM) for disease genes identification using solely the primer structure of the proteins as a prior-knowledge. Four different feature representation methods are used to transform the amino acids sequences to numerical feature vectors. Then, four sequencebased individual classifiers using SVM algorithm have been employed. The outputs of these SVM-based predictors were fused using C4.5 algorithm. The results reveal the importance of using protein sequences as a prior-knowledge. In addition, a significant improvement in the classification performance has been also observed by using fusion of SVM-based predictors. Furthermore, using Decision Tree in the fusion layer helps the biologists to interpret the prediction rules. As comparison with other methods, we found that SFM achieved noticeably better results than the previous methods. For future work, to achieve better classification performance, feature extraction methods will be considered to extract the effective features from the feature vector. We will also consider how to generate the rules for disease gene identification. References Aerts, S., et al., 2006. Gene prioritization through genomic data fusion. Nat. Biotechnol. 24, 537–544. Ala, U., et al., 2008. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput. Biol. 4, e1000043.

7

67 Cai, C.Z., et al., 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids 68 Res. 31, 3692–3697. 69 Cai, C.Z., et al., 2004. Enzyme family classification by support vector machines. 70 Proteins 55, 66–76. Charton, M., Charton, B.I., 1982. The structural dependence of amino acid hydro71 phobicity parameters. J. Theor. Biol. 99, 629–644. 72 Chothia, C., 1976. The nature of the accessible and buried surfaces in proteins. J. 73 Mol. Biol. 105, 1–12. Chothia, C., 1992. Proteins. One thousand families for the molecular biologist. 74 Nature 357, 543–544. 75 Chou, K.C., Cai, Y.D., 2004. Prediction of protein subcellular locations by GO–FunD– 76 PseAA predictor. Biochem. Biophys. Res. Commun. 320, 1236–1239. Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20, 273–297. 77 De Bie, T., et al., 2007. Kernel-based data fusion for gene prioritization. Bioinfor78 matics 23, i125–i132. 79 Eisenberg, D., McLachlan, A.D., 1986. Solvation energy in protein folding and 80 binding. Nature 319, 199–203. Fauchere, J.L., et al., 1988. Amino acid side chain parameters for correlation studies 81 in biology and pharmacology. Int. J. Pept. Protein Res. 32, 269–278. 82 Feng, Z.P., Zhang, C.T., 2000. Prediction of membrane protein types based on the 83 hydrophobic index of amino acids. J. Protein Chem. 19, 269–275. Flicek, P., et al., 2011. Ensembl 2011. Nucleic Acids Res. 39, D800–D806. 84 Freudenberg, J., Propping, P., 2002. A similarity-based method for genome-wide 85 prediction of disease-relevant human genes. Bioinformatics 18 (Suppl 2), 86 S110–S115. Fukasawa, Y., et al., 2014. Plus ça change – evolutionary sequence divergence 87 predicts protein subcellular localization signals. BMC Genomics 15, 46. 88 Glazier, A.M., Nadeau, J.H., Aitman, T.J., 2002. Finding genes that underlie complex 89 traits. Science 298, 2345–2349. Grantham, R., 1974. Amino acid difference formula to help explain protein 90 evolution. Science 185, 862–864. 91 Guo, Y., et al., 2008. Using support vector machine combined with auto covariance 92 to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 36, 3025–3030. 93 Han, L.Y., et al., 2004. Prediction of RNA-binding proteins from primary sequence by 94 a support vector machine approach. RNA 10, 355–368. 95 Hopp, T.P., Woods, K.R., 1981. Prediction of protein antigenic determinants from 96 amino acid sequences. Proc. Natl. Acad. Sci. USA 78, 3824–3828. Janin, J., 1979. Surface and inside volumes in globular proteins. Nature 277, 97 491–492. 98 Jolliffe, I., 2005. Principal component analysis. Wiley Online Library. 99 Kohler, S., et al., 2008. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82, 949–958. 100 Li, Y., Patra, J.C., 2010. Integration of multiple data sources to prioritize candidate 101 genes using discounted rating system. BMC Bioinform. 11, S20. 102 Li, Z.R., et al., 2006. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. 103 Nucleic Acids Res. 34, W32–W37. 104 Lo, S.L., et al., 2005. Effect of training datasets on support vector machine prediction 105 of protein–protein interactions. Proteomics 5, 876–884. McKusick, V.A., 2007. Mendelian Inheritance in Man and its online version, OMIM. 106 Am. J. Hum. Genet. 80, 588–604. 107 Mordelet, F., Vert, J.P., 2011. ProDiGe: prioritization of disease genes with multitask 108 machine learning from positive and unlabeled examples. BMC Bioinform. 12, 109 389. Navlakha, S., Kingsford, C., 2010. The power of protein interaction networks for 110 associating genes with diseases. Bioinformatics 26, 1057–1063. 111 Prabhakaran, M., Ponnuswamy, P.K., 1982. Shape and surface features of globular 112 proteins. Macromolecules 15, 314–320. Quinlan, J.R.. 1996. Improved use of continuous attributes in C4.5, arXiv preprint cs/ 113 9603103. 114 Radivojac, P., et al., 2008. An integrated approach to inferring gene-disease 115 associations in humans. Proteins 72, 1030–1037. Safran, M., et al., 2010. GeneCards Version 3: the human gene integrator. Database: Q2116 J. Biol. Databases Curation 2010, baq020. 117 Shepherd, A.J., Gorse, D., Thornton, J.M., 2003. A novel approach to the recognition 118 of protein architecture from sequence using Fourier analysis and neural networks. Proteins: Struct. Funct. Bioinform. 50, 290–302. 119 Singhal, A., 2001. Modern information retrieval: a brief overview. IEEE Data Eng. 120 Bull. 24, 35–43. Smalter, A., Lei, S.F. and Chen, X.-w. 2007. Human disease-gene classification with Q3121 122 integrative sequence-based and topological features of protein–protein interaction networks. In: Proceedings of the IEEE International Conference on 123 Bioinformatics and Biomedicine. pp. 209–216. 124 Sokal, R.R., Thomson, B.A., 2006. Population structure inferred by local spatial 125 autocorrelation: an example from an Amerindian tribal population. Am. J. Phys. Anthropol. 129, 121–131. 126 Sweet, R.M., Eisenberg, D., 1983. Correlation of sequence hydrophobicities mea127 sures similarity in three-dimensional protein structure. J. Mol. Biol. 171, 128 479–488. Xia, J.F., Han, K., Huang, D.S., 2010. Sequence-based prediction of protein–protein 129 interactions by means of rotation forest and autocorrelation descriptor. Protein 130 Pept. Lett. 17, 137–145. 131 Xu, J., Li, Y., 2006. Discovering disease-genes by topological features in human 132 protein–protein interaction network. Bioinformatics 22, 2800–2805.

Please cite this article as: Yousef, A., Moghadam Charkari, N., SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.010i

8

1 2 3 4 5 6 7

A. Yousef, N. Moghadam Charkari / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Yang, P., et al., 2011. Inferring gene-phenotype associations via global protein complex network propagation. PLoS One 6, e21502. Yang, P., et al., 2012. Positive-unlabeled learning for disease gene identification. Bioinformatics 28, 2640–2647. Yousef, A., Moghadam Charkari, N., 2013. A novel method based on new adaptive LVQ neural network for predicting protein–protein interactions from protein sequences. J. Theor. Biol. 336, 231–239.

Yu, C.-Y., Chou, L.-C., Chang, D.T.H., 2010. Predicting protein–protein interactions in unbalanced data using the primary structure of proteins. BMC Bioinform. 11, 167. Zhang, W., Sun, F., Jiang, R., 2011. Integrating multiple protein–protein interaction networks to prioritize disease genes: a Bayesian regression approach. BMC Bioinform. 12 (Suppl 1), S11.

Please cite this article as: Yousef, A., Moghadam Charkari, N., SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.010i

8 9 10 11 12 13