Journal of Theoretical Biology 437 (2018) 149–162
Contents lists available at ScienceDirect
Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/jtbi
S-FLN: A sequence-based hierarchical approach for functional linkage network construction A. Jalilvand a, B. Akbari a,∗, F. Zare Mirakabad b a b
Department of Electronic and computer engineering,Tarbiat Modares University, Tehran, Iran Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran
a r t i c l e
i n f o
Article history: Received 31 December 2016 Revised 27 July 2017 Accepted 18 October 2017 Available online 26 October 2017 Keywords: Network modeling Network construction Ensemble learning Functional linkage network (FLN) Link prediction
a b s t r a c t The functional linkage network (FLN) construction is a primary and important step in drug discovery and disease gene prioritization methods. In order to construct FLN, several methods have been introduced based on integration of various biological data. Although, there are impressive ideas behind these methods, they suffer from low quality of the biological data. In this paper, a hierarchical sequence-based approach is proposed to construct FLN. The proposed approach, denoted as S-FLN (Sequence-based Functional Linkage Network), uses the sequence of proteins as the primary data in three main steps. Firstly, the physicochemical properties of amino-acids are employed to describe the functionality of proteins. As the sequence of proteins is a more comprehensive and accurate primary data, more reliable relations are achieved. Secondly, seven different descriptor methods are used to extract feature vectors from the proteins sequences. Advantage of different descriptor methods lead to obtain diverse ensemble learners in the next step. Finally, a two-layer ensemble learning structure is proposed to calculated the score of protein pairs. The proposed approach has been evaluated using two biological datasets, S.Cerevisiae and H.Pylori, and resulted in 93.9% and 91.15% precision rates, respectively. The results of various experiments indicate the efficiency and validity of the proposed approach. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction Recent researches show that perturbations of cellular systems are the main cause of human diseases, especially in molecular networks (Barabási et al., 2011; Goh et al., 2007). Meanwhile, the associated genes to the same or similar diseases commonly reside in the same neighborhoods of molecular networks (Goh et al., 2007). These observations have been found as the basis of many computational methods to associate the unknown genes to certain diseases, Considering that the machine learning methods play an important role in system biology approaches (Hedberg, 20 06; Kell, 20 06). The majority of these methods are based on functional linkage networks (FLNs) (Wang et al., 2011). FLNs are well-defined data structures which are used to identify disease-related genes. They are extendable to investigate gene cooperation in complex diseases and drug discovery (Apolloni et al., 2011). Moreover, FLNs can be used to assign function classes to unknown genes or proteins, which has been known as a fundamental task in biological researches (Apolloni et al., 2011; Manimaran et al., 2009).
∗
Corresponding author. E-mail address:
[email protected] (B. Akbari).
https://doi.org/10.1016/j.jtbi.2017.10.021 0022-5193/© 2017 Elsevier Ltd. All rights reserved.
An FLN is defined as a graph in which the nodes represent genes or corresponding proteins and the edges denote functional associations between them. In other words, two proteins are connected in an FLN if some experimental or computational methods indicate that they share the same functionality. In this regard, the process of identifying functional relationships among the proteins is called FLN construction. Fig. 1 shows an overview of the FLN construction process. Many computational methods have been proposed in the literature to predict the links between the proteins and construct the biological networks. Generally, they can be divided into two categories. In the first category, the main idea is to integrate different data sources to construct the biological network. The data sources include Protein-Protein Interaction (PPI) network, gene fusion, gene neighborhoods, literature mining knowledge, Gene Ontology(GO), and other data (Franke et al., 2006; Lei et al., 2012; Linghu et al., 2009; Wang et al., 2014; Wu et al., 2010; You et al., 2010). In Franke et al. (2006) an FLN is constructed by integrating various types of biological data. The authors employed PPI, microarray co-expression, and GO and applied a Bayesian approach to predict gene pairs that participate in the same GO biological process. Similarly, in Köhler et al. (2008) multiple biological data
150
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
Fig. 1. Functional linkage network reconstruction: the Section 1 shows the gold standard (GS) network has been created by Gene Ontology. The Section 2 shows feature vector extraction step based-on amino acid sequences for positive-negative linkages of GS network. The Section 3 shows use of proposed method to approximate natural network based-on extracted features of protein pairs.
sources by random walk algorithm was used for disease gene prioritization. In Linghu et al. (2008), a six-dimension feature vector based on the six biological data sources has been proposed. Then, multiple machine learning methods such as Support Vector Machine (SVM), Linear Discriminant Analysis(LDA), Naive Bayes, and Neural Network are applied to construct a reliable FLN. A human FLN by integrating 16 biological data features from 6 model organisms has been constructed by Linghu et al. (2009). Afterwards they use a Naive Bayes classifier to predict functional linkage between genes. Wang et al. (2014) built an FLN of mitochondrial proteins by integrating biological features such as genomic context, gene expression profiles, metabolic pathways and PPI network. A recent and interesting survey on functional linkage network construction methods is provided in Linghu et al. (2013). Although the results of these methods might be acceptable in non-biological networks, due to low quality of biological data sources they are faced with major challenges in the biological networks. The reason is that the employed feature vectors include some missing values because most of the proteins are not appeared in different data sources. Missing values in integration based methods, causes a negative impact on the prediction accuracy of the protein pairs. In the second category, a number of methods have been developed to derive information directly from amino acid sequences (Guo et al., 2008; Mei and Zhu, 2014; Shen et al., 2007; Xia et al.,
2010b; You et al., 2013; 2014; Yousef and Charkari, 2013). These methods generally are divided into two classes: alignment-based and alignment-free methods. Although, the alignment-based methods obtain high accuracy for some of sequences, their results are inaccurate on the inversion, translocation at substring level, and diverse sequences with the same functionally or unequal lengths (Borozan et al., 2015; Li et al., 2016; Otu and Sayood, 2003). In this regard, alignment-free methods have been proposed to overcome these issues (Aguiar-Pulido et al., 2012; Agüero-Chapin et al., 2009; Dea-Ayuela et al., 2008; Munteanu et al., 2008a, 2009; Perez-Bello et al., 2009; Vilar et al., 2009; Vinga, 2014). The alignment-free methods include two steps: (i) the protein sequences are transformed into fixed-length feature vectors; (ii) The feature vectors are employed as training set in machine learning algorithms (Fernandez-Lozano et al., 2014; González-Díaz and Riera-Fernández, 2012; Munteanu et al., 2008b; Yao et al., 2014). A number of methods have been developed that use the sequence information of proteins to predict links in biological networks (Shen et al., 2007; Xia et al., 2010b; Yang et al., 2010; Yousef and Charkari, 2013; Zhang et al., 2011). Some of these methods use physicochemical properties of amino acids to enrich extracted feature vectors (Huang et al., 2016; Xia et al., 2010b; Yousef and Charkari, 2013). In Zhang et al. (2011) a computational approach based on compressed sensing theory is proposed to predict yeast PPI. They have used Auto Covariance (AC) method (Guo et al., 2008) and 7 physicochemical properties to extract the features. In
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
Huang et al. (2016) a computational model has been proposed to predict PPIs by combining a global encoding representation of sequences and a weighted sparse representation based classifier. In Xia et al. (2010b), You et al. (2013) and Yousef and Charkari (2013), six feature descriptor methods have been applied in an ensemble approach, including Auto Covariance (AC) (Guo et al., 2008), Geary autocorrelation (GA) (Sokal and Thomson, 2006), Conjoint triad (CT) (Shen et al., 2007), Local descriptor (LD) (Yang et al., 2010), Moran Autocorrelation(MA) (Xia et al., 2010a) and Normalized Moreau-Broto Autocorrelation(NMA) (Feng and Zhang, 20 0 0). The obtained results show that employing data fusion and ensemble learning structures provide acceptable accuracy in PPI link prediction. It has been found in many researches that the information of amino acid sequences is sufficient to predict protein-protein interactions (Guo et al., 2008; Mei and Zhu, 2014; Xia et al., 2010b). On the other hand, the lack of accurate data sources is the most important challenge in FLN construction. In this regard, we propose a novel approach, to construct the FLN using solely information of amino acids sequences. The proposed approach denoted as sequence-based FLN (S-FLN) is a hierarchical sequence-based approach that has three steps. 1 A gold standard network is constructed using GO data. 2 Seven feature vectors are extracted from each pair of protein sequences based on twelve physicochemical properties of amino-acids. Furthermore, seven statistical description methods have been used in order to achieve the fixed-length feature vectors. 3 Each proteins pair is scored by a two-layers ensemble learning structure. In the first layer, an initial score is calculated by a random forest learning method. In the second layer, a multilayer perceptron is applied to combine the results. We evaluate S-FLN using S. cerevisiae and H. pylori datasets. The proposed approach obtains 93.9% and 91.15% precision rates on each dataset, respectively. In summary, the main contributions of our paper are: •
•
•
We employed the amino acid sequences as a single primary data source to construct the FLN. As a result, the issue of missing values that are appeared in integration of different biological data sources in conventional methods would not be a major concern. Various descriptor methods have been proposed to capture different dependencies between the proteins. This leads to obtain a set of diverse feature sets from the amino acid sequences that are essential in the ensemble-based learning step. To deal with the imbalanced data issue, we took the advantage of bagging technique, which has not been reported in construction of FLN
The remaining of this paper has been organized as following. In Section 2 the basic concepts and biological network construction approaches are briefly reviewed. In Section 3 the proposed approach is introduced. The experimental results are presented in Section 4. In Section 5, the performance of S-FLN are discussed, and finally we conclude the paper in section 6. 2. Basic concepts and problem definition 2.1. Gold standard Gene Ontology is a hierarchical dictionary for describing the functionality of gene products(proteins) that enables semantic analysis of the protein network. GO is composed of three ontology sets, each of which is considered as a different aspect of cell biology: biological processes (BP), molecular functions (MF), and
151
cellular components (CC). Each set is represented by hierarchical relationships between the biological concepts (term ontologies). GO could be used to define the functional similarity between the genes in link prediction and network construction task. The resulting network is called as gold standard. These methods Graph structure-based and information contentbased (IC) measures are two well-known classes of methods to measure the functional similarity of genes based on their GO annotations (Ovaska et al., 2008). While the GS-based methods employs the hierarchical structure of GO graph for similarity measuring (Wang et al., 2007), the IC-based methods consider the information contents of GO terms. It has been found that the IC-based methods provide a better performance compared to the graphbased methods (Li et al., 2013; Mukhopadhyay et al., 2012; Resnik, 1995; Teng et al., 2013). 2.2. Principal component analysis Principal components analysis (PCA) is a popular statistical technique that has been widely used to reduce multidimensional data-sets and extract a new feature vector by choosing a subset of the components that contains the most of the essential information (Abdi and Williams, 2010). To do this, PCA finds the directions of most variance in the feature space and represents each data point by its coordinates along each of these directions. The PCA has a simple computational procedure. In the first step, the covariance matrix is computed for the whole dataset. Then, the eigenvectors and eigenvalues are computed using the covariance matrix. Finally, eigenvectors are sorted according to the descending order and a new feature space is created by selecting more representative eigenvectors. 2.3. Ensemble learning Stacking strategy is a machine learning approach, which has been used in a wide range of network reconstruction and link prediction researches (Bock and Gough, 2003; Martin et al., 2005; Nanni and Lumini, 2006; Shi et al., 2010; You et al., 2013). Stacking, as an ensemble learning approach, combines the results of several classifiers with different parameter values to obtain a better performance compared to a single classifier (Galar et al., 2012). Several stacking methods have been proposed to construct FLNs and PPIs based on different classifiers and schemes, including phylogenetic bootstrap (Bock and Gough, 2003), boosting (Shi et al., 2010), signature products (Martin et al., 2005), E-HKNN (Nanni and Lumini, 2006) and ensemble extreme learning (You et al., 2013). 2.4. Random forest Random Forest (RF) model is one of the well-known ensemble learning techniques, that has been employed in many areas of computational biology (Caruana et al., 2008; Jia et al., 2015; Kandaswamy et al., 2011). RF consists of several decision trees to increase the accuracy of classification with reducing the prediction variance of individual decision trees (Ho, 1995). After generating the trees, the results are aggregated via a voting strategy as:
Cls(X ) = V ote(C ls1 (X ), C ls2 (X ), . . . C lsm (X ))
(1)
where Ci (x) is defined as the predicted label for sample X by the ith primary decision tree classifier (Breiman, 2001). 2.5. Problem definition Definition 1. Natural FLN, is a graph GN = (Vp , E p ) where Vp = P = { p1 , p2 , . . . , pn } is a set of n proteins or corresponding genes and Ep ⊆P × P is the set of functional links between the protein pairs.
152
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162 Table 1 The table of symbols and their descriptions. Symbol-Variable
Definition
GN = (Vp , E p ) GS = (Vg , Eg ) G = (V, E ) Ss P GO pi t pli i
The natural FLN graph include proteins as vertices Vp and functional linkages as edge Ep The gold standard graph include proteins as vertices Vg and functional linkages as edge Eg The constructed graph include proteins as vertices V and predicted functional linkages as edge E Similarity weight between a pair of protein protein set GO terms for the i th protein The li th term for the pi th protein
Definition 2. Gold Standard, is a graph GS = (Vg , Eg ) where Vg = G = {g1 , g2 , . . . , gn } is a set of m ≤ n genes that corresponds to a protein as their production in Natural FLN. Here n is the number of all proteins from Definition 1. Eg ⊆G × G is the set of functional link between a pair of genes. Definition 3. Constructed FLN, is a predicted graph G = (V, E ) where V = P = { p1 , p2 , . . . , pn } is a set of n proteins and E⊆Ep is the set of predicted functional links between a pair of proteins. The aim of FLN construction is to build a graph G = (V, P ) that is similar to the natural FLN graph GN = (Vp , E p ), as stated in definition 1 and 3, as much as possible. In order to build the G, we need to predict E, including links among all pair of proteins. Thus, the main step in our FLN construction step is link prediction between the protein pairs. In this regard, given GS and GN , the output of the proposed FLN construction approach is a model F that assigns a functional similarity score Ss to each proteins pair (pi , pj ) as:
Ss = F ( pi , p j ) =
1, if ( pi , p j ) ∈ E . 0, if ( pi , p j ) ∈ / E.
(2)
Table 1summarizes the notation used in this paper.
use for both the training and testing phases. A common solution to define functional associations between the proteins is to calculate the functional similarity of proteins based on the biological process (Huang et al., 2007; Hughes and Roth, 2008; Linghu et al., 2013; 20 09; Zhou et al., 20 02). For this purpose, we develop the Algorithm 1 to construct the gold standard sets by aggregating inAlgorithm 1: Construction of gold standard.
1 2 3 4
5 6 7 8
3. The proposed method In this section, we introduce the proposed hierarchical sequence-based approach to construct the FLN using the gold standard and additional biological information. In summary, the proposed approach, has three main steps. •
•
•
Gold standard construction: in the first step, we reconstruct the gold standard network based on the GO to compose the training and test sets. For this purpose, a context based similarity measure is employed to capture both semantic and topological properties of the GO graph. Feature extraction: in the second step, seven various descriptor methods are used to capture different biological information of the amino-acids sequences. Accordingly, seven feature vectors are extracted which then fed to different learners in the next step. Ensemble learning: in the third step, a two-layer learning structure is proposed. Firstly, a random forest algorithm is applied on the extracted feature set by each descriptor method. Secondly, a MLP is employed to combine the output of the random forest learners to obtain the final decisions.
In following subsections, we introduce these steps in more details. 3.1. Gold standard construction Since we employ a supervised learning approach, a data set is required to train and test the learning method. Therefore the first step is to obtain an initial network with the highest similarity to the natural FLN. we consider this network as the gold standard and
9 10
11
12
13 14 15
16 17
18
Input: Gene-Ontology, protein-list Output: Gold Standard Graph /* Create Full-Graph of proteins */ /* ((n*(n-1))/2) link will be created */ foreach protein-A in protein-list do Create new link from protein-A to all other proteins; Add the created links to the PrimaryGraph; end /* GO and AIC methods is used to weighting the PrimaryGraph */ foreach link in PrimaryGraph do A = first protein of the link; B = second protein of the link; foreach Term-A in Gene-Ontology-Terms of A do foreach Term-B in Gene-Ontology-Terms of B do Compute information content(IC) for each Term-A and Term-B; Compute semantic value for all ancestors of TermA and TermB via IC and K; Compute similarity[Term-A,Term-B] that is the sum of double semantic weight of all the common ancestors of Term-A and Term-B divided by theweight of Term-A and Term-B; end end Compute similarity[A,B] that is average similarity Term-A and Term-B; end GoldStandard = give weight to the PrimaryGraph based on the similarity measure; return GoldStandard;
formation content (AIC) in IC-based methods. This algorithm computes semantic information of a GO term by considering the information content of all its ancestor terms in the graph (Song et al., 2014). Also we used the ”biological process” among the mentioned GO sets. In Algorithm 1, a list of n proteins and their annotations, extracted from the GO data (Ashburner et al., 20 0 0), are given as:
P = { p1 , p2 , . . . pn }
(3)
GO pi = {GO p1 , . . . , GO pn }
(4)
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
153
where GO pi is the set of GO terms for each protein as:
GO pi = {t pl1i , . . . , t pli i }
(5)
denotes li GO terms followed by protein pi . In line 1–4, a complete graph is constructed based on the given proteins. Then, it is weighted based on the GO terms, as stated in lines 5–16. The idea behind the weighting step is to capture similar GO annotations between two proteins. This would represent similar functionalities or biological pathways between the proteins. For each pair of proteins pi , pj ∈ P, the similarity function is defined as (see lines 6 and 7):
l i
l j
k1 =1
Sim( pi , p j ) =
k2 =1
li × l j
SimGO (t pk1i , t pk2j )
(6) k
where semantic similarity between each pair of GO terms t p1i ∈ k
GO pi , k1 = 1, 2, .., li , and t p2j ∈ GO p j , k2 = 1, 2, .., l j , is calculated as:
t∈ancestors k ∩ancestors k t p1 i
SimGO (t pk1i , t pk2j ) =
t p2 j
2 ∗ SW (t )
SV (t pk1i ) + SV (t pk2j )
,
(7) k
k
where t shows the common ancestors between two terms t p1i , t p2j , in GO directed acyclic graph. The function SW(t) obtains a semantical weight for each term t and is defined as follows:
SW (t ) =
1 1
1 + e IC (t )
,
(8)
where IC(t) is the basic requirement for the IC-based methods and is defined as follows:
IC (t ) = − log
f req(t ) , f req(root )
(9)
In above equation freq(t) is The frequency of GO term t, that is recursively defined as:
f req(t ) = annotation(t ) +
f req(i ),
(10)
i∈child (t )
where annotation(t) is the number of gene annotated with term t in GO, and child(t) is the set of the children of the term t. After obtaining the value of semantic weight of term t, the semantic value SV(x) of the GO term x is computed by adding SW(t) of all ancestors of GO term t, as follows:
SV (x ) =
SW (t ).
(11)
t∈ancestorsx
The outputs of the algorithm are the gold standard positive (GSP) and negative (GSN) sets. 3.2. Feature extraction Firstly, a feature vector is extracted for each protein in a pair by employing the physicochemical properties of each amino-acid of each the protein. Afterwards, the two feature vectors of each protein, would be concatenated to compose an entity in training set (see Fig. 2). To extract an alignment-free feature vector for each pair of proteins, seven descriptor methods are employed such as: Auto Co-variance (AC) (Guo et al., 2008), Geary Autocorrelation (GA) (Sokal and Thomson, 2006), Conjoint Triad (CT) (Shen et al., 2007), Local Descriptor (LD) (Yang et al., 2010), Moran Autocorrelation(MA) (Xia et al., 2010a), Normalized Moreau-Broto Autocorrelation(NMA) (Feng and Zhang, 20 0 0) and a new Modified Geary Autocorrelation (MGA). Moreover, twelve physicochemical properties of amino acids are employed to extract an enriched feature vector according to the selected feature descriptor methods. As
Fig. 2. Work flow of feature extraction: (A)The GSP and GSN sets of the Golden Standard are created, where a pair is considered as GSN when there their semantic similarity weight is 0 and that is considered as GSP when their semantic similarity weight is greater than a threshold. (B) The physicochemical properties of each amino-acid in a protein sequence is extracted then those are added to physicochemical properties matrix(PPM), where a PPM of amino-acid sequences is created for each proteins (C)The seven feature vectors are extracted for each protein pairs, where each pair consists of two PPM that are used as input of feature descriptor methods independently.
154
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162 Table 2 Normalized values of 12 physicochemical properties of amino-acids. These properties include hydrophobicity (HY-PHOB) (Sweet and Eisenberg, 1983), hydrophilicity (HY-PHIL) (Hopp and Woods, 1981), polarity (POL) (Grantham, 1974), polarizability (POL2) (Charton and Charton, 1982), solvation free energy (SFE) (Eisenberg and McLachlan, 1986), graph shape index (GSI) (Fauchère et al., 1988), transfer free energy (TFE) (Janin, 1979), amino acid composition (AAC) (Grantham, 1974), CC in regression analysis (CC) (Prabhakaran and Ponnuswamy, 1982), residue accessible surface area in tripeptide (RAS) (Chothia, 1976), partition coefficient (PC) (Garel et al., 1973) and entropy of formation (EOF) (Hutchens, 1970).
A C D E F G H I K L M N P Q R S T V W Y
HY-PHOB
HY-PHIL
POL
POL2
SFE
GSI
TFE
AAC
CC
RAS
PC
EOF
0.281 0.458 0 0.027 1 0.198 0.207 0.792 0.198 0.783 0.721 0.12 0.253 0.123 0.222 0.235 0.318 0.687 0.56 0.922
0.453 0.375 1 1 0.14 0.531 0.453 0.25 1 0.25 0.328 0.562 0.531 0.562 1 0.578 0.468 0.296 0 0.171
0.395 0.074 1 0.913 0.037 0.506 0.679 0.037 0.79 0 0.098 0.827 0.382 0.691 0.691 0.53 0.456 0.123 0.061 0.16
0.112 0.312 0.256 0.369 0.709 0 0.562 0.454 0.535 0.454 0.54 0.327 0.32 0.44 0.711 0.151 0.264 0.342 1 0.728
0.589 0.527 0.191 0.285 0.936 0.446 0.582 0.851 0.325 0.851 0.957 0.319 0.702 0.4 0 0.448 0.557 0.765 1 0.787
0.305 0.422 0.381 0.372 0.701 0 0.713 1 0.451 0.618 0.56 0.381 0.637 0.372 0.558 0.312 0.723 0.875 0.766 0.701
0.777 1 0.444 0.407 0.851 0.777 0.629 0.925 0 0.851 0.814 0.481 0.555 0.407 0.148 0.629 0.592 0.888 0.777 0.518
0 1 0.501 0.334 0 0.269 0.21 0 0.12 0 0 0.483 0.141 0.323 0.236 0.516 0.258 0 0.047 0.072
0.942 0 0.82 0.902 0.697 0.904 0.735 0.668 0.32 0.617 0.144 0.502 0.748 0.586 0.726 0.953 1 0.591 0.82 0.515
0.222 0.333 0.416 0.638 0.75 0 0.666 0.555 0.694 0.527 0.611 0.472 0.388 0.583 0.833 0.222 0.361 0.444 1 0.861
0.033 0.033 0.021 0.042 0.372 0.014 0.021 0.13 0 0.162 0.115 0.028 0.053 0.046 0.001 0.005 0.021 0.09 1 0.208
0.124 0.431 0.314 0.447 0.36 0 0.537 0.494 0.809 0.489 0.35 0.375 0.244 0.504 1 0.216 0.365 0.373 0.511 0.475
the physicochemical properties of the amino-acids have different scales, it would be required to extract a unique feature vector based on different scales of the physicochemical properties. Therefore, an additional normalization step is needed to normalize the physicochemical properties. For this purpose, the min-max normalization method is used (Mathura and Kolippakkam, 2005; Qiu et al., 2014; Yousef and Charkari, 2013). The normalized form of physicochemical properties is shown in Table 2. Thus, seven different feature vectors are obtained for protein pairs by employing the seven different methods for feature extraction of amino-acid sequences. Since the number of dimensions of the concatenated feature vectors varies from 360 to 1260, the PCA technique is applied on normalized data to reduce the undesirable redundancies as well as the feature vectors dimensions. Algorithm 2 describes the feature extraction step. It should be noted that the golden standard network, including negative and positive pairs, are the output of algorithm, imported from the previous step. After executing this algorithm, a number of dimension feature set are obtained.
3.3. Ensemble learning The third step is to apply a learning algorithm on the extracted feature sets. To do this, a stacking two-layer learning structure is proposed to construct the FLN. The major concern in this step is the imbalance data problem. The positive pairs are a small fraction of the total pairs, which causes a high imbalance ratio. To address this problem, a number of methods have been proposed (Bertoni et al., 2011; Galar et al., 2012). As stated by Galar et al. (2012), bagging is one on of the most efficient techniques to deal with the imbalanced data issue. Accordingly, we use a bagging method to balance the data-set. Firstly, for each feature descriptor, we randomly select c subset from the negative data. The obtained negative subsets is denoted by N1 toNc , where the size of each negative subset is equal to that of the positive data. Then a replica of the positive data is integrated with each of the c negative subsets to create c independent data sets. Then, we execute both the feature extraction and learning step in c iteration for each dataset. Therefore, the balanced data
is composed as:
balance dataseton =
⎧ 1 Set = GSN1 ∪ GSP ⎪ ⎪ ⎨Set 2 = GSN ∪ GSP 2
Set 3 = GSN3 ∪ GSP ⎪ ⎪ ⎩. . . m Set = GSNm ∪ GSP
(12)
The random forest is applied on seven distinct datasets in the first layer of our ensemble learning schema. This learning algorithm has been regarded as one of the most efficient methods in biological network construction (Jia et al., 2015; Kandaswamy et al., 2011). Specifically it has a number of significant properties for link predication in biological networks such as: (1) It is applicable to deal with large datasets (Breiman, 2001; Caruana et al., 2008; Galar et al., 2012); (2) it is not encouraged by outlier data (Breiman, 2001; Brown et al., 2005; Caruana et al., 2008); (3) it provides more diversity in ensemble methods by performing further randomization in the obtained model (Brown et al., 2005; Ho, 1995); (4) random forest outperforms the simple decision tree which might grow to arbitrary complexity for possible loss of generalized accuracy on unseen data (Ho, 1995); and (5) it does not require much tuning of parameters. In this way, a protein pair have seven individual feature vectors thus seven predicted scores are obtained for each them. In the second layer, we use a multi-layer perceptron (MLP) classifier to fuse prediction scores of the proteins pairs. It has been found that weighted averaging lead to high prediction performance for combining several learners in ensemble methods (Brown et al., ´ 2005; Wozniak et al., 2014). Accordingly, the prediction scores of each protein pair are integrated as a new 7 dimensional feature vectors. Then, these new features are feeded to the MLP classifier as the inputs to obtain the final score of each protein pair. In Fig. 3, the work flow of the learning process is shown. 4. Results 4.1. Dataset We employed different data sets for performance evaluation of the proposed approach. In the first one, the H. pylori dataset
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
155
Fig. 3. Work flow of learning approach: at first, the GSP and GSN set are balanced based-on bagging technique to m balanced dataset. Afterwards, the proposed stacking learning are running in m iteration for each the balanced dataset.
(Martin et al., 2005) is used to compare S-FLN with some other well-known methods in similar scope. This data set consists of 1458 positive pairs and 1458 negative pairs. We also employed three data sets including Uniprot Consortium (Boutet et al., 2016), GO database (Consortium et al., 2015), and NCBI database (Geer et al., 2009) to obtain the proteins’ IDs of S. cerevisiae, GO annotations, and amino-acid sequences, respectively. Furthermore, the Saccharomyces Genome Database (SGD) (Cherry et al., 2011) is used as an integrated resource to complete the GO annotations of Yeast (Saccharomyces cerevisiae). Finally, the set of saccharomyces cerevisiae including 6739 proteins is extracted. Afterwards, all proteins in the sets are weighted via GO semantic similarity measure to obtain GSP and GSN sets, where are two
strategies based on the similarity weights to create GSP and GSN sets of the weighted graph. In the first one, the Dataset-A consists of 105,980 pairs as GSP, and 384,964 pairs as GSN, in which every pair of genes with threshold greater than 0.4 are defined as Gold-Standard Positives (GSP), and gene pairs that do not share any term (semantic weight is equal to zero) are defined as Gold-Standard Negatives (GSN). In the second one, the Dataset-B includes GSP and GSN, in which the weights greater than 0.6 are considered as GSP that consists of 58,759 gene pairs. The pair with GO weights equal to zero annotated by GO biological process terms without any sharing terms are considered as GSN that includes 236,037 gene pairs.
156
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
Algorithm 2: The feature extraction phase.
1
2 3 4 5 6 7 8
9 10 11 12
13
14 15 16 17 18 19 20 21
Input: gold standard Graph, Physicochemical Matrix , Amino-acid Sequence of proteins Output: Feature Vector of all protein pairs initialization; /* Create Negative and Positive set */ foreach protein pair e in gold standard Graph do if weight of e is smaller than a threshold value then Add e to gold standard Negative set(GSN); else Add e to gold standard Positive set(GSP); end end /* Create physicochemical properties Matrix for each protein, and then Feature extraction based on the Matrix */ foreach pair e in GSN & GSP sets do A = first protein of the pair(link); B = second protein of the pair(link); Fetch Amino-acid Sequence for Protein A & B save in AS & BS; Create a ppMatrix of physicochemical properties for each amino-acid in AS & BS ; /* ppMatrix is an m*n matrix where m equals the lengths of Protein sequence and n equals number of physicochemical properties */ for i = 1 to number of feature descriptor methods do Compute FeatureVector[i1 ] of ppMatrix[AS]; Compute FeatureVector[i2 ] of ppMatrix[BS]; Concatenate FeatureVector[i1 ]&[i2 ] in FeatureVector[i]; end using PCA in order to reduce dimension of FeatureVector; end return i independent feature vector for each pair of proteins;
4.2. Evaluation metrics In order to evaluate the proposed approach, five metrics are used as precision, recall, accuracy, F-measure and ROC-AUC. These evaluation metrics are introduced in following:
P recision = Recall =
TP TP + FP
TP TP + FN
Accuracy =
TP + TN TP + TN + FP + FN
F − measure =
2 ∗ P recision ∗ Recall , P recision + Recall
(13)
(14)
(15)
(16)
where TP is true positive, FP is false positive, TN is true negative and FN is false negative. Positive and negative are linked and unlinked protein pairs, respectively. In other words the label of a link in FLN is 1 when there is the same function or pathway between the proteins pair and 0 otherwise. 4.3. Prediction accuracy comparison In order to show the advantages of S-FLN and comparison with the state-of-the-art of other methods in biological network construction, Helicobacter pylori dataset has been used. We compared
the result of S-FLN based on cross-validation with the reported results by Yousef and Charkari (2013), You et al. (2013) and Xia et al. (2010a) in the same data set. As shown in Table 3, the average prediction performance obtained by S-FLN are 0.890, 0.911, 0.901, and 0.900 for recall, precision, accuracy, and f-measure, respectively. The phylogenetic bootstrap (Bock and Gough, 2003) method shows the worst performance. The reason is that it employs a single classifier with a single feature set. Thus, its ability to detect the positive pairs as well as retrieving the latent links is not high. HKNN (Nanni, 2005), Signature products (Martin et al., 2005), and Boosting (Shi et al., 2010) methods show moderate performances with some enriched features sets. However, their results are not so high as they have applied single or basic learning algorithms without any specialization for constructing the networks. As it can be observed from the table, the best results are obtained by NLVQ (Yousef and Charkari, 2013), PCA-EELM (You et al., 2013), and E-HKNN (Nanni and Lumini, 2006) where their results become competitive with S-FLN. However, thanks to employing multiple feature sets obtained from different features descriptors as well as developing an ensemble based learning algorithm, the S-FLN obtains more accurate results. Moreover, the issue of imbalanced data set has been addressed in S-FLN through applying the bagging technique before the learning process starts. In order to evaluate the effect of handling the imbalanced data issue, we have divided the primary negative data of Dataset-A into four sub-sets and the positive part is duplicated. Fig. 4 shows the result of S-FLN based on two descriptors, Auto Covariance (AC) and Local descriptor (LD), on four subsets, known as bags. The result of precision, recall and f-measure using the balanced data set indicate that the S-FLN provides high accuracy in functional like prediction even by executing a not-exhaustive parameter tunning process. As mentioned above, we have used cross validation to report the results. In order to clarify this evaluation strategy, the results of S-FLN based on 5-fold cross validation have been reported in Table 4. For this purpose, the data set is divided into 5 parts. Then, at each iteration, four subsets are used to train the model, and the remaining one subset is used to test the model. As shown in Table 4, a variation of lower than 1% on all metrics is obtained. This indicates that the S-FLN approach has not been overfitted on the dataset.
4.4. Analyzing the S-FLN In this subsection, different aspects of S-FLN including similarity weight, the employed features, different descriptors, dimensionality reduction, and fusion methods are discussed. Similarity weight. As mentioned in Section 4.1, the set of 6739 proteins are extracted from saccharomyces cerevisiae and the protein pairs are weighted as a full-connected graph based-on GO similarity. In order to show the impact of the semantic similarity weight on the performance of the S-FLN, two data sets are created, namely Dataset-A and Dataset-B. In Dataset-B, we intend to increase the accuracy of gold standard set. To this end, we adopted a more stringent approach and reduced the number of protein pairs in Dataset-B. As show in Fig. 5, the F-measure and recall are decreased about 2% when Dataset-B is used. These results indicate that some valuable pairs might be missed if a high value threshold is selected to construct the GS network. On the other hand, selecting some small values for the similarity weight, causes some undesirable noises in the obtained positive or negative sets. In practice, we have observed that selecting the value of weight in the range of [0.4,0.6] could provide acceptable results.
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
157
Table 3 Comparison the S-FLN with state-of-the-art methods in biological network construction in terms of the precision, recall, accuracy, and f-measure. The best performance is achieved at 1.0 for all metrics. Methods
Recall
Precision
Accuracy
F-measure
Phylogenetic bootstrap (Bock and Gough, 2003) HKNN (Nanni, 2005) E-HKNN (Nanni and Lumini, 2006) Signature products (Martin et al., 2005) Boosting (Shi et al., 2010) Meta (Xia et al., 2010a) PCA-EELM (You et al., 2013) NLVQ (Yousef and Charkari, 2013) Proposed Method
0.698 0.860 0.867 0.799 0.803 0.840 0.889 0.870 0.890
0.802 0.840 0.850 0.857 0.816 0.900 0.861 0.890 0.911
0.758 0.840 0.866 0.834 0.795 0.879 0.875 0.900 0.901
0.746 0.849 0.858 0.826 0.810 0.869 0.869 0.882 0.900
Fig. 4. Comparison of precision, recall and f-measure values in the four datasets balanced by bagging on the AC and LD feature sets. AC shows a more steady result with little variation.
Table 4 Prediction performances for 5-fold cross validation of S-FLN. The best performance is achieved at 1.0 for all metrics.
CV-1 CV-2 CV-3 CV-4 CV-5 Avg.
Precision
Recall
F-measure
AUC-ROC
Accuracy
0.936 0.948 0.943 0.931 0.935 0.939
0.935 0.948 0.942 0.931 0.935 0.938
0.935 0.948 0.942 0.931 0.935 0.938
0.979 0.983 0.982 0.979 0.981 0.981
0.935 0.948 0.942 0.931 0.935 0.938
Features. In order to evaluate the impact of the number of the properties on the performance of the S-FLN, we have extracted two feature sets. The first one uses six physiochemical properties to create the feature vectors, while all physiochemical properties are used in the second dataset. The results are shown in Fig. 6 in a radar diagram. Clearly, an improvement of 1% in all metrics is obtained, when we employ all physiochemical properties. Descriptors. Using the normalized units, the area under the curve is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (which is called AUC) (Fawcett, 2006). In this work, we compare the descriptors using AUC measure. Fig. 7 (A–I) illustrates AUC for each of the mentioned descriptors, while Fig. 7 (J)
illustrates fusion of all employed descriptors. The CT and LD descriptors, use the structures and sequence chains of amino acids to represent feature vectors. Moreover, the LD could extract effective feature vectors in both continuous and discontinuous regions of sequences. As shown in Fig. 7-G, the LD shows efficient results in predicting the functional links between the proteins by employing more structural information from the sequences. In the other descriptors, physicochemical properties of amino acids are used in order to extract the useful feature vector. Due to the natural difference of the FLN versus the PPI network, the physicochemical properties would be more useful to construct FLN network rather than PPI. The reason is that, the FLN specifically models the functional relationships between the proteins. In Fig. 7, the result demonstrated the efficiency of these properties to predict functional links. Moreover, as observed in Fig. 7-J, the integration of descriptors, that use all the features, provide 0.98 of AUC rate. It could be inferred that a better performance is obtained compared to employing a single type of feature set. Dimensionality reduction. One of the most important step in feature extraction is to reduce the redundancy in the feature vectors. To show the impact of the number of the principle components, denoted by k, on the performance of S-FLN, the value of k has been varied from 2 to length of feature dimension. Then, the PCA is executed on the obtained features using each feature descriptor
158
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
Fig. 6. Impact of physiochemical properties on the performance of the S-FLN. Twelve physiochemical properties are used in the first dataset and six ones are used in the second data set to extract the feature vectors. These are indicated in blue and red colors, respectively. Each spoke on the plot represents an evaluation metric including precision, recall, F-measure, and ROC-AUC. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 5. The radar plot that depicts the influence of semantic similarity weight on performance of S-FLN using two created datasets. It should be noticed that DatasetB is a subset of Dataset-A. Each spoke on the plot represents an evaluation metric. Also, each plot shows the performance of S-FLN on positive and negative sets. Table 5 The result of applying PCA on different feature descriptors, where LD obtains higher accuracy with fewer number of features. Feature Descriptor
Original feature length
PCA feature length
Accuracy
GA MGA AC CT LD MA NMA
720 720 720 686 1260 720 720
30 35 40 30 20 30 30
0.84 0.86 0.93 0.89 0.92 0.90 0.90
to select the best value of k. The result of this evaluation is shown in Fig. 8. Moreover, Table 5 shows the original features number and the number of extracted features which lead to high accuracy for each descriptors method. It can be observed from Table 5 that the LD feature set has the best performance with fewer length of feature vector. Additionally, this feature set contains the maximum data redundancy, where the feature length has been reduce from 1260 to 20 using the PCA. In other feature sets, we observed a significant decrease in feature
length, while the accuracy is still high. Fusion. It has been found that the weighted voting strategy could obtain the best results in many engineering applications (Brown et al., 2005). Accordingly, in this part, we have used three well-known classifiers including support vector machine (SVM), Multilayer Perceptron (MLP) and Naive Bayesian, to combine the outputs of the first layer in S-FLN. The results have been shown in Fig. 9. Clearly, the MLP shows a better performance compared to other two meta-classifiers. The reason is that the MLP could assign more accurate weight to each separate classifier when there are not any expert or prior knowledge about the features or the importance of each classifier. 5. Concluding remarks In this paper, the problem of network construction has been addressed as a classification problem. Commonly, more than one data source are employed to solve this problem due to the lack of a single and reliable biological data source. However, using this approach suffers from appearing some missing or invalid information which causes an undesirable decrease in the prediction accuracy. In order to cope with this problem, a hierarchical approach is proposed in three main steps. Firstly, a reliable network is constructed using the GO data. Secondly, different types of feature descriptors are employed to create diverse feature sets. Finally, a set of learners are obtained based on the extracted features sets to train the global learner. The proposed approach provides some interesting achievements. As the sequence of proteins is a more comprehensive and accurate data source, a set of efficient and informative feature sets are obtained. Besides, thanks to employing different types of the feature descriptors, a set of diverse training sets are obtained which leads to make an effective ensemble in the learning step. Also, the issue of missing values and invalid information is addressed as well. Furthermore, using an ensemble learning structure as well as handling the imbalance data problem bring high prediction performances. In the proposed approach, we have considered the single type link prediction task to construct the FLN. However,
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
159
Fig. 7. The AUCs of seven feature descriptor methods and their integration, shown in (a)GA (b)MGA (c)AC (d)CT (g)LD (h) MA (i)NAM (j)Integrated. Where the true positive rate is plotted on the y-axis, and the false positive rate on the x-axis and each plot presents 5-fold cross validation and average of AUC.
160
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
Fig. 8. The impact of the PCA on the performance of S-FLN in an incremental manner.
Fig. 9. Comparison of three different classifiers as a meta-classifier in the second layer of proposed approach.
a typical FLN naturally includes various types of links and proteins dependencies. In this regard, constructing the FLN according to a heterogeneous network model and consequently performing heterogeneous link prediction has been considered as our future plan. References Abdi, H., Williams, L.J., 2010. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2 (4), 433–459. Aguiar-Pulido, V., Munteanu, C.R., Seoane, J.A., Fernandez-Blanco, E., Perez-Montoto, L.G., González-Díaz, H., Dorado, J., 2012. Naive Bayes qsdr classification based on spiral-graph Shannon entropies for protein biomarkers in human colon cancer. Mol. BioSyst. 8 (6), 1716–1722. Agüero-Chapin, G., Varona-Santos, J., de la Riva, G.A., Antunes, A., González-Villa, T., Uriarte, E., González-Díaz, H., 2009. Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from Cof-
fea arabica and prediction of a new sequence. J. Proteome Res. 8 (4), 2122–2128. doi:10.1021/pr800867y. Apolloni, B., et al., 2011. Learning functional linkage networks with a cost-sensitive approach. In: Neural Nets WIRN10: Proceedings of the 20th Italian Workshop on Neural Nets, 226. IOS Press, p. 52. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al., 20 0 0. Gene ontology: tool for the unification of biology. Nat. Genet. 25 (1), 25–29. Barabási, A.-L., Gulbahce, N., Loscalzo, J., 2011. Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 12 (1), 56–68. Bertoni, A., Frasca, M., Grossi, G., Valentini, G., 2011. Learning functional linkage networks with a cost-sensitive approach. Front. Artif. Intell. Appl. 226, 52–61. doi:10.3233/978- 1- 60750- 692- 8- 52. Bock, J.R., Gough, D.A., 2003. Whole-proteome interaction mining. Bioinformatics 19 (1), 125–134. Borozan, I., Watt, S., Ferretti, V., 2015. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification. Bioinformatics btv006. Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A.J., Poux, S., Bougueleret, L., Xenarios, I., 2016. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt knowledgebase: how to use the entry view. In: Plant Bioinformatics: Methods and Protocols, pp. 23–54. Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32. Brown, G., Wyatt, J., Harris, R., Yao, X., 2005. Diversity creation methods: a survey and categorisation. Inf. Fusion 6 (1), 5–20. Caruana, R., Karampatziakis, N., Yessenalina, A., 2008. An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning. ACM, pp. 96–103. Charton, M., Charton, B.I., 1982. The structural dependence of amino acid hydrophobicity parameters. J. Theor. Biol. 99 (4), 629–644. Cherry, J.M., Hong, E.L., Amundsen, C., Balakrishnan, R., Binkley, G., Chan, E.T., Christie, K.R., Costanzo, M.C., Dwight, S.S., Engel, S.R., et al., 2011. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res. gkr1029. Chothia, C., 1976. The nature of the accessible and buried surfaces in proteins. J. Mol. Biol. 105 (1), 1–12. Consortium, G.O., et al., 2015. Gene ontology consortium: going forward. Nucleic Acids Res. 43 (D1), D1049–D1056. Dea-Ayuela, M.A., Pérez-Castillo, Y., Meneses-Marcel, A., Ubeira, F.M., Bolas-Fernández, F., Chou, K.-C., González-Díaz, H., 2008. Hp-lattice qsar for dynein proteins: experimental proteomics (2d-electrophoresis, mass spectrometry) and theoretic study of a leishmania infantum sequence. Bioorg. Med. Chem. 16 (16), 7770–7776.
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162 Eisenberg, D., McLachlan, A.D., 1986. Solvation energy in protein folding and binding. Nature 319 (6050), 199–203. Fauchère, J.-L., Charton, M., Kier, L.B., Verloop, A., Pliska, V., 1988. Amino acid side chain parameters for correlation studies in biology and pharmacology. Chem. Biol. Drug Des. 32 (4), 269–278. Fawcett, T., 2006. An introduction to roc analysis. Pattern Recognit. Lett. 27 (8), 861–874. Feng, Z.-P., Zhang, C.-T., 20 0 0. Prediction of membrane protein types based on the hydrophobic index of amino acids. J. Protein Chem. 19 (4), 269–275. Fernandez-Lozano, C., Gestal, M., González-Díaz, H., Dorado, J., Pazos, A., Munteanu, C.R., 2014. Markov mean properties for cell death-related protein classification. J. Theor. Biol. 349, 12–21. doi:10.1016/j.jtbi.2014.01.033. Franke, L., van Bakel, H., Fokkens, L., de Jong, E., Egmont-Petersen, M., Wijmenga, C., 2006. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 78 (June), 1011—-1025. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F., 2012. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man. Cybern. Part C (Appl. Rev.) 42 (4), 463–484. Garel, J.P., Filliol, D., Mandel, P., 1973. Coefficients de partage d’aminoacides, nucléobases, nucléosides et nucléotides dans un systéme solvant salin. J. Chromatogr. A 78 (2), 381–391. Geer, L.Y., Marchler-Bauer, A., Geer, R.C., Han, L., He, J., He, S., Liu, C., Shi, W., Bryant, S.H., 2009. The ncbi biosystems database. Nucleic Acids Res. gkp858. Goh, K.-I., Cusick, M.E., Valle, D., Childs, B., Vidal, M., Barabási, A.-L., 2007. The human disease network. Proc. Natl. Acad. Sci. 104 (21), 8685–8690. González-Díaz, H., Riera-Fernández, P., 2012. New Markov-autocorrelation indices for re-evaluation of links in chemical and biological complex networks used in metabolomics, parasitology, neurosciences, and epidemiology. J. Chem. Inf. Model. 52 (12), 3331–3340. doi:10.1021/ci300321f. Grantham, R., 1974. Amino acid difference formula to help explain protein evolution. Science 185 (4154), 862–864. Guo, Y., Yu, L., Wen, Z., Li, M., 2008. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 36 (9), 3025–3030. Hedberg, S.R., 2006. Machine learning in biology: a profile of David Haussler. IEEE Intell. Syst. 21 (1), 8–10. Ho, T.K., 1995. Random decision forests. In: Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, 1. IEEE, pp. 278–282. Hopp, T.P., Woods, K.R., 1981. Prediction of protein antigenic determinants from amino acid sequences. Proc. Natl. Acad. Sci. 78 (6), 3824–3828. Huang, Y., Li, H., Hu, H., Yan, X., Waterman, M.S., Huang, H., Zhou, X.J., 2007. Systematic discovery of functional modules and context-specific functional annotation of human genome. Bioinformatics 23 (13), i222–i229. Huang, Y.-A., You, Z.-H., Chen, X., Chan, K., Luo, X., 2016. Sequence-based prediction of protein-protein interactions using weighted sparse representation model combined with global encoding. BMC Bioinform. 17 (1), 184. Hughes, T.R., Roth, F.P., 2008. A race through the maze of genomic evidence. Genome Biol. 9 (1), S1. Hutchens, J.O., 1970. Heat capacities, absolute entropies, and entropies of formation of amino acids and related compounds. Handbook of Biochemistry. Janin, J., 1979. Surface and inside volumes in globular proteins. Nature 277 (5696), 491–492. Jia, J., Liu, Z., Xiao, X., Liu, B., Chou, K.-C., 2015. ippi-esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac. J. Theor. Biol. 377, 47–56. Kandaswamy, K.K., Chou, K.-C., Martinetz, T., Möller, S., Suganthan, P., Sridharan, S., Pugalenthi, G., 2011. Afp-pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties. J. Theor. Biol. 270 (1), 56–62. Kell, D.B., 2006. Metabolomics, modelling and machine learning in systems biology–towards an understanding of the languages of cells. FEBS J. 273 (5), 873–894. Köhler, S., Bauer, S., Horn, D., Robinson, P.N., 2008. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82 (4), 949–958. Lei, Y.-K., You, Z.-H., Ji, Z., Zhu, L., Huang, D.-S., 2012. Assessing and predicting protein interactions by combining manifold embedding with multiple information integration. BMC Bioinform. 13 (7), 1. Li, M., Wu, X., Pan, Y., Wang, J., 2013. hf-measure: a new measurement for evaluating clusters in protein–protein interaction networks. Proteomics 13 (2), 291–300. Li, Y., Song, T., Yang, J., Zhang, Y., Yang, J., 2016. An alignment-free algorithm in comparing the similarity of protein sequences based on pseudo-Markov transition probabilities among amino acids. PLoS ONE 11 (12), 1–14. Linghu, B., Franzosa, E.A., Xia, Y., 2013. Construction of functional linkage gene networks by data integration. In: Data Mining for Systems Biology. Springer, pp. 215–232. Linghu, B., Snitkin, E.S., Holloway, D.T., Gustafson, A.M., Xia, Y., DeLisi, C., 2008. High-precision high-coverage functional inference from integrated data sources. BMC Bioinform. 9, 119. doi:10.1186/1471-2105-9-119. Linghu, B., Snitkin, E.S., Hu, Z., Xia, Y., Delisi, C., 2009. Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biol. 10 (9), R91. doi:10.1186/gb- 2009- 10- 9- r91. Manimaran, P., Hegde, S.R., Mande, S.C., 2009. Prediction of conditional gene essen-
161
tiality through graph theoretical analysis of genome-wide functional linkages. Mol. Biosyst. 5 (12), 1936–1942. Martin, S., Roe, D., Faulon, J.-L., 2005. Predicting protein–protein interactions using signature products. Bioinformatics 21 (2), 218–226. Mathura, V.S., Kolippakkam, D., 2005. Apdbase: amino acid physico-chemical properties database. Bioinformation 1 (1), 2–4. Mei, S., Zhu, H., 2014. Adaboost based multi-instance transfer learning for predicting proteome-wide interactions between salmonella and human proteins. PLoS ONE 9 (10), e110488. Mukhopadhyay, A., Ray, S., De, M., 2012. Detecting protein complexes in a ppi network: a gene ontology based multi-objective evolutionary approach. Mol. BioSyst. 8, 3036–3048. doi:10.1039/C2MB25302J. Munteanu, C.R., González-Díaz, H., Borges, F., de Magalhães, A.L., 2008. Natural/random protein classification models based on star network topological indices. J. Theor. Biol. 254 (4), 775–783. Munteanu, C.R., González-Díaz, H., Magalhães, A.L., 2008. Enzymes/non-enzymes classification model complexity based on composition, sequence, 3D and topological indices. J. Theor. Biol. 254 (2), 476–482. doi:10.1016/j.jtbi.20 08.06.0 03. Munteanu, C.R., Magalhães, A.L., Uriarte, E., González-Díaz, H., 2009. Multi-target qpdr classification model for human breast and colon cancer-related proteins using star graph topological indices. J. Theor. Biol. 257 (2), 303–311. Nanni, L., 2005. Hyperplanes for predicting protein–protein interactions. Neurocomputing 69 (1), 257–263. Nanni, L., Lumini, A., 2006. An ensemble of k-local hyperplanes for predicting protein–protein interactions. Bioinformatics 22 (10), 1207–1210. Otu, H.H., Sayood, K., 2003. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19 (16), 2122–2130. Ovaska, K., Laakso, M., Hautaniemi, S., 2008. Fast gene ontology based clustering for microarray experiments. BioData Min. 1 (1), 1. Perez-Bello, A., Munteanu, C.R., Ubeira, F.M., Lopes De Magalhães, A., Uriarte, E., González-Díaz, H., 2009. Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices. J. Theor. Biol. 256 (3), 458–466. doi:10.1016/j.jtbi.2008.09.035. Prabhakaran, M., Ponnuswamy, P., 1982. Shape and surface features of globular proteins. Macromolecules 15 (2), 314–320. Qiu, W.-R., Xiao, X., Chou, K.-C., 2014. irspot-tncpseaac: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int. J. Mol. Sci. 15 (2), 1746–1766. Resnik, P., 1995. Using information content to evaluate semantic similarity in a taxonomy. In: IJCAI’95 Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453. arXiv preprint cmp-lg/9511007 Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y., Jiang, H., 2007. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. 104 (11), 4337–4341. Shi, M.-G., Xia, J.-F., Li, X.-L., Huang, D.-S., 2010. Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids 38 (3), 891–899. Sokal, R.R., Thomson, B.A., 2006. Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population. Am. J. Phys. Anthropol. 129 (1), 121–131. Song, X., Li, L., Srimani, P.K., Yu, P.S., Wang, J.Z., 2014. Measure the semantic similarity of go terms using aggregate information content. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 11 (3), 468–476. Sweet, R.M., Eisenberg, D., 1983. Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. J. Mol. Biol. 171 (4), 479–488. Teng, Z., Guo, M., Liu, X., Dai, Q., Wang, C., Xuan, P., 2013. Measuring gene functional similarity based on group-wise comparison of go terms. Bioinformatics btt160. Vilar, S., González-Díaz, H., Santana, L., Uriarte, E., 2009. A network-qsar model for prediction of genetic-component biomarkers in human colorectal cancer. J. Theor. Biol. 261 (3), 449–458. Vinga, S., 2014. Editorial: alignment-free methods in computational biology. Brief. Bioinform. 15 (3), 341. Wang, J., Yang, J., Mao, S., Chai, X., Hu, Y., Hou, X., Tang, Y., Bi, C., Li, X., 2014. MitProNet: a knowledgebase and analysis platform of proteome, interactome and diseases for mammalian mitochondria. PLoS ONE 9 (10), e111187. doi:10.1371/ journal.pone.0111187. Wang, J.Z., Du, Z., Payattakool, R., Philip, S.Y., Chen, C.-F., 2007. A new method to measure the semantic similarity of go terms. Bioinformatics 23 (10), 1274–1281. Wang, X., Gulbahce, N., Yu, H., 2011. Network-based methods for human disease gene prediction. Brief. Funct. Genomics 10 (5). doi:10.1093/bfgp/elr024. 280–93 ´ Wozniak, M., Graña, M., Corchado, E., 2014. A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17. Wu, M., Li, X., Chua, H.N., Kwoh, C.-K., Ng, S.-K., 2010. Integrating diverse biological and computational sources for reliable protein-protein interactions. BMC Bioinform. 11 (7), S8. Xia, J.-F., Han, K., Huang, D.-S., 2010. Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. Protein Pept. Lett. 17 (1), 137–145. Xia, J.-F., Zhao, X.-M., Huang, D.-S., 2010. Predicting protein–protein interactions from protein sequences using meta predictor. Amino Acids 39 (5), 1595–1599. Yang, L., Xia, J.-F., Gui, J., 2010. Prediction of protein-protein interactions from protein sequence using local descriptors. Protein Pept. Lett. 17 (9), 1085–1090. Yao, Y., Yan, S., Han, J., Dai, Q., He, P.-a., 2014. A novel descriptor of protein sequences and its application. J. Theor. Biol. 347, 109–117. You, Z.-H., Lei, Y.-K., Gui, J., Huang, D.-S., Zhou, X., 2010. Using manifold embedding
162
A. Jalilvand et al. / Journal of Theoretical Biology 437 (2018) 149–162
for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics 26 (21), 2744–2751. You, Z.-H., Lei, Y.-K., Zhu, L., Xia, J., Wang, B., 2013. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinform. 14 (8), 1. You, Z.-H., Zhu, L., Zheng, C.-H., Yu, H.-J., Deng, S.-P., Ji, Z., 2014. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinform. 15 (Suppl 15), S9. Yousef, A., Charkari, N.M., 2013. A novel method based on new adaptive lvq neural
network for predicting protein–protein interactions from protein sequences. J. Theor. Biol. 336, 231–239. Zhang, Y.-N., Pan, X.-Y., Huang, Y., Shen, H.-B., 2011. Adaptive compressive learning for prediction of protein–protein interactions from primary sequence. J. Theor. Biol. 283 (1), 44–52. Zhou, X., Kao, M.-C.J., Wong, W.H., 2002. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl. Acad. Sci. 99 (20), 12783–12788.