Journal of Molecular Graphics and Modelling 73 (2017) 166–178
Contents lists available at ScienceDirect
Journal of Molecular Graphics and Modelling journal homepage: www.elsevier.com/locate/JMGM
Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties Nguyen-Quoc-Khanh Le ∗ , Trinh-Trung-Duong Nguyen, Yu-Yen Ou ∗ Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
a r t i c l e
i n f o
Article history: Received 4 November 2016 Received in revised form 26 December 2016 Accepted 4 January 2017 Available online 2 February 2017 Keywords: Electron transport proteins Transporter Annotation Feature selection
a b s t r a c t The electron transport proteins have an important role in storing and transferring electrons in cellular respiration, which is the most proficient process through which cells gather energy from consumed food. According to the molecular functions, the electron transport chain components could be formed with five complexes with several different electron carriers and functions. Therefore, identifying the molecular functions in the electron transport chain is vital for helping biologists understand the electron transport chain process and energy production in cells. This work includes two phases for discriminating electron transport proteins from transport proteins and classifying categories of five complexes in electron transport proteins. In the first phase, the performances from PSSM with AAIndex feature set were successful in identifying electron transport proteins in transport proteins with achieved sensitivity of 73.2%, specificity of 94.1%, and accuracy of 91.3%, with MCC of 0.64 for independent data set. With the second phase, our method can approach a precise model for identifying of five complexes with different molecular functions in electron transport proteins. The PSSM with AAIndex properties in five complexes achieved MCC of 0.51, 0.47, 0.42, 0.74, and 1.00 for independent data set, respectively. We suggest that our study could be a power model for determining new proteins that belongs into which molecular function of electron transport proteins. © 2017 Elsevier Inc. All rights reserved.
1. Introduction Cellular respiration is the procedure for generating adenosine triphosphate (ATP) and allows cells to gain energy from foods. When we carry out all the activities in our life, cellular respiration is used to make energy inside the shape of ATP (allow our living organism to work). During cellular respiration, cells damage food molecules, such as sugar, and release energy. The goal of cellular respiration is to reap electrons from natural compounds to create ATP, which is used to provide energy for most cellular reactions. As cells go through cellular respiration, they require a pathway to keep and transport electrons (i.e., the electron transport chain). The electron transport chain produces a transmembrane proton electrochemical gradient because of oxidation-reduction reactions. If protons flow back via the ATP synthase through the membrane, ATP synthase converts this mechanical energy into chemical energy through generating ATP, which presents energy in several cellular procedures.
∗ Corresponding author. E-mail addresses:
[email protected] (N.-Q.-K. Le),
[email protected] (Y.-Y. Ou). http://dx.doi.org/10.1016/j.jmgm.2017.01.003 1093-3263/© 2017 Elsevier Inc. All rights reserved.
The electron transport chain is a number of protein complexes embedded inside the inner membrane of the mitochondria. Fig. 1 indicates the electron transport chain system. Electrons captured from donor molecules are transferred via these complexes. These complexes are organized into Complex I, Complex II, Complex III, Complex IV, and ATP synthase (which may be called Complex V). Each complex includes numerous specific electron carriers with different molecular functions. At the mitochondrial inner membrane, electrons from nicotinamide adenine dinucleotide (NADH) and succinate bypass through the electron transport chain to oxygen. The most famous molecular function in complex I and complex II are NADH dehydrogenase and succinate dehydrogenase, respectively. Electrons bypass from complex I to a carrier (coenzyme Q) that embeds itself inside the membrane. From coenzyme Q, electrons are handed to complex III (cytochrome b, c1 complex). The pathway from complex III ends in cytochrome c then to complex IV (cytochrome oxidase complex). At the end, the proton electrochemical gradient allows ATP synthase to apply the flow of H+ to generate ATP. Electron transport proteins and membrane proteins have attracted the interest of numerous researchers due to their relevance in cellular respiration and our existence. For example, Gromiha [1] provided a simple statistical method for discriminat-
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
167
Fig. 1. The process of electron transport Chain.
ing outer membrane proteins with excessive accuracy. Moreover, Ou [2] tried to discriminate of beta-barrel membrane proteins transport by using radial basis function networks (RBFNs) and position specific scoring matrices (PSSM) profiles. The study from Chen [3] divided electron transport proteins into four varieties of transport proteins to behavior prediction and analysis. After the prediction and evaluation, Chen categorized the transport proteins and determined the functions of each protein type inside the transport protein using PSSM profiles and biochemical properties. Then, Ou [4] integrated significant amino acid pairs to identify O-linked glycosylation sites on transmembrane proteins and nontransmembrane proteins. This study proposes a method primarily based on PSSM profiles and biochemical properties for identifying the category in electron transport proteins from their molecular function. In the first section, we used the set of 2277 transport proteins and 354 electron transport proteins to identify electron transport proteins in transport proteins. This section performed sensitivity of 74.6%, specificity of 95.8%, and accuracy of 92.9%, with Matthews Correlation Coefficient (MCC) of 0.7 for cross-validation dataset. And for the independent dataset, our method achieved sensitivity of 73.2%, specificity of 94.1%, and accuracy of 91.3%, with MCC of 0.64. With second section, we used the variety of electron transport proteins recognized from the first section to do experiment, with 101 electron transport proteins as training dataset and 31 electron transport proteins for the independent test dataset. We implemented the independent dataset to evaluate the performance of the proposed approach, which established an MCC of 0.51, 0.47, 0.42, 0.74, and 1.00, respectively for 5 complexes. In these stages, the essential approach is that using F-score to select 544 biochemical properties adding to PSSM profiles to improve prediction effects. The proposed method has an extensive result and gives beneficial information for biologists. The proposed approach can serve as a powerful model for predicting the categories in electron transport proteins and may
help biologists recognize electron transport chain functions, especially the categories in electron transport protein. 2. Materials and methods This work consist of two stages for discriminating electron transport proteins from transport proteins and classifying categories of five complexes in electron transport proteins. Fig. 2 displays the whole architecture of this work, consists of three subprocesses in each stage: data collection, feature set generation, and model evaluation. From this architecture, we have evolved a novel approach based on PSSM profiles and biochemical properties for discriminating electron transport proteins from transporters and classifying categories of five complexes in electron transport proteins. 2.1. Data collection First of all, we accumulated transport proteins from the UniProt database [5]. In this section, we eliminated the sequences without the annotation “evidence at protein level” or “complete”. Next, BLAST [6] was used to exclude sequences with a sequence identity of greater than 20% from the dataset. Finally, 2277 transport proteins and 354 electron transport proteins were used in this work. Alternatively, only 132 proteins, which include annotation of complex, are used for second stage to classify categories of five complexes in electron transport proteins. The annotation of complex retrieved from GeneOntology, which contains the descriptions of many gene products for biologists. We divided the accumulated protein sequences into two data sets: the training dataset and the independent test dataset. In these stages, the training dataset is used for identifying electron transport proteins and evaluating biochemical properties. The independent test dataset is used to assess the overall performance of the pro-
168
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
Fig. 2. The Architecture of annotating electron transport proteins.
posed approach. Within the first section, the training dataset consists of 1821 transport proteins and 283 electron transport proteins. The independent test dataset consists of 456 transport proteins and 71 electron transport proteins. In the 2d section, the training dataset and the independent test dataset consists of 101 and 31 electron transport proteins, respectively. The information of all datasets is indexed in Tables 1 and 2. Table 3 summarizes
Table 1 Datasets for the discrimination of electron transport proteins. Original data
transport proteins electron transport proteins
5121 820
Id <20% data Training data
Independent data
1821 283
456 71
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
169
Table 2 Datasets for classifying categories of 5 complexes in electron transport proteins.
Complex I Complex II Complex III Complex IV ATPase
Original data
Training data
Independent data
57 9 24 36 6
45 5 19 28 4
12 4 5 8 2
the molecular functions for five complexes in electron transport proteins in this work. 2.2. Compositions of amino acids and amino acid pairs Composition of amino acids and amino acid pairs are very popular feature sets in computational biology field. It also had been presented in our previous work [7,8]. This study used 20 values and 400 values (20 × 20) to represent all proteins in the training data for composition of amino acids and composition of amino acid pairs, respectively [9,1]. The composition of amino acids and amino acid pairs are calculated using the expressions: Amino acid composition (i) =
n (i) x100/N
(1)
where n (i) is the number of residues of type i and N is the total number of residues in a protein; i increases from 1 to 20 for the 20 amino acid residues. Composition of amino acid pairs (i, j) =
n (i, j) x100/N
(2)
where n (i, j) is the number of residues of type i neighbored with a residues of type j and N is the total number of residues; i and j increases from 1 to 20 and the total combinations would be 400. 2.3. Position specific scoring matrix profiles PSSM is a matrix represented for all amino acid sequences in biology area. The position specific Scoring Matrix (PSSM) profile has been extensively utilized in protein secondary structure prediction, subcellular localization, classification of transporters, prediction of transport targets and other bioinformatics issues with extensive improvement [10–13,2,3]. We accumulated all sequence records from PSI-BLAST and the non-redundant (NR) protein database and used them to establish the sequences in a PSSM. To discover category of electron transport proteins, we calculated the most reliable protein sequence for Table 3 Molecular functions for 5 complexes in electron transport proteins. Molecular function Complex I
oxidoreductase activity, acting on NADH or NADPH NADH dehydrogenase activity oxidoreductase activity, acting on NADH or NADPH, quinone or similar compound as acceptor NADH dehydrogenase (quinone) activity NADH dehydrogenase (ubiquinone) activity
Complex II
succinate dehydrogenase activity succinate dehydrogenase (ubiquinone) activity
Complex III
ubiquinol-cytochrome-c reductase activity electron transporter, transferring electrons within CoQH2-cytochrome c reductase complex activity
Complex IV ATPase
cytochrome-c oxidase activity proton-transporting ATPase activity, rotational mechanism
Fig. 3. F-score illustration.
each amino acid. We located 20 types of amino acids in the calculated sequences, leading to the advent of a matrix. The critical part of Fig. 1 indicates the information of generating the 400 PSSM capabilities from original PSSM profiles [3,13]. Each element of 400 dimensions (400D) input vector was divided by the sequence length and then scaled by using expression F(x) =
1 1 + exp(−x)
(3)
2.4. F-score In classification technical analysis, F-score is an elementary parameter utilized for measuring the accuracy of a test by using two sets of real numbers. The F-score of the ith feature is defined as: (xi
F − Score(i) =
(+)
2
− xi ) + (xi
+
1 n+ −1
n
(+) (xk,i
k=1
− xi
(+) 2
) +
(−)
2
− xi ) −
1 n− −1
n
(−)
(xk,i − xi
(−) 2
)
k=1
(4) where n+ is the number of positive instances and n− is the number (+) (−) of negative instances. Furthermore, x¯ i ,x¯ i , and x¯ i are the averages of the ith feature of the entire, positive, and negative data sets, respectively; x(+) k,i is the ith feature of the kth positive instance; and x(−) k,i is the ith feature of the kth negative instance [1]. If the F-score is high, it indicates that corresponding feature contains more special information. The illustration of F-score is shown detail in Fig. 3, the positive data sets are represented by left curve, while the negative data sets are represented by right curve, and mean data sets are the average of two data sets for positive and negative datasets. The F-score is higher if the value of variance has low distribution for each class (the distance between two classes is far and the value of each class will not overlap). In other hand, the F-score is lower if the value of variance has high distribution for each class (the distance between two classes is close and the value of each class will overlap). 2.5. Biochemical properties In this study, we calculated all F-score values for all biochemical properties to enter into the feature sets of electron transport proteins to improve prediction performance. To enhance prediction performance, there are completely 544 properties of amino acid residues extracted from AAIndex database [15] in this study. We are able to recall biochemical property is a combination between biology and chemistry; it is the properties of internal chemical
170
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
Fig. 4. (a)Topmost F-scores of AAIndex four parts of electron transport proteins (b) Topmost F-scores of AAIndex for N-terminal 10–150 of electron transport proteins.
approaches. Biochemical properties extensively utilized in lots of published research and had accurate effects [7,13]. To investigate the contribution of 544 properties of amino acid residues, we divided each protein into 4 identical components and calculated the F-score for all of the amino acid properties primarily based on formula 4. The important idea of this component is selecting the area includes extra special information for amino acid residues. Then we noticed that the highest F-score appears at
the N-terminal location for almost amino acid properties (Fig. 4a). Therefore, we can determine that the 40 N-terminal residues have more influence in the electron transport process. Consequently, we tried to pick out these amino acids from the sequence within higher F-score. We calculated the F-score value [14] for each biochemical property from the N-terminal residue with at least 10 amino acid residues. The average F-scores of 12 top-
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
171
Table 4 Discrimination of electron transport proteins in transport proteins. Cross-validation dataset
AAC DPC AAC + DPC PSSM PSSM + AAIndex
Independent test dataset
Sens.
Spec.
Acc.
MCC
Sens.
Spec.
Acc.
MCC
58.3 48.1 58.7 72.8 74.6
94.7 95.2 95.1 95.7 95.8
89.8 88.9 90.2 92.6 92.9
0.55 0.48 0.56 0.68 0.70
59.2 45.1 49.3 70.4 73.2
94.5 94.5 95.0 93.9 94.1
89.8 87.9 88.8 90.7 91.3
0.55 0.44 0.48 0.62 0.64
most ranking biochemical properties from the N-terminal residue to 10 C-terminal residues are shown in Fig. 4b. F-score value is established to enhance the overall performance of prediction effects in bioinformatics area. In this work, we brought these topmost ranking biochemical properties separately to the PSSM feature sets using their F-score value. In this step, if less biochemical properties were added from the top, it cannot easily to improve the result because the additional feature sets we input into dataset is less. On the other hand, if we choose too many biochemical properties to add from the top, our model also performs not well. Because there are some biochemical properties have lower F-score and cannot become a significance feature. Next, from these types of values, we handiest chose the high-quality values inside the feature set (it improved the performance through five-fold cross-validation test). The detailed technique is shown in “additional biochemical properties” part of Fig. 2.
We generated the QuickRBF package [16] to assemble RBFN classifiers. Moreover, we assigned a regular bandwidth of five for each kernel function is generated in the network. We used all training data as centers. Eventually, the RBFN classifier was used to discover category of electron transport proteins to the output function value. We defined the details of the network structure and design in our previous article [17]. RBFN-based classifications had been utilized in numerous applications in bioinformatics to predict cleavage sites in proteins [18], inter-residue contacts [19], and protein disorder [20]; moreover, they have been implemented for discriminating -barrel proteins [2], classifying transporters [3,13], identifying O-linked glycosylation sites [4], and identifying FAD binding sites in electron transport proteins [21]. The general mathematical form of the output nodes in an RBFN is as follows: k
wji (x − i ; i );
Sensitivity =
TP TP + FN
(6)
Specificity =
TN TN + FP
(7)
Accuracy =
2.6. Design of the radial basis function networks
gj (x) =
iment is finished. Then, before re-using an independent data set to do a complete validation experiments, we need to examine the experimental training data set created through the relevant degree and fairness. Sensitivity, specificity, accuracy, and MCC (Matthew’s correlation coefficient) were used to measure the prediction performance. TP, FP, TN, FN are true positives, false positives, true negatives, and false negatives, respectively. This measure sensitivity, specificity, accuracy, and MCC are threshold dependent and we select the threshold to optimize the balance between sensitivity and specificity.
(5)
i=1
where gj (x) is the function corresponding to the j-th output node and is a linear combination of k radial basis functions () with center and bandwidth i ; The value of can be estimated with data-driven methods [2,3] and we used a fixed bandwidth of 5 for each kernel function, which showed the best performance. In addition, is the weight associated with the correlation between the j-th output node and i-th hidden mode. 2.7. Assessment of predictive ability In this study, we divided into sets of data: independent testing data set and training data set to do five-fold cross validation. Initially, the training data set of protein sequences is divided into five identical components, and then we take a test data as testing data, the other four as training data. Sequentially the aggregate into five test data and training data and carry out cross-validation exper-
MCC =
TP + TN TP + FP + TN + FN TP × TN-FP × FN
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
(8) (9)
3. Results and discussion 3.1. Predictive performance for identifying electron transport proteins in transport proteins with different feature sets We advanced many techniques for identifying electron transport proteins in transport proteins. Table 4 displays the results using Amino Acid Composition (AAC), Dipeptide Composition (DPC), AAC with DPC, position specific Scoring Matrix (PSSM), and PSSM with AAIndex feature. From this table, we can see that the performance from PSSM with AAIndex feature set turned into highest, with performed sensitivity of 74.6%, specificity of 95.8%, and accuracy of 92.9%, with MCC of 0.70 for cross-validation dataset. And for the independent dataset, the PSSM with AAIndex feature set achieved sensitivity of 73.2%, specificity of 94.1%, and accuracy of 91.3%, with MCC of 0.64. Consequently we are able to identify electron transport protein in transport protein from PSSM and AAIndex with an excessive accuracy. 3.2. Comparison of the performance for identifying electron transport proteins in transport proteins with different classifier The discrimination of electron transport proteins in transport proteins using different classifier is shown in Table 5. In this study, we tried to compare between many classifiers, i.e., Decision Trees (J48), Naïve Bayes, KNN, RandomForest, and QuickRBF. Then the results showed that PSSM with AAIndex feature set using QuickRBF as the classifier, overall have better result than other classifiers. 3.3. Predictive performance for identifying the category of electron transport proteins Table 6 shows the performance for identifying of five complexes in electron transport proteins with PSSM and PSSM with
86.1 79.9 89.6 92.2 91.3 91.2 81.6 94.1 97.1 94.1
0.43 0.40 0.55 0.64 0.64
Acc Spec. Sens.
53.5 69.0 60.6 60.6 73.2 0.44 0.42 0.62 0.63 0.70
MCC Acc
86.9 80.9 91.3 92.2 92.9 92.5 82.7 95.4 97.7 95.8
Spec. Sens.
50.9 69.6 64.7 56.5 74.6 0.43 0.40 0.54 0.59 0.62
MCC Acc
86.1 79.9 89.6 91.3 90.7 91.2 81.6 94.3 96.7 93.9
Spec. Sens.
53.5 69.0 59.2 56.3 70.4 0.42 0.40 0.60 0.60 0.68
MCC
92.4 82.1 95.5 97.5 95.7 49.5 68.6 61.8 53.7 72.8
Acc Spec. Sens
86.6 80.3 91.0 91.6 92.6
(PSSM feature) Independent test (PSSM feature) Cross-validation
Table 5 Discrimination of electron transport proteins in transport proteins for different classifier.
Decision Trees(J48) Naïve Bayes KNN Random Forest QuickRBF
Cross-validation (PSSM+AAindex feature)
MCC
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
Independent test (PSSM+AAindex feature)
172
AAIndex feature set using QuickRBF as the classifier. The results showed that PSSM with AAIndex properties have overall better result compared to PSSM feature set only. The PSSM with AAIndex properties in Complex I–V achieved MCC of 0.56, 0.66, 0.49, 0.44, and 0.81 for cross-validation dataset, respectively. And for the independent dataset, the PSSM with AAIndex properties in Complex I–V achieved MCC of 0.51, 0.47, 0.42, 0.74, and 1.00, respectively. 3.4. The statistical analysis in electron transporters and transport proteins This part aims to have a statistic analysis in electron transport protein and transport protein, and to show how we can identify electron transports from transport proteins. First of all, in Fig. 5a, we have calculated the amino acid composition of electron transport proteins and transport proteins. Then the F-score of amino acid composition has been also calculated and shown in Fig. 5b. From this figure, we saw that the highest F-score of electron transport proteins in transport proteins appeared in several amino acids Ala, Pro, Gin, Ser, and Trp. The higher F-scores of electron transport proteins in transport proteins appeared in several amino acid pairs NS, DS, LQ, SS, and SQ. Similar to the previous steps, the last statistic from this part we got from F-score for biochemical properties, and the topmost scores are shown in Table 7. These properties were added to PSSM profiles as feature sets and did prediction. Finally we proved that these properties contributed an important role for improving the performance of identifying electron transport proteins from transport proteins. To have the statistic from the topmost F-scores of AAIndex for electron transport proteins, we divided them into four parts and Nterminal 10–150 based on amino acid residues. Fig. 4a shows the topmost F-scores of AAIndex for four parts of electron transport proteins, we can see the topmost F-Score fundamentally in first part and second part (1–40%). Then Fig. 4b shows the topmost F-scores of AAIndex for N-terminal 10–150 of electron transport proteins. From Fig. 4b, the top most F-scores are in N-terminal 40. Fig. 6a shows data value of amino acid composition for 40 residues of Nterminal in electron transport protein and transport protein. Fig. 6b shows F-score value of amino acid composition for 40 residues of N-terminal in electron transport protein and transport protein. The proportion of the amino acid composition for two kinds of proteins, which is electron transport protein and transport protein with 40 residues of N-terminal have more significant drop and there is small difference for whole protein. As we can see from the statistical data for transport proportion, overall amino acid composition are similar, that means these two proteins almost same on the physical and chemical, which can be understandable because they belongs to the transport protein. So, any kind of physical and chemical properties that applied to calculate the performance will be quite similar, and therefore unable to reach a significant effect on the classification. Although the structure is similar, there is still difference, i.e., positions and function of transport. So, F-score analysis is used to find the higher score that can help us to improve the prediction performance. Thus explains why property analysis and prediction in the 40 residues of N-terminal have better result rather than on the entire protein. 3.5. The statistical analysis for identifying the category of electron transport proteins The second experiment was designed to find effective properties to predict unknown protein belongs among the five kinds of electron transport Complex. We decided to use PSSM feature as the main attribute and an additional 544 kinds AAIndex physical and chemical properties to enhance the performance during analysis. After perform calculation for F-score, there is an effective fraction
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
173
Fig. 5. (a) Amino Acid Composition (b) F-Score of Amino Acid Composition in electron transport proteins and transport proteins. Table 6 Discrimination of five complexes in electron transport proteins. Accuracy
MCC
C1
C2
C3
C4
C5
C1
C2
C3
C4
C5
Cross validation PSSM PSSM+AAIndex
72.9 73.5
66.7 75.0
50.0 62.5
61.5 61.5
50.0 66.7
0.54 0.56
0.50 0.66
0.40 0.49
0.44 0.44
0.48 0.81
Independent PSSM PSSM+AAIndex
54.5 54.5
100 100
50.0 100
80.0 100
100 100
0.51 0.51
0.47 0.47
0.24 0.42
0.54 0.74
0.70 1.00
C1: Complex I, C2: Complex II, C3: Complex III, C4: Complex IV, C5: ATPase.
174
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
Fig. 6. (a) Amino Acid Composition and (b) F-Score for 40 Residues of N-Terminal in electron transport proteins and transport proteins.
in the 20 residues of C-terminal. In this region, physical and chemical properties were added as attributes and there is improvement result for physical and chemical properties. This part aims to have a statistic analysis in classifying the category of electron transport protein and transport protein. First of all, in Fig. 7a, we have calculated the amino acid composition for 5 complexes in electron transport proteins. Then the F-score of amino acid composition has been also calculated and shown in Fig. 7b. From this figure, we saw that the highest F-score for classifying categories of electron transport proteins appeared in several amino acids Glu,
His, He, Met, and Trp. Next, we also used composition of amino acid pairs to analyze, and the topmost F-score are listed in Table 8. The higher F-scores of electron transport proteins in transport proteins appeared in several amino acid pairs SE, CC, HR, KI, and LM. Similar to the previous steps, the last statistic from this part we got from Fscore for biochemical properties, and the topmost scores are shown in Table 8. These properties were added to PSSM profiles as feature sets and did prediction. Finally we proved that these properties contributed an important role for improving the performance of classifying categories of electron transport proteins.
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
175
Fig. 7. (a) Amino Acid Composition and (b) F-Score in 5 complexes in electron transport proteins.
Table 7 Properties with topmost F-scores for discriminating of electron transport proteins in transport proteins. AAC
F-score
DPC
F-score
AAIndex
F-score
S W Q A P
0.22 0.12 0.12 0.07 0.05
NS DS LQ SS SQ
0.10 0.10 0.08 0.08 0.08
PARS000102 KOEP990102 ZASB820101 AURR980104 PALJ810112
0.19 0.16 0.15 0.15 0.15
To have the statistic from the topmost F-scores of AAIndex for five complexes of electron transport proteins, we divided them into four parts and N-terminal 10–150 based on amino acid residues. Fig. 8a shows the topmost F-scores of AAIndex for four parts of five complexes in electron transport proteins, we can see the topmost F-Score esentially in the last part (70–100%). Then Fig. 8b shows the topmost F-scores of AAIndex for N-terminal 10–150 of electron transport proteins. From Fig. 8b, the top most F-scores are in N-terminal 20. Fig. 9a shows Amino Acid Composition with 20 residues of C-Terminal for five complexes in electron transport pro-
176
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
Fig. 8. Topmost F-scores of AAIndex for four parts and C-terminal 10–150 of 5 complex in electron transport proteins.
teins. Fig. 9b shows F-score calculation of Amino Acid Composition with 20 residues of C-Terminal for five complexes in electron transport proteins. The proportion of the amino acid composition for two kinds of proteins, which is electron transport protein and transport protein with 20 residues of C-terminal have more significant drop and there is small difference for whole protein. As we can see from the statistical data for transport proportion, overall amino acid composition for almost all Complex are similar, that means these two proteins
almost same on the physical and chemical, which can be understandable because they belongs to the electron transport protein. So, any kind of physical and chemical properties that applied to calculate the performance will be quite similar, and therefore unable to reach a significant effect on the classification. Although the structure is similar, there are significant differences in the proportion of amino acid composition and variance. So, F-score analysis is used to find the higher score that can help us to improve the prediction performance. Thus explains why property analysis and prediction
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
177
Fig. 9. (a) Amino Acid Composition and F-Score for 20 Residues of C-Terminal in 5 complexes in electron transport proteins.
in the 20 residues of C-terminal have better result rather than on the entire protein.
Finally, from Tables 9 and 10, we could see the differences between the biochemical properties added into two problems in
Table 8 Properties with topmost F-scores for classifying categories of electron transport proteins. AAC
F-score
DPC
F-score
AAIndex
F-score
W E H M I
0.64 0.38 0.33 FUKS010110 KUMS000103
SE CC HR
0.51 0.46 0.44 1.33 1.08
KUMS000102 KUMS000101 KUMS000104
1.44 1.43 1.41
0.30 0.25
KI LM
0.42 0.36
178
N.-Q.-K. Le et al. / Journal of Molecular Graphics and Modelling 73 (2017) 166–178
Table 9 Properties are used for classification based on forward feature selection. Classification of electron transport proteins in transport proteins PARS000102 AAIndex KOEP990102 Classification of 5 Complexes in electron transport proteins KUMS000102 AAIndex
Table 10 Description of biochemical properties. PARS000102 KOEP990102 KUMS000102
p-Values of thermophilic proteins based on the distributions of B values Beta-sheet propensity derived from designed sequences Distribution of amino acid residues in the 18 non-redundant families of mesophilic proteins
this study. In the first problem, two impressed properties are PARS000102 (p-Values of thermophilic proteins based on the distributions of B values) and KOEP990102 (beta-sheet propensity derived from designed sequences). On the other hand, KUMS000102 (distribution of amino acid residues in the 18 nonredundant families of mesophilic proteins) is used to identify categories of electron transport proteins. Thus we could determine that we need to select the best biochemical properties to enhance the prediction for each problem. It could be also determined that F-score is very important in this method. If we did not choose the best biochemical properties by F-score calculation, the performance prediction could not be improved and we definitely do not have a precise model. 3.6. Identification and classification of new electron transport proteins in transport proteins In this part, we applied our method for prediction of electron transport proteins from transport proteins and classify it into each complex. The testing dataset retrieved from UniProt, which is a famous protein databank. We only retrieved some transport protein do not have electron transport annotation. After using BLAST to remove sequence similarity more than 20 percent, the rest of dataset contained 500 proteins. Then our model can found 102 electron transport proteins from this dataset. From these electron transport proteins, our model can classify 66 proteins in complex I, three proteins in complex II, 17 proteins in complex III, 15 proteins in complex IV and one protein in complex V. Thus our research can help biologists discover some new electron transport proteins in transport proteins and classify them. 4. Conclusion We have consistently built a conceptualization for identifying category of electron transport proteins with different features such as amino acid composition, dipeptide composition, PSSM profiles, and biochemical properties based on F-score. The performance had been evaluated using 5-fold cross validation method and independent datasets with a radial basis network. Our method showed a 5-fold cross validation MCC of 0.56, 0.66, 0.49, 0.44, and 0.81 for identifying each category of electron transport proteins, respectively. The MCC with independent datasets are 0.51, 0.47, 0.42, 0.74, and 1. With this study, we could also build a power model for discovering new proteins that belongs into which category of electron transport proteins.
Acknowledgement This research is partially supported by Ministry of Science and Technology, Taiwan, R.O.C. under Grant no. MOST 104-2221-E-155037 and 105-2221-E-155-065. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.jmgm.2017.01. 003. References [1] M.M. Gromiha, M. Suwa, A simple statistical method for discriminating outer membrane proteins with better accuracy, Bioinformatics 21 (7) (2005) 961–968, http://dx.doi.org/10.1093/bioinformatics/bti126. [2] Y.Y. Ou, M.M. Gromiha, S.A. Chen, M. Suwa, TMBETADISC-RBF: discrimination of beta-barrel membrane proteins using RBF networks and PSSM profiles, Comput. Biol. Chem. 32 (3) (2008) 227–231, http://dx.doi.org/10.1016/j. compbiolchem.2008.03.002. [3] S.A. Chen, Y.Y. Ou, T.Y. Lee, M.M. Gromiha, Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties, Bioinformatics 27 (15) (2011) 2062–2067, http://dx.doi.org/10.1093/ bioinformatics/btr340. [4] S.A. Chen, T.Y. Lee, Y.Y. Ou, Incorporating significant amino acid pairs to identify O-linked glycosylation sites on transmembrane proteins and non-transmembrane proteins, BMC Bioinf. 11 (2010) 536, http://dx.doi.org/ 10.1186/1471-2105-11-536. [5] C. UniProt, The universal protein resource (UniProt) in 2010, Nucleic Acids Res. 38 (2010) D142–D148, http://dx.doi.org/10.1093/nar/gkp846. [6] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (17) (1997) 3389–3402, gka562[pii]. [7] Y.-Y. Ou, S.-A. Chen, Y.-M. Chang, D. Velmurugan, K. Fukui, M. Michael Gromiha, Identification of efflux proteins using efficient radial basis function networks with position-specific scoring matrices and biochemical properties, Proteins: Struct. Funct. Bioinf. 81 (9) (2013) 1634–1643, http://dx.doi.org/10. 1002/prot.24322. [8] Y.-Y. Ou, S.-A. Chen, S.-C. Wu, ETMB-RBF: discrimination of metal-binding sites in electron transporters based on RBF networks with PSSM profiles and significant amino acid pairs, PLoS One 8 (2) (2013). [9] M.M. Gromiha, Protein Bioinformatics: From Sequence to Function, Academic Press, 2010. [10] D.T. Jones, protein secondary structure prediction based on position-specific scoring matrices, J. Mol. Biol. 292 (2) (1999) 195–202. [11] D. Xie, A. Li, M.H. Wang, Z.W. Fan, H.Q. Feng, LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST, Nucleic Acids Res. 33 (2005) W105–W110, http://dx.doi.org/10. 1093/nar/gki359. [12] Y.Y. Ou, S.A. Chen, M.M. Gromiha, Prediction of membrane spanning segments and topology in beta-Barrel membrane proteins at better accuracy, J. Comput. Chem. 31 (1) (2010) 217–223, http://dx.doi.org/10.1002/jcc.21281. [13] Y.Y. Ou, S.A. Chen, M.M. Gromiha, Classification of transporters using efficient radial basis function networks with position-specific scoring matrices and biochemical properties, Proteins Struct. Funct. Bioinf. 78 (7) (2010) 1789–1797, http://dx.doi.org/10.1002/prot.22694. [14] C.-J. Lin, Y-W. Chen. Combining SVMs with various feature selection strategies. NIPS 2003 Feature Selection Challenge Vancouver: 1–10. 2003. [15] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, M. Kanehisa, AAindex: amino acid index database, progress report 2008, Nucleic Acids Res. 36 (Suppl. 1) (2008) D202–D205. [16] Y.-Y. Ou. QuickRBF: a package for efficient radial basis function networks. QuickRBF software available at http://csieorg/∼yien/quickrbf/. 2005. [17] Y. Ou, Y. Oyang, C. Chen. A novel radial basis function network classifier with centers set by hierarchical clustering. 2005. pp 1383–1388. [18] Z. Yang, R. Thomson, Bio-basis function neural network for prediction of protease cleavage sites in proteins, IEEE Trans. Neural Netw. 16 (1) (2005) 263–274. [19] G.Z. Zhang, D.S. Huang, Prediction of inter-residue contacts map based on genetic algorithm optimized radial basis function neural network and binary input encoding scheme, J. Comput. Aided Mol. Des. 18 (12) (2004) 797–810, http://dx.doi.org/10.1007/s10822-005-0578-7. [20] C.T. Su, C.Y. Chen, Y.Y. Ou, protein disorder prediction by condensed PSSM considering propensity for order or disorder, BMC Bioinf. 7 (2006) 319, http:// dx.doi.org/10.1186/1471-2105-7-319. [21] N.Q.K. Le, Y.Y. Ou, Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs, BMC Bioinf. 17 (2016) 298.