Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou’s general PseAAC via Kullback–Leibler divergence

Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou’s general PseAAC via Kullback–Leibler divergence

Accepted Manuscript Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou’s general PseAAC via K...

713KB Sizes 0 Downloads 37 Views

Accepted Manuscript

Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou’s general PseAAC via Kullback-Leibler divergence Yunyun Liang, Shengli Zhang PII: DOI: Reference:

S0022-5193(18)30282-0 10.1016/j.jtbi.2018.05.035 YJTBI 9491

To appear in:

Journal of Theoretical Biology

Received date: Revised date: Accepted date:

10 April 2018 19 May 2018 29 May 2018

Please cite this article as: Yunyun Liang, Shengli Zhang, Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou’s general PseAAC via KullbackLeibler divergence, Journal of Theoretical Biology (2018), doi: 10.1016/j.jtbi.2018.05.035

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • A feature design model named ACCP-KL-NMF is proposed based on PSSM. • The nonnegative matrix factorization based on Kullback-Leibler divergence is successfully

CR IP T

applied to identify Gram-negative bacterial secreted protein types.

AC

CE

PT

ED

M

AN US

• The ACCP-KL-NMF model achieves the approving performance.

1

ACCEPTED MANUSCRIPT

Identify Gram-negative bacterial secreted protein

CR IP T

types by incorporating different modes of PSSM into Chou’s general PseAAC via Kullback-Leibler divergence

a

b

Shengli Zhangb,†

AN US

Yunyun Lianga,∗,

School of Science, Xi’an Polytechnic University, Xi’an 710048, P. R. China

School of Mathematics and Statistics, Xidian University, Xi’an 710071, P. R. China

Abstract. Gram-negative bacterial secreted proteins are crucial for bacterial pathogenesis

M

by making bacteria interact with their environments. Therefore, identification of bacterial secreted proteins becomes a significant process for the research of various diseases and

ED

the corresponding drugs. In this paper, we develop a feature design model named ACCPKL-NMF by fusing PSSM-based auto-cross correlation analysis for features extraction and nonnegative matrix factorization algorithm based on Kullback-Leibler divergence for dimen-

PT

sionality reduction. Hence, a 150-dimensional feature vector is constructed on the training set. Then support vector machine is adopted as the classifier, and the most objective jack-

CE

knife test is chosen for evaluating the accuracy. The ACCP-KL-NMF model yields the approving performance of the overall accuracy on the test set, and also outperforms the other three existing models. The numerical experimental results show that our model is

AC

effective and reliable for identification of Gram-negative bacterial secreted protein types. Moreover, it is anticipated that the proposed model could be beneficial for other biology sequence in future research.

Key words: Secreted proteins; Position-specific scoring matrix; Correlation analysis; Nonnegative matrix factorization; Support vector machine

∗ †

Corresponding author. Tel./Fax:+86-29-83116360. Corresponding author. Tel./Fax:+86-29-88202860.

E-mail address: [email protected] E-mail address: [email protected]

2

ACCEPTED MANUSCRIPT

1

Introduction

Protein secretion plays an important role in maintaining normal physiological activities of cells, which can occur in all organisms. All proteins that are synthesized in the cell and then secreted to other organelles, the extracellular environment, and other functions within the cell

CR IP T

are collectively referred to as secreted proteins. Gram-negative bacteria comprises of double membrane layer enclosing the periplasmic space and peptidoglycan layer between the two lipid bilayers. Bacterial organisms have evolved dedicated secretion systems that aid in the transport of polypeptides across their outer membrane. Secretion systems in gram-negative bacteria secrete a wide range of proteins across the cell membrane such as those involved in biogenesis of

AN US

pili and flagella, nutrient acquisition, virulence and efflux of drugs and other toxins [1, 2], and release these secreted proteins into the extracellular environment or directly inject these proteins into eukaryotic hosts in order to survive, multiply and proliferate in the host. This process can result in infection of human cells, and eventually lead to the occurrence of various diseases in the human body [3–5]. Therefore, an in-depth study of the secreted proteins of Gram-negative

M

bacteria not only helps to fully understand, analyze and interpret the secretion mechanism of proteins and various physiological and pathological phenomena, but also provides more reference

ED

for disease diagnosis and treatment, research and development of new drugs [6, 7]. On the basis of molecular nature of transport machineries and their catalyzed reactions,

PT

currently, only eight secretion systems have been discovered in Gram-negative bacterial cells, and they are designated as type I secretion system (T1SS) to type VIII secretion system (T8SS) [8]. Hence, secreted proteins released from T1SS to T8SS are defined as type I secreted proteins

CE

(T1SP) to type VIII secreted proteins (T8SP), respectively. Based on the lack of N-terminal signal peptides [9], secreted proteins can be simply classified into two categories: classically

AC

secreted proteins, which include T2SP, T5SP, T7SP and T8SP, and non-classically secreted proteins, which include T1SP, T3SP, T4SP and T6SP. Secreted proteins are beneficial to understand pathogenic mechanism of bacteria, however, biologists have always focused on the predicting the subcellular localization of gram-negative bacterial proteins [10–12, 81] and the identification and classification of proteins involved in bacterial secretion systems [1, 2, 4]. Nevertheless, very little attention has been paid to identification of Gram-negative bacterial secreted protein types [13]. In 2013, Yu et al. [14] firstly pay close attention to different types of Gram-negative bacterial 3

ACCEPTED MANUSCRIPT

secreted proteins, and try to analyze the relationships and differences among them. They establish a standard dataset and propose a silico identification method named SecretP to distinguish different types of secreted proteins from primary sequence. Owing to the small numbers and high sequence similarity for T6SP and T8SP, this standard dataset only includes six types: T1SP,

CR IP T

T2SP, T3SP, T4SP, T5SP and T7SP, and is divided into two parts: the training set and the test set. The prediction accuracies of the test set reach 86.05% and 90.12% by using the one-to-rest strategy and the one-to-one strategy for the SVM, respectively. Hereafter, Ding and Zhang [15] follow with interest and develop a novel prediction method by combining the long-range and linear correlation information from the position-specific score matrix (PSSM) with a filter feature selection method, and the prediction accuracy of the test set reaches 93.60% with the

AN US

SVM classifier and the jackknife test. Although the accuracy has been greatly improved, it still needs further improvement by using more effective prediction models.

In this paper, we focus on developing a feature design method combining correlation analysis from PSSM for feature extraction and nonnegative matrix factorization (NMF) based on Kullback-Leibler (KL) divergence for feature dimension reduction to identify of Gram-negative

M

bacterial secreted protein types. Firstly, according to auto-cross correlation analysis approach, we construct a 4000-dimensional nonnegative feature vector, which is too large to input into

ED

the SVM classifier. The large dimension will exist redundancy and increase computational complexity. Hence, using a suitable dimension reduction method is very important. The NMF is a

PT

effective algorithm to reduce the dimensionality for extracted features, which is originally proposed by Lee and Seung [16], and applied to decompose facial images and derive parts-based

CE

representation of whole images. The NMF is proposed as a matrix factorization technique that produces a useful decomposition in the analysis of data, and results in a reduced representation of the original data. Then, with the help of the NMF based on KL divergence, a 150-dimensional

AC

feature vector is obtained for the SVM based on the training set from a standard dataset constructed by Yu et al [14]. Finally, the bias-free jackknife cross-validation test is employed on the test set to evaluate our predictor. The experimental results show that our feature design method is excellent and effective, and the NMF is also a powerful data mining and dimensional reduction algorithm. To develop a solid sequence-based statistical predictor for a biological system as reported

4

ACCEPTED MANUSCRIPT

in a series of recent publications [17–29], one should observe the Chou’s 5-step rule [30]; i.e., making the following five steps very clear: (i) how to construct or select a valid benchmark dataset to train and test the predictor; (ii) how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with

CR IP T

the target to be predicted; (iii) how to introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) how to properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) how to establish a user-friendly web-server for the predictor that is accessible to the public. Below, we are to describe how to deal with these steps

2 2.1

Materials and methods Datasets

AN US

one-by-one.

In order to test current method strictly and facilitate the comparison with the previous works, a

M

standard dataset constructed by Yu et al. [14] in 2013 is adopted, which is obtained by searching Swiss-Prot [31], TrEMBL [32] and RefSeq [33] databases. This standard dataset contains six

ED

subset for T1SP, T2SP, T3SP, T4SP, T5SP and T7SP, and is divided into two parts: the training set and the test set. More details about the training set and the test set are listed in Table 1.

Feature design

PT

2.2

Feature design is critical for successful identification of Gram-negative bacterial secreted protein

CE

types. Feature design mainly includes two steps: the first step is feature extraction, which formulates the protein samples with an effective mathematical expression that truly reflect their

AC

intrinsic correlation; The second step is feature reduction, which can simplify the model, shorten the training time, avoid the curse of dimensionality, and enhance the generalization ability of the model. In this paper, we extract the sequence-order information with the auto-cross correlation analysis based on PSSM, and adopt the nonnegative matrix factorization algorithm based on Kullback-Leibler divergence for dimensionality reduction. Then, we fuse these novel features information into the general pseudo amino acid composition.

5

ACCEPTED MANUSCRIPT

2.2.1

General pseudo amino acid composition

With the explosive growth of biological sequences in the post-genomic era, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, yet still keep considerable sequence-order information

CR IP T

or key pattern characteristic. This is because all the existing machine-learning algorithms can only handle vector but not sequence samples, as elucidated in a comprehensive review [34]. However, a vector defined in a discrete model may completely lose all the sequence-pattern information. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition [35] or PseAAC [36] was proposed. Ever since the concept of

AN US

Chou’s PseAAC was proposed, it has been widely used in nearly all the areas of computational proteomics (see, e.g., [37–39] as well as a long list of references cited in [40]). Because it has been widely and increasingly used, recently three powerful open access soft-wares, called ‘PseAACBuilder’, ‘propy’, and ‘PseAAC-General’, were established: the former two are for generating various modes of Chou’s special PseAAC; while the 3rd one for those of Chou’s general PseAAC

M

[30], including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as “Functional Domain” mode (see Eqs.9-10 of [30]), “Gene Ontology”

ED

mode (see Eqs. 11-12 of [30]), and “Sequential Evolution” or “PSSM” mode (see Eqs.13-14 of [30]). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, the concept of PseKNC (Pseudo K-tuple Nucleotide Composition) [41] was developed for generating

PT

various feature vectors for DNA/RNA sequences [42, 43] that have proved very useful as well. Particularly, recently a very powerful web-server called ‘Pse-in-One’ [44] and its updated version

CE

‘Pse-in-One2.0’ [45] have been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users’ studies.

AC

In the current study, we are to use PSSM and auto-cross PSSM to define Chou’s pseudo components for analyzing Gram-negative bacterial secreted protein types. According to Eq.(6) of Chou(2011) [30], the feature vector for any protein, peptide or biological sequence is just the general form of PseAAC [35] that can be formulated as: P = (ψ1 , ψ2 , · · · , ψµ , · · · , ψΩ )T ,

(2.1)

where the subscript is an integer to reflect the feature vector’s dimension, and its value as well as the components ψ1 , ψ2 , · · · will depend on how to extract the desired features information 6

ACCEPTED MANUSCRIPT

from the statistical samples concerned. Here, we use the auto-cross correlation analysis based on PSSM and the nonnegative matrix factorization algorithm for feature design, and final Ω = 150.

2.2.2

Evolutionary information via position-specific scoring matrix

CR IP T

To reflect the evolutionary information, we use each secreted protein sequence P from both the training set and the test set as a seed to search and align homogenous sequences from SwissProt dataset using the PSI-BLAST program [46] with parameters h = 10−6 and j = 3. The PSI-BLAST will return a position-specific scoring matrix (PSSM), the (i, j)th entry of the obtained matrix represents the score of the amino acid residue in the ith position of the secreted

AN US

protein sequence being mutated amino acid type j in the biology evolution process. We further normalize each element of the original PSSM score for reducing the bias and noise using the sigmoid function:

f (s) = 1/(1 + e−s ).

(2.2)

M

where s is the original PSSM score. Let we denote the PSSM as

PP SSM = (P1 , P2 , · · · , Pj , · · · , P20 ),

(2.3)

ED

where Pj = (P1,j , P2,j , · · · , PL,j )T (j = 1, 2, · · · , 20), L is the length of the secreted protein sequence P , 20 is due to the 20 native amino acids, and T is the transpose operator.

Auto-cross correlation analysis based on PSSM

PT

2.2.3

CE

A protein sequence can be regarded as the time series of the corresponding physicochemical properties. Here, we use the evolutionary information represented in the form of PSSM as the property under consideration. We already know that each protein sequence can obtain a PSSM,

AC

each column of the PSSM can be seen as a time series according to one property. Each PSSM contains 20 columns, so PSSM can be seen as 20 time series according to these 20 properties. However, in terms of the PSSM form, we see that the different protein sequences have different lengths, of which the number of rows according to the different PSSM is also different. In order to convert the PSSMs with different lengths into the feature vectors with the same dimension, we define the auto correlation function and the cross correlation function on the PSSM to obtain a consistent representation. 7

ACCEPTED MANUSCRIPT

As a powerful statistical tool, the auto correlation descriptor is a method for describing the distribution of amino acid index proposed by Zhang et al. [47]. It can convert a protein sequence into a specific numerical sequence, each value of the sequence represents the corresponding amino acid index. Then the auto correlation factors as the extracted features are calculated by the

CR IP T

auto correlation descriptor on this numerical sequence. Amino acid index [48] includes the hydrophobicity scale, the average flexibility index, the polarizability parameter, the relative mutability, the residue accessible surface area, the amino acid residue volume, etc. This method has been successfully applied in the field of protein structural class prediction [49].

The auto correlation only measures the correlation of the same property between two amino acid residues separated by a certain distance of lag apart along a protein sequence. Whereas,

AN US

for the different properties, the correlation analysis is missing. Hence, the cross correlation descriptor is defined based on PSSM, and combined with the auto correlation descriptor can be expressed by Cjlag 1 ,j2

L−lag X 1 = Pi,j1 × Pi+lag,j2 , (j1 , j2 = 1, 2, · · · , 20; 0 < lag < L), L − lag

(2.4)

M

i=1

where i represents the ith row of the PSSM, j1 and j2 represent the j1 th and the j2 th columns

ED

of the PSSM, respectively. While j1 = j2 , Cjlag represents the auto correlation factor of the 1 ,j2 same amino acid type j1 , while j1 6= j2 , Cjlag represents the cross correlation factor of the 1 ,j2 two different amino acid types j1 and j2 . Thus, auto-cross correlation analysis based on PSSM

PT

(ACCP) reflects the evolutionary sequence-order information. Let lg be the maximum value of the lag, in this way, a protein sequence is represented by a (lg × 400)-dimensional features

CE

vector. Here, we let lg = 10, 4000 ACCP features are obtained, which contain 200 PSSM-based auto correlation (ACP) features and 3800 PSSM-based cross correlation (CCP) features.

AC

However, a 4000-dimensional (4000D) feature vector for each secreted protein sequence is too large to input it into classifier. The large dimension can lead to three problems: over-fitting, information noise and high computational complexity. Hence, dimensionality reduction plays a critical role in classification task.

8

ACCEPTED MANUSCRIPT

2.3

Nonnegative matrix factorization algorithm

Nonnegative matrix factorization (NMF) is a data mining algorithm originally proposed by Lee and Seung in 1999 [16] to analyze facial images, and also can be applied to reduce the

CR IP T

dimensionality. The NMF can be described as follows [16, 50]: V ≈ W H,

(2.5)

where V is a nonnegative matrix with the size of m × n, m and n represent the number of samples and the number of features, respectively. The V approximately is decomposed into a nonnegative matrix named W with the size of m × r and a nonnegative matrix named H with

AN US

the size of r × n. W represents the reduced r basis vectors in the factorization factor, and H represents the coefficients of the linear combination of basis vectors, which is also called as the coding vector. The factorization rank r satisfies r < mn/(m + n) so that W and H are smaller than the original matrix V . Therefore, the NMF can be regarded as a way to compress the original data V . The NMF can represent a large number of original data vectors with a small

M

number of basis vectors, and proper basis vectors can well approximate the original data vectors and achieve the goal of dimensionality reduction.

ED

In order to obtain the approximate decomposition process, a reasonable objective function is needed to measure the approximate degree between the original matrix V and the matrix W H after decomposition. At present, there are two commonly used objective functions: Euclidean

PT

distance and generalized Kullback-Leibler divergence (KL divergence, is also called relative en-

AC

and

CE

tropy), which are defined separately by [50]: D(V kW H) = kV − W Hk =

DKL (V kW H) =

X (Vij log ij

X ij

(Vij − (W H)ij )2 ,

Vij − Vij + (W H)ij ). (W H)ij

(2.6)

(2.7)

where i = 1, 2, · · · , m, and j = 1, 2, · · · , n. m and n represent the number of rows and columns

of V , respectively, in other words, represent the number of samples and features, respectively. Here, the generalized KL divergence is adopted to construct the objective function. Thus, the NMF based on the KL divergence denoted by KL-NMF can be transformed into a constrained

9

ACCEPTED MANUSCRIPT

optimization problem as follows:    min DKL (V kW H),

(2.8)

  s.t. W, H ≥ 0.

CR IP T

Lee and Seung adopt the compromise algorithm of the gradient descent method and the conjugate gradient method: the multiplicative update rules to solve this constrained optimization problem. The implementation steps of the NMF algorithm based on KL divergence are as follows: Step 1. Initialize W and H with positive random numbers.

AN US

Step 2. Compute the new basis matrix W by the update rules: P j [Hkj Vij (W H)ij ] P . Wik ← Wik j Hkj

(2.9)

Step 3. The columns of W are normalized.

(2.10)

M

Step 4. Compute the new coefficient matrix H by the update rules: P [Wkj Vij (W H)ij ] Hkj ← Hkj i P . i Wik

Step 5. Determine whether meeting the terminating condition (the maximum number of

ED

iterations), if yes, it will stop running, output the base matrix W and the coefficient matrix H, or return to the step 2. Repeat this process until termination condition is satisfied. As we know, the biggest difference between the NMF and the other types of matrix de-

PT

composition methods is that the base vectors and the coding vectors are both non-negatively constraints. From a geometric perspective, the NMF is to project the original data to the sub-

CE

space spanned by the basis vector W . From an algebraic perspective, the NMF is an algebraic expression for excavating the intrinsic characteristics of non-negative data. From the perspective

AC

of features reduction, the matrix W obtained by the NMF is substituted for the original matrix V , therefore, the NMF can be regarded as a reduction method of feature transformation.

2.4

Support vector machine

Support vector machine (SVM) is based on the structural risk minimization principle from statistical learning theory [51], which has been comprehensively applied to classification. With regard to binary classification, the SVM trains a classifier by mapping the input samples onto a 10

ACCEPTED MANUSCRIPT

high-dimensional space through kernel functions, and then seeking a separating hyperplane that differentiates the two classes with maximal margin and minimal error. Identification of Gramnegative bacterial secreted protein types for the Yu’s standard dataset is a six classification problem, which can be converted into binary classification by using either one-to-rest strategy or

CR IP T

one-to-one strategy. The SVM method has proven to be powerful in many fields of bioinformatics [52–64]. In this paper, prediction model is trained and built with LIBSVM package [65] by using one-to-one strategy, and radial basis function (RBF) K(xi , xj ) = exp(−γkxi − xj k2 ) is selected as the kernel function for our SVM prediction system. The kernel parameter γ and the penalty parameter C are optimized based on the 10-fold cross-validation in a grid-based manner, and this manner tries to select the values of γ and C pairs with the best classification accuracy.

AN US

The parameters γ and C are searched exponentially in the ranges of [2−15 , 25 ] and [2−5 , 215 ], respectively. Here, the factorization rank r is optimized on the training set by the SVM.

2.5

Performance evaluation

M

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling

ED

test, and jackknife test. However, of the three test methods, the jackknife test is deemed the most objective. The reasons are as follows [64] : a) In the jackknife test, all the proteins in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by

PT

the remaining protein samples; b) During the process of jackknifing, both the training dataset and testing dataset are actually open, and each protein sample will be in turn moved between

CE

the two. The jackknife test can exclude the ‘memory’ effect; c) The arbitrariness problem as mentioned for the independent dataset test and subsampling test can be avoided because the

AC

outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset [66].

The classification quality is evaluated by using the Sensitivity (Sens), Specificity (Spec),

F-measure, Matthew’s correlation coefficient (MCC) and the Overall accuracy (OA). The Sens represents the percentage of positive samples being predicted correctly, and is also called the Recall. The Spec represents the percentage of negative samples being predicted correctly. The OA denotes the percentage of both positive samples and negative samples correctly predicted. F-

11

ACCEPTED MANUSCRIPT

measure is a more robust metric avoid overestimating, which is the harmonic mean of precision and recall. MCC is another comprehensive indicator considering both positive and negative samples, and returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction; 0 represents an average random prediction and −1 represents the worst possible

CR IP T

prediction. All of the above metrics are calculated in the case of the jackknife cross-validation

AN US

and defined as follows:  TP   Sens or Recall =   T P + FN     T N   Spec =    TN + FP    TP    P recision =  TP + FP P recision × Recall   F − measure = 2 ×   P recision + Recall     TP × TN − FP × FN   M CC = p    (T P + F P )(T P + F N )(T N + F P )(T N + F N )      TP + TN   OA = TP + FN + FP + TN

(2.11)

M

where T P denotes the number of positive samples with positive prediction, T N denotes the number of negative samples with negative prediction, F P denotes the number of negative sam-

ED

ples but predicted as positive samples and F N denotes the number of positive samples but predicted as negative samples. T P + F N and T N + F P are equal to the number of positive and negative samples in the dataset, respectively.

PT

To provide a more intuitive and easier-to-understand method for most biologists to measure the prediction quality, Eq. 2.11 can be replaced with the four sub-equations in Eq. 19 of [67],

CE

or the four sub-equations in Eq. 14 of [68], which were derived based on the Chou’s symbols introduced for studying protein signal peptides. Particularly, the advantages of Chou’s intuitive

AC

metrics have been analyzed and concurred by a series of studies published very recently (see, e.g., [17–26, 28, 29, 69–72]). The set of metrics is valid only for the single-label systems (in which each protein sequence

only belongs to one Gram-negative bacterial secreted protein type). For the multi-label systems (in which a protein sequence might belong to several Gram-negative bacterial secreted protein types), whose existence has become more frequent in system biology [78–83], system medicine [20, 73] and biomedicine [72], a completely different set of metrics as defined in [74] is needed. 12

ACCEPTED MANUSCRIPT

3 3.1

Results and discussion Selection of the factorization rank r

An important consideration in the application of the classical KL-NMF model, is the selection of

CR IP T

the value of the factorization rank r used to better represent the data. As a rule of thumb, this value is generally chosen so that (m + n)r < mn. However, this estimation is not informative enough to make a proper decision. Finding an appropriate value of r depends on the concrete problem and it is mostly influenced by the nature of the dataset itself. In this paper, we in turn choose the rank r for 50, 100, 150, 200, 250, 300, 350, 400, 450 and 500, and calculate the overall

AN US

accuracy corresponding to each value of the rank r for the training set. As shown in Fig. 1, the optimal r is equal to 150 due to the accuracy of the training set. Thus, a 4000D feature vector is reduced to 150D by using KL-NMF algorithm.

3.2

Prediction performance of our model

M

In this paper, we construct one features design model denoted ACCP-KL-NMF by incorporating auto-cross correlation analysis based on PSSM (ACCP) with lg = 10 and nonnegative

ED

matrix factorization based on Kullback-Leibler divergence (KL-NMF) with the factorization rank r = 150 determined on the training set. Thus, 150 ACCP features are input into the SVM constructed by using the one to one strategy with the RBF kernel function. The simple

PT

grid-search approach and 10-fold cross-validation are executed on the test set for finding the best C and γ pairs. Finally, the best overall accuracy is obtained by parameters C = 4 and

CE

γ = 22.6274. The most rigorous and objective jackknife cross-validated results for the test set are shown in Table 2.

AC

As listed in Table 2, we report the Sens, Spec, F-measure and MCC for each type, as well as

the OA. Relying solely on PSSM for feature extraction, we achieve up to 94.19% overall accuracy for the test set. Referring to six types, the values of Sens achieve the 84.00%, 100.00%, 96.43%, 86.36%, 97.14% and 96.97% for T1SP, T2SP, T3SP, T4SP, T5SP and T7SP types, respectively. Among them, all the positive examples of T2SP type that are correctly predicted, and the other five types all obtain approving results of the Sens. Meanwhile, the values of Spec, Fmeasure and MCC of the six types also have obtained excellent performance. The fact indicates 13

ACCEPTED MANUSCRIPT

that ACCP-KL-NFM model is powerful to distinguish Gram-negative bacterial secreted protein types.

3.3

Contribution analysis of features group

CR IP T

In order to analyze the contribution of the ACP (200D) features group and the CCP (3800D) features group, we calculate the sens for each type and the overall accuracies of the ACP feature group and the ACCP-KL-NMF for the test set, respectively, more detailed results are shown in Table 3. From Table 3, we can also intuitively see that the OA of the features group ACP obtains 90.70%, when the CCP features group is added, the prediction accuracy reaches 94.19%

AN US

with KL-NMF. That is to say, the CCP features increase the prediction accuracy by 3.49%, the ACP features and the CCP features are both important and make positive contribution for addressing our problem.

3.4

Performance comparison between KL-NMF and PCA

M

The principal component analysis (PCA) is one of the most classical dimensionality reduction method. The PCA aims to select some dominant features that can retain most of the information

ED

in terms of an orthogonal transformation. It is widely used in face recognition and image compression. To investigate the superiority of the KL-NMF, we compare the Sens and OA obtained

PT

by the ACCP-KL-NMF with those obtained by the ACCP-PCA using the same dimension of 150 for the test set, and the cumulative contribution rate is greater than or equal to 99%, which can be considered that the 150 selected principal components contain most of the information in

CE

the original data. The OA of the ACCP-PCA reaches 93.60%, that is 0.59% lower than that of the ACCP-KL-NMF. From Fig. 2, we can also intuitively see that the sens of the ACCP-PCA

AC

equals that of the ACCP-KL-NMF for T1SP and T4SP types, respectively. The sens for T2SP and T7SP types of the ACCP-KL-NMF are both higher than those of the ACCP-PCA. On the contrary, the sens for T3SP and T5SP types of the ACCP-KL-NMF are both lower than those of the ACCP-PCA. In other words, for each type of the Sens, the ACCP-KL-NMF and the ACCP-PCA have their own advantages and disadvantages, but from the overall perspective, the ACCP-KL-NMF is more suitable and powerful for identification of Gram-negative bacterial secreted protein types. 14

ACCEPTED MANUSCRIPT

3.5

Performance comparison with other models

To make it more convenient to show the superiority of the ACCP-KL-NMF, we summarize the index values of the Sens and OA for the ACCP-KL-NMF, Yu et al., and Ding et al., respectively, as well as the Spec and MCC for the ACCP-KL-NMF and Ding et al., respectively, of the test

CR IP T

set in Table 4. Yu et al. includes the results of two parts with the one-to-rest strategy and the one-to-one strategy, which are represented by Yu et al.a and Yu et al.b , respectively. As shown in Table 4, the ACCP-KL-NMF model achieves the highest overall accuracy than the other three models for the test set. The value of the OA is 8.14%, 4.07% and 0.59% higher than that obtained by Yu et al.a , Yu et al.b and Ding et al., respectively. Compared with Yu et

AN US

al.b with the same one-to-one strategy, our model has great advantage in predicting type T2SP, T4SP and T7SP. For the Sens, compared with Ding et al., all the other five types have the same results except for the T3SP, however, the Sens of our model is 3.57% higher than that of Ding et al. for the T3SP type. It can be said that our ACCP-KL-NMF model is easier to distinguish T3SP type than Ding et al. For the Spec and MCC, our model for all the other five types except

4

Conclusions

ED

results from multiple angles.

M

for the T5SP outperform the Ding et al. To sum up, our model has obtained good prediction

PT

Considering the importance of the sequence-order information in a given secreted protein sequence, the auto-cross correlation analysis descriptor is employed based on PSSM, and a 4000D

CE

feature vector is obtained for each sample. In order to eliminate the adverse effects of highdimensional features on the classifier, the nonnegative matrix factorization approach based on

AC

Kullback-Leibler divergence is adopted to reduce the dimensionality and finds out the most useful information. Then, the factorization rank r is optimized on the training set. Thereby, the feature design model ACCP-KL-NMF is developed. The SVM with the one to one strategy and the rigorous jackknife test are used to predict and evaluate the results on the test set, respectively, and our model provides a more accurate computational tool for identification of Gram-negative bacterial secreted protein types. As pointed out in [75] and demonstrated in a series of recent publications (see, e.g., [18, 20–26, 28, 76–87]), user-friendly and publicly accessi-

15

ACCEPTED MANUSCRIPT

ble web-servers represent the future direction for developing practically more useful prediction methods and computational tools. Actually, many practically useful web-servers have increasing impacts on medical science [34], driving medicinal chemistry into an unprecedented revolution [40], we shall make efforts in our future work to provide a web-server for the prediction method

CR IP T

presented in this paper.

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful comments on

AN US

our manuscript. This work was supported by the National Natural Science Foundation of China (No. 11601407) and Doctoral Scientific Research Foundation of Xi’an Polytechnic University (No. BS1710).

M

References

[1] S. Pundhir, A. Kumar, SSPred: A prediction server based on SVM for the identification and

ED

classification of proteins involved in bacterial secretion systems, Bioinformation 6 (2011) 380-382.

PT

[2] V. T. Lee, O. Schneewind, Review: Protein secretion and the pathogenesis of bacterial infections, Genes Dev. 15 (2001) 1725-1752.

CE

[3] M. de Chial, B. Ghysels, S. A. Beatson, et al., Identification of type II and type III pyoverdinereceptors from Pseudomonas aeruginosa, Microbiology 149 (2003) 821-831.

AC

[4] A. Blocker, K. Komoriya, S. Aizawa, Type III secretion systems and bacterial flagella: insights into their function from structural similarities, Proc. Natl. Acad. Sci. USA 100 (2003) 3027-3030.

[5] K. Omori, A. Idei, Gram-negative bacterial atp-binding cassette protein exporter family and diverse secretory proteins, J. Biosci. Bioeng. 95 (2003) 1-12. [6] M. E. Konkel, B. J. Kim, V. Rivera-Amill, et al., Bacterial secreted proteins are required for 16

ACCEPTED MANUSCRIPT

the internalization of Campylobacter jejuni into cultured mammalian cells, Mol. Microbiol. 32 (1999) 691-701. [7] D. Buttner, U.Bonas, Common infection strategies of plant and animal pathogenic bacteria, Curr. Opinion Plant Biol. 6 (2003) 312-319.

CR IP T

[8] M. Desvaux, M. Hebraud, R. Talon, et al., Secretion and subcellular localizations of bacterial proteins: a semantic awareness issue, Trends Microbiol. 17 (2009) 139-145.

[9] J. D. Bendtsen, L. Kiemer, A. Fausboll, et al., Non-classical protein secretion in bacteria, BMC Microbiol. 5 (2005) 58-70.

J. Proteome Res. 5 (2006) 3420-3428.

AN US

[10] H. B. Shen, Large-scale predictions of Gram-negative bacterial protein subcellular locations,

[11] H. B. Shen, Gneg-mPLoc: A top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins, J. Theor. Biol. 264 (2010) 326-333.

M

[12] X. Xiao, Z. C. Wu, K. C. Chou, A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites, PLoS

ED

ONE 6 (2011) e20592.

[13] B. Mudrak, M. J. Kuehn, Specificity of the type II secretion systems of enterotoxigenic Escherichia coli and Vibrio cholerae for heat-labile enterotoxin and choleratoxin, J. Bacteriol.

PT

192 (2010) 1902-1911.

CE

[14] L. Z. Yu, J. S. Luo, Y. Z. Guo, In silico identification of Gram-negative bacterial secreted proteins from primary sequence, Comput. Biol. Med. 43 (2013) 1177-1181.

AC

[15] S. Y. Ding, S. L. Zhang, A Gram-negative bacterial secreted protein types prediction method based on PSI-BLAST profile, BioMed Res. Int. 3206741 (2016) 1-5.

[16] D. D. Lee, H. S. Seung, Learning the parts of objects by nonnegative matrix factorization, Nature 401 (1999) 788-791.

[17] J. H. Jia, Z. Liu, X. Xiao, et al., iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC, J Theor. Biol. 377 (2015) 47-56. 17

ACCEPTED MANUSCRIPT

[18] W. Chen, P. M. Feng, H. Yang, et al., iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences, Oncotarget 8 (2017) 4208-4217. [19] B. Liu, L. Y. Fang, F. Liu, et al., Identification of real microRNA precursors with a pseudo structure status composition approach, PLoS ONE 10 (2015) e0121501.

CR IP T

[20] X. Cheng, S. G. Zhao, X. Xiao, et al., iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics 33 (2017) 341-346.

[21] P. Feng, H. Ding, H. Yang, et al., iRNA-PseColl: Identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC, Mol.

AN US

Ther. Nucleic Acids 7 (2017) 155-163.

[22] B. Liu, S. Y. Wang, R. Long, et al., iRSpot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics 33 (2017) 35-41.

[23] B. Liu, F. Yang, K. C. Chou, 2L-piRNA: A two-layer ensemble classifier for identifying

M

piwi-interacting RNAs and their function, Mol. Ther. Nucleic Acids 7 (2017) 267-277. [24] L. M. Liu, Y. Xu, K. C. Chou, iPGK-PseAAC: identify lysine phosphoglycerylation sites in

ED

proteins by incorporating four different tiers of amino acid pairwise coupling information into the general PseAAC, Med. Chem. 13 (2017) 552-559.

PT

[25] W. R. Qiu, S. Y. Jiang, Z. C. Xu, et al., iRNAm5C-PseDNC: identifying RNA 5methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide

CE

composition, Oncotarget 8 (2017) 41178-41188. [26] W. R. Qiu, B. Q. Sun, X. Xiao, et al., iPhos-PseEvo: Identifying human phosphorylated

AC

proteins by incorporating evolutionary information into general PseAAC via grey system theory, Mol. Inform. 36 (2017) UNSP 1600010.

[27] Q. Su, W. Lu, D. Du, et al., Prediction of the aquatic toxicity of aromatic compounds to tetrahymena pyriformis through support vector regression. Oncotarget 8 (2017) 4935949369.

18

ACCEPTED MANUSCRIPT

[28] Y. Xu, Z. Wang, C. Li, et al., iPreny-PseAAC: identify C-terminal cysteine prenylation sites in proteins by incorporating two tiers of sequence couplings into PseAAC, Med. Chem. 13 (2017) 544-551. [29] J. Song, Y. Wang, F. Li, et al., iProt-Sub: a comprehensive package for accurately map-

CR IP T

ping and predicting protease-specific substrates and cleavage sites, Brief. Bioinform. doi: 10.1093/bib/bby028 (2018).

[30] K. C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review), J. Theor. Biol. 273 (2011) 236-247.

AN US

[31] B. Boeckmann, A. Bairoch, R. Apweiler, et al., The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003, Nucleic Acids Res. 31 (2003) 365-370. [32] UniProt Consortium, The universal protein resource(UniProt), Nucleic Acids Res. 36 (2008) 190-195.

M

[33] K. D. Pruitt, T. Tatusova, W. Klimke, et al., NCBI reference sequences: current status, policy and new initiatives, Nucleic Acids Res. 37 (2009) 32-36.

ED

[34] K. C. Chou, Impacts of bioinformatics to medicinal chemistry, Med. Chem. 11 (2015) 218234.

PT

[35] K. C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition, Proteins: Struct. Funct. Bioinform. 43 (2001) 246-255.

CE

[36] K. C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21 (2005) 10-19.

AC

[37] K. Ahmad, M. Waris, M. Hayat, Prediction of protein submitochondrial locations by incorporating dipeptide composition into Chou’s general pseudo amino acid composition, J. Membrane Biol. 249 (2016) 293-304.

[38] P. K. Meher, T. K. Sahu, V. Saini, et al., Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC, Sci. Rep. 7 (2017) 42362.

19

ACCEPTED MANUSCRIPT

[39] J. Mei, J. Zhao, Prediction of HIV-1 and HIV-2 proteins by using Chou’s pseudo amino acid compositions and different classifiers, Sci. Rep. 8 (2018) 2359. [40] K. C. Chou, An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Curr. Top. Med. Chem. 17 (2017) 2337-2358.

CR IP T

[41] W. Chen, T. Y. Lei, D. C. Jin, et al., PseKNC: a flexible web-server for generating pseudo K-tuple nucleotide composition, Anal. Biochem. 456 (2014) 53-60.

[42] W. Chen, H. Lin, K. C. Chou, Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences, Mol. Biosyst. 11 (2015) 2620-2634.

AN US

[43] B. Liu, F. Yang, D. S. Huang, iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC, Bioinformatics, 34 (2018) 33-40. [44] B. Liu, F. Liu, X. Wang, et al., Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nucleic Acids Res. 43 (2015)

M

W65-W71.

[45] B. Liu, H. Wu, K. C. Chou, Pse-in-One 2.0: An improved package of web servers for

ED

generating various modes of pseudo components of DNA, RNA, and protein Sequences, Natural Science, 9 (2017) 67-91.

PT

[46] S. F. Altschul, T. L. Madden, A. A. Sch¨ affer, et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389-3402.

CE

[47] W. S. Bu, Z. P. Feng, Z. Zhang, et al., Prediction of protein (domain) structural classes based on amino-acid index. European Journal of Biochemistry 266 (1999) 1043-1049.

AC

[48] S. Kawashima, M. Kanehisa, Aaindex: Amino acid index database, Nucleic acids res. 28 (2000) 374-374.

[49] C. Chen, L. X. Chen, X. Y. Zou, et al., Predicting protein structural class based on multifeatures fusion, J. Theor. Biol. 253 (2008) 388-392. [50] D. D. Lee, H. S. Seung, Algorithms for non-negative matrix factorization, Advance in Neural Information Processing Systems, MIT Press (2001) 556-562. 20

ACCEPTED MANUSCRIPT

[51] V. Vapnik, Statistical Learning Theory, Wiley, NewYork, 1998. [52] J. R. Wang, C. Wang, J. J. Cao, et al., Prediction of protein structural classes for lowsimilarity sequences using reduced PSSM and position-based secondary structural features, Gene 554 (2015) 241-248.

CR IP T

[53] S. L. Zhang, X. Duan, Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC, J. Theor. Biol. 437 (2018) 239-250.

[54] L. Z. Yu, Y. Z. Guo, Z. Zhang, et al., SecretP: a new method for predicting mammalian secreted proteins, Peptides 31 (2010) 574-578.

AN US

[55] G. L. Fan, Q. Z. Li, Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition, J. Theor. Biol. 304 (2012) 88-95.

[56] J. Y. Yang, X. Chen, Improving taxonomy-based protein fold recognition by using global

M

and local features, Proteins 79 (2011) 2053-2064.

[57] J. Chen, H. M. Xu, P. A. He, et al., A multiple information fusion method for predicting

139 (2016) 37-45.

ED

subcellular locations of two different types of bacterial protein simultaneously, BioSystems

PT

[58] C. Huang, J. Q. Yuan, Using radial basis function on the general form of Chou’s pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both

CE

single and multiple sites, BioSystems 113 (2013) 50-57. [59] A. Dehzangi, R. Heffernan, A. Sharma, et al., Gram-positive and Gram-negative protein

AC

subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC, J. Theor. Biol. 364 (2015) 284-294.

[60] Y. H. Yao, Z. X. Shi, Q. Dai, Apoptosis protein subcellular location prediction based on position-specific scoring matrix, J. Comput. Theor. Nanosci. 11 (2014) 2073-2078. [61] S. L. Li, H. Li, M. F. Li, et al., Improved prediction of lysine acetylation by support vector machines, Protein and Peptide Lett. 16 (2009) 977-983.

21

ACCEPTED MANUSCRIPT

[62] Y. N. Zhang, D. J. Yu, S. S. Li, et al., Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features, BMC Bioinformatics 13 (2012) 1-11. [63] Y. C. Dou, B. Yao, C. Zhang, PhosphoSVM: prediction of phosphorylation sites by inte-

CR IP T

grating various protein sequence attributes with a support vector machine, Amino Acids 46 (2014) 1459-1469.

[64] S. Niu, L. L. Hu, L. L. Zheng, Predicting protein oxidation sites with feature selection and analysis approach, J. Biomol. Struct. Dyn. 29 (2012) 650-658.

AN US

[65] C. C. Chang, C. J. Lin, LIBSVM: a library for support vector machines, 2001.

[66] K. C. Chou, H. B. Shen, Review: recent progress in protein subcellular location prediction, Anal. Biochem. 370 (2007) 1-16.

[67] Y. Xu, X. J. Shao, L. Y. Wu, et al., iSNO-AAPair: incorporating amino acid pairwise

M

coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins, PeerJ. 1 (2013) e171.

ED

[68] W. Chen, P. M. Feng, H. Lin, et al., iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res. 41 (2013) e68.

PT

[69] J. Jia, Z. Liu, X. Xiao, et al., iCar-PseCp: identify carbonylation sites in proteins by Monto Carlo sampling and incorporating sequence coupled effects into general PseAAC,

CE

Oncotarget. 7 (2016) 34558-34570. [70] B. Liu, R. Long, K. C. Chou, iDHS-EL: Identifying DNase I hypersensi-tivesites by fusing

AC

three different modes of pseudo nucleotide composition into an ensemble learning framework, Bioinformatics. 32 (2016) 2411-2418.

[71] Z. Liu, X. Xiao, D. J. Yu, et al., pRNAm-PC: Predicting N6-methyladenosine sites in RNA sequences via physical-chemical properties, Anal. Biochem. 497 (2016) 60-67. [72] W. R. Qiu, B. Q. Sun, X. Xiao, et al., iPTM-mLys: identifying multiple lysine PTM sites and their different types, Bioinformatics. 32 (2016) 3116-3123.

22

ACCEPTED MANUSCRIPT

[73] X. Cheng, S. G. Zhao, X. Xiao, et al., iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals, Oncotarget. 8 (2017) 5849458503. [74] K. C. Chou, Some Remarks on Predicting Multi-Label Attributes in Molecular Biosystems,

CR IP T

Mol. Biosyst. 9 (2013) 1092-1100.

[75] H. B. Shen, Recent advances in developing web-servers for predicting protein attributes. Natural Science, 1 (2009) 63-92.

[76] B. Liu, L. Fang, F. Liu, et al., Identification of real microRNA precursors with a pseudo

AN US

structure status composition approach, PLoS ONE, 10 (2015) e0121501.

[77] B. Liu, L. Fang, R. Long, et al., iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics, 32 (2016) 362-369.

M

[78] X. Cheng, X. Xiao, K. C. Chou, pLoc-mPlant: predict subcellular localization of multilocation plant proteins by incorporating the optimal GO information into general PseAAC,

ED

Mol. Biosyst. 13 (2017) 1722-1727.

[79] X. Cheng, X. Xiao, K. C. Chou, pLoc-mVirus: predict subcellular localization of multilocation virus proteins via incorporating the optimal GO information into general PseAAC,

PT

Gene 628 (2017) 315-321.

CE

[80] X. Cheng, X. Xiao, K. C. Chou, pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC, Genomics,

AC

110 (2018) 50-58.

[81] X. Cheng, X. Xiao, K. C. Chou, pLoc-mGneg: Predict subcellular localization of Gramnegative bacterial proteins by deep gene ontology learning via general PseAAC, Genomics (2017) doi:10.1016/j.ygeno.2017.10.002.

[82] X. Cheng, S. G. Zhao, W. Z. Lin, et al., pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites, Bioinformatics 33 (2017) 3524-3531.

23

ACCEPTED MANUSCRIPT

[83] X. Xiao, X. Cheng, S. Su, et al., pLoc-mGpos: Incorporate key gene ontology information into general PseAAC for predicting subcellular localization of Gram-positive bacterial proteins, Natural Science, 2017, 9, 331-349. [84] J. Song, F. Li, K. Takemoto, et al., an integrative approach for inferring catalytic residues

CR IP T

using sequence, structural and network features in a machine learning framework, J. Theor. Biol. 443 (2018) 125-137.

[85] W. R. Qiu, B. Q. Sun, X. Xiao, et al., iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier, Genomics, 2017,

AN US

10.1016/j.ygeno.2017.10.008.

[86] X. Cheng, X. Xiao, K. C. Chou, pLoc-mHum: predict subcellular localization of multilocation human proteins via general PseAAC to winnow out the crucial GO information, Bioinformatics, 34 (2018) 1448-1456.

[87] J. Wang, B. Yang, J. Revote, et al., POSSUM: a bioinformatics toolkit for generating

M

numerical sequence feature descriptors based on PSSM profiles, Bioinformatics 33 (2017)

AC

CE

PT

ED

2756-2758.

24

ACCEPTED MANUSCRIPT

Table Captions: Table 1: The compositions of the training set and the test set. Table 2: The prediction results of the test set by the jackknife test. Table 3: The contribution of features group for the prediction accuracy(%).

Table 1

CR IP T

Table 4: Performance comparison of different methods on the test set.

The compositions of the training set and the test set. T1SP

T2SP

T3SP

T4SP

T5SP

T7SP

Total

Training set

112

99

182

62

164

48

667

Test set

25

29

28

22

35

33

172

Sens(%)

T1SP

84.00

100.00

0.91

0.90

T2SP

100.00

99.30

0.98

0.98

F-measure

MCC

T3SP

96.43

98.61

0.95

0.94

T4SP

86.36

99.33

0.89

0.87

T5SP

97.14

97.81

0.94

0.93

T7SP

96.97

98.56

0.96

0.94

OA

94.19

C=4,γ=22.6274.

AC

CE

Spec(%)

M

Type

ED

The prediction results of the test set by the jackknife test.

PT

Table 2

AN US

Dataset

25

ACCEPTED MANUSCRIPT

Table 3

The contribution of features group for the prediction accuracy(%).

Features group

Prediction accuracy(%)

T2SP

T3SP

T4SP

T5SP

T7SP

OA(%)

ACP

88.00

93.10

96.43

86.36

97.14

81.81

90.70

ACCP-KL-NMF

84.00

100.00

96.43

86.36

97.14

96.97

94.19

Table 4

Performance comparison of different methods on the test set. Prediction accuracy(%)

Method Yu et al. [14] b

Yu et al. [14] Ding et al. [15]

This paper

T1SP

T2SP

T3SP

T4SP

T5SP

T7SP

OA(%)

80.00 88.00 84.00 84.00

75.86 79.31 100.00 100.00

100.00 100.00 92.86 96.43

77.27 81.82 86.36 86.36

97.14 100.00 97.14 97.14

81.82 87.88 96.97 96.97

86.05 90.12 93.60 94.19

T2SP

Spec(%) T3SP T4SP

T1SP

T5SP

T7SP

100.00 100.00

97.90 99.30

98.61 98.61

98.67 99.33

99.27 97.81

97.84 98.56

Method

This paper Method

This paper

T1SP

T2SP

T3SP

T4SP

T5SP

T7SP

0.90 0.90

0.94 0.98

0.92 0.94

0.87 0.87

0.96 0.93

0.93 0.94

ED

Ding et al. [15]

MCC

M

Ding et al. [15]

AN US

a

CR IP T

T1SP

Yu et al.a denotes the results of the one-to-rest strategy for the SVM.

AC

CE

PT

Yu et al.b denotes the results of the one-to-one strategy for the SVM.

26

ACCEPTED MANUSCRIPT

Figure Captions: Figure 1: The selection of the factorization rank r based on the training set.

CR IP T

Figure 2: Performance comparison of the ACCP-KL-NMF and the ACCP-PCA.

87 86 85 84 83 82 81

100

150

200 250 300 350 400 The value of the factorization rank r

450

500

M

80 50

AN US

The overall accuracy(%) of the training set

88

ED

Figure 1: The selection of the factorization rank r based on the training set.

PT

100

ACCP−KL−NMF ACCP−PCA

98

The prediction accuracy(%)

AC

CE

96 94 92 90 88 86 84 82 80

T1SP T2SP T3SP T4SP T5SP T7SP

OA

Figure 2: Performance comparison of the ACCP-KL-NMF and the ACCP-PCA.

27