Statistical analysis of proteomics data: A review on feature selection

Journal of Proteomics xxx (xxxx) xxx–xxx Contents lists available at ScienceDirect Journal of Proteomics journal homepage: www.elsevier.com/locate/j...

Download PDF

1MB Sizes 0 Downloads 96 Views

Report

PDF Reader
Full Text

Journal of Proteomics xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Journal of Proteomics journal homepage: www.elsevier.com/locate/jprot

Statistical analysis of proteomics data: A review on feature selection Marta Lualdi, Mauro Fasano

⁎

Department of Science and High Technology (DiSAT), University of Insubria, Busto Arsizio, Italy

ARTICLE INFO

ABSTRACT

Keywords: Inductive reasoning Dimensionality and Sparsity Feature selection Proteomics signature

The spread of “-omics” strategies has strongly changed the way of thinking about the scientific method. Indeed, managing huge amounts of data imposes the replacement of the classical deductive approach with a data-driven inductive approach, so to generate mechanistical hypotheses from data. Data reduction is a crucial step in the process of proteomics data analysis, because of the sparsity of significant features in big datasets. Thus, feature selection methods are applied to obtain a set of features based on which a proteomics signature can be drawn, with a functional significance (e.g., classification, diagnosis, prognosis). In this frame, the aim of the present review article is to give an overview of the methods available for proteomics data analysis, with a focus on biomedical translational research. Suggestions for the choice of the most appropriate standard statistical procedures are presented to perform data reduction by feature selection, cross-validation and functional analysis of proteomics profiles. Significance: The proteome, including all so-called “proteoforms”, represents the highest level of complexity of biomolecules when compared to the other “-omes” (i.e., genome, transcriptome). For this reason, the use of proper data reduction strategies is mandatory for proteomics data analysis. However, the strategies to be employed for feature selection must be carefully chosen, since many different approaches exist based on both input data and desired output. So far, a well-established decision-making workflow for proteomics data analysis is lacking, opening up to misleading and incorrect data analysis and interpretation. In this review article many statistical approaches are described and compared for their application in the field of biomedical research, in order to suggest the reader the most suitable analysis pathway and to avoid mistakes.

1. Data analysis workflows in science Deductive reasoning has been the way science has worked for hundreds of years, based on the classical hypothesis-driven scientific method. Indeed, scientists visualize a hypothesis in their mind, they perform experiments to test the hypothesis and the latter will be then confirmed or falsified. Eventually, based on both initial hypothesis and empirical results, a theoretical model is generated so to draw general rules that describe the world around us. In this view, data are something that can be visualized in its totality and correlations among data can be mechanistically analyzed to highlight causative relations and to generate models. Models are powerful tools to describe reality, from astrophysics to human biology, but they suffer from many limitations. As the statistician George E. P. Box said some decades ago, “All models are wrong, but some are useful” [1]. 1.1. From deduction to induction: the “omics” revolution of scientific thought “Omics” techniques (e.g., proteomics, genomics, metabolomics), ⁎

from which we are now able to obtain huge datasets, require a different way of thinking about science that can be summarized with the idea that, when data are enough, they can speak for themselves. The approach is then data-driven, which means that experimental observations are used to draw inferences. This hypothesis-generating strategy can be applied either without any a priori hypothesis or starting from a precise idea (e.g., the presence of alterations in a specific biological pathway) that will be then confirmed or denied by data. In this view, correlation supplants causation and complex statistical algorithms are capable to find patterns where the scientists cannot. This is true for “omics” techniques in general, but it is especially true for proteomics. The reason why is that a set of “proteoforms” actually exists for each single protein (i.e., protein isoforms and post-translational modified forms), so that proteins represent a higher level of complexity as compared to genes and transcripts. This complexity can only be approached by means of inductive reasoning. Journal of Proteomics, which is now celebrating its 10th Anniversary, greatly contributed to the field of proteomics research, as the official journal of the European Proteomics Association (EuPA). In our opinion,

Corresponding author at: Dept. of Science and High Technology, University of Insubria, via Manara 7, I-21052 Busto Arsizio, Italy. E-mail address: [email protected] (M. Fasano).

https://doi.org/10.1016/j.jprot.2018.12.004 Received 12 October 2018; Received in revised form 27 November 2018; Accepted 5 December 2018 1874-3919/ © 2018 Elsevier B.V. All rights reserved.

Please cite this article as: Lualdi, M., Journal of Proteomics, https://doi.org/10.1016/j.jprot.2018.12.004

Journal of Proteomics xxx (xxxx) xxx–xxx

M. Lualdi, M. Fasano

disease biomarkers) a robust statistical workflow must be used for data reduction and feature selection.

JoP has gained the big credit of publishing scientific articles that had high impact on both researchers community and social media [2–4]. JoP not only publishes papers giving directions to choose the most appropriate proteomics data processing workflow [5], but also makes a number of specific protein datasets available [6–8], which represent the starting point for meta-analysis studies and for the application of systems biology approaches. This is of crucial importance for the inductive approach in proteomics to spread and grow. In this frame, the aim of this review is to show to the reader an overview of the methods available for proteomics data analysis, thus suggesting the most appropriate standard statistical procedures for data reduction, feature selection, cross-validation and functional analysis of proteomics profiles.

3. Proteomics data simplification by feature selection Proteomics data suffers from two unavoidable and related issues: dimensionality and sparsity. The first refers to the fact that, when looking at proteomics data, the number of variables/features is normally very much higher than the number of samples. The larger the dimensionality, the higher the amount of data required to obtain a reliable analysis. This was known as the “curse of dimensionality” in machine learning, since computational costs also increased exponentially [18]. The latter issue has been improved using Kernel algorithms for pattern analysis that, working in a high-dimensional feature space (principal components, rankings, clusters), are computationally less demanding than algorithms that operate in the “real” feature space. Still, the problem remains and related to the increase in dimensionality is the difficulty to demonstrate that the results are statistically significant, which is actually due to the sparsity of the meaningful data in a big dataset. The main issue with this kind of large datasets (many features, few samples) is that they are prone to overfitting, which means mistake small differences for significant variance, leading to errors.

2. Inductive approach in proteomics The inductive reasoning dramatically changed bioinformatics in terms of modeling strategies, since model generation is now driven by data and not, or at least less, by theoretical presuppositions. This represents a great advantage in terms of new opportunities with respect to hypothesis-driven models, but also a big challenge [9]. Indeed, despite the big data generated almost daily by proteomics studies, a well-established statistical workflow for data analysis in proteomics is still lacking, opening up to misleading and incorrect data analysis and interpretation [10].

3.1. Feature selection The solution to the problems of dimensionality and sparsity is offered by methods for data reduction, which envisage the selection of a specific subset of significant features, with the aims of maximizing the relevance and minimizing the redundancy. In order to obtain this subset, a feature selection (FS) approach must be employed [19], choosing from either feature subset selection or feature extraction methods [20]. Of note, big proteomics datasets require, before feature selection, a pre-processing that usually includes normalization, missing values imputation and scaling. We will not focus on data pre-processing in this review, but this topic has been already extensively revised [17]. A schematic summary of the proteomics data analysis workflow is shown in Fig. 1.

2.1. Emergent properties Biological organisms are complex systems, which are able to cope up with environmental changes by the reorganization of their individual components, thus leading to the acquisition of new properties. These properties of the system can only be explained by systems theory [11]. Indeed, an “emergent property” is characteristic of the system, but it is not present in its individual components. In other words, emergent properties arise when the elements of a system collaborate, while they cannot emerge when any of these individuals act alone [12]. To give a general example, human consciousness can be considered an emergent property of the human brain, since no single neuron holds it, but consciousness arises from the sum of all neurons acting together. The concept of “emergence” is perfectly suitable to multifactorial diseases, in which multiple genetic and environmental factors contribute to initiation and progression of the pathogenetic process. This clearly implies that studying the function of a single (or a few) disease-gene/ protein will never give a comprehensive view of the pathogenetic process.

3.1.1. Feature subset selection Feature subset selection methods are based on the removal of those features that are not relevant or are redundant, following the Occam's Razor principle; in other words, they only select a subset among the original variables. Three main categories have been described, assuming feature independency or near-independency: filters, wrappers and embedded methods. i) Filters. They select a subset of “the most important” features based on a score or correlation function, independently of the employed model, in one-shot non-iterative processes. Univariate filters (e.g., information gain [21], correlation [22], reliefF [23]) evaluate and rank a single feature, while multivariate filters (e.g., minimum redundancy maximum relevance mRmR [24], correlation-based feature selection CFS [25], multi-cluster feature selection MCFS [26]) analyse an entire subset of features. ii) Wrappers. They use machine-learning algorithms to evaluate which features are the most important, using some search strategies (e.g., simulated annealing or genetic algorithms). They take the model into account by training and testing features, in an iterative process where several subsets of features are generated and tested in each iteration. Feature selection is here a part of the model training [27]. It has been empirically demonstrated that wrappers show better performance than filters [28]. iii) Embedded methods. They combine the feature selection procedure and the classification process, by performing selection during the execution of the modeling algorithm. Most used methods include

2.2. Predictivity Inductive reasoning also involves the ability to make predictions about novel conditions based on preexisting knowledge or observations [13,14]. The simplest way in which we perform inductive reasoning is category-based induction. This means that we infer some properties of a “final category” based on observations obtained from a “premise” or “set” category (e.g., if we find a specific enzyme expressed in Great Dane dogs, we are moderately confident that this can be generalized to all dogs). This concept, that is the ability to make true predictions based on a set of empirical observations, is central in current biomedical research. For instance, in biomarker discovery “omics” approaches are applied to small sets of individuals so to obtain large datasets, from which a small number of statistically discriminant features can be extracted (signature), able to discriminate among, e.g., control individuals and patients [15–17]. The signature obtained must be predictive, in that the panel of discriminating features inferred by the “set” sample should be applied to a “test” sample reaching the same sensitivity and specificity. To achieve this goal (i.e., the obtainment of predictive 2

Journal of Proteomics xxx (xxxx) xxx–xxx

M. Lualdi, M. Fasano

Fig. 1. Proteomics signature workflow. Biological samples are processed so to obtain a protein extract that is enzymatically digested and analyzed by LC-MS/MS. Raw MS data are then analyzed and a list of quantified and validated features is obtained for each sample, gathered in a complex matrix of data. The feature selection (FS) procedure is then performed for data reduction by either feature subset selection (FSS) of feature extraction (FE) methods. FSS generates a subset of original features, whereas FE transforms the original data and extracts features that are different from the original ones. A set of discriminating features is eventually obtained that, after stringent cross-validation procedures, represents the proteomics signature of interest.

various decision tree algorithms (e.g., CART and random forest), least square regression and support vector machines (SVM). They have been proposed to combine the best properties of filters and wrappers. Indeed, a filter is firstly used to reduce the dimension of the feature space, then a wrapper is used to select the best candidate subset.

two is that univariate methods consider the input features one by one, while multivariate ones consider whole groups of variables together. 3.2.1. p-Value, statistical significance and errors Most popular univariate methods include t-test, analysis of variance (ANOVA) and logistic regression. All these methods provide a score, which can be used to rank the input features. Statistical tests provide the so-called p-value, which is a measure of the level of significance, so to estimate the absolute “importance” of a feature. An issue that is sometimes overlooked is that a p-value of 0.05 means that there is 5% chance of getting the observed result, if the null hypothesis (no difference among groups) were true. In other words, if 100 statistical tests are performed and for all of them the null hypothesis is actually true, it is expected that 5 of them will be significant at the p < 0.05 level, by chance. In this case, five statistically significant results are obtained, all of them being false positives (type I error). Indeed, type I errors (false positives) represent those situations in which the null hypothesis is rejected, but it was actually true. By contrast, type II errors (false negatives) occur when the null hypothesis is accepted, but it was actually false.

3.1.2. Feature extraction On the other hand, feature extraction algorithms (linear and nonlinear) create new variables as combinations (aggregations, transformations, etc.) of the existing ones. 3.2. Univariate vs. multivariate methods Intuitively, the simplest way to perform feature selection of proteomics data is focusing on those features that quantitatively change the most (differential proteins). However, a strong quantitative difference between two categories or classes (e.g., control vs. disease, treated vs. untreated) not necessarily indicates true mechanistic difference. In other words, statistical testing is always needed to assess the likelihood that the strong quantitative difference observed has arisen by chance. Statistics approaches can be categorized in two main groups: i) univariate methods, and ii) multivariate methods. As already mentioned when describing filters, the conceptual difference between the

3.2.2. Multiple testing correction From a clinical proteomics point of view, falling into type I errors is 3

Journal of Proteomics xxx (xxxx) xxx–xxx

M. Lualdi, M. Fasano

worse than type II. Thus, the problem of false positives when performing multiple statistical tests (i.e., several variables are considered simultaneously) is managed by the use of multiple testing corrections (MTCs). However, this cannot be done without affecting the probability of type II errors, which concomitantly increases. Thus, attention must always be kept when applying these kinds of corrections. Most used MTCs methods are briefly described below. i) Bonferroni correction [29]. The significance level (α) for an individual test (corresponding to a single feature) is found by dividing the threshold value (usually 0.05) by k, which is the total number of features. For instance, if you have 100 features (k = 100), α = 0.05/ 100 = 0.0005, so that only individual tests with p < 0.0005 would be considered significant. This clearly lowers the power of the statistical analysis (many false negatives), and this is why Bonferroni correction is appropriate when even a single false positive would represent a problem. ii) Holm correction [30]. Similar to the previous one, but α is adjusted step-by-step. The k features are ordered by their respective pvalues, then the lowest p-value is compared to α/k. If the p-value is lower than α/k, then the null hypothesis is rejected. The same selection is performed with the remaining k − 1 tests, setting a new threshold of α/(k − 1), and so on. The iteration is performed until the selected pvalue becomes higher than the threshold, at which time all the remaining p-values are not significant. By this step-by-step adaptation of the threshold, the Holm's method results more powerful than the Bonferroni's (the number of false negatives is reduced). iii) Benjamini-Hochberg correction [31]. Instead of controlling the family-wise error rate (i.e., the probability of getting one or more false positives), this method aims at controlling the false discovery rate (FDR), which is “the expected proportion of errors among the rejected hypotheses” [31]. As before, individual p-values are put in order, from lowest to highest. The lowest p-value has a rank of r = 1, the next has r = 2, and so on. Then, each p-value is compared to (r/k) × Q, where Q is the chosen FDR. All those p-values for which p < (r/k) × Q are considered to be significant. As shown in Table 1, where we applied the three methods to the same set of p-values, Benjamini-Hochberg's correction is the less stringent among the three (five p-values result to be significant, in contrast with the three and two obtained with Holm and Bonferroni methods, respectively). Also for this reason, this is the most widely used correction in proteomics and the issue of how to estimate and manage the FDR has been already revised elsewhere [32,33].

to rank the input features according to their usefulness for prediction. In the end, they work well in practice, but they can make big errors. Multivariate methods are computationally more demanding, but they are also more resistant to errors during selection. The most widely used feature extraction methods in proteomics data analysis are linear discriminant analysis, partial least squares discriminant analysis, principal component analysis and clustering methods. 3.3.1. Linear discriminant analysis (LDA) LDA is a multivariate method that is often used in proteomics to select a subset of variables (proteins) able to i) discriminate between two (or more) groups, and ii) derive a model for the classification of new observations into these known groups [34]. (L)DA works with continuous independent variables and a categorical dependent variable. The simplest type of LDA is that in which the categorical variable has two groups (e.g., healthy and diseased state, or treated and untreated). In this case, a linear discriminant function that passes through the means of the two groups (centroids) can be used to discriminate subjects between the two groups and for the classification of new subjects. As an example, Schultz and colleagues recently published a work based on the use of multivariate modeling with LDA, demonstrating the potential of proteomics as a diagnostic aid in psychiatry [35]. Briefly, by the analysis of serum proteins in subjects with schizophrenia, schizoaffective disorder, bipolar disorder and healthy controls with no psychiatric illnesses, fifty-seven proteins that differed significantly between groups were identified and used for modeling with LDA. A series of binary classification models including 8–12 proteins have been identified that produced separation between all diseased subjects and controls, and between each diagnostic group and controls. In addition to the classification purpose, LDA can be a tool for feature selection since weighted ranking coefficients associated to the features (scores) can be used, by recursive feature elimination, to draw a signature of the most significant features. Alberio and colleagues applied this kind of approach for the discovery and verification of a panel of lymphocyte proteins as peripheral biomarkers for Parkinson's disease [36]. 3.3.2. Partial least squares discriminant analysis (PLS-DA) Unlike PLS, which provides a tool to classify data in terms of a continuous variable, PLS-DA is conceptually similar to LDA, but with the difference that it can handle very well multiple dependent categorical variables (e.g., different disease classes) and it is feasible to the analysis of highly noisy datasets [37,38]. This method is used to optimize the degree of separation among different groups of samples. More specifically, it aims at maximizing the covariance between the independent variables (features; proteins in proteomics studies) and the corresponding dependent variables (classes, groups, disease states) by finding a linear subspace of the independent variables. This subspace allows for the prediction of the dependent variable based on a small number of factors, called PLS components. These components describe the behavior of the dependent variables, spanning the subspace where the independent variables are projected. The method is widely used in genomics, proteomics and metabolomics data analysis, mainly because of the easy availability in the most popular software packages (e.g., R, SAS, MATLAB). Indeed, PLS-DA has been extensively used in proteomics studies applied to oncology, with the aim to uncover proteins that could have prognostic value being tightly associated with cancer differentiation and progression. For instance, in a recent paper Simeone and colleagues identified forty-eight differentially expressed protein markers for glioblastoma by two-dimensional electrophoresis (2-DE). Then, using PLS-DA as multivariate tool for protein clustering, they identified among them a unique network of deranged proteins that allowed for the discrimination between high-grade and low-grade glioblastomas, thus representing a novel signaling module controlling pathogenesis and malignant progression [39].

3.3. Supervised vs. unsupervised methods As already mentioned, extraction methods can be used instead of feature subset selection ones to achieve data reduction, without missing any feature. Indeed, all data in this case are transformed, while a subset, by definition, always comprises a number of features that is lower than that of the original dataset. Feature extraction can be performed in proteomics in order to reach two different goals: i) to perform data reconstruction and find hidden structures in the dataset, or ii) to predict data and improve the predictive power of a model signature. When the aim is data reconstruction unsupervised methods are used, while supervised methods are employed to make predictions. More in general, “supervised” means that all input data are labelled, and the algorithms learn to predict the output data using the known information available from the input. By contrast, working “unsupervised” means that the input data are nonlabelled. Semi-supervised methods also exist, which sit in between the other two. Here, some input data is labelled and some (usually the majority) is unlabelled. When choosing between uni- and multi-variate methods for feature extraction, both advantages and pitfalls should be considered. Univariate methods are usually simpler than multivariate and allow us 4

Journal of Proteomics xxx (xxxx) xxx–xxx

M. Lualdi, M. Fasano

Table 1 Comparison among MTCs methods working on the same ordered p-value list. Bonferroni's method (selected α = 0,05) n

r

p

Corrected α

1 2 3 4 5 6 7 8 … 100

1 2 3 4 5 6 7 8 … 100

0,0001 0,0002 0,0005 0,001 0,002 0,005 0,01 0,02 … 0,9

α/k α/k α/k α/k α/k α/k α/k α/k … α/k

Test 0,0005 0,0005 0,0005 0,0005 0,0005 0,0005 0,0005 0,0005 … 0,0005

0,0001 0,0002 0,0005 0,001 0,002 0,005 0,01 0,02 … 0,9

< < = > > > > > … >

0,0005 0,0005 0,0005 0,0005 0,0005 0,0005 0,0005 0,0005 … 0,0005

Significant (null hyp rejected) Significant (null hyp rejected) Not significant Not significant Not significant Not significant Not significant Not significant … Not significant

< < < > > > > > … >

0,0005 0,00051 0,00051 0,00052 0,00052 0,00053 0,00053 0,00054 … 0,05

Significant (null hyp rejected) Significant (null hyp rejected) Significant (null hyp rejected) Not significant Not significant Not significant Not significant Not significant … Not significant

Holm's method (selected α = 0,05) n

r

p

Corrected α

1 2 3 4 5 6 7 8 … 100

1 2 3 4 5 6 7 8 … 100

0,0001 0,0002 0,0005 0,001 0,002 0,005 0,01 0,02 … 0,9

α/k α/(k-1) α/(k-2) α/(k-3) α/(k-4) α/(k-5) α/(k-6) α/(k-7) … α/(k−99)

Test 0,0005 0,00051 0,00051 0,00052 0,00052 0,00053 0,00053 0,00054 … 0,05

0,0001 0,0002 0,0005 0,001 0,002 0,005 0,01 0,02 … 0,9

Benjamini-Hochberg's method (selected Q = 0,05) n

r

p

Corrected α

1 2 3 4 5 6 7 8 … 100

1 2 3 4 5 6 7 8 … 100

0,0001 0,0002 0,0005 0,001 0,002 0,005 0,01 0,02 … 0,9

(r/k) (r/k) (r/k) (r/k) (r/k) (r/k) (r/k) (r/k) … (r/k)

x x x x x x x x

Q Q Q Q Q Q Q Q

xQ

Test 0,0005 0,001 0,0015 0,002 0,0025 0,003 0,0035 0,004 … 0,05

FDR (p x k)/r

0,0001 0,0002 0,0005 0,001 0,002 0,005 0,01 0,02 … 0,9

< < < < < > > > … >

0,0005 0,001 0,0015 0,002 0,0025 0,003 0,0035 0,004 … 0,05

Significant (null Significant (null Significant (null Significant (null Significant (null Not significant Not significant Not significant … Not significant

hyp hyp hyp hyp hyp

rejected) rejected) rejected) rejected) rejected)

0,01 0,01 0,016666667 0,025 0,04 0,083333333 0,142,857,143 0,25 0,9

n: features. r: rank. p: p-value. k: total number of features.

However, although very useful, this method may provide misleading results when used by nonexperts, as nicely revised elsewhere [40].

different potential overall survival rates for patients, ii) unveiled large biological diversity between the two subtypes of tumor, and iii) suggested ENO1 as a novel prognostic marker. The point here is that PCA can be a powerful tool not only for classification purposes, but also to extract hidden features (proteins) able to separate disease subtypes with different molecular backgrounds and different prognostic outcomes. Even though PCA is able to reveal main hidden patterns in proteomics data, this technique can also detect some systematic non-biologically related, or unwanted biologically related, biases (batch effects). Batch effects are not removed by data normalization and frequently misguide the interpretation of biological omics results. In order to avoid this, both the experimental study plan and the data analysis workflow must include the removal of each known confounding factor, either biological (e.g., age, pharmacological treatment, co-morbidity) or technical (e.g., change in reagents, instruments, technicians). Moreover, the final results must always be validated using new subjects and/or different quantitative techniques (e.g., Western blot, ELISA, IF/IHC).

3.3.3. Principal component analysis (PCA) Proposed by Harold Hotelling in 1933 [41], PCA is perhaps the most widely used multivariate method in proteomics data analysis. In general, PCA involves the transformation of the original variables in new variables, called principal components. These are linear combinations of the original variables, whose most important property is being orthogonal. The transformation is performed so that the first principal component accounts for the highest variance possible, which means that it accounts for as much as possible of the variability in the dataset. Then, the second component has the highest possible variance, obeying the rule that it must be orthogonal to the first one. And the same is for the subsequent components. Basically, the aim of PCA is to reduce the dimensionality of a huge dataset, “preserving as much variability (i.e., statistical information) as possible” [42]. Many examples of application of PCA in the proteomics field exist. In a very recent work of Ludvigsen and colleagues, unsupervised PCA is nicely used for feature selection [43]. Indeed, the separation of two subclusters in peripheral T-cells lymphoma samples has been obtained, in addition to the separation of control Vs. disease. These two lymphoma subclusters i) correlated with

3.3.4. Clustering methods Clustering methods include those approaches in which a group of similar features (e.g., proteoforms belonging to the same gene) is replaced by a single feature, called centroid. In order to generate the 5

Journal of Proteomics xxx (xxxx) xxx–xxx

M. Lualdi, M. Fasano

cluster, “similarity” must be statistically evaluated and the most intuitive criterion to gather similar features is based on their proximity, for example in terms of distances within a tree or network. Examples of this kind of strategies are aggregative nesting [44] and neighbor-joining [45] strategies. Discussing them into details is beyond the scope of the present review.

The performance evaluation by ROC curve can be performed not only after the cross-validation of the proteomics signature, but also be integrated in the feature selection process (see below).

3.4. Cross-validation of a proteomics signature

Recent scientific literature is plenty of good papers in which new statistical workflows for proteomics data analysis are described and where existing methods are compared to each other based on their performance using the same dataset of original data. Here we present an overview of some of them, suggesting some guidelines for decisionmaking when proteomics data must be analyzed.

4. Learning from literature: comparison among statistical workflows in proteomics

When a proteomics signature (i.e., a panel of features that have been shown to work well as classifiers) is obtained, it is mandatory to assess its performance for new samples. To this aim, either external or internal validation strategies can be employed. External validation, which implies the presence of a test set of samples that has not been used for model construction, should always be considered the best choice. However, when working with proteomics datasets it is not so easy to exclude a set of samples from model construction; the reason why is that the number of samples is typically much lower than the number of features, so that the exclusion of a fraction of them in this phase could be detrimental for the robustness of the model itself. Thus, internal validation strategies (i.e., crossvalidation, CV) are very frequently used in proteomics studies [46], which involve the use of a test set that includes samples already used for model construction. k-fold and leave-one-out cross-validations are the most popular methods used in proteomics.

4.1. Classification methods in clinical proteomics: diagnostic protein signatures When talking about current clinical proteomics strategies, the most important issue is that proteomics signatures usually have a low diagnostic power, making it difficult the application to clinics. The main reason is that reproducible feature selection in clinical proteomics can hardly be achieved by classical statistical hypothesis testing (SHT), whose drawbacks have been recently revised [50]. As already described, SHT, which is traditionally used for multiple testing in proteomics [51], involves stating the null hypothesis, choose a statistical test to perform, select a priori significance level, obtain a p-value to be compared to the significance level chosen and then accept (not significant) or reject (significant) the null hypothesis. In this framework, the low predictive power of diagnostic signatures is mainly due to the pvalue variability issue (inevitable), which is worsen in proteomics by additional factors: small sample size [52], missing values [53] and low measurements accuracy [54]. Moreover, biological variability is equally unavoidable and represents a great challenge in clinical proteomics studies. This implies that the experimental setup should be carefully designed to improve this issue: i) the most appropriate model must be identified and the possibility to translate the results from one model to another must be an inclusion/exclusion criterion (e.g., cell lines, tissues, plasma); ii) power calculation must be performed so to identify the adequate sample size, based on both sample type to be analyzed and techniques to be employed. The power of clinical proteomics signatures could be raised empirically, by the increment in sample size and/or reducing measurement errors; unfortunately, this is almost always unfeasible. If the power of a proteomics signature cannot be increased empirically within a single proteomics experimental setup, one possibility can be explored that is the combination of that experimental dataset with others retrieved by the literature. We used this approach in a recent paper focused on Parkinson's disease plasma biomarkers [55]. Briefly, experimentally obtained proteomics data have been merged with data from literature, so to obtain a subset of discriminating 2-DE spots that have been subjected to LDA and aggregative nesting. On the other hand, the power can be increased analytically, using signal boosting transformations (SBTs), which increase signal-overnoise (i.e., boosting confident signals with low variability, while penalizing signals with high variability). Alternatively, network-based statistical testing (NBST) can also be used to improve proteomics signatures power, by considering the autocorrelations among proteins belonging to the same protein complex or participating to the same subnet. Examples of such approaches are ranked-based network approaches (RBNAs) [56,57], quantitative proteomics signature profiling (QPSP) [58] and Fuzzy-FishNET [59]. But there is also the possibility to improve the predictive power of protein signatures in clinical proteomics by carefully evaluate which among the existing approaches is the best suited to the specific experimental purpose. Christin and colleagues have recently compared the performance of six different feature selection methods (t-test; Mann-

3.4.1. k-fold CV k-fold cross-validation implies that the original dataset is split into k subsets of the same size. At each cycle of validation, one of the subsets is retained as the test set, while the remaining (k – 1) subsets are used as the training set. The CV process is then repeated k times, so that each of the subsets is used once as the test set. Eventually, all the subsets are used as both training and test sets. k results will be obtained, that can be averaged to obtain a single estimation. Agranoff and colleagues published a work in which they identified diagnostic markers for tuberculosis by proteomic fingerprinting of serum [47]. More in detail, they used a supervised machine-learning approach based on the support vector machine (SVM) to obtain a classifier and then they used k-fold CV and random sampling of the SVM classifier to assess it further. 3.4.2. Leave-one-out CV (LOOCV) LOOCV is a k-fold cross validation where the number of k subsets is equal to the number of samples in the dataset (n). This means that, at each cycle of validation, the training set comprises all the data included in the original dataset but one (leave-one-out), that is used as the test. The CV process is then performed n times. Eventually, as for k-fold, an average error is calculated and used to evaluate the performance of the model. Examples of the use of this CV strategy are widespread in proteomics literature. In a recent work, Yu and colleagues demonstrated that tumor proteomics profiles contain the information to predict the response of ovarian carcinoma patients to platinum drugs. To do so, they used supervised machine learning methods and evaluated the performance of their prediction models through LOOCV [48]. 3.4.3. Performance estimation After the CV process has been completed, the likelihood scores can be calculated for each of the samples included in the dataset and the robustness of the model in terms of predictivity (i.e., sensitivity and specificity) can be visualized with a receiver operating characteristics (ROC) curve. Here, the true positives rate is reported in the Y axis, while the false positives rate is in the X axis. The lowest the percentage of false positives with respect to the true ones, the highest the area under the curve (AUC) and the better is the performance of the model signature. An example of this step-by-step procedure can be found in a work of Alberio and colleagues, where the verification step of a protein signature in T-lymphocytes for Parkinson's disease is performed by multiple reaction monitoring [49]. 6

Journal of Proteomics xxx (xxxx) xxx–xxx

M. Lualdi, M. Fasano

Whitney-Wilcoxon test, mww-test; Nearest Shrunken Centroid, NSC; linear Support Vector Machine – Recursive Feature Elimination, SVMRFE; Principal Component Discriminant Analysis, PCDA; PLS-DA) for LC-MS based biomarker discovery in clinical proteomics [60]. The authors concluded that univariate t-test and mww-test with multiple testing corrections are not successfully applicable to small datasets, but their performance improves markedly with the increase of the sample size, reaching a point where these methods work better than the others. So, sample size must be carefully considered when choosing the feature selection method to be applied. On the other hand, PCDA and PLS-DA are very precise in selecting small feature sets, but many true positive features are missed. NSC represents a good compromise, showing good performance independent of the sample size. Eventually, linear SVMRFE performs poorly in feature selection, even if the classification error is quite low. Alternative methods proposed to be used for feature selection in biomarker discovery are Bayesian methods. Bayesian statistics is extensively employed in LC-MS/MS-based proteomics, where it is involved in raw data analysis for the identification of proteins starting from peptides [61]. However, Bayesian approaches have been recently proposed for feature selection, too. We won't go into the details here, but the topic has been extensively described elsewhere [62,63].

significance. To this aim, a systems biology approach must be used, so to identify specific metabolic/signaling mechanisms involved. In other words, pathway analysis applied to proteomics is aimed at recognizing a short list of crucial biological events in a long list of protein identifiers. The most widely used method for pathway analysis in proteomics is the over representation analysis (ORA) [68]. The basic hypothesis when performing ORA is that relevant pathways can be detected if the proportion of differentially expressed proteins belonging to a given pathway exceeds the proportion of proteins that could be expected by chance. The probability value for the null hypothesis is usually calculated by the Fisher's exact test. An important feature of ORA is that this approach can also be used as a filter, to eliminate false positives in a set of results. For instance, if nine out of ten proteins in a discriminating list are involved in the apoptotic cell death pathway while only one is a structural cytoskeletal protein, we are quite confident to say that the latter can represent a false positive. Crucial in ORA is the choice of the pathway database, among which the most popular in proteomics are Reactome [69], KEGG [70] and BioCarta. Although ORA is very efficient at identifying major biological meaning among large proteomics datasets, it also has several limitations, that can be partially overcome by combining ORA with network methods (e.g., using ORA platforms such as WebGestalt [71]). An alternative approach for the analysis of large biological data is based on a computational method that determines whether an a priori defined set of genes/proteins shows statistically significant differences between two biological states. Two of the most popular tools are Gene Set Enrichment Analysis (GSEA) [72] and Protein Set Enrichment analysis (PSEA) [73]. Several papers have been published on this topic, in different biomedical research areas, such as neurodegeneration [74], cancer [75,76], and metabolic disorders [77,78]. Moreover, the application of ORA to proteomics data has already been revised elsewhere, for example in the context of neurodegeneration [79].

4.2. Classification methods for high-dimension proteomics data of mixed quality A big challenge in proteomics is the possibility to analyse together datasets deriving from different experimental setups, because of the high level of variability among the experiments (mixed quality data). Marchiori and colleagues [64] designed a particular experimental setup by which they were able to assess the performance of two state-of-theart feature selection methods (recursive feature elimination, RFE [65] and RELIEF [66]) for the classification of proteomics samples of mixed quality. Two types of classifiers have been considered: support vector machines (SVM) and k-nearest neighbor (kNN). LOOCV has been used for cross-validation. Results suggested that RELIEF was able to select more stable feature subsets than RFE. However, RFE performed better than RELIEF in terms of accuracy. Interestingly, almost all samples that were wrongly classified by the algorithms had high storage temperatures, thus indicating that experimental conditions (sample storage conditions in this case) can strongly influence the reproducibility of the analytic workflow. Moreover, the overall results indicated that, when samples of mixed quality are computationally analyzed together, the feature selection of only relevant features does not necessarily correspond to highest accuracy of classification. Very recently a new algorithm has been proposed by Conrad and colleagues [67] for FS and classification of high-dimensional proteomics data. Such data are usually very noisy, so the perfect algorithm should be robust against noise and outliers. At the same time, the selected set of features should be as small as possible. This new algorithm, called Sparse Proteomics Analysis (SPA), showed a performance competitive with the most widely used algorithms and was shown to be highly robust against random and systematic noise.

6. Concluding remarks: decision-making in statistical analysis of proteomics data In this review we have explored several methods that are used for statistical data analysis, in order to try to overcome the most important issues of proteomics datasets: big dimensionality and sparsity. We also gave some examples of how these feature selection methods have been employed, improved and compared in proteomics scientific literature. However, we also stated that, despite the wide range of possibilities, it is still quite difficult to find the correct statistical analysis workflow for the analysis of a specific dataset. More in general, it is difficult to define simple rules that must be followed based on both experimental setup and question to be answered. The reason for this is probably that different analysis workflows can actually lead to the same conclusions. Based on our experience in the field, we can try to suggest a decision-making process (Fig. 2) that starts from two simple questions: i) how many features, and ii) how many samples. In the most frequent scenario, a proteomics dataset includes many features and many (but less than features) subjects. In this case a multivariate unsupervised approach can be applied at first, so to select a subset of discriminating features. On the other hand, when the dataset contains many features and few subjects a filter (univariate and tolerant) can be applied at first. Then, discriminant analysis with recursive feature elimination can be used to construct the model, whose performance can be assessed by a ROC curve. Eventually, when not so many (~200) features and many samples are present, a multivariate approach can be employed followed by RFE to select the best signature, that will be then cross-validated and assessed in its predictive performance by a ROC curve. Examples of such kind of approaches have been already included in Section 3. Another question that must be carefully considered is the final goal of the proteomics analysis. If classification is the main purpose of the

5. Functional analysis of a proteomics profile Till now we have described and compared methods that are able to select, starting from a huge proteomics dataset, a subset of statistically validated “discriminating” features. These features could be useful in different ways, related to the aim of the experimental setup. For instance, they can either represent biomarkers for a particular disease or can be useful for treatment monitoring and prognosis, or again they can be used to distinguish among different disease states. This is crucial, but it would not be enough. Sometimes it could be very informative to try to transform a list of protein names in something that can strive to achieve a functional 7

Journal of Proteomics xxx (xxxx) xxx–xxx

M. Lualdi, M. Fasano

Fig. 2. Decision-making workflow to obtain a proteomics signature. Details are discussed in the main text.

study, then multivariate supervised methods provide the best results. On the other hand, if the goal is to let some hidden patterns within features emerge, unsupervised methods must be chosen. In the latter situation, batch effects must be carefully taken into account and avoided when feasible. Even if the right procedure for the statistical analysis of proteomics data might be difficult to be defined, we have tried here to highlight pros and cons of several feature selection procedures. As a whole, we hope that this review article may provide sufficient indications to personalize the statistical analysis design, depending on the properties and quality of the dataset, the computational resources available, the sample size and the type of biomedical problem being addressed.

[13] [14] [15] [16] [17] [18] [19]

Acknowledgements

[20]

We would like to thank Dr. Tiziana Alberio for helpful discussion.

[21]

Funding

[22]

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. References

[23]

[1] G.E.P. Box, Science and statistics, J. Am. Stat. Assoc. 71 (1976) 791–799, https:// doi.org/10.1080/01621459.1976.10480949. [2] R. Banarjee, A. Sharma, S. Bai, A. Deshmukh, M. Kulkarni, Proteomic study of endothelial dysfunction induced by AGEs and its possible role in diabetic cardiovascular complications, J. Proteomics 187 (2018) 69–79, https://doi.org/10.1016/ j.jprot.2018.06.009. [3] P.E. Khoonsari, S. Musunri, S. Herman, C.I. Svensson, L. Tanum, T. Gordh, K. Kultima, Systematic analysis of the cerebrospinal fluid proteome of fibromyalgia patients, J. Proteomics (2018), https://doi.org/10.1016/j.jprot.2018.04.014. [4] D. Kreutz, C. Sinthuvanich, A. Bileck, L. Janker, B. Muqaku, A. Slany, C. Gerner, Curcumin exerts its antitumor effects in a context dependent fashion, J. Proteomics 182 (2018) 65–72, https://doi.org/10.1016/j.jprot.2018.05.007. [5] C. Ramus, A. Hovasse, M. Marcellin, A.-M. Hesse, E. Mouton-Barbosa, D. Bouyssié, S. Vaca, C. Carapito, K. Chaoui, C. Bruley, J. Garin, S. Cianférani, M. Ferro, A. Van Dorssaeler, O. Burlet-Schiltz, C. Schaeffer, Y. Couté, A. Gonzalez de Peredo, Benchmarking quantitative label-free LC-MS data processing workflows using a complex spiked proteomic standard dataset, J. Proteomics 132 (2016) 51–62, https://doi.org/10.1016/j.jprot.2015.11.011. [6] D. Chiasserini, J.R.T. van Weering, S.R. Piersma, T.V. Pham, A. Malekzadeh, C.E. Teunissen, H. de Wit, C.R. Jiménez, Proteomic analysis of cerebrospinal fluid extracellular vesicles: a comprehensive dataset, J. Proteomics 106 (2014) 191–204, https://doi.org/10.1016/j.jprot.2014.04.028. [7] B. Manconi, B. Liori, T. Cabras, F. Vincenzoni, F. Iavarone, L. Lorefice, E. Cocco, M. Castagnola, I. Messana, A. Olianas, Top-down proteomic profiling of human saliva in multiple sclerosis patients, J. Proteomics 187 (2018) 212–222, https://doi. org/10.1016/j.jprot.2018.07.019. [8] X. Liu, Y. Song, Z. Guo, W. Sun, J. Liu, A comprehensive profile and inter-individual variations analysis of the human normal amniotic fluid proteome, J. Proteomics (2018), https://doi.org/10.1016/j.jprot.2018.04.023. [9] C. Bruce, K. Stone, E. Gulcicek, K. Williams, Proteomics and the analysis of proteomic data: 2013 overview of current protein-profiling technologies, Curr. Protoc. Bioinformatics 13 (2013), https://doi.org/10.1002/0471250953.bi1321s41 Unit13.21. [10] M.-S. Kim, J. Zhong, A. Pandey, Common errors in mass spectrometry-based analysis of post-translational modifications, Proteomics 16 (2016) 700–714, https:// doi.org/10.1002/pmic.201500355. [11] A. Ma'ayan, Complex systems biology, J. R. Soc. Interface 14 (2017), https://doi. org/10.1098/rsif.2017.0391. [12] C.D. Broad, Mind and Its Place in Nature, Harcourt, Brace & Company, Inc., New

[24] [25] [26] [27] [28]

[29] [30] [31] [32] [33] [34] [35] [36]

[37]

8

York, 1925http://archive.org/details/minditsplaceinna00broa (accessed September 20, 2018). B.K. Hayes, E. Heit, H. Swendsen, Inductive reasoning, Wiley Interdiscip. Rev. Cogn. Sci. 1 (2010) 278–292, https://doi.org/10.1002/wcs.44. B.K. Hayes, E. Heit, Inductive reasoning 2.0, Wiley Interdiscip. Rev. Cogn. Sci. 9 (2018) e1459, , https://doi.org/10.1002/wcs.1459. Q.-Y. He, J.-F. Chiu, Proteomics in biomarker discovery and drug development, J. Cell. Biochem. 89 (2003) 868–886, https://doi.org/10.1002/jcb.10576. E.C. Kohn, N. Azad, C. Annunziata, A.S. Dhamoon, G. Whiteley, Proteomics as a tool for biomarker discovery, Dis. Markers 23 (2007) 411–417. A. Suppers, A.J. van Gool, H.J.C.T. Wessels, Integrated chemometrics and statistics to drive successful proteomics biomarker discovery, Proteomes 6 (2018), https:// doi.org/10.3390/proteomes6020020. L. Bittner, R. Bellman, Adaptive control processes. A guided tour. XVI + 255 S. Princeton, N. J., 1961. Princeton University Press. Preis geb. $ 6.50, ZAMM J. Appl. Math. Mech. 42 (1962) 364–365, https://doi.org/10.1002/zamm.19620420718. I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. Z.M. Hira, D.F. Gillies, A review of feature selection and feature extraction methods applied on microarray data, Adv. Bioinformatics (2015), https://doi.org/10.1155/ 2015/198363. N. Hoque, D.K. Bhattacharyya, J.K. Kalita, MIFS-ND: a mutual information-based feature selection method, Expert Syst. Appl. 41 (2014) 6371–6385, https://doi.org/ 10.1016/j.eswa.2014.04.019. L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based filter solution, Proceedings, Twentieth International Conference on Machine Learning, 2003, pp. 856–863 https://asu.pure.elsevier.com/en/publications/ feature-selection-for-high-dimensional-data-a-fast-correlation-ba (accessed October 6, 2018). M. Robnik-Šikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and RReliefF, Mach. Learn. 53 (2003) 23–69, https://doi.org/10.1023/ A:1025667309714. M. Radovic, M. Ghalwash, N. Filipovic, Z. Obradovic, Minimum redundancy maximum relevance feature selection approach for temporal gene expression data, BMC Bioinformatics 18 (2017), https://doi.org/10.1186/s12859-016-1423-9. F. Azuaje, I.H. Witten, E. Frank, Data mining: practical machine learning tools and techniques 2nd edition, Biomed. Eng. Online 5 (2006) 51, https://doi.org/10.1186/ 1475-925X-5-51. S. Alelyani, J. Tang, H. Liu, Feature selection for clustering: a review, Data Clustering: Algorithms and Applications, 2013. R. Kohavi, G.H. John, Wrappers for feature subset selection, Artif. Intell. 97 (1997) 273–324, https://doi.org/10.1016/S0004-3702(97)00043-X. A. Jović, K. Brkić, N. Bogunović, A review of feature selection methods with applications, 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2015 2015, pp. 1200–1205, , https://doi.org/10.1109/MIPRO.2015.7160458. C. Bonferroni, Teoria statistica delle classi e calcolo delle probabilita, Pubblicazioni Del R Istituto Superiore Di Scienze Economiche e Commericiali Di Firenze 8 (1936) 3–62. S. Holm, A simple sequentially rejective multiple test procedure, Scand. J. Stat. 6 (1979) 65–70. Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B Methodol. 57 (1995) 289–300. S. Aggarwal, A.K. Yadav, False discovery rate estimation in proteomics, Methods Mol. Biol. 1362 (2016) 119–128, https://doi.org/10.1007/978-1-4939-3106-4_7. A.P. Diz, A. Carvajal-Rodríguez, D.O.F. Skibinski, Multiple hypothesis testing in proteomics: a strategy for experimental work, Mol. Cell. Proteomics 10 (2011), https://doi.org/10.1074/mcp.M110.004374. R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics 7 (1936) 179–188, https://doi.org/10.1111/j.1469-1809.1936. tb02137.x. S.C. Schulz, S. Overgaard, D.J. Bond, R. Kaldate, Assessment of proteomic measures across serious psychiatric illness, Clin. Schizophr. Relat. Psychoses 11 (2017) 103–112, https://doi.org/10.3371/CSRP.SSSO.071717. T. Alberio, A.C. Pippione, M. Zibetti, S. Olgiati, D. Cecconi, C. Comi, L. Lopiano, M. Fasano, Discovery and verification of panels of T-lymphocyte proteins as biomarkers of Parkinson's disease, Sci. Rep. 2 (2012) 953, , https://doi.org/10.1038/ srep00953. S. Wold, M. Sjöström, L. Eriksson, PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst. 58 (2001) 109–130, https://doi.org/10.1016/S0169-

Journal of Proteomics xxx (xxxx) xxx–xxx

M. Lualdi, M. Fasano

67, https://doi.org/10.1186/s12920-016-0228-z. [60] C. Christin, H.C.J. Hoefsloot, A.K. Smilde, B. Hoekman, F. Suits, R. Bischoff, P. Horvatovich, A critical assessment of feature selection methods for biomarker discovery in clinical proteomics, Mol. Cell. Proteomics 12 (2013) 263–276, https:// doi.org/10.1074/mcp.M112.022566. [61] G. Alterovitz, J. Liu, E. Afkhami, M.F. Ramoni, Bayesian methods for proteomics, Proteomics 7 (2007) 2843–2855, https://doi.org/10.1002/pmic.200700422. [62] B. Hernández, S.R. Pennington, A.C. Parnell, Bayesian methods for proteomic biomarker development, EuPA Open Proteomics 9 (2015) 54–64, https://doi.org/10. 1016/j.euprot.2015.08.001. [63] N. Dridi, A. Giremus, J.-F. Giovannelli, C. Truntzer, M. Hadzagic, J.-P. Charrier, L. Gerfault, P. Ducoroy, B. Lacroix, P. Grangeat, P. Roy, Bayesian inference for biomarker discovery in proteomics: an analytic solution, EURASIP J. Bioinform. Syst. Biol. (2017) (2017) 9, https://doi.org/10.1186/s13637-017-0062-4. [64] E. Marchiori, N.H.H. Heegaard, M. West-Nielsen, C.R. Jimenez, Feature selection for classification with proteomic data of mixed quality, IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 2005 2005, pp. 1–7, , https://doi.org/10.1109/CIBCB.2005.1594944. [65] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (2002) 389–422, https://doi.org/ 10.1023/A:1012487302797. [66] K. Kira, L.A. Rendell, The feature selection problem: traditional methods and a new algorithm, Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI Press, San Jose, California, 1992, pp. 129–134 http://dl.acm.org/citation. cfm?id=1867135.1867155 (accessed October 7, 2018). [67] T.O.F. Conrad, M. Genzel, N. Cvetkovic, N. Wulkow, A. Leichtle, J. Vybiral, G. Kutyniok, C. Schütte, Sparse Proteomics Analysis - a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data, BMC Bioinformatics 18 (2017) 160, , https://doi.org/10. 1186/s12859-017-1565-4. [68] M.A. García-Campos, J. Espinal-Enríquez, E. Hernández-Lemus, Pathway analysis: state of the art, Front. Physiol. 6 (2015), https://doi.org/10.3389/fphys.2015. 00383. [69] A. Fabregat, S. Jupe, L. Matthews, K. Sidiropoulos, M. Gillespie, P. Garapati, R. Haw, B. Jassal, F. Korninger, B. May, M. Milacic, C.D. Roca, K. Rothfels, C. Sevilla, V. Shamovsky, S. Shorser, T. Varusai, G. Viteri, J. Weiser, G. Wu, L. Stein, H. Hermjakob, P. D'Eustachio, The reactome pathway knowledgebase, Nucleic Acids Res. 46 (2018) D649–D655, https://doi.org/10.1093/nar/gkx1132. [70] M. Kanehisa, The KEGG database, Novartis Found. Symp. 247 (91–101) (2002) 244–252 discussion 101-103, 119–128. [71] B. Zhang, S. Kirov, J. Snoddy, WebGestalt: an integrated system for exploring gene sets in various biological contexts, Nucleic Acids Res. 33 (2005) W741–W748, https://doi.org/10.1093/nar/gki475. [72] A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, J.P. Mesirov, Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles, Proc. Natl. Acad. Sci. U. S. A. 102 (2005) 15545–15550, https:// doi.org/10.1073/pnas.0506580102. [73] M. Lavallée-Adam, N. Rauniyar, D.B. McClatchy, J.R. Yates, PSEA-quant: a protein set enrichment analysis on label-free and label-based protein quantification data, J. Proteome Res. 13 (2014) 5496–5509, https://doi.org/10.1021/pr500473n. [74] C. Monti, I. Colugnat, L. Lopiano, A. Chiò, T. Alberio, Network analysis identifies disease-specific pathways for Parkinson's disease, Mol. Neurobiol. (2016), https:// doi.org/10.1007/s12035-016-0326-0. [75] L. Fu-Jun, J. Shao-Hua, S. Xiao-Fang, Differential proteomic analysis of pathway biomarkers in human breast cancer by integrated bioinformatics, Oncol. Lett. 4 (2012) 1097–1103, https://doi.org/10.3892/ol.2012.881. [76] H. Xie, W. Wang, F. Sun, K. Deng, X. Lu, H. Liu, W. Zhao, Y. Zhang, X. Zhou, K. Li, Y. Hou, Proteomics analysis to reveal biological pathways and predictive proteins in the survival of high-grade serous ovarian cancer, Sci. Rep. 7 (2017) 9896, , https:// doi.org/10.1038/s41598-017-10559-9. [77] F. Bertile, T. Raclot, Proteomics can help to gain insights into metabolic disorders according to body reserve availability, Curr. Med. Chem. 15 (2008) 2545–2558. [78] O.A. Rangel-Zúñiga, A. Camargo, C. Marin, P. Peña-Orihuela, P. Pérez-Martínez, J. Delgado-Lista, L. González-Guardia, E.M. Yubero-Serrano, F.J. Tinahones, M.M. Malagón, F. Pérez-Jiménez, H.M. Roche, J. López-Miranda, Proteome from patients with metabolic syndrome is regulated by quantity and quality of dietary lipids, BMC Genomics 16 (2015) 509, , https://doi.org/10.1186/s12864-0151725-8. [79] M. Fasano, C. Monti, T. Alberio, A systems biology-led insight into the role of the proteome in neurodegenerative diseases, Expert Rev. Proteomics 13 (2016) 845–855, https://doi.org/10.1080/14789450.2016.1219254.

7439(01)00155-1. [38] R.G. Brereton, G.R. Lloyd, Partial least squares discriminant analysis: taking the magic away, J. Chemom. 28 (2014) 213–225, https://doi.org/10.1002/cem.2609. [39] P. Simeone, M. Trerotola, A. Urbanella, R. Lattanzio, D. Ciavardelli, F. Di Giuseppe, E. Eleuterio, M. Sulpizio, V. Eusebi, A. Pession, M. Piantelli, S. Alberti, A unique four-hub protein cluster associates to glioblastoma progression, PLoS One 9 (2014) e103030, , https://doi.org/10.1371/journal.pone.0103030. [40] P.S. Gromski, H. Muhamadali, D.I. Ellis, Y. Xu, E. Correa, M.L. Turner, R. Goodacre, A tutorial review: Metabolomics and partial least squares-discriminant analysis—a marriage of convenience or a shotgun wedding, Anal. Chim. Acta 879 (2015) 10–23, https://doi.org/10.1016/j.aca.2015.02.012. [41] H. Hotelling, Analysis of a complex of statistical variables into principal components, J. Educ. Psychol. 24 (1933) 417–441, https://doi.org/10.1037/h0071325. [42] I.T. Jolliffe, J. Cadima, Principal component analysis: a review and recent developments, Philos. Trans. A Math. Phys. Eng. Sci. 374 (2016), https://doi.org/10. 1098/rsta.2015.0202. [43] M. Ludvigsen, M. Bjerregård Pedersen, K. Lystlund Lauridsen, T. Svenstrup Poulsen, S.J. Hamilton-Dutoit, S. Besenbacher, K. Bendix, M.B. Møller, P. Nørgaard, F. d’'Amore, B. Honoré, Proteomic profiling identifies outcome-predictive markers in patients with peripheral T-cell lymphoma, not otherwise specified, Blood Adv. 2 (2018) 2533–2542, https://doi.org/10.1182/bloodadvances.2018019893. [44] W. Dubitzky, M. Granzow, D.P. Berrar, Fundamentals of Data Mining in Genomics and Proteomics, Springer Science & Business Media, 2007. [45] N. Saitou, M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Mol. Biol. Evol. 4 (1987) 406–425, https://doi.org/10.1093/ oxfordjournals.molbev.a040454. [46] J. Kuligowski, D. Pérez-Guaita, G. Quintás, Application of discriminant analysis and cross-validation on proteomics data, Methods Mol. Biol. 1362 (2016) 175–184, https://doi.org/10.1007/978-1-4939-3106-4_11. [47] D. Agranoff, D. Fernandez-Reyes, M.C. Papadopoulos, S.A. Rojas, M. Herbster, A. Loosemore, E. Tarelli, J. Sheldon, A. Schwenk, R. Pollok, C.F.J. Rayner, S. Krishna, Identification of diagnostic markers for tuberculosis by proteomic fingerprinting of serum, Lancet 368 (2006) 1012–1021, https://doi.org/10.1016/ S0140-6736(06)69342-2. [48] K.-H. Yu, D.A. Levine, H. Zhang, D.W. Chan, Z. Zhang, M. Snyder, Predicting ovarian cancer patients' clinical response to platinum-based chemotherapy by their tumor proteomic signatures, J. Proteome Res. 15 (2016) 2455–2465, https://doi. org/10.1021/acs.jproteome.5b01129. [49] T. Alberio, K. McMahon, M. Cuccurullo, L.A. Gethings, C. Lawless, M. Zibetti, L. Lopiano, J.P.C. Vissers, M. Fasano, Verification of a Parkinson's disease protein signature in T-lymphocytes by multiple reaction monitoring, J. Proteome Res. 13 (2014) 3554–3561, https://doi.org/10.1021/pr401142p. [50] W. Wang, A.C.-H. Sue, W.W.B. Goh, Feature selection in clinical proteomics: with great power comes great reproducibility, Drug Discov. Today 22 (2017) 912–918, https://doi.org/10.1016/j.drudis.2016.12.006. [51] J. Neyman, E.S. Pearson, On the problem of the most efficient tests of statistical hypotheses, Philos. Trans. R. Soc. Lond. A 231 (1933) 289–337, https://doi.org/10. 1098/rsta.1933.0009. [52] K. Jung, Statistical methods for proteomics, Methods Mol. Biol. 620 (2010) 497–507, https://doi.org/10.1007/978-1-60761-580-4_18. [53] C. Lazar, L. Gatto, M. Ferro, C. Bruley, T. Burger, Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies, J. Proteome Res. 15 (2016) 1116–1125, https://doi.org/10. 1021/acs.jproteome.5b00981. [54] J.D. Egertson, A. Kuehn, G.E. Merrihew, N.W. Bateman, B.X. MacLean, Y.S. Ting, J.D. Canterbury, D.M. Marsh, M. Kellmann, V. Zabrouskov, C.C. Wu, M.J. MacCoss, Multiplexed MS/MS for improved data-independent acquisition, Nat. Methods 10 (2013) 744–746, https://doi.org/10.1038/nmeth.2528. [55] T. Alberio, E.M. Bucci, M. Natale, D. Bonino, M. Di Giovanni, E. Bottacchi, M. Fasano, Parkinson's disease plasma biomarkers: an automated literature analysis followed by experimental validation, J. Proteomics 90 (2013) 107–114, https://doi. org/10.1016/j.jprot.2013.01.025. [56] W.W.B. Goh, L. Wong, Evaluating feature-selection stability in next-generation proteomics, J. Bioinform. Comput. Biol. 14 (2016) 1650029, https://doi.org/10. 1142/S0219720016500293. [57] K. Lim, L. Wong, Finding consistent disease subnetworks using PFSNet, Bioinformatics 30 (2014) 189–196, https://doi.org/10.1093/bioinformatics/ btt625. [58] W.W.B. Goh, L. Wong, Advancing clinical proteomics via analysis based on biological complexes: a tale of five paradigms, J. Proteome Res. 15 (2016) 3167–3179, https://doi.org/10.1021/acs.jproteome.6b00402. [59] W.W.B. Goh, Fuzzy-FishNET: a highly reproducible protein complex-based approach for feature selection in comparative proteomics, BMC Med. Genet. 9 (2016)

9

Statistical analysis of proteomics data: A review on feature selection

Statistical analysis of proteomics data: A review on feature selection

Recommend Documents