Expert Systems With Applications 90 (2017) 40–49
Contents lists available at ScienceDirect
Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa
A semi-supervised learning approach for model selection based on class-hypothesis testing Juan M. Gorriz a,c,∗, Javier Ramirez a, John Suckling c, F.J. Martinez-Murcia a, I.A. Illán a, F. Segovia a, A. Ortiz b, D. Salas-González a, D. Castillo-Barnés a, C.G. Puntonet d a
Department of Signal Theory, Networking and Communications, Avda. Fuentenueva s/n 18071, University of Granada, Spain Department of Communication Engineering, Campus de Teatinos s/n 29071, University of Málaga, Spain c Department of Psychiatry, Robinson Way, CB2 0SZ, University of Cambridge, UK d Department of Computer Architecture and Technology, C/ Daniel Saucedo s/n 18071, University of Granada, Spain b
a r t i c l e
i n f o
Article history: Received 3 April 2017 Revised 13 July 2017 Accepted 1 August 2017 Available online 5 August 2017 Keywords: Statistical learning and decision theory Semi-supervised learning Support vector machines (SVM) Hypothesis testing Partial least squares
a b s t r a c t This paper deals with the topic of learning from unlabeled or noisy-labeled data in the context of a classification problem. In the classification problem the outcome yields one of a discrete set of values thus, assumptions on them could be established to obtain the most likely prediction model at the training stage. In this paper, a novel case-based model selection method is proposed, which combines hypothesis testing from a discrete set of expected outcomes and feature extraction within a cross-validated classification stage. This wrapper-type procedure acts on fully-observable variables under hypothesis-testing and improves the classification accuracy on the test set, or keeps its performance at least at the level of the statistical classifier. The model selection strategy in the cross validation loop allows building an ensemble classifier that could improve the performance of any expert and intelligence system, particularly on small sample-size datasets. Experiments were carried out on several databases yielding a clear improvement on the baseline, i.e., SPECT dataset Acc = 86.35 ± 1.51, with Sen = 91.10 ± 2.77, and Spe = 81.11 ± 1.61. In addition, the CV error estimate for the classifier under our approach was found to be an almost unbiased estimate (as the baseline approach) of the true error that the classifier would incur on independent data. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction Statistical learning theory (SLT) is a recently developed area in statistics that has been successfully applied to several fields including machine learning and artificial intelligence (Hastie, Tibshirani, & Friedman, 2001; James, Witten, Hastie, & Tibshirani, 2013; Vapnik, 1998). From least squares methods for linear regression, proposed in the very beginning of the nineteenth century, to the novel advances in machine learning such as random forests, Support Vector Machines (SVM), bagging or boosting in the early 90s (Breiman, Friedman, Olshen, & Stone, 1984; Hastie et al., 2001; Vapnik, 20 0 0), SLT has become a new paradigm focusing on supervised and unsupervised modeling and prediction, i.e., the de∗ Corresponding author at: Department of Signal Theory, Networking and Communications, Avda. Fuentenueva s/n 18071, University of Granada, Spain. E-mail addresses:
[email protected] (J.M. Gorriz),
[email protected] (J. Ramirez),
[email protected] (J. Suckling),
[email protected] (F.J. Martinez-Murcia),
[email protected] (I.A. Illán),
[email protected] (F. Segovia),
[email protected] (A. Ortiz),
[email protected] (D. Salas-González),
[email protected] (D. Castillo-Barnés),
[email protected] (C.G. Puntonet).
http://dx.doi.org/10.1016/j.eswa.2017.08.006 0957-4174/© 2017 Elsevier Ltd. All rights reserved.
velopment of Computer-Aided Diagnosis (CAD) systems (Gur et al., 2004; Illán et al., 2010; Padilla, Lopez, Gorriz, & 2012; Suzuki, Li, Sone, & Doi, 2005). On the other hand, decision theory is the application of statistical hypothesis testing to the detection of signals in noise (Kay, 1993). Because under hypothesis testing we are essentially attempting to determine a desired pattern or to classify it as one of a set of possible patterns, it is also referred to as a pattern recognition or classification problem (Fukunaga, 1990). The most common form of machine learning is the supervised learning (LeCun, Bengio, & Hinton, 2015). In this case, a quantitative response Yk and several predictors {Xk } for k = 1, . . . , p are observed, and the aim is to discover the relationship among them, which can be written in a general form Y = f (X ) + , where f is an unknown function of the predictors and is a random error term. In this way, supervised learning refers to a set of approaches for estimating f based on a set of known predictors and responses (James et al., 2013). When the supervised learning does not involve predicting a quantitative value but a qualitative response or class, this is known as a classification problem. In the latter case, once a
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49
Fig. 1. Diagram of the model selection approach (up) versus the common supervised FE learning approach (bottom) on the training set.
final classifier fˆ has been estimated, it can be used to predict the classes of the test samples. Another field of research that have drawn the attention in the machine learning community is the semi-supervised learning (SSL) (Chapelle, Scholkopf and Zien, 2006). SSL belongs to the supervised category but in this case we have access to an additional unlabeled sample {X p+ j } for j = 1, . . . , m or samples with a few noisy initial labels (Lu, Gao, Wang, Wen, & Huang, 2015). In general, the solution to this subcategory is broadly based on two approaches: avoiding the use of unlabeled data or treating unobserved Y variables as a latent-class variables in the estimation of the system parameters (Chapelle et al., 2006). In recent years, several feature selection methods have been proposed based on Information Theory, filter (the one used in this paper as a baseline), embedded and wrapper methods (Guyon, Gunn, Nikravesh, Zadeh, & Eds, 2006) in fully-supervised data (Brown, Pocock, Zhao, & Luján, 2012). Under these approaches the features X are selected by quantifying the information that they share with the class variable Y. However, on partially-labeled datasets surrogate variables can be introduced to derive ranking equivalent approaches using all available information in an entirely classifier-independent and inference-free fashion (Sechidis, 2015). Some surrogate approaches assume the label of the unobservable variable Y and are found to be valid and informed to perform hypothesis testing for feature selection (Sechidis, Calvo, & Brown, 2014). 1.1. General outline Before estimating the classifier f, relevant and non-redundant features are usually extracted from the raw data to facilitate the subsequent learning and generalization steps (Varol, Gaonkar, Erus, Schultz, & Davatzikos, 2012). Based on the previous ideas and feature extraction (FE) schemes, we investigate the possibility of using a semi-supervised model selection algorithm based on hypothesis testing applied to the responses or outcomes. In Fig. 1, we show the differences between our methodology (up) and the baseline (bottom) at the training stage to derive the classifier fˆ. Learning from data samples involves, at this stage, model fitting by the use of observed variables and their labels (outcomes) that are grouped into two groups, the training and the validation sets. The use of a validation set at the training stage allows to select the classifier whose actual risk S(x) is minimal, i.e., by parameter tuning. Finally, a test set can be employed only to as-
41
sess the performance (generalization) of a fully-specified classifier and to avoid overfitting (Ripley, 1996). FE may be applied to surrogate variables, shown in the latter figure as the class-information Y0, 1 , within the Cross Validation (CV)-loop to obtain extended feature datasets by hypothesizing on the unknown outcomes of the validation patterns. The statistical consequences in each hypothesis could be analyzed in terms of probability within a Bayesian framework as proposed in this paper (the likelihood ratio test (LRT) block in Fig. 1), that is, in a classifierindependent fashion unlike embedded or wrapper methods. Other possibility is to evaluate the classifier configuration derived from the feature datasets, i.e., probability map of the support vectors (Padilla, et al., 2012). The influence of the validation pattern on the prediction models, i.e., a trained SVM, will depend on the relevance of the features that represent the validation samples in the feature space (Chapelle et al., 2006). Assuming the feature to be relevant, a decision function can be formulated in terms of class-conditional probabilities following the Neyman–Pearson (NP) lemma. The resulting LRT is similar to the one achieved by the classical linear discriminant (LDA) or quadratic DA analysis, but evaluated on two different feature datasets. Finally, the overall system can be seen as a wrapper-type method since although the feature selection is classifier-independent the system builds the final ensemble classifier based on a maximization process on several feature subsets (Martinez-Murcia, Górriz, & Ramírez, 1999). This paper is organized as follows. In Section 2, a background to the NP approach to signal detection is provided. In the following Section 3 classical FE methods, such as Least Squares (LS) and Partial LS methods, are applied on semi-supervised datasets to obtain two feature extractions of the training database. As a result, hypothesis testing theory is employed to provide a novel framework for model selection as a part of the general tools of assessing statistical accuracy, such as CV or Bootstrap methods (Varma & Simon, 2006). The resulting LRT is the optimal tradeoff between type I&II errors which is employed for FE under certain assumptions. The set of assumptions comprises Gaussian modeling for conditional probabilities, feature relevance, and statistical independence among feature components. Finally, in Section 5, a fully experimental framework is provided to demonstrate the benefits of the proposed approach acting on baseline filter-based approaches, i.e., using LS and PLS FE methods and a SVM learning algorithm that minimizes the leave-one-out (LOO)-CV error. In Section 6, conclusions are drawn. 2. NP approach to signal detection: A background Assume we observe a training set of random variables Z = {X ∈ R p , Y ∈ R}. The realization of the outcome variable Y is modeled by normal distributions N (μi , σi ), with mean μi and variance σ i for i = 1, . . . , C, where C denotes the number of outcomes or classes. In general, we must therefore determine if μ = μi for a single observation under a multiple C-ary hypothesis testing using a NP criterion. However, this is hardly used in practice, and the minimum probability error criterion is used instead (Kay, 1993). For a binary problem the test is defined as:
H0 : μ = 0;
H1 : μ = 1
(1)
where every possible value of μ is thought as one of two competing hypotheses. In terms of the observed variable x, the hypothesis can be reformulated as:
H0 : x; Y = 0
H1 : x; Y = 1
(2)
Thus, we are implicitly assuming that the hypothesis testing on the unobserved class Y can be reformulated in terms of the observed pattern value x, via the joint pdf, p(X, Y) or an unknown function, f: X→Y, an issue that is common in signal detection problems
42
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49
Fig. 2. Schematic of the database extension in the CMS approach. The statistical accuracy of the quantity S(Z) is computed by selecting the most likely dataset Z∗ . 2M data sets Zk each of size [(N + M ) × p, (N + M ) × 1] are drawn by considering all the possibilities in the binary variable Y.
(Gorriz, Ramirez, Lang, & Puntonet, 2008). Under the NP approach, we try to maximize the probability of detection PD = P (H1 ; H1 ) of one of the hypotheses when it is true for a given significance level or probability of false alarm P(H1 ; H0 ). In particular H1 is decided if the LRT holds:
p(x; H1 ) L (x ) = >γ p(x; H0 )
(3)
3. Semi-supervised feature extraction based on P-LS The process of model fitting at the learning stage depends on how the input patterns are related to their outcomes. In neuroimaging, for example, the labeling process is performed via a careful examination by the physician, and affects how the learner is adjusted to obtain the best model, the one that provides the best estimation error, i.e., S(Z). Linear methods for classification, such as LDA or QDA, are based on a LRT similar to the one shown in Eq. (3), but evaluating hypothesis testing on the raw data, i.e., the input pattern X is assumed to be observed under the null & alternative hypotheses in order to check which state is more likely. In this paper, a different approach, the so-called case-based method for model selection (CMS), is considered by obtaining realizations of the input patterns under H0 and H1 . This is actually achieved through a supervised FE scheme on extended datasets, in which we consider all the possibilities, equal to the number of classes, for a single validation pattern. In this sense, by performing FE on these datasets, we select the one that maximizes a discriminant function as defined in following sections. An schematic representation of the general method for M input patterns is shown in Fig. 2.
related to each other as:
wk = wk¯ ± (XTe Xe )−1 x
(4)
where
wk = (XTe Xe )−1 XTe Yk is the optimum LS hyperplane and Xe =
(5) [xT , xT1 , . . . , xTN ]T ,
Yk = [k, Y]T are the extended datasets and outcomes, respectively. k¯ stands for the logical “nor” operator on class k. Once a w is estimated, an x is classified to class k (k¯ ), if the projection wT x > 0( < ). This one-dimensional projection is used to extract novel feature datasets:
Z˜ k = [wTk Xe , Yk ]
(6)
under the class-hypothesis k and then, a log-LRT, as shown in the following section, is evaluated for model selection (H0 vs.H1 ). Once this Bayesian Information Criterion (BIC)-type model selection is applied, features are then extracted accordingly and the final SVM model is fitted in the LOO-CV loop. As an example, consider N = 200 realizations from two Gaussians distributions with mean values m1 = [0, 0] and m2 = [1, 1] and covariance matrix = [.9.3; .3.8], an LS FE-based model selection provides two models w1, 2 and feature datasets Z˜ 1,2 depicted in Fig. 3. In the upper figure we represent the input data and the unlabeled test sample in red font (class 1). The blue-shaded area represents the subspace between the two LS-estimators where they assign a different class to samples that fall in it. In the lower figure the two one-dimensional extended feature datasets are depicted including the validation feature (black cross). 3.2. PLS for a class-hypothesis-based FE
3.1. LS for the class-hypothesis-based FE Let Z = [S, Y] be a training set, where S is the set of training samples xi , for i = 1, . . . , N with their respective outcomes Y = (y1 , . . . , yN )T . In the LS method, we estimate a vector of parameters w, which determines the hyperplane wT x = 0, by minimizing a squared error cost function (Theodoridis, Pikrakis, Koutroumbas, & Cavouras, 2010). Given the validation pattern [x, k] for k = 1, 0, the LS estimates for the extended training sets Zk = Z ∪ [x, k] are
PLS (Helland, 2001) is a statistical method which models relationships among sets of observed variables by means of latent variables. It includes regression analysis and classification tasks, and is intended as a dimension reduction technique or a modeling tool. The starting point for PLS is the very simple assumption that the observed data is generated by a system or process which is driven by a smaller number of latent (not directly observed or measured) variables. In its general form, PLS is a linear algorithm (i.e., SIMPLS
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49
43
of the input feature pattern X, under the class-hypothesis Hk = Yk for Yk = {1, 0}. Assume the joint pdf p(X, S; Hk ) is known under both hypothesis. To maximize the power of the test for a given significance level, the NP approach is transformed into:
L (X ) =
p(X = x, S = s; H1 ) p(x|s; H1 ) p(s; H1 ) = >γ p(X = x, S = s; H0 ) p(x|s; H0 ) p(s; H0 )
(9)
where γ is a constant threshold and, if the inequality holds, we decide H1 . Assuming the pdf of s does not depend on the hypothesis (or these joint distributions are equivalent) and the input pattern x is relevant to the detection problem, the LRT can be rewritten as:
L (X ) =
p(x|s; H1 ) >γ p(x|s; H0 )
(10)
The assumption of relevance is necessary to find Eq. (10), since we have that the inequality p(x|s; Y1 ) = p(x|s; Y0 ) holds (see Appendix).1 Suppose that we model each conditional-density as multivariate Gaussian:
p(x|s; Hk ) =
CMS with LS FE
1
(2π ) p/2 |k |1/2
−1 1 T e − 2 ( x − μ k ) k ( x − μ k )
(11)
where μk , k are the mean vector and the covariance matrix which are estimated using the training set Z.2 Computing the logLRT, we finally get:
L (X ) = w1
δ1 (x ) δ0 (x )
(12)
where
1 2
1 2
δk (x ) = − log |k | − (x − μk )T k −1 (x − μk ) w0
-0,4
-0,3
-0,2
-0,1 TPTP
0
0,1
0,2
Fig. 3. CMS approach for LS-FE. Note the blue-shaded region which defines the difference between models under the hypotheses (“xor” logical operation between them). A false hypothesis on the validation pattern (up: red circle, bottom: black cross labeled by TP) results in a fault classification of the training samples s that fall in the margin. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
(de Jong, 1993)) for modeling the relation between Xe and Yk by decomposing them into the form:
Xe = Xs,k XTl,k + Xr,k
(7)
Yk = Ys,k YTl,k + Yr,k
(8)
where Xs, k , Ys, k are (N + 1 ) × Ncomp matrices of the Ncomp extracted score vectors (components or latent vectors), Xl, k , Yl, k are p × Ncomp matrices of loadings and Xr, k , Yr, k are (N + 1 ) × p matrices of residuals (or error matrices). The xs, k -scores in Xs, k are linear combinations of the x˜ -variables and can be considered as good “summaries” of them. Finally, we extract the novel feature datasets as Z˜ k = [Xs,k , Yk ]. 4. LRT under the class-hypothesis: The case-based approach Let simplify notation by defining the feature training set as Z = [S, Y] and the unlabeled feature realization as x = (x1 , . . . , x p )
(13)
is a quadratic discriminant function similar to the one obtained in QDA but defined on the two competing hypotheses in Eq. (2). Moreover, notice how LDA is a particular case of this expression for k = and x; Yk = x, ∀k. This slight alteration makes the method a completely novel approach, since this model allows us to select the most likely class when one of the two hypotheses is true. When the dimension of the feature space is high, density estimation is unattractive, thus an independence assumption is usually employed (Naive Bayes model). Given a class Yk , if the features xi p are independent ( p(x|s, Hk ) = j=1 p j (x j |s, Hk )), the quadratic discriminant function transforms into:
δk (x ) = −
p p 1 1 ( x j − μk j ) 2 2 log σ jk − 2 2 σ jk2 j=1
(14)
j=1
where μkj and σ jk are the k-class mean and variance of the jth feature component. The whole process comprising feature extraction and model selection is depicted in Fig. 1 and summarized in the following algorithm. 4.1. Classification In this section, we propose a simple classification stage to adjust the statistical classifiers (model fitting) using the improved datasets, i.e., a linear SVM classifier. In this sense, the LRT process is repeated in a CV-loop adjusting each classifier to the novel
1 In terms of ability to classify, this ratio is equivalent to having the class posteriors for optimal classification (P(H|x)) (Hastie et al., 2001). It is worth mentioning that the resulting Bayes factor in Eq. (10) is connected to the BIC approach for model selection in Ripley (1996). Here the selected model is expressed in terms of the conditional probability of the input pattern x given the training patterns s. 2 Again, this assumption follows the same line of reasoning as the Laplacian approximation proposed by Ripley (1996) for model selection.
44
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49
datasets, thus the predictions from all of them could then combined through a weighted majority vote to produce the final prediction (Hastie et al., 2001):
f (x ) = sign
N
αk fk (x )
(15)
k=1
where N is the number of folds in the CV loop (the sample size in LOO-CV), fk is the linear SVM classifier which is tuned after model selection in each iteration and the selection of α gives higher influence to the more accurate classifiers in the sequence. In the experiments αk = 1/N was selected, assuming that no prior-knowledge on the validation labels is available, although they could be computed by means of a boosting-type algorithm as well. This selection is also motivated by the aim to assess the optimization by means of the feature training sets defined in the previous section, avoiding the benefits of boosting to achieve this purpose.
Table 1 Demographic details of the SPECT dataset. AD 1 = possible AD, AD 2 = probable AD, AD 3 = certain AD. μ and σ stands for population mean and standard deviation, respectively.
NOR AD1 AD2 AD3
#samples
Sex(M/F)(%)
μ[range/σ ]
41 29 22 4
32.95/12.19 10.97/18.29 13.41/9.76 0/2.43
71.51[46-85/7.99] 65.29[23-81/13.36] 65.73[46-86/8.25] 76[69-83/9.90]
5. Experiments In this section, a set of experiments are carried out on several image databases where the small sample size problem is typically an issue together with the presence of noisy-labeled data, i.e., a SPECT image database (Górriz, Segovia, Ramírez, Lassl, & SalasGonzalez, 2011). The motivation for analyzing such kind of datasets is double. On the one hand, this is the typical scenario in biomedical datasets, i.e., structural/functional imaging, where the visual expert labeling is always prone to error. On the other hand, these datasets usually consist of millions of variables/predictors and a limited sample size, thus the effect of the proposed approach can be detected just by using a LOO-CV strategy. To this achieve this analysis, a fair comparison using the same FE and statistical validation schemes for the proposed CMS approach and the baseline filter-based methods is performed. In both cases the error estimation is obtained by LOO-CV and a linear SVM classifier to avoid overfitting. A naive-Bayes model is tested on the log-LRT decision rule that defines the CMS approach, while the number of extracted components for the PLS approaches are varied, i.e., from 1 to 10, showing average results and standard deviations. Standard statistical measures are computed such as Accuracy (Acc), Sensibility (Sen), Specificity (Spe), positive likelihood (PL), negative likelihood (NL) and the confusion matrix (ConfM), although other measures could be analyzed as well (Liang, Liang, Li, Li, & Wang, 2016), in order to assess the system performance. Finally, the bias of the error estimation is also evaluated for both approaches. 5.1. SPECT image database Baseline SPECT data from 96 participants were collected from the “Virgen de las Nieves” hospital in Granada (Spain) (Górriz et al., 2011). The patients were injected with a gamma emitting 99m TcECD radiopharmeceutical and the SPECT raw data was acquired by a three head gamma camera Picker Prism 30 0 0. A total of 180 projections were taken with a 2 deg angular resolution. The images of the brain volumes were reconstructed from the projection data using the filtered backprojection (FBP) algorithm in combination with a Butterworth filter for noise removal . The SPECT images were spatially normalized, using the SPM software (Friston, Ashburner, Kiebel, Nichols, & Penny, 2007), in order to ensure that the voxels in different images refer to the same anatomical positions in the brain. After the spatial normalization a 95 × 69 × 79 voxel representation of each subject was obtained, where each voxel represents a brain volume of 2.18 × 2.18 × 3.56 mm3 . Finally, the intensities of the SPECT images were normalized with a maximum intensity value Imax , which is computed for each image by averaging over the 3% highest voxel intensities. The SPECT images were
Fig. 4. Axial example slices (# 30) of four subjects of the SPECT database. Left to right, top to bottom: NOR, AD1, AD2, AD3.
visually classified by experts of the “Virgen de las Nieves” hospital using four different labels: normal (NOR) for patients without any symptoms of Alzheimer Disease (AD), and possible AD (AD1), probable AD (AD2) and certain AD (AD3) to distinguish between different levels of the presence of typical characteristics for AD. Overall, the database consists of 41 NOR, 29 AD1, 22 AD2 and 4 AD3 patients, however they were classified considering the binary case NOR (41) vs. AD (55). Table 1 shows the demographic details of the database and in Fig. 4 some examples of the dataset are depicted. 5.2. INRIA person dataset INRIA pedestrian database contains 1805 64 × 128 images of humans in any orientation and against a wide variety of background images including crowds (see some example in Fig. 5). Many are bystanders, taken from the image backgrounds, so there is no particular bias on their pose (Dalal & Triggs, 2005). In these applications, data preprocessing usually requires the computation of normalized histograms of image local gradient orientations (HOG) in a dense grid and then, a classification stage is applied, based on several paradigms such as Bayesian framework (Wu & Nevatia, 2007), morphological analysis (Barnich, Jodogne, & Van Droogenbroeck, 2006), ensemble learning (Viola & Jones, 2001) or artificial neural networks (Geng et al., 2017; Zhao & Thorpe, 20 0 0), to categorize the visual features. The detection window is decomposed into 50 % overlapping (7 × 15) blocks and (4 per block) cells (9 bins per cell) yielding positive and negative HOG feature vectors with size Nd = 3780. In our experiments, a randomly selected and balanced subset of this database is processed in order to extract the HOG features which are further processed by selecting the 7 out of 9 most discriminant bins per cell. As a conclusion, a 42 × 2940 dataset is consider for FE, and classification containing 21 positive and 21 negative HOG features. Although this is not the typical sce-
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49
45
Fig. 5. Negative and positive examples extracted from the INRIA dataset (Dalal & Triggs, 2005).
nario in the problem of pedestrian detection, the database is wellsuited to our purposes as it amounts to dealing with the problem of small sample sizes, and to analyzing the effect of a strongly correlated set of features.
Table 2 Statistical measures of performance for the proposed methods and the baseline approaches (LS and PLS) for standardized databases. The std values for Acc, Sen, Spe of the baseline and proposed methods are obtained varying the number of PCA-PLS components n = 1, . . . , Ncomp . The std of these values are in the order of 10−2 and 10−1 (INRIA database), respectively.
5.3. Experimental analysis and discussion The statistical measures to assess the performance of the CMS approach on the above mentioned datasets are summarized in Table 2, where a linear SVM classifier in a LOO-CV loop is used for classification. The synthetic dataset is drawn randomly from the two-dimensional Gaussian distributions, as described in the previous example with N = 200, but with selected means m1 = [0, 0] and m2 = [.5, .5]. The aim of this selection is to highlight the benefits of the proposed method when dealing with overlapping classes, causing an increase of the blue-shaded area in Fig. 3. On this dataset, the number of PLS components is forced to be Ncomp = 1 for obvious reasons. Thus the strength of this FE method cannot be exploited on this low-dimensional dummy dataset for none of the evaluated methods. In the latter case, the CMS using LS is the one that yields the best accuracy at the CV-stage, outperforming all other evaluated methods, as shown in Table 2. Nevertheless, LS approaches on high-dimensional databases (i.e., INRIA dataset) require the application of some additional FE scheme for dimensionality reduction, since the inversion of large matrices can be prohibitive. For this purpose, the well-known Principal Component Analysis (PCA)-based FE method (López et al., 2011) is selected, analyzing the effect of varying the number of PCs for n = 1, . . . , Ncomp = 20 and averaging over the results, as shown in Table 2. The maximum number of components is selected in terms of amount of explained variance; based on this, the chosen number of components should explain 80% of the data vari-
LS
C-LS
PLS
C-PLS
Acc (%) Spe (%) Sen (%) PL NL ConfM
61.50 63 60 1.62 0.63 [60 37] [40 63]
64.5 68 61 1.90 0.57 [61 32] [39 68]
62 65 59 1.68 0.63 [59 35] [41 65]
62.5 66 59 1.74 0.62 [59 34] [41 66]
2D-Gaussians
Acc (%) Spe (%) Sen (%) PL NL ConfM
90.48 90.48 90.48 9.50 0.11 [190 20] [20 190]
94.52 97.05 93 31.55 0.07 [193 6] [17 204]
92.86 90.91 95 10.45 0.05 [400 40] [20 380]
96.67 95.60 98.21 22.31 0.019 [412 20] [8 400]
INRIA
Acc (%) Spe (%) Sen (%) PL NL ConfM
72.53 63.12 90.05 2.45 0.15 [1123 684] [107 966]
73.06 63.22 91.94 2.49 0.13 [1145 691] [85 959]
78.96 73.35 84.07 3.15 0.22 [329 121] [81 429]
86.35 81.11 91.10 4.82 0.11 [364 85] [46 465]
SPECT
ance at least (see Fig. 6). The results provided by the PLS-based schemes are also averaged in terms of the number of components n = 1, . . . , Ncomp = 10. As clearly stated in the results, the application of the two versions of the CMS approach outperforms classical LS and PLS methods for feature extraction, although the standard
46
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49
Pareto chart for the INRIA Dataset 100
100% Variance explained/componet Total variance explained
90
90%
80
80%
70
70%
60
60%
50
50%
40
40%
30
30%
20
20%
10
10%
0
0% 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Fig. 6. Criterium selection for the maximum number of components of FE-methods (INRIA dataset). At least 80% of the total variance should be explained.
Fig. 7. PLS-FE method under the CMS approach (INRIA datasets) on hypothesis H0 .
Fig. 8. PLS-FE method under the CMS approach (INRIA datasets) on hypothesis H1 .
deviation (std) of the approaches is slightly higher, as shown in Table 2. Figs. 7 and 8 shows the different subsets Z˜ 1 , Z˜ 0 generated by applying a PLS-FE CMS approach to the extended INRIA datasets. Observe how the different extended datasets provide different configuration of SVs. In this example, it can be readily seen that the input pattern is more likely to belong to class 1 (right figure). As a result, the number of SVs of the decision surface is increased when the false hypothesis (labeling the input pattern as class 0) is assumed for dataset generation. As in the previous example, the improvement is based on the fact that the fitted model obtained in the CV-stage is well suited for the noisy-labeled data presented in the sample unlike the one that yields the CV baseline. Finally, on a typical uncorrelated high-dimensional biomedical dataset (SPECT database), the proposed approaches clearly outperform the baseline methods within the same framework, i.e., the estimation of the CV prediction error using a linear SVM classifier. In addition, the std of the results is lower in our proposed approaches, i.e., PLS-FE Acc = 78.96 ± 3.29, Sen = 84.07 ± 1.72, Spe = 73.35 ± 4.88; CMS PLS-FE Acc =
86.35 ± 1.51, Sen = 91.10 ± 2.77, Spe = 81.11 ± 1.61. A further analysis on the outcome of the proposed method reveals that the improvements for the CMS approach are mainly found in NOR and AD1 subjects (Górriz et al., 2011), an issue that is clearly motivated by the overlapping nature of both image patterns (from the confusion matrix: 35 NOR patterns and 36 AD1 patterns). As a conclusion on this part, we have demonstrated that the CMS approach outperforms the standard filter-based approach for feature selection by including class-information in the validation pattern. This improvement is achieved in several scenarios (balanced and unbalance/real and synthetic datasets with overlapping classes), sample sizes, number of predictors and feature extraction and classification schemes. It is worth mentioning that the workbench of the proposed method for model selection is biomedical data sets, where the number of samples N is usually less than the data dimension d (predictors). The application of the proposed method by a nonparametric implementation of the LRT to a large data set d N can be found in (Górriz et al., 2017). Although the latter approach presents some limitations when d increases, i.e., the delimited re-
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49
47
Table 3 Statistical measures for the empirical distribution of the risk Sexp (n∗ ). 10-D
μ
σ2
% bias
CMS Baseline
0.5163 0.5132
0.0088 0.0112
−3.25 −2.64
10 0 0-D CMS Baseline
μ
σ2
0.5069 0.5115
0.0051 0.0090
% bias −1.39 −2.31
gions by the wrongly classified support vectors are empty of samples, the performance of the system is preserved when the number of samples is increased in large synthetic 2D datasets. 5.4. Bias in error estimation Although it is reasonable to optimize parameters in the development of CAD systems by minimizing CV error rates, the resulting classification rates are usually biased estimates of the actual risk due to the small sample size problem. This is a common setting in healthcare database studies, where LOO-CV-based error estimation is usually selected as validation method (Górriz et al., 2009). The aim of this simulation is to compare bias and variance in the error estimation of the CMS approach with the ones obtained by the baseline approaches. 10 0 0 simulations were run with 40 class-balanced samples per simulation, drawn from a multivariate Gaussian distribution with zero mean and random semi-positive definite covariance matrix. Each sample consists of a 10,10 0 0 dimensional feature vector. In addition, an independent test set with 2 · 105 samples was generated with the same pdf. To determine the amount of bias for each method, we compute the LOO CV error of the linear SVM classifier. The system parameter to be optimized during the simulation is the number of selected components n of the feature vector, in order to obtain the mean LOO CV error estimate Sexp (n∗ ). This resulting classifier was used to predict the class of the samples, created independently, obtaining the distribution of the actual risk, i. e., the true error S(n∗ ). Figs. 9 and 10 Algorithm 1 Case-based Model Selection Algorithm. Input: Collections S p×N of labeled and X p×M unlabeled samples Initialisation : 1: Build the subsets Zk = Z ∪ [XYk ] LOOP Process 2: for k = 1 to 2M do 3: Feature Extract Z˜ k using P-LS (E step) Use current subsets to estimate p(x|s, Hk ) 4: 5: end for 6: (M step) Re-estimate the most likely subset by H ˆ = k arg maxk (δk (x )) ˆ 7: return k ˜ˆ 8: Estimate f ˆ according to Z k
k
Output: A classifier (equation 15) that takes the test samples x and predicts their class label. show the empirical distributions of Sexp (n∗ ), the CV error estimate for the optimal n and S(n∗ ), the actual risk, for the optimized SVM classifier at each method. The resulting mean and variance values for the distributions are depicted in Table 3. As a conclusion, the difference between empirical and true errors is lower than 5%, and both are statistically similar over this simulation, where the mean estimator was considered following the same strategy as in the experimental part. Although the bias of the LOO CV error estimate is not significant for none of the aforementioned methods on this classification task, we could obtain a
Fig. 9. Distribution of the experimental error estimate (LOO CV) and the actual risk (true error) for optimized Support Vector Machine (10-D feature vector).
close to unbiased estimate of the actual risk by using the results of several resampling and optimization methods (Efron & Tibshirani, 1997; Varma & Simon, 2006). 6. Conclusions In this paper, the notion of the CMS approach and some connections with previous approaches are presented and evaluated using several synthetic and real image datasets (Dalal & Triggs, 2005; Górriz et al., 2011). The CMS approach combines FE, hypothesis testing on a discrete set of expected outcomes and a crossvalidated classification stage. This methodology provides extended datasets (one per class-hypothesis) by means of FE methods, which are scored probabilistically using a NP approach inside a LOO-CV loop. Our results demonstrate that, although the method only considers a binary classification problem and a LOO CV scheme (inherent bias similar to the baseline), the resulting minimum CV error estimate outperforms the one obtained by baseline methods using several FE methods (LS and PLS) that do not consider such FE/S optimization. As an example, on a SPECT database the proposed method yields an Acc = 86.35 ± 1.51, with Sen = 91.10 ± 2.77, and Spe = 81.11 ± 1.61, clearly outperforming the baseline method. As an ongoing research the evaluation of the proposed method will be conducted on Magnetic Resonance Imaging (MRI) datasets for the detection of abnormal neurological patterns such as AD or Autism. The validation procedure will include complex classifiers (decision trees, random forest) in different statistical cross-
48
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49
ing algorithms on optimal feature subsets with high rate of accuracy. To assume Y to be conditionally dependent of X implies that probability of the label (given all features S) can change when we eliminate knowledge about the value of X. If not, our LRT in Eq. (12) would provide random behavior since both datasets would be equally probable. In that case, in terms of machine learning, the SVM classification stage would solve this issue since the current input pattern is not a SV. Bayes rule connects the definition of relevance with the discussion following Eq. (10). For relevant patterns the LRT is expressed as:
p( y 1 | X = x , S = s ) p( X = x , S = s | y 1 ) p( y 1 ) = p( y 0 | X = x , S = s ) p( X = x , S = s | y 0 ) p( y 0 )
(A.1)
where p(y1 ) = p(y2 ) is usually assumed. Within the Bayesian paradigm the latent-class variables y is the hypothesis H, thus p(x|H ) p(H ) = p(x; H ). References
Fig. 10. Distribution of the experimental error estimate (LOO CV) and the actual risk (true error) for optimized Support Vector Machine (10 0 0-D feature vector).
validation and ensemble learning schemes (k-fold, bagging, boosting). Acknowledgments This work was partly supported by the MINECO under the TEC2015-64718-R project, the Salvador de Madariaga Mobility Grants 2017 and the Consejería de Economía, Innovación, Ciencia y Empleo (Junta de Andalucía, Spain) under the Excellence Project P11-TIC-7103. Finally, thanks to the reviewers for all the time you have taken to revise the article and for all your valuable comments. Appendix A. The concept of relevance The definition of relevance to a set of variables {X1 , . . . , XN } with respect to their classes {Y1 , . . . , YN }, has been suggested in many works (Kohavi & John, 1997). As an example, Gennari, Langley, and Fisher (1989) define relevant features as those whose “values vary systematically with category membership”. Another possible definition is the following: Definition: Let S = {X1 , X2 , . . . , XN }, the set of all features except the remaining test pattern X. Denote by s a value-assignment to all features in S. X is relevant iff there exists some x, y, and s for which p(X = x, S = s ) > 0 such that p(Y = y|X = x, S = s ) = p(Y = y|S = s ). The latter definition is one of the degrees of relevance (a weak definition is also admissible) that allows the application of learn-
Barnich, O., Jodogne, S., & Van Droogenbroeck, M. (2006). Robust analysis of silhouettes by morphological size distributions. In J. Blanc-Talon, W. Philips, D. Popescu, & P. Scheunders (Eds.), Proceedings of the eight International Conferenceon on Advanced concepts for intelligent vision systems (pp. 734–745). Berlin, Heidelberg: Springer Berlin Heidelberg. Antwerp, Belgium, September 18–21. doi:10.1007/11864349_67. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Taylor & Francis. Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13(1), 27–66. Chapelle, O., Scholkopf, B., & Zien, A. (2006). Semi-supervised learning. MIT Press. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In C. Schmid, S. Soatto, & C. Tomasi (Eds.), Proceedings of the international conference on computer vision & pattern recognition: vol. 2 (pp. 886–893). INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334. Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The .632+ bootstrap method. Journal of the American Statistical Association, 92(438), 548–560. Friston, K. J., Ashburner, J., Kiebel, S. J., Nichols, T. E., & Penny, W. D. (Eds. ) (2007). Statistical parametric mapping: The analysis of functional brain images. Academic Press. Fukunaga, K. (1990). Introduction to statistical pattern recognition ((2nd ed.)). San Diego, CA, USA: Academic Press Professional, Inc. Geng, Y., Li, W., Liang, G., Wang, J., Wu, Y., Patil, N., & Wang, J. J.-Y. (2017). A novel image tag completion method based on convolutional neural network. CoRR. abs/1703.00586. arXiv: 1703.00586. Gennari, J. H., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40(1–3), 11–61. doi:10.1016/0 0 04-3702(89)90 046-5. Górriz, J. M., Lassl, A., Ramírez, J., Salas-Gonzalez, D., Puntonet, C., & Lang, E. (2009). Automatic selection of ROIS in functional imaging using Gaussian mixture models. Neuroscience Letters, 460(2), 108–111. Gorriz, J. M., Ramirez, J., Lang, E. W., & Puntonet, C. G. (2008). Jointly Gaussian pdfbased likelihood ratio test for voice activity detection. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1565–1578. doi:10.1109/TASL.2008. 2004293. Górriz, J. M., Segovia, F., Ramírez, J., Lassl, A., & Salas-Gonzalez, D. (2011). GMM based SPECT image classification for the diagnosis of Alzheimer’s disease. Applied Soft Computing, 11(2), 2313–2325. doi:10.1016/j.asoc.2010.08.012. Gur, D., Stalder, J. S., Hardesty, L. A., Zheng, B., Sumkin, J. H., Chough, D. M., . . . Rockette, H. E. (2004). Computer-aided detection performance in mammographic examination of masses: Assessment. Radiology, 233(2), 418–423. Guyon, I. M., Gunn, S. R., Nikravesh, M., Zadeh, L., & Eds (2006). Feature extraction, foundations and applications. Springer. Górriz, J. M., Ramírez, J., Suckling, J., Illán, I. A., Ortiz, A., Martinez-Murcia, F. J., . . . Wang, S. (2017). Case-based statistical learning: A non-parametric implementation with a conditional-error rate SVM. IEEE Access, 5, 11468–11478. doi:10.1109/ ACCESS.2017.2714579. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer. Helland, I. S. (2001). Some theoretical aspects of partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 58(2), 97–107. doi:10.1016/ S0169-7439(01)00154-X. Illán, I. A., Górriz, J. M., Ramírez, J., Salas-Gonzalez, D., López, M., Segovia, F., . . . Puntonet, C. (2010). Projecting independent components of SPECT images for computer aided diagnosis of Alzheimer’s disease. Pattern Recognition Letters, 31(11), 1342–1347. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. Springer. de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18(3), 251–263. Kay, S. M. (1993). Fundamentals of statistical signal processing. In Detection theory. Prentice Hall signal processing series: vol. II. Upper Saddle River (N.J.): Prentice Hall.
J.M. Gorriz et al. / Expert Systems With Applications 90 (2017) 40–49 Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1), 273–324. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. Liang, R.-Z., Liang, G., Li, W., Li, Q., & Wang, J. J.-Y. (2016). Learning convolutional neural network to maximize pos@top performance measure. CoRR. abs/1609.08417. Lu, Z., Gao, X., Wang, L., Wen, J.-R., & Huang, S. (2015). Noise-robust semi-supervised learning by large-scale sparse coding. In Proceedings of the twenty-ninth AAAI conference on artificial intelligence. In AAAI’15 (pp. 2828–2834). AAAI Press. López, M., Ramírez, J., Górriz, J. M., Álvarez, I., Salas-Gonzalez, D., Segovia, F., . . . Gómez-Río, M. (2011). Principal component analysis-based techniques and supervised classification schemes for the early detection of Alzheimer’s disease. Neurocomputing, 74(8), 1260–1271. Martinez-Murcia, F. J., Górriz, J. M., & Ramírez, J. (1999). Feature extraction. In Wiley encyclopedia of electrical and electronics engineering. John Wiley & Sons, Inc. doi:10.1002/047134608X.W5506.pub2. Padilla, P., Lopez, M., Gorriz, J. M., Ramirez, J., Salas-Gonzalez, D., & Alvarez, I. (2012). NMF-SVM based cad tool applied to functional brain images for the diagnosis of Alzheimer’s disease. IEEE Transactions on Medical Imaging, 31(2), 207–216. doi:10.1109/TMI.2011.2167628. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge University Press. Sechidis, K. (2015). Hypothesis testing and feature selection in semi-supervised data. (Ph.D. thesis) University of Manchester. Sechidis, K., Calvo, B., & Brown, G. (2014). Statistical hypothesis testing in positive unlabelled data. In Machine learning and knowledge discovery in databases (pp. 66–81). Berlin Heidelberg: Springer.
49
Suzuki, K., Li, F., Sone, S., & Doi, K. (2005). Computer-aided diagnostic scheme for distinction between benign and malignant nodules in thoracic low-dose ct by use of massive training artificial neural network. IEEE Transactions on Medical Imaging, 24(9), 1138–1150. Theodoridis, S., Pikrakis, A., Koutroumbas, K., & Cavouras, D. (2010). Classifiers based on cost function optimization. Chapter 2. In Introduction to pattern recognition (pp. 29–77). Boston: Academic Press. Vapnik, V. N. (1998). Statistical learning theory. New York: John Wiley and Sons, Inc. Vapnik, V. N. (20 0 0). The nature of statistical learning theory. Springer. Varma, S., & Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 91(7). Varol, E., Gaonkar, B., Erus, G., Schultz, R., & Davatzikos, C. (2012). Feature ranking based nested support vector machine ensemble for medical image classification. In Proceedings of the international symposium on biomedical imaging: From nano to macro (pp. 146–149). IEEE. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition.: vol. 1. I–511–I–518. 10.1109/CVPR.2001.990517 Wu, B., & Nevatia, R. (2007). Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. International Journal of Computer Vision, 75(2), 247–266. doi:10.10 07/s11263-0 06-0 027-7. Zhao, L., & Thorpe, C. E. (20 0 0). Stereo- and neural network-based pedestrian detection. Transactions on Intelligent Transportation Systems, 1(3), 148–154. doi:10. 1109/6979.892151.