Author’s Accepted Manuscript Inductive and Flexible Feature Extraction for SemiSupervised Pattern Categorization F. Dornaika, Y. El Traboulsi, A. Assoum
www.elsevier.com/locate/pr
PII: DOI: Reference:
S0031-3203(16)30092-9 http://dx.doi.org/10.1016/j.patcog.2016.04.024 PR5729
To appear in: Pattern Recognition Received date: 17 June 2015 Revised date: 10 December 2015 Accepted date: 28 April 2016 Cite this article as: F. Dornaika, Y. El Traboulsi and A. Assoum, Inductive and Flexible Feature Extraction for Semi-Supervised Pattern Categorization, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.04.024 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Inductive and Flexible Feature Extraction for Semi-Supervised Pattern Categorization F. Dornaika1,2,∗ , Y. El Traboulsi 1,3 , and A. Assoum 3 University of the Basque Country UPV/EHU, Manuel Lardizabal, 1, 20018 San Sebastian, Spain IKERBASQUE, Basque Foundation for Science, Maria Diaz de Haro 3, 48013 Bilbao, Spain 3 Doctoral School of Sciences and Technology, Lebanese University, Mitein Street, Tripoli, Lebanon 1
2
∗
Corresponding Author: University of the Basque Country UPV/EHU, Manuel Lardizabal, 1, 20018 San Sebastian, Spain; Tel: +34 943018034; FAX: +34 943 015590 ; E-mail:
[email protected]
Abstract This paper proposes a novel discriminant semi-supervised feature extraction method for generic classification and recognition tasks. This method, called inductive flexible semisupervised feature extraction, is a graph-based embedding method that seeks a linear subspace close to a non-linear one. It is based on a criterion that simultaneously exploits the discrimination information provided by the labeled samples, maintains the graphbased smoothness associated with all samples, regularizes the complexity of the linear transform, and minimizes the discrepancy between the unknown linear regression and the unknown non-linear projection. We extend the proposed method to the case of nonlinear feature extraction through the use of kernel trick. This latter allows to obtain a nonlinear regression function with an output subspace closer to the learned manifold than that of the linear one. Extensive experiments are conducted on ten benchmark databases in order to study the performance of the proposed methods. Obtained results demonstrate a significant improvement over state-of-the-art algorithms that are based on label propagation or semi-supervised graph-based embedding. Keywords: Feature extraction, semi-supervised discriminant analysis, graph-based
Preprint submitted to Pattern Recognition
May 20, 2016
embedding, out-of-sample extension, pattern categorization 1. Introduction Feature extraction with dimensionality reduction is an important step and essential process in embedding data analysis. By computing an adequate representation of data that has a low dimension, more efficient learning and inference [1, 2, 3, 4] can be achieved. There are two main reasons for estimating a low-dimensional representation of high-dimensional data: reducing measurement cost of further data analysis and beating the curse of dimensionality. The dimensionality reduction can be achieved either by feature extraction or feature selection. Feature extraction refers to methods that create a set of new features based on transformations and/or combinations of the original features, while feature selection methods select the most representative and relevant subset from the original feature set [5]. Feature extraction methods can be classified into two main classes: (1) linear methods, and (2) non-linear methods. Besides this categorization, these methods can also be classified into three categories: (i) supervised, (ii) semi-supervised, and (iii) unsupervised. The linear techniques have been increasingly important in pattern recognition [6, 7, 8, 3] since they permit a relatively simple mapping of data onto a lower-dimensional subspace, leading to simple and computationally efficient classification strategies. The classical linear embedding methods (e.g., PCA, Linear Discriminant Analysis (LDA), Maximum Margin Criterion (MMC)[9]) and Locally LDA [10] are demonstrated to be computationally efficient and suitable for practical applications, such as pattern classification and visual recognition. PCA projects the samples along the directions of maximal variances and aims to preserve the Euclidean distances between the samples. Unlike PCA which is unsupervised, Linear Discriminant Analysis (LDA) [11, 8] is a supervised technique. One limitation of PCA and LDA is that they only see the linear global Euclidean structure. The non-linear methods such as Locally Linear Embedding (LLE) [12] and Laplacian eigenmaps [13] focus on preserving the local structures. Isomap [14] is a non-linear projection method that globally preserve the data. It also attempts to preserve the geodesic 2
distances between samples. Although the supervised feature extraction methods had been successfully applied to many pattern recognition applications, they require a full labeling of data samples. It is well-known that it is much easier to collect unlabeled data than labeled samples. The labeling process is often expensive, time consuming, and requires intensive human involvement. As a result, partially labeled datasets are more frequently encountered in real-world problems. In the last decade, semi-supervised learning algorithms have been developed to effectively use limited number of labeled samples and a large amount of unlabeled samples for real-world applications [15, 16]. In the past years, many graph-based methods for semisupervised learning have been developed. The main advantage of such methods is their ability to identify classes of arbitrary distributions. The use of data-driven graphs has led to many progresses in the field of semi-supervised learning [17, 18, 19, 20, 21, 22, 23, 24, 25]. Toward classification, an excellent subspace should be smooth as well as discriminative. Hence, a graph-theoretic learning framework is usually deployed to simultaneously meet the smoothness requirement among nearby points and the discriminative requirement among differently labeled points [26]. In addition to the use of partial labelling in semi-supervised learning, many researchers use pairwise constraints which can be seen as another form of side information [27]. These constraints simply indicate whether two instances are similar (must-link) or dissimilar (cannot-link). They are usually used for getting a linear or non-linear embedding by adding them to the criterion derived from unlabelled data samples [28, 29, 30, 31]. The final application is to help spectral clustering recover from an undesirable partition. From the point of view of manifold learning, semi-supervised extensions can generally improve the performance over their supervised counterparts. Nevertheless, despite the success of many graph-based algorithms in dealing with partially labeled problems [32], there are still some problems that are not properly addressed. Almost all semi-supervised feature extraction techniques can suffer from one of the following limitations:
3
1. The non-linear semi-supervised approaches do not have, in general, an implicit function that can map unseen data samples. In other words, the non-linear methods provide embedding for only the training data. This is the transductive setting, i.e., the test set coincides with the set of unlabeled samples in the training dataset. Indeed, solving the out-of-sample extension is still an open problem for those techniques adopting non-linear embedding. 2. Almost all proposed semi-supervised approaches target the estimation of a linear transform that maps original data into a low dimensional space. While this simplifies the learning processes and gets rid of the out-of-sample problem, there is no guarantee that such approaches will be optimal for all datasets. The main reason behind this is that the criterion used is already a rigid constraint that contains only the linear mapping. Thus, any coordinate in the low-dimensional space is supposed to be a linear combination of the original features. For that reason, these approaches have not the flexibility to adapt their linear model to a more generic non-linear model. In addition to the above limitations, it is not clear what would be the performance of the semi-supervised approaches when minimal labeling is used. For instance, in the domain of face recognition the so-called one sample problem can be a challenging issue. In this paper, we propose an Inductive Flexible Semi Supervised Feature Extraction method as well as its kernelized version. The aim is to combine the merits of Flexible Manifold Embedding and the non-linear graph based embedding. The proposed linear method will be flexible since it estimates a non-linear manifold that is the closest to a linear embedding. The proposed kernelized method will be also flexible since it estimates the non-linear manifold that is the closest to a kernel-based embedding. In both proposed methods, the non-linear manifold as well as the mapping (linear transform for the linear regression and the kernel multipliers for the non-linear regression) are simultaneously estimated. This simultaneous estimation is the main reason that makes the proposed frameworks superior to many existing algorithms as it will be shown in the sequel. We can also notice that the dimension of the final embedding obtained by the proposed methods is not limited to the
4
number of classes. This allows the application of any kind of classifiers once the data are embedded in new spaces. In contrast with non-linear dimensionality reduction approaches, our proposed methods have an obvious advantage that the learnt subspace has a direct out-of-sample extension to novel samples, and are thus easily generalized to the entire high-dimensional input space. The main differences between our proposed method and the other state-of-the-art ones are (1) our method is not based on label propagation and (2) it simultaneously exploits labeled discrimination information, uses graph-based smoothness, and estimates a regression function whose output is close at most to the non-linear model. The paper is structured as follows. In section 2, we briefly review the main methods for semi-supervised learning including the graph-based label propagation and the semisupervised embedding methods. In section 3, we introduce the IFSSFE method and its kernel variant. Section 4 depicts the experimental results obtained with ten real datasets and compares the performance of the proposed method with those of the competing ones. Finally, in section 5 we present our conclusions. In the sequel, capital bold letters denote matrices and small bold letters denote vectors.
2. Related work In order to make the paper self-contained, this section will briefly describe some stateof-the art semi-supervised methods. 2.1. Notation and preliminaries We define the training data matrix as X = [x1 , x2 , ..., xl , xl+1 , ..., xl+u ] ∈ RD×(l+u) , where xi |li=1 and xi |l+u i=l+1 are the labeled and unlabeled samples, respectively, with l and u being the total numbers of labeled and unlabeled samples and D being the feature dimension. Let N = l + u denote the total number of training samples and nc be the total number of labeled samples in the cth class and let XL = [x1 , x2 , ..., xl ] ∈ RD×l be the labeled samples matrix, with the label of xi as yi ∈ {1, 2, ..., C}, where C is the total number of classes. Let S ∈ R(l+u)×(l+u) denote the graph similarity matrix with S(i, j) 5
representing the similarity between xi and xj , i.e., S(i, j) = sim(xi , xj ). In a supervised context, one can also consider two similarity matrices Sw and Sb that encode the within class and between class graphs, respectively. Sw encodes the pairwise similarities among samples having the same label. Thus, Sw (i, j) = sim(xi , xj ) if xi and xj have the same class label; Sw (i, j) = 0, otherwise. Similarly, Sb encodes the pairwise similarities among samples having different labels. Thus, Sb (i, j) = sim(xi , xj ) if xi and xj have different labels; Sb (i, j) = 0, otherwise. The function sim(., .) can be any symmetric function that measures the similarity between two samples. This can be given by the cosine or the Gaussian kernel. For each similarity matrix S, a Laplacian matrix L can be computed. This latter is given by: L = D − S where D is a diagonal matrix whose elements are the row (or column since the similarity matrix is symmetric) sums of S matrix. Similar expression can be ˆ is defined by L ˆ = I − D−1/2 S D−1/2 found for Lb and Lw . The normalized Laplacian L where I denotes the identity matrix. We also define a binary label matrix Y ∈ BN ×C associated with the samples with Y (i, j) = 1 if xi has label yi = j; Y (i, j) = 0, otherwise. Similarly, we can define ⎛ ⎜ FL unknown label matrix denoted by F ∈ RN ×C . In a semi-supervised setting, F = ⎝ FU where FL = YL .
an ⎞ ⎟ ⎠
2.2. Graph-based label propagation methods In the last decade, the semi-supervised learning methods using graph-based label propagation attracted much attention. All of them impose that samples with high similarity should share similar labels. They differ by the regularization term as well as by the loss function used for fitting label information associated with the labeled samples. All of these methods take as input the weighted graph S associated with data and the label matrix Y. The state-of-the art label propagation algorithms (can also be called classifiers [33]) can be: Gaussian Fields and Harmonic Functions (GFHF) [34], Local and Global Consistency (LGC) [35], Laplacian Regularized Least Square (LapRLS) [36], Robust Multi-class
6
Graph Transduction (RMGT) [37], and Flexible Manifold Embedding (FME) [38]. 2.3. Graph-based embedding methods Many of the existing label propagation algorithms and non-linear embedding techniques can only work on transductive setting, which requires that both the training and test sets are available during the learning process. Therefore, these algorithms are not always suitable for recognition applications where the test set is generally not available during the training phase. Unlike label propagation techniques that seek label inference, the embedding techniques seek a general coordinate representation where the dimension of the mapped data is not necessarily limited to the number of classes. Two main techniques represent the state-of-the-art in semi-supervised graph-based embedding: 2.3.1. Semi-Supervised Discriminant Analysis (SDA) Cai et al. extended LDA to SDA [39] by adding a geometrically-based regularization term in the objective function of LDA. The core assumption in SDA is still the manifold smoothness, namely, nearby points will have similar representations in the lowerdimensional space. 2.3.2. Semi-Supervised Discriminant Embedding (SDE) SDE can be seen as the semi-supervised variant of the Local Discriminant Embedding (LDE) method [40]. In order to discover both geometrical and discriminant structure of the data manifold, SDE [41, 42] relies on three graphs: the within-class graph Gw (intrinsic graph), the between-class graph Gb (penalty), and the graph defined over the whole set (labeled and unlabeled samples). For each data sample xi , two subsets, Nw (xi ) and Nb (xi ) are computed. Nw (xi ) contains the neighbors sharing the same label with xi , while Nb (xi ) contains the neighbors having different labels. One simple possible way to compute these two sets of neighbors associated with the local sample is to use two nearest neighbor graphs: one for homogeneous samples (parameterized by K1 ) and one for heterogeneous samples (parameterized by K2 ). Note that K1 and K2 are user-defined parameters.
7
Each of the graphs Gw and Gb , mentioned above, is characterized by its corresponding similarity (weight) matrix Sw and Sb , respectively. The entries of these symmetric matrices are defined by the following formulas:
Sw (i, j) =
Sb (i, j) =
⎧ ⎪ ⎨ sim(xi , xj ) if xj ∈ Nw (xi ) or xi ∈ Nw (xj ) ⎪ ⎩ 0, otherwise ⎧ ⎪ ⎨ sim(xi , xj ) if xj ∈ Nb (xi ) or xi ∈ Nb (xj ) ⎪ ⎩ 0,
(1)
(2)
otherwise
Let Lw , Lb , and L denote the Laplacian matrices associated with the graph similarity matrices, Sw , Sb , and S, respectively. SDE seeks a linear mapping W such that the following three criteria should be optimized simultaneously: l l 1 WT (xi − xj )2 Sw (i, j) = min T r(WT XL Lw XTL W) min W W 2 i j
(3)
l l 1 WT (xi − xj )2 Sb (i, j) = max T r(WT XL Lb XTL W) max W W 2 i j
(4)
N N 1 min WT (xi − xj )2 S(i, j) = min T r(WT X L XT W) W 2 i j W
where T r(.) denotes the trace of a matrix. This leads to the following criterion: T r(WT XL Lb XTL W) W = arg max W T r(WT (XL Lw XTL + α X L XT + β I) W)
b ∈ RN ×N denote the augmented Laplacian matrices, namely:
w ∈ RN ×N and L Let L ⎛
⎞
⎛
⎞
Lw 0 ⎟ Lb 0 ⎟
w = ⎜
b = ⎜ L ⎠, L ⎠ ⎝ ⎝ 0 0 0 0
8
(5)
The above criterion becomes:
b XT W) T r(WT X L W = arg max
w XT + α X L XT + β I) W) W T r(WT (X L The above trace ratio can be approximated by:
w XT + α X L XT + β I)W)−1 (WT X L
b XT W)) W = arg max T r(WT (X L W
b XT ) associ w XT + α X L XT + β I)−1 (X L Thus, W is given by the eigenvectors of (X L ated with the largest eigenvalues.
3. Proposed methods 3.1. Inductive Flexible Semi Supervised Feature Extraction (IFSSFE) In this section, we propose an Inductive Flexible Semi Supervised Feature Extraction (IFSSFE) that can combine the merits of Flexible Manifold Embedding idea and non-linear graph-based embedding. It should be noticed that the dimension of the final embedding is not limited to the number of classes. We assume that the non-linear embedding of the training data samples is given by the matrix Z ∈ RN ×d , i.e., the row vector Zi. is the non-linear representation of the vector xi . We consider again the within class and between class graphs associated with the labeled data as well as the graph associated the whole set of training data (labeled and unlabeled). The expression of the criteria associated with the non-linear Semi-Supervised Discriminant Embedding will be:
w Z) max T r(ZT L
b Z) min T r(ZT L Z) min T r(ZT L Z Z Z
(6)
By combining the above criteria together with the regression and regularization terms we can define a criterion that should be minimized. This criterion depends on both the
9
non-linear embedding and the regression transform. It is given by:
w Z) − λ T r(ZT L
b Z) + μ (W2 + γ XT W + 1N bT − Z2 ) e(Z, W, b) = T r(ZT L Z) + T r(ZT L = T r(ZT L1 Z) + μ (W2 + γ XT W + 1N bT − Z2 )
(7)
w − λ L
b , μ, γ, and λ are three positive balance parameters, and 1N where L1 = L + L denotes a vector of N ones. The non-linear embedding as well as the regression should be estimated such that e is minimized. To obtain the optimal solution, we vanish the derivatives of the objective function e with respect to W and b. We have:
b =
1 T (Z 1N − WT X 1N ) N
W = γ (γ Xc XTc + I)−1 Xc Z = A Z
(8) (9)
where A = γ (γ Xc XTc + I)−1 Xc . We use the above expression for W and b in the regression function XT W + 1N bT , we get:
XT W + 1N bT
with B = Hc XT A +
1 T N 1N 1N
1 1 1N 1TN Z − 1N 1TN XT A Z N N 1 1 T T = (I − 1N 1N ) X A Z + 1N 1TN Z N N 1 T T = Hc X A Z + 1 N 1 N Z = B Z N = XT A Z +
and Hc = (I −
1 N 1N
1TN ). Thus, the criterion e becomes:
e(Z, W, b) = T r(ZT L1 Z) + μ [T r(ZT AT A Z) + γ T r((B Z − Z)T (B Z − Z))] = T r(ZT (L1 + μ AT A + μγ(B − I)T (B − I)) Z) = T r(ZT (L1 + E) Z)
(10)
where E = μ AT A + μγ(B − I)T (B − I).
10
Thus, the non-linear embedding Z is estimated by minimizing the above criterion with a constraint in order to avoid the trivial solution Z = 0.
Z = arg min T r(ZT (L1 + E) Z) s.t. ZT Z = I Z Thus Z is given by the eigenvectors of L1 + E associated with the smallest eigenvalues. Once Z is estimated the corresponding regression W and b are estimated by Eqs. (9) and (8). Given an unseen sample xtest its embedding (a column vector) is given by ztest = WT xtest + b . 3.2. Difference between the proposed method and existing methods Obviously, our proposed flexible method has several advantages compared with existing methods. Indeed, it combines the merits of the flexible framework for graph-based semi-supervised label propagation (FME) [38] and those of graph-based semi-supervised embedding methods. The advantages are as follows. First, unlike the FME method which estimates label distributions, our method estimates a non-linear embedding whose dimension is not limited to the number of classes as it is the case with many frameworks adopting the label propagation algorithm. Second, the proposed method is a kind of a non-linear feature extractor that lends itself nicely to all machine learning paradigms that can be used in the output space with any dimension in order to infer the class (classification) or the continuous label (regression). Third, the method is still inductive in the sense that it can work with unseen data. Fourth, it inherits the flexibility of FME in the sense that a non-linear embedding and a regression are found such that the former is close to the linear one obtained by the latter. Thus, the proposed method can better cope with the data sampled from a certain type of non-linear manifold that is somewhat close to a linear subspace. Existing works adopting spectral regression like [43, 44] provide the linear embedding using a cascaded estimation. First, they estimate a non-linear embedding given by
11
Laplacian Eigenmaps then they estimate the regression in a subsequent stage. Thus, our proposed method can better cope with the data sampled from a certain type of nonlinear manifold that is somewhat close to a linear subspace. This will be confirmed in the experimental results section. 3.3. Kernel IFSSFE In this section we propose the kernel version of the (IFSSFE). The motivation behind the use of kernel is that in some cases the non-linearity of data cannot be close to a linear subspace [45, 46]. In such cases, the flexibility introduced by the linear regression term may not lead to good approximation of the embedded data. The proposed kernel version aims at a flexibility in which the regression itself is non-linear. Thus, the role of the kernel trick is to seek an inductive non-linear embedding that is close to a real non-linear subspace. In [36], it is shown that kernelized version of label propagation based on the linear LapRLS can be written as:
min
l
L(f(xi ), yi ) + λA f2K + λI fG
i=1
where L(., .) is a loss function, fK and fG are the RKHS-norm and the graph-based smoothness of the model f, respectively. The linear homologue of the above criterion is obtained by using f(xi ) = WT xi + b. According to Belkin [36], the model f will expand over all the labeled and unlabeled points as follows (here f is a row vector having C elements): f(x) =
l+u
vTj K(x, xj ) = [K(x, x1 ), . . . , K(x, xl+u )] V
j=1
where K ∈ R(l+u)×(l+u) is a kernel Gram matrix, and V ∈ R(l+u)×C is the matrix of multipliers (V = [v1 , v2 , . . . , vl+u ]T ). The entries of the Gram matrix are given by K(xi , xj ) that represents a dot product in feature space. This kernel function can be Gaussian or polynomial.
12
For all samples, this condition can be written in matrix form as Z = K V where the rows of Z will be the predicted label. In our proposed kernel version of IFSSFE the non-linear embedding of the feature vector xi is represented by the row vector Zi. whose dimension is d. Note that the value of d can be any arbitrary number such that d ≤ l + u. The non-linear embedding of xi should be as close as possible to its kernel embedding given by f(xi ) = l+u j=1 vj K(xi , xj ). Thus, the global criterion that allows the simultaneous estimation of the matrix of multipliers V and the non-linear embedding Z will be given by: e(Z, V) = T r(ZT L1 Z) + μ [T r(VT K V) + γ T r((K V − Z)T (K V − Z))]
(11)
At the extremum of e the derivative of e w.r.t. V should vanish. This gives: 2K V + 2γ K (K V − Z) = 0
This can be written in the following form: V = γ (I + γK)−1 Z = A1 Z where A1 = γ (I + γK)−1 . By plugging the above expression in Eq. (11), it becomes: e(Z) = T r(ZT L1 Z) + μ T r(ZT AT1 A1 Z) + μ γ T r(ZT BT1 B1 Z) = T r(ZT (L1 + μAT1 KA1 + μ γ BT1 B1 ) Z)
(12)
with B1 = K A1 − I. The solution Z is given by the eigenvectors of L1 + μAT1 KA1 + μ γ BT1 B1 associated with the smallest eigenvalues. The matrix of multipliers is given by V = γ (I + γ K)−1 Z . Given an unseen sample xtest its embedding (a column vector) is given by ztest = VT [K(xtest , x1 ), . . . , K(xtest , xN )]T .
13
4. Performance evaluation We test our proposed methods on ten datasets. These datasets cover several types of objects that are usually used in the domain of pattern recognition. Indeed, we use seven face datasets ORL, UMIST, CMU PIE, Extended Yale, FacePix, FERET, and LFW-a, one Handwritten Digits database (USPS), one object database (COIL-20), and one text database (20-NEWS). 4.1. Datasets • The ORL face data set1 . There are 10 images for each of the 40 human subjects, which were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). The images were taken with a tolerance for some tilting and rotation of the face up to 20o . • The UMIST face data set2 . The UMIST data set contains 575 gray images of 20 different people. The images depict variations in head pose. • PIE3 : We use a reduced data set containing 1926 face images of 68 individuals. The images contain poses variations, illumination variations, and facial expression variations. The image size is 32×32 pixels. • Extended Yale4 : We use the cropped version contains 1774 face images of 28 individuals. The images of the cropped version contain illumination variations and facial expression variations. The image size is 192×168 pixels with 256-bit grey scale. The images are rescaled to 32×32 pixels in our experiments. • FacePix5 : This database includes a set of face images with pose angle variations. It is composed of 181 face images (representing yaw angles from −90◦ to +90◦ at 1 1
http : //www.cl.cam.ac.uk/research/dtg/attarchive/f acedatabase.html http : //www.shef.ac.uk/eee/research/vie/research/f ace.html 3 http : //www.ri.cmu.edu/projects/project 418.html 4 http : //vision.ucsd.edu/ ∼ leekc/ExtY aleDatabase/ExtY aleB.html 5 http : //www.f acepix.org/ 2
14
degree increments) of 30 different subjects, with a total of 5430 images. We used a subset of this dataset in which each person has 18 images. • FERET6 : We use a subset of FERET database, which includes 1400 images of 200 distinct subjects, each subject has seven images. The subset involves variations in facial expression, illumination and pose. In our experiment, the facial portion of each original image is cropped automatically based on the location of eyes and resized to 32×32 pixels. • Labeled Faces in the Wild (LFW) is a largescale database of face photographs designed for unconstrained face verification with variations of pose, illumination, expression, misalignment and occlusion, etc. We used a subset of aligned LFW-a [47] which consists of 143 subjects with no less than 10 images per subject. The total number of images is 4174. Despite the fact that LFW was mainly used for face verification, in our work it is used for identification purpose. • USPS Handwritten Digits7 : This dataset contains 8-bit grayscale images of “0” through “9”. Each class has 1100 images. We used one tenth of this dataset. • COIL-208 This dataset (Columbia Object Image Library) consists of 1440 images of 20 objets. Each object has underwent 72 rotations (each object has 72 images). The objects display a wide variety of complex geometry and reflectance characteristics. We used a subset of the database with 18 images for each object (one image for every 20 degree of rotation). • 20-Newsgroups9 : This is tiny version of the 20newsgroups data, with binary occurrence data for 100 words across 16242 postings. The number of classes is four and the total number of samples is 2000. 6
http : //www.itl.nist.gov/iad/humanid/f eret/ http : //www.cs.nyu.edu/ roweis/data.html 8 http : //www.cs.columbia.edu/CAV E/sof tware/sof tlib/coil − 20.php: 9 http : //www.cs.nyu.edu/ roweis/data.html 7
15
Table 1 summarizes some important features of the used datasets in our experiments. Table 1: The specifications of the ten used datasets.
Dataset ORL UMIST PIE Ext Yale FacePix FERET LFW-a USPS COIL-20 20-NEWS
Number of samples 400 575 1926 1774 1080 1400 4174 1100 360 2000
Number of classes 40 20 68 28 30 200 143 10 20 4
Dimension of a sample 32×32 28×23 32×32 32×32 32×32 32×32 32×32 16×16 51×51 100
∼Samples/class 10 29 28 63 181 7 >10 110 18 500
4.2. Semi-supervised learning and empirical setting We compare the proposed methods with GFHF [34], Class Mass Normalized GFHF (GFHF+CMN), RMGT [37], LapRLS [36], SDA [39], SDE [41], FME [38], and LDA [11]. It should be noticed that all these methods are semi-supervised except LDA which is supervised. For the embedding methods relying on a projection matrix (LDA, SDA, SDE, and the two proposed methods), any classifier can be used with the obtained mapped data in order to classify the unlabeled and unseen data samples. Since all compared semi-supervised methods used the graph Laplacian L associated with the training data, the graph was constructed using the classic KNN graph (symmetric KNN) and the RBF (or Gaussian) kernel for the edge weights. The associated weight with each neighboring pair is given by S(xi , xj ) = exp(−||xi − xj 2 /t0 ) where t0 ∈ R+ is the kernel bandwidth parameter. It is set as in many works to the average of squared distances in the training set. The value of neighborhood size was set to 10. For the proposed methods, we need to compute the within-class and the between class graph (built on the labeled subset). The weights associated are set to ones or zeros, i.e. the corresponding similarity matrices Sb and Sw are binary matrices. It is worth noticing that all compared methods used the same data graph. This makes sure that the difference in performance is due to the embedding method only and not to the data graph.
16
We randomly select 50% data as the training dataset and use the remaining 50% data as the test dataset. Among the training data, we randomly label P samples per class and treat the other training samples as unlabeled data. The above setting has been adopted to compare different methods. All the training data (labeled and unlabeled samples) are used to learn a subspace (i.e., a projection matrix) for semi-supervised embedding methods or a label mapping for the label propagation methods, except that we only use the labeled data for subspace learning in LDA. In all the experiments, PCA is used as a preprocessing step to preserve 98% energy of the data. 4.3. Method comparison For LapRLS, SDA, SDE, FME, two regularization parameters should be tuned. For our two proposed methods three parameters are used. Each of these parameters is set to a subset of values belonging to {10−9 , 10−6 , 10−3 , 1, 103 , 106 , 109 } as in [38]. For the kernel version, the same interval was used for λ and μ, however γ was {1, 10, 50, 100}. We report the top-1 recognition accuracy (best average recognition rate) from the best parameter configuration. Table 2 reports the best mean recognition accuracy (for nine datasets) over ten random splits on the unlabeled data and the test data, which are referred to as Unlabel and Test, respectively. For the embedding methods (LDA, SDA, SDE, IFSSFE, and kernel IFSSFE), the classification was performed using the Nearest Neighbor classifier. For the proposed methods (IFSSFE and kernel IFSSFE), the dimension of the embedding is bounded by the number of training samples N . Thus, for each parameter configuration associated with the criterion and for each split we have a curve for the recognition rate that depicts the rate at several sampled dimensions. Thus, for each parameter configuration, the performance is set to the best rate in the mean curve which was obtained by averaging the rate curves over the splits. For the kernel IFSSFE, we used the Gaussian kernel whose expression was similar to the similarity elements, i.e., K(xi , xj ) = exp(−||xi − xj 2 /(2m t0 )) in which t0 is set to the average of squared distances in the training set, and m is an integer chosen from the set {1, 2, 3, 4, 5, 6}. Figure 1 illustrates the average recognition rate curves as a function of feature dimen17
sions. These curves were obtained for the test part of data using one labeled sample per class. We recall that FME method does not depend on the dimension, the maximum dimension of SDA method is given by C − 1, and the maximum dimension of SDE is given by the dimension of input samples. For the proposed methods the maximum dimension is given by the number of training samples. Table 3 illustrates the average recognition rate obtained with LFW-a dataset. In total, the training data are composed by 1430 images (1287 labeled and 143 unlabeled) and the test set was formed by 2774 images. From the results depicted in Table 2, and Figure 1, we can draw the following conclusions: • The proposed method IFSSFE and its kernel version have given the best recognition rate in general. Out of 54 tested cases for IFSSFE, the latter was slightly outperformed 4 times. • In general, the kernel version of IFSSFE has given better performance than the nonkernel one. For PIE and Extended Yale, the performance of the kernel IFSSFE was lower than IFSSFE performance. However, when polynomial kernels were used for PIE and Extended Yale, the performance of the kernel version was improved. In the Table, all results were obtained by the Gaussian kernel. • For non-face datasets (COIL-20, USPS, and 20-NEWS), the performance improvement obtained by the proposed methods was very significant compared with the performance of the competing methods such as SDA, SDE, and FME. This holds true for all label percentages and for unlabeled and test data. • For some datasets, the performance obtained with the test part of the data was better than the one obtained with the unlabeled part. This can be explained by the fact that the captured model has a high generalization capacity. • The performance of GFHF, GFHF+CMN, and RMGT (direct label propagation methods) was not that good for face datasets. Indeed, the results obtained for USPS 18
and 20-NEWS are somehow good. This can be explained by the fact that face identification problem belongs to the “fine-grained categorization” problem considered as a hard problem for methods relying on a direct label propagation. • As can be seen from the recognition accuracy curves associated with IFSSFE (Figure 1), it is not necessary that by increasing the number of features in the projection space the recognition accuracy increases. This finding is very consistent with many non-linear embedding methods for which increasing the number of features would deteriorate the performance [48, 49]. Indeed, the IFSSFE is essentially a flexible non-linear method that explicitly incorporates a linear regression. Therefore, its performance as a function of feature dimension can be similar to either linear embedding methods (for which increasing the number of features can improve the performance) or to the non-linear embedding methods (for which increasing the number of features can deteriorate the performance). For instance, the IFSSFE curve exhibited a linear behavior for UMIST dataset (See Figure 1.(a)) and a non-linear behavior for COIL-20 dataset (See Figure 1.(f)) • More importantly, we can observe that the optimal performance can be reached with a relatively low dimension. This property makes the proposed method very appealing in practice and is very consistent with the objective of dimensionality reduction paradigm. Indeed, one needs to find a trade-off between a high recognition rate and a compact representation with a reduced number of dimensions. • The performance of the proposed methods on the challenging LFW-a dataset is superior to that of the competing methods. We stress the fact that the LFW-a dataset is mainly used for face verification evaluation. 4.4. Method performance with fixed dimension Based on the above results, we can observe that the FME method is, in general, the best competing method. In this section, we compare the performance of the FME method with that of our proposed methods for which the dimension of features is fixed to 19
the number of classes, C. Note that the FME method is essentially a label propagation algorithm that uses C features. We will show that even in the case where dimensionality of the embedding is fixed to C, IFSSFE and kernel IFSSFE are still superior to FME if the balance parameters were optimized at this fixed dimension. This indicates that the criteria used by the proposed methods were the main reason for this obtained superiority. Table 4 illustrates the average performance of FME and IFSSFE in such conditions. In this table, we report the results corresponding to the unlabeled samples. Since the proposed IFSSFE is a generic semi-supervised embedding, we used five classifiers: 1-NN, Support Vector Machines (SVM with RBF kernel), SVM (with polynomial degree equal to 1 and 3), and the Two Phase Test Sample Sparse Representation (TPTSSR) classifier [50]. As can be seen, even when IFSSFE is restricted to work with only C features, its performance is still better than that of FME. We can also observe that the use of other classifiers such as SVM and TPTSSR has enhanced the performance of the IFSSFE with respect to the Nearest Neighbor classifier. Figure 2 illustrates the average performance of FME and the proposed kernel IFSSFE as a function of feature dimension for four datasets: Extended Yale, PIE, FacePix, and COIL-20. The projection models associated with FME and IFSSFE were optimized on the fixed dimension given by C. In each plot, we show the average curve over ten splits. These curves depicts the performance on the test subset in which the number of labeled samples per class was set to three. For the proposed method IFSSFE, we used four classifiers: 1-NN, RBF SVM, polynomial SVM (degree = 3), and the Two Phase Test Sample Sparse Representation (TPTSSR) classifier. Despite that the performance was optimized at C, the performance of the kernel IFSSFE and the IFSSFE method (the corresponding plots are omitted due to space limitation) at other dimensions can be superior to that of FME for most cases. 4.5. Proposed method versus a Two Phase estimation In order to show the benefit of simultaneously estimating the non-linear embedding and the regression (IFSSFE), we have used a two phase scheme estimation of the embedding. 20
In this two phase scheme, we used two consecutive optimizations. The first optimization is simply a semi-supervised non-linear embedding which provides the embedding of the training data (the Z matrix) by minimizing minZ T r(ZT L1 Z). The second optimization performs a regularized linear regression in order to estimate the projection matrix and the offset vector by minimizing minW,b (W2 + γ XT W + 1N bT − Z2 ). Table 5 illustrates a comparison between the proposed IFSSFE and a two phase estimation in which a non-linear semi-supervised embedding is first estimated then a linear regression is applied via 2 -norm regularization. As can be seen, for all cases the proposed method in which the non-linear embedding and the linear regression are simultaneously estimated provided much better results than the two phase estimation. This can be explained by the fact that the proposed method explicitly uses flexibility by forcing the embedding to be a result of a simultaneous optimization on the non-linear embedding and on the linear regression. 4.6. Sensitivity to parameters In [33], it is suggested that for every new semi-supervised learning algorithm one should fully analyze the influence of its parameters. Due to the nature of semi-supervised learning, in which there exists only a limited number of labeled examples, tuning all parameters might be unfeasible. Therefore, there is a need for algorithms that are slightly dependable on parameter tuning, i.e., that have a stable performance over the parameter space. Our proposed methods have explicitly three balance parameters μ, γ, and λ. The study conducted in Table 2 has shown that the optimal λ has a constant value for all datasets except for COIL-20. This value was 106 . We study the performance of the proposed method as a function of μ by fixing the two other parameters. Figure 3 illustrates the recognition accuracy variation with different parameter μ for IFSSFE and its kernel version. The plots show the results on the test datasets. Three labeled samples per class are used in Extended Yale, PIE, FacePix, COIL20, UMIST, and USPS. We observe that IFSSFE is relatively robust to the parameter μ. We have similar observations on other databases as well as with different number of labeled 21
samples.
5. Conclusion This paper presented a novel semi-supervised dimensionality reduction method for classification tasks. We propose an Inductive Flexible Semi Supervised Feature Extraction as well as its kernelized version. The proposed schemes retained the merits of Flexible Manifold Embedding and the graph-based non-linear embedding. The proposed methods simultaneously estimate a non-linear embedding as well as a transform needed for mapping the unseen samples. If the transform is linear we get the linear variant. The kernel variant is obtained by adopting the kernel trick. The proposed criterion associated with the linear variant blends several useful criteria aiming at getting a smooth and discriminant nonlinear subspace that is very close to a linear subspace. The criterion of the kernelized variant aims at getting non-linear subspace that is very close to kernel-based mapping. The proposed methods were evaluated on ten benchmark databases. We have provided a comparison with several competing methods based on label propagation schemes as well as on semi-supervised graph-based embedding. Our proposed methods outperformed the competing methods in most cases. Although the proposed method has three balance parameters, we have observed that one of them can be kept constant to a given high value regardless of the dataset used. Thus, the proposed method has actually two balance parameters, which is exactly the same number as the one of existing semi-supervised graph-based embedding techniques. We have also shown that the optimal performance can be obtained at a relatively low number of features.
22
Table 2: The best average classification results on ten random splits using several methods. The methods GFHF, GFHF+CMN, RMGT, LapRLS, and FME are based on label propagation. LDA, SDA, SDE, and the two proposed methods are embedding methods for which nearest neighbor classifier was used after the projection. Dataset\Method
ORL UMIST PIE Ext Yale FacePix FERET USPS COIL-20 20-NEWS
ORL UMIST PIE Ext Yale FacePix FERET USPS COIL-20 20-NEWS
Labeled Samples P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3
GFHF
LDA
GFHF
31.4 65.8 58.3 80.2 15.3 38.6 36.6 64.4 26.0 48.2 21.7 43.9 30.3 50.0 43.2 58.5 41.8 48.7
44.2 59.2 55.4 72.0 10.3 22.5 19.0 42.9 17.3 36.6 17.8 29.6 37.5 56.7 52.8 63.2 30.4 46.2
47.6 63.1 59.1 72.3 13.1 22.7 26.3 45.9 21.6 38.0 23.6 38.4 46.3 61.3 58.9 68.0 54.3 61.8
47.9 62.7 57.1 72.1 11.4 22.9 23.0 45.6 18.1 38.8 19.2 31.1 45.1 59.9 57.1 65.1 53.6 61.5
37.7 77.8 34.9 45.6 19.6 44.2 34.8 60.1 26.9 46.6 21.0 55.0 30.8 48.4 43.5 66.9 41.5 48.2
-
-
-
+CMN
RMGT
23
LapRLS
SDA
Unlabeled (%) 61.0 31.4 69.1 70.8 63.9 58.3 85.4 83.2 29.3 15.3 30.3 48.2 44.9 36.6 61.3 65.0 31.3 26.0 48.4 53.6 39.0 21.7 59.6 46.8 39.7 30.3 53.1 49.8 58.9 43.2 69.3 60.4 44.3 41.8 54.3 51.5 Test (%) 59.7 37.7 68.6 79.0 37.3 34.9 50.3 50.9 32.4 19.6 32.5 52.8 41.7 34.8 59.1 61.5 30.0 26.9 45.6 50.9 35.6 21.0 60.2 56.2 39.4 30.8 51.7 48.8 54.1 43.5 71.7 66.9 43.8 41.5 54.0 51.4
Kernel
SDE
FME
IFSSFE
51.0 75.9 59.4 83.5 20.9 31.3 40.0 50.0 40.4 57.1 24.6 42.3 35.9 52.8 56.0 75.4 36.4 47.3
62.3 80.1 58.7 74.9 23.7 41.6 38.4 64.8 28.1 50.5 35.5 54.1 39.7 55.6 62.0 70.7 41.5 54.4
63.8 85.5 65.7 85.8 33.1 48.5 46.3 75.3 40.6 65.8 39.7 60.6 46.7 63.1 68.0 80.4 61.7 64.7
70.6 88.3 66.1 87.1 36.4 55.9 46.1 80.9 39.4 70.9 51.6 66.7 46.8 63.8 70.6 83.6 56.2 62.7
50.0 77.9 39.3 48.1 29.2 34.9 37.7 49.0 36.0 52.2 41.2 62.1 35.6 52.3 55.5 72.4 36.3 47.4
59.7 83.5 34.0 45.4 25.7 46.1 35.6 59.1 28.0 47.7 27.9 53.0 40.5 52.4 57.9 68.6 41.2 53.5
60.0 83.4 41.0 58.1 33.7 51.8 41.2 69.3 37.8 61.2 37.4 70.8 45.0 62.5 60.8 77.4 60.9 63.8
59.5 83.1 42.0 58.7 32.1 51.2 40.3 67.4 38.2 63.5 42.6 73.1 48.4 63.9 64.5 80.8 56.6 62.4
IFSSFE
Table 3: Classification results on LFW-a dataset. There are 143 unlabelled images and 2744 test images. FME is based on label propagation. LDA, SDA, SDE, and the two proposed methods are embedding methods for which the nearest neighbor classifier was used after the projection.
LFW-a Method LDA SDA SDE FME IFSSFE Kernel IFSSFE
Unlabel(%) 28.1 44.7 25.0 44.8 54.5 56.0
Test(%) 38.5 28.9 13.0 38.0 40.2 42.3
Table 4: Comparing the average performance of the FME method and the proposed IFSSFE method
obtained at dimension equal to C. For the proposed IFSSFE (a generic semi-supervised embedding), we used five classifiers: 1-NN, RBF SVM, polynomial SVM (degree=1 and 3), and the Two Phase Test Sample Sparse Representation (TPTSSR) classifier. Reported results correspond to unlabeled data. Dataset Ext Yale PIE FacePix COIL-20
Labeled Samples P =1 P =3 P =1 P =3 P =1 P =3 P =1 P =3
FME
38.4 64.8 23.7 41.6 28.1 50.5 62.3 70.7
IFSSFE
IFSSFE
IFSSFE
IFSSFE
IFSSFE
(1-NN)
(RBF SVM)
(SVM d=1)
(SVM d=3)
(TPTSSR)
38.5 65.4 26.3 43.6 31.6 56.8 63.6 78.2
38.5 59.5 26.3 48.0 31.6 57.1 60.2 76.9
38.5 66.6 30.4 48.5 31.8 57.4 63.6 79.8
44.9 67.3 32.3 49.4 32.1 56.7 63.7 79.2
49.3 71.6 33.0 56.2 31.9 61.3 62.0 70.7
24
UMIST
Extended Yale 42 40
40
Recognition rate %
Recognition rate %
38
35
30 FME SDA SDE Prop. method Prop. Kernel method
25
20
60
40
20
80
100
36 34 32 30 28
FME SDA SDE Prop. method Prop. Kernel method
26 24 22
120
40
20
60
Feature dimension
80
100
140
120
160
180
200
Feature dimension
(a)
(b)
PIE
FacePix 40
35
38 36
Recognition rate %
Recognition rate %
30
25
20 FME SDA SDE Prop. method Prop. Kernel method
15
10
150
100
50
200
250
34 32 30 28 FME SDA SDE Prop. method Prop. Kernel method
26 24 22 20
300
40
20
60
Feature dimension
80
100
140
120
160
180
Feature dimension
(c)
(d)
USPS
COIL−20 65
45
35
Recognition rate %
Recognition rate %
60 40
FME SDA SDE Prop. method Prop. Kernel method
30
25
20
55
50
45
FME SDA SDE Prop. method Prop. Kernel method
40
50
150
100
200
35
250
Feature dimension
20
40
60
80
100
120
Feature dimension
(e)
(f)
Figure 1: Recognition accuracy variation as a function of dimensions for UMIST, Extended Yale, PIE, FacePix, USPS and COIL-20 datasets. These curves correspond to the best average curves. One labeled sample per class is used.
25
200
PIE
Extended Yale 80 55
FME Kernel method (NN) Kernel method (SVM RBF) Kernel method (SVM Polynomial) Kernel method (TPTSSR)
70
65
50
Recognition rate %
Recognition rate %
75
60
55
45
40 FME Kernel method (NN) Kernel method (SVM RBF) Kernel method (SVM−Polynomial) Kernel method (TPTSSR)
35 50
45
30
250
200
150
100
50
50
150
100
FacePix
300
350
400
85
FME Kernel method (NN) Kernel method (SVM RBF) Kernel method (SVM Polynomial) Kernel method (TPTSSR)
60
80
Recognition rate %
65
Recognition rate %
250
COIL−20
70
55
50
45
200
Feature dimension
Feature dimension
75
70
FME Kernel method (NN) Kernel method (SVM RBF) Kernel method (SVM Polynomial) Kernel method (TPTSSR)
65
20
40
60
80
100
120
140
160
180
60
200
Feature dimension
20
40
60
80
100
120
Feature dimension
Figure 2: A performance comparison between FME and the proposed kernel IFSSFE as a function of features for four datasets: Extended Yale, PIE, FacePix, and COIL-20. The projection models associated with FME and kernel IFSSFE were optimized on the fixed dimension given by C, namely the number of classes. In each plot, we show the average curve over ten splits. These curves depicts the performance on the test subset in which the number of labeled samples was set to three per class. For the proposed kernel IFSSFE, we used four classifiers: 1-NN, RBF SVM, polynomial SVM (degree 3), and the Two Phase Test Sample Sparse Representation (TPTSSR) classifier.
26
Figure 3: Recognition accuracy variation with different parameter µ for IFSSFE. The plots show the results on the unseen test datasets. Three labeled samples per class are used in Extended Yale, PIE, FacePix, COIL-20, UMIST, and USPS.
27
Table 5: A comparison between the proposed method IFSSFE and a two phase estimation in which a non-linear semi-supervised embedding is first estimated then a linear regression is estimated via a ridge regression.
Method
1 labeled sample U(%) T(%)
2 labeled samples U(%) T(%)
3 labeled samples U(%) T(%)
ORL Two phase estimation IFSSFE
50.8 63.8
48.5 60.0
58.1 73.9
56.4 72.6
66.6 85.5
65.3 83.4
79.1 85.8
47.7 58.1
38.9 48.5
42.5 51.8
68.3 75.3
65.3 69.3
48.9 65.8
47.2 61.2
39.9 60.6
40.0 70.8
58.5 63.1
55.2 62.5
72.6 80.4
66.1 77.4
51.3 58.8
51.7 59.5
UMIST Two phase estimation IFSSFE
52.8 65.7
31.2 43.3
65.5 75.3
Two phase estimation IFSSFE
21.8 33.1
24.9 34.9
30.7 42.1
Two phase estimation IFSSFE
39.7 46.3
37.4 42.5
59.1 65.9
42.8 52.7
PIE 35.8 44.6
Extended Yale 58.6 62.6
FacePix Two phase estimation
IFSSFE
26.9 40.6
27.8 40.4
38.8 56.6
Two phase estimation IFSSFE
23.8 39.7
20.0 37.4
31.2 51.3
Two phase estimation IFSSFE
44.4 46.7
43.0 47.0
50.7 55.2
39.5 53.6
FERET 28.2 50.1
USPS 48.7 54.4
COIL-20 Two phase estimation IFSSFE
59.6 68.0
55.4 62.6
67.6 75.1
Two phase estimation IFSSFE
42.7 49.9
42.5 49.8
47.6 55.8
62.4 71.7
20-NEWS
28
47.4 56.5
References
[1] L. Maaten, E. Postma, and J. Herik, “Dimensionality reduction: A comparative review,” Tech. Rep. TiCC TR 2009005, TiCC, Tilburg University, 2009. [2] L. Saul, K. Weinberger, F. Sha, J. Ham, and D. Lee, Semisupervised Learning, ch. Spectral methods for dimensionality reduction. MIT Press, Cambridge, MA, 2006. [3] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph embedding and extension: a general framework for dimensionality reduction,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40–51, 2007. [4] T. Zhang, D. Tao, X. Li, and J. Yang, “Patch alignment for dimensionality reduction,” IEEE Trans. on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1299–1313, 2009. [5] A. Jain, R. Duin, and J. Mao, “Statistical pattern recognition: a review,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4–37, 200. [6] X. Li, S. Lin, S. Yan, and D. Xu, “Discriminant locally linear embedding with highorder tensor data,” IEEE Trans. Syst., Man, Cybern. B: Cybern, vol. 32, pp. 342–352, April 2008. [7] A. M. Martinez and M. Zhu, “Where are linear feature extraction methods applicable?,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1934–1944, 2005. [8] Y. Tao and J. Yang, “Quotient vs. difference: Comparison between the two discriminant criteria,” Neurocomputing, vol. 18, no. 12, pp. 1808–1817, 2010. [9] H. Li, T. Jiang, and K. Zhang, “Efficient and robust feature extraction by maximum margin criterion,” IEEE Trans. on Neural Networks, vol. 17, no. 1, pp. 157–165, 2006.
29
[10] T. Kim and J. Kittler, “Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 318–327, 2005. [11] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. [12] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. [13] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003. [14] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000. [15] O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. MIT Press, Cambridge MA, 2006. [16] T. Silva and L. Zhao, “Network-based stochastic semisupervised learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 3, pp. 451–466, 2012. [17] G. Camps-Valls, T. B. Marsheva, and D. Zhou, “Semi-supervised graph-based hyperspectral image classification,” IEEE Trans. Geoscience and Remote Sensing, vol. 45, no. 10, pp. 3044–3054, 2007. [18] Y. Zhang and D.-Y. Yeung, “Semisupervised generalized discriminant analysis,” Neural Networks, IEEE Transactions on, vol. 22, no. 8, pp. 1207–1217, 2011. [19] H. Huang, J. Li, and J. Liu, “Enhanced semi-supervised local fisher discriminant analysis for face recognition,” Future Generation Computer Systems, vol. 28, pp. 244– 253, 2012. 30
[20] W. Liu, J. He, and S. Chang, “Large graph construction for scalable semi-supervised learning,” in International Conference on Machine Learning, 2010. [21] F. Pan, J. Wang, and X. Lin, “Local margin based semi-supervised discriminant embedding for visual recognition,” Neurocomputing, vol. 74, pp. 812–819, 2011. [22] W. Yang, S. Zhang, and W. Liang, “A graph based subspace semi-supervised learning framework for dimensionality reduction,” in International Conference on Computer Vision, pp. 664–677, 2008. [23] Y. Zhang and D.-Y. Yeung, “Semi-supervised discriminant analysis via cccp,” in Machine Learning and Knowledge Discovery in Databases, pp. 644–659, Springer, 2008. [24] M. R.-T. L. Z. Xu, I. King and J. Rong, “Discriminative semi-supervised feature selection via manifold regularization,” IEEE Trans. on Neural Networks, vol. 21, no. 7, pp. 1033–1047, 2010. [25] T. Zhang, R. Ji, W. Liu, D. Tao, and G. Hua, “Semi-supervised learning with manifold fitted graphs,” in International Joint Conference on Artificial Intelligence, 2013. [26] W. Liu, D. Tao, and J. Liu, “Transductive component analysis,” in IEEE International Conference on Data Mining, pp. 433–442, 2008. [27] H. Cevikalp, “Semi-supervised dimensionality reduction using pairwise equivalence constraints,” in International Conference on Computer Vision Theory and Applications, pp. 489–496, 2009. [28] I. Davidson, “Knowledge driven dimension reduction for clustering,” in International Joint Conference on Artificial Intelligence, pp. 1034–1039, 2009. [29] G. Wacquet, E. P. Caillault, D. Hamad, and P. A. H´eBert, “Constrained spectral embedding for k-way data clustering,” Pattern Recognition Letters, vol. 34, no. 9, pp. 1009–1017, 2013. 31
[30] X. Wang, B. Qian, and I. Davidson, “On constrained spectral clustering and its applications,” Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. 1–30, 2014. [31] S. Yan, S. Bouaziz, D. Lee, and J. Barlow, “Semi-supervised dimensionality reduction for analyzing high-dimensional data with constraints,” Neurocomputing, vol. 76, no. 1, pp. 114–124, 2012. [32] Y. Song, F. Nie, C. Zhang, and S. Xiang, “A unified framework for semi-supervised dimensionality reduction,” Pattern Recognition, vol. 41, pp. 2789–2799, 2008. [33] C. Sousa, S. Rezende, and G. Batista, “Influence of graph construction on semisupervised learning,” in European Conferene on Machine Learning, pp. 160–175, 2013. [34] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in International Conference on Machine Learning, 2003. [35] D. Zhou, J. Huang, and B. Scholkopf, “Learning from labeled and unlabeled data on a directed graph,” in International Conference on Machine Learning, 2005. [36] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: a geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006. [37] W. Liu and S. Chang, “Robust multi-class transductive learning with graphs,” in Computer Vision and Pattern Recognition, pp. 381–388, 2009. [38] F. Nie, D. Xu, I. Tsang, and C. Zhang, “Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction,” IEEE Transactions on Image Processing, vol. 19, no. 7, pp. 1921–1932, 2010. [39] D. Cai, X. He, and J. Han, “Semi-supervised discriminant analysis,” in IEEE Int. Conf. Comput.Vision, pp. 1–7, 2007.
32
[40] H. Chen, H. Chang, and T. Liu, “Local discriminant embedding and its variants,” in IEEE International Conference on Computer Vision and Pattern Recognition, pp. 846–853, 2005. [41] H. Huang, J. Liu, and Y. Pan, “Semi-supervised marginal fisher analysis for hyperspectral image classification,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. I-3, pp. 377–382, 2012. [42] G. Yu, G. Zhang, C. Domeniconi, Z. Yu, and Y. J, “Semi-supervised classification based on random subspace dimensionality reduction,” Pattern Recognition, vol. 45, pp. 1119–1135, 2012. [43] D. Cai, X. He, K. Zhou, J. Han, and H. Bao, “Locality sensitive discriminant analysis,” in International Joint Conference on Artificial Intelligence, pp. 708–713, 2007. [44] Z. Lai, Z. Jin, J. Yang, and W. Wong, “Sparse local discriminant projections for face feature extraction,” in IEEE International Conference on Pattern Recognition, pp. 926–929, 2010. [45] W. Zheng, Z. Lin, and H. Wang, “L1-norm kernel discriminant analysis via bayes error bound optimization for robust feature extraction,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 4, pp. 793–803, 2014. [46] J. Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, England, 2006. [47] W. Liu and S. Chang, “Similarity scores based on background samples,” in Computer Vision and Pattern Recognition, 2009. [48] B. Raducanu and F. Dornaika, “A supervised non-linear dimensionality reduction approach for manifold learning,” Pattern Recognition, vol. 45, pp. 2432–2444, 2012. [49] B. Raducanu and F. Dornaika, “Embedding new observations via sparse-coding for non-linear manifold learning,” Pattern Recognition, vol. 47, pp. 480–492, 2014. 33
[50] Y. Xu, D. Zhang, J. Yang, and J.-Y. Yang, “A two-phase test sample sparse representation method for use with face recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 9, pp. 1255–1262, 2011.
34