Pattern Recognition 45 (2012) 4466–4493
Contents lists available at SciVerse ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Constrained large Margin Local Projection algorithms and extensions for multimodal dimensionality reduction Zhao Zhang n, Mingbo Zhao, Tommy W.S. Chow Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
a r t i c l e i n f o
a b s t r a c t
Article history: Received 17 August 2011 Received in revised form 11 April 2012 Accepted 19 May 2012 Available online 5 June 2012
A Constrained large Margin Local Projection (CMLP) technique for multimodal dimensionality reduction is proposed. We elaborate the criterion of CMLP from a pairwise constrained marginal perspective. Four effective CMLP solution schemes are presented and the corresponding comparative analyses are given. An equivalent weighted least squares formulation for CMLP is also detailed. CMLP is originated from the criterion of Locality Preserving Projections (LPP), but CMLP offers a number of attractive advantages over LPP. To keep the intrinsic proximity relations of inter-class and intra-class similarity pairs, the localized pairwise Cannot-Link and Must-Link constraints are applied to specify the types of those neighboring pairs. By utilizing the CMLP criterion, margins between inter- and intra-class clusters are significantly enlarged. As a result, multimodal distributions are effectively preserved. To further optimize the CMLP criterion, one feasible improvement strategy is described. With kernel methods, we present the kernelized extensions of our approaches. Mathematical comparisons and analyses between this work and the related works are also detailed. Extensive simulations including multivariate manifold visualization and classification on the benchmark UCL, ORL, YALE, UMIST, MIT CBCL and USPS datasets are conducted to verify the efficiency of our techniques. The presented results reveal that our methods are highly competitive with and even outperform some widely used state-of-the-art algorithms. & 2012 Elsevier Ltd. All rights reserved.
Keywords: Dimensionality reduction Large margin projection Manifold visualization Pairwise constraints Locality preservation Multimodality preservation Kernel method Pattern classification
1. Introduction Tackling high-dimensional data has become increasingly important, as most of real data in the emerging applications consists of many attributes and redundant information, such as face data and gene expression data, etc. Researchers working on these applications usually require confronting with the problem about how to represent high-dimensional datasets efficiently. Thus extracting the most informative attributes holding all the required information for subsequent visualization and classification is important. This process can be benefited from feature reduction or dimensionality reduction (DR). DR aims to find a set of projection axes to transform high-dimensional dataset into low-dimensional representations with the intrinsic structure characteristics being effectively preserved [4,5]. Principal Component Analysis (PCA) [6,8] and Fisher Linear Discriminant Analysis (LDA) [8,21] are the two most widely used DR techniques. PCA and LDA can reveal only the linear relations between the features. LDA as a unimodal method tends to deliver unsatisfactory results when facing multimodal cases [9].
n
Corresponding author. Tel.: þ852 34422784. E-mail addresses:
[email protected],
[email protected] (Z. Zhang).
0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.05.015
But intrinsic multimodality is often encountered in many real life applications, such as facial and gait-based gender classification. The intra-class multimodal structure appears when genders are simply classified into males and females. Thus, preserving multimodal structures for dimensionality reduction is a major problem that needs to be addressed. To represent given data efficiently, it is vital to incorporate the local information of data into feature representation, because points in a dense region deliver similar manifolds [4,5]. This assumption leads to the appearance of many neighborhood preserving techniques [13], e.g., Locally Linear Embedding (LLE) [4] and Laplacian Eigenmaps (LE) [5]. Nonlinear LE and LLE are developed for DR and visualization. A common shortcoming of LE and LLE is that they cannot embed new points, because they produce the embeddings directly without exhibiting the projection axes. To improve LLE and LE, Locality Preserving Projections (LPP) [1] and Neighborhood Preserving Embedding (NPE) [15] provide the linear approximations to LE and LLE. A common behavior of LE, LLE, LPP and NPE is that the projections of all similarity pairs are preserved in close vicinity of the reduced output space, but note that inter-class neighbors also exhibit similar embeddings. Thus these DR methods may suffer from the problem of congregating the embeddings of those inter-class similarity pairs. The major reason is they are naturally
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
unsupervised and do not apply any form of supervised information, such as class labels or pairwise constraints (PC). Note that NPE shares the similar objective function to LLE. But the unclear interpretation of the weights based metrics for LLE and NPE may limit their certain applications [20]. In contrast, the weighted metric of LPP is definitely defined and the projection axes of LPP can be easily obtained by eigen-decomposition method. Through enabling the inclusion of class label information directly into the LPP criterion, some discriminant locality preserving LPP extensions, e.g., discriminant LPP (DLPP) [49], maximum margin criterion (MMC) based DLPP (DLPP/MMC) [50] and supervised optimal LPP (SOLPP) [53] were recently proposed for DR. But it is worth noting that these class labels driven LPP extensions are rigid in regulating the supervised information for the fixed label numbers. It is also noticed that graph constructions affect the performance of graph algorithms directly [47]. To improve the graph construction of LPP, a graph-optimized LPP (GoLPP) [11] method was recently presented. But its graph is iteratively updated, thus GoLPP is time-consuming for large-scale datasets and a proper iteration stop threshold is difficult to be guaranteed. Some other graph constructions, for instance adaptive graph construction (e.g., sparse representation [27,32,43,48] and sample-dependent graph construction [31]), and b-matching graph [12] were also proposed. Note that all of these graph constructions mainly focus on determining the neighborhood size automatically or gradually updating the weight matrix. Compared with the class labels, pairwise Cannot-Link (CL) and Must-Link (ML) constraints have been widely used in many areas, e.g., feature selection [17,18] and semi-supervised learning [2,3,19]. PC can sometimes be obtained with minimal effort and can provide more supervised information compared with the class labels [2,17] for the fixed label numbers. Different from virtually all previous discriminant LPP extensions, this work applies the neighborhood graph induced PC to guide the discriminant learning. Two new graphs called ML-graph and CL-graph are constructed from the constrained neighborhood graph. We then elaborate a new LPP criterion based pairwise constrained marginal weighted metric and propose a Constrained large Margin Local Projection (CMLP) algorithm for multimodal DR. CMLP clearly considers the local information of data as well as the discriminant structures embedded in the PC. It is worthwhile to highlight the major contributions of this work from a number of perspectives: 1. A constrained marginal local projection criterion is proposed. In this setting, PC induced from the neighborhood graph are used to specify whether similarity pairs are in a class or different classes, which makes CMLP flexible in regulating the supervised information. With this criterion, margins between inter- and intra-class clusters are significantly expanded. Meanwhile, margins between points within each cluster can be significantly shrunk. 2. Four effective solution schemes are presented to compute the CMLP criterion. We call the solution schemes as CMLP-1, CMLP-2, CMLP-3 and CMLP-4. We also compare the presented schemes. Similar to the LPP problem, the projection axes of CMLP-2, CMLP3 and CMLP-4 are solved by the generalized eigen-decomposition [21] and thus not orthogonal. To address this issue, the Trace Ratio (TR) optimization [23,24,25] based improvement strategy is also proposed. Based on the TR criterion, the projection matrix is orthogonal, thus the similarity based on the Euclidean distance can be effectively preserved according to the stronger orthogonal constraints [10,24,25]. Then, three TR criterion based orthogonal technologies are detailed. 3. A least squares view of CMLP-3 is elaborated. CMLP-3 is theoretically formulated as a weighted least squares (WLS)
4467
problem. We also detail the equivalence between CMLP-3 and the WLS framework. 4. CMLP is linear which makes it fast and suitable for real applications. But nonlinear structures are common in real life data [14], so we also extend our criteria to nonlinear scenarios for mining the nonlinear structures of data. 5. The mathematical comparisons between this work and the related work are discussed. By comparing our methods with the related discriminant local or global techniques proposed by enabling the inclusion of class labels directly, we conclude that our CMLP criterion is more general, delivering strong generalization capabilities. The outline of the paper is described as follows. In Section 2, we briefly review LPP. In Section 3, we model the CMLP criteria mathematically and detail the four solution schemes. The kernelized extensions of our methods are also described. In Section 4, we discuss the connections between our work and the related work. We in Section 5 describe the simulation settings and evaluate our techniques using the Glass Identification dataset from UCL and real ORL, YALE, UMIST, MIT CBCL, USPS databases. Finally, we offer the concluding remarks in Section 6.
2. Locality preserving projections Let xi A Rn ði ¼ 1,2,. . .,NÞ be an n-dimensional vector and zi A Rd ð1r d r nÞ be the low-dimensional representation of xi. Denote the n d projection matrix by C, so the embedding of xi can be given by zi ¼ CTxi, where T denotes the transpose of a matrix or a vector. This process is the so-called linear DR or feature reduction. For a given data matrix X ¼ ½x1 ,x2 ,. . .,xN A RnN , LPP finds a locality preserving projection matrix C to transform X into a set of points in Rd ðd rnÞ. The first step of LPP is to identify the neighbors of each sample xi and construct a neighborhood graph. Note that e-neighborhood [5] or the k-neighborhood [5] can be applied for this purpose. Then a sparse adjacency matrix W is defined to reflect the proximity relations of all vertices in the neighborhood graph, where Wi,j is the affinity value between xi and xj. In this step, local scaling heuristic method [42], simpleminded method [5] and heat kernel [1,5] can be used to construct W. For the simple-minded method, Wi,j ¼Wj,i ¼1 if xi and xj are mutually neighbors. According to [1], LPP computes the matrix C ¼(c19c29y9cd) by solving Min CAR
nd
N N X 1 X 2 2 :CT xi CT xj : W i,j , s:t: Dii :CT xi : ¼ 1, 2 i,j ¼ 1 i¼1
ð1Þ
where notation : : denotes the l2-norm. From Eq. (1), the transforming basis vectors of LPP can then be obtained as the eigenvectors of the following generalized eigen-problem: XLXTcj ¼ ljXDXTcj, where graph Laplacian L¼D W and D is an P N-dimensional diagonal matrix with the entries being Dii ¼ jWi,j. Note that the unsupervised LPP is one of the most representative locality preserving linear DR algorithms.
3. Constrained large Margin Local Projection algorithm (CMLP) 3.1. Motivation and objective Next we describe the LPP criterion from a marginal perspective. When we substitute the squared Euclidean distance
4468
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
Class 1 Class 2 Intra-class similarity pairs Inter-class similarity pairs
Class 1 Class 2 Intra-class similarity pairs Inter-class similarity pairs Cluster 2 in Class 2
Cluster 1 in Class 1
Cluster 2 in Class 2
Cluster 1 in Class 1
Cluster 1 in Class 2
Cluster 1 in Class 2
After LPP criterion
Cluster 2 in Class 1 Cluster 2 in Class 1 Dimension-reduced Feature Space
Original Input Space
Fig. 1. Geometrical interpretations of the neighborhood preserving LPP criterion.
2
2
d ðzi , zj Þ ¼ :zi zj : into Eq. (1), we obtain J LPP ðCÞ ¼ Min CAR
nd
N X N 1X 2 d ðzi , zj ÞW i,j , 2i¼1j¼1
ð2Þ
where d2(zi,zj) denotes the squared Euclidean distance between low-dimensional representations zi and zj. We consider a multimodal binary-class case in Fig. 1, in which an incomplete undirected neighborhood graph is shown. That is, only partial solid and dotted lines are illustrated to connect the intra- and inter-class neighbors, respectively. In LPP, weight value Wi,j reflects the proximity relations between vertices. As LPP preserves the local information of all neighbors, a nonzero weight Wi,j will be set in all solid and dotted lines. Obviously, when Wi,j becomes heavier, the distance d2(zi,zj) must decrease to minimize the summation for balancing the functional value. When k in nearest-neighbor-search (NNS) is bigger, most of vertex pairs may be mutual neighbors. Thus most of the entries of W will be nonzero. Then LPP pushes all neighbors, including inter-class similarity pairs, close to preserve the local relations of vertices. This operation may cause the embeddings to deteriorate and is disadvantageous for visualization and classification. As a result, all margins between similarity pairs in the reduced space of LPP will be smaller than that in the original space. For Fig. 1, after performing LPP, embeddings of the clusters can be geometrically illustrated in the right of Fig. 1. But for DR, achieving high separation of inter-class similarity points in addition to preserving the local information is important. But the LPP criterion is unable to achieve this objective. The problem of existing locality preserving DR methods, including LPP, stem from its inability to take inter-class separation into account, because they only focus on keeping the local relations of all neighbors. So it is vital to define more efficient criterion to improve the tightness of intra-class similarity pairs, separate inter-class neighbors, and then deliver large margins for feature representation. To solve the shortcomings of LPP, this work considers new criterion from the marginal perspective and proposes a solution to address these issues with the geometrical interpretations.
We describe the marginal criterion from the perspective of pairwise constraints. The neighborhood graph-induced ML and CL constraints are applied for identifying the types of the neighbors. The definitions of ML and CL constraint sets are detailed in the next section. ML and CL reflect the supervised information of points and are more practical way than obtaining the class labels [17–19]. We in Fig. 2 show some examples of the ML and CL constraints in arrows. Intuitively, the proximity relations of the neighboring pairs in ML should be enhanced, while the proximity relations between the neighboring pairs in CL should be weakened as much as possible, because we aim at separating them. It is noted that, to balance the objective functional value, Wi,j and d2(zi,zj) in LPP must have the opposite meanings. Motivated by these analyses, if Wi,j increases, we can shorten the distance P d2(zi,zj) to minimize the summation over i,jd2(zi,zj)Wi,j if vertices xi and xj are constrained by ML. On the contrary, if vertices xi and xj are constrained by CL, a heavy penalty Wi,j will be imposed to P maximize the summation over i,jd2(zi,zj)Wi,j, implying that the 2 distance d (zi,zj) will be significantly expanded. Based on the above analyses, the objective function of the CMLP criterion can be addressed as follows by minimizing: þ CL 1 ML CL J1CMLP ðCÞ ¼ J ML LPP ðCÞ‘ þ J LPP ðCÞ or J CMLP ðCÞ ¼ J LPP ðCÞ=J LPP ðCÞ
ð3Þ
or equivalently by maximizing þ ML 2 CL ML ðCÞ ¼ J CL J2CMLP LPP ðCÞ‘ þ J LPP ðCÞ or J CMLP ðCÞ ¼ J LPP ðCÞ=J LPP ðCÞ,
ð4Þ
þ J 1CMLP ðCÞ
respectively, where ‘ þ is a control parameter. Note that þ and J 2CMLP ðCÞ are based on the generalized MMC [44]. Note that the optimization problems defined in Eqs. (3) and (4) all meet with the above analysis and are all feasible. The terms J ML LPP ðCÞ and JCL LPP ðCÞ share similar objective to LPP and are respectively formulated as JML LPP ðCÞ ¼
1 X 2 ~ ðMLÞ , d ðzi , zj ÞW i,j 2 ðx ,x Þ A ML i
j
1 X 2 ~ ðCLÞ JCL d ðzi , zj ÞW i,j LPP ðCÞ ¼ 2 ðx ,x Þ A CL i
j
ð5Þ
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
Class 1 Class 2 Intra-class similarity pairs Inter-class similarity pairs Example of CL constraints Example of ML constraints
4469
Class 1 Class 2 Intra-class similarity pairs Inter-class similarity pairs Cluster 2 in Class 2 Cluster 2 in Class 2
Cluster 1 in Class 1
Cluster 1 in Class 1
Cluster 1 in Class 2
After our criterion Cluster 1 in Class 2
Cluster 2 in Class 1
Cluster 2 in Class 1
Dimension-reduced Feature Space
Original Input Space
Fig. 2. Geometrical interpretation of our proposed constrained marginal CMLP criterion.
~ ðMLÞ and W ~ ðCLÞ reflect the local where adjacency matrices W neighborhood and class information of samples. Based on the above defined criteria, margins between inter-class clusters can be significantly broadened and margins between points of each intra-class cluster can be significantly shrunk in the reduced output space as shown in Fig. 2. And most importantly, the intrinsic multimodal structures are efficiently preserved, because the margins between clusters of a single class are enlarged correspondingly. In this work, we will focus on addressing these issues to achieve maximum margin discrimination for multimodal local projections and embeddings. 3.2. Graph-induced pairwise must-link and cannot-link constraints Based on the definition of local neighborhood, we compute the ML and CL constraint sets in a graph-induced approach. For a given data matrix X¼[x1,x2,y,xN] with C classes, a data graph G ¼(V,E) with N vertices can be constructed. Let xj A Nðxþi Þ denote the k nearest neighbor set of vertex xi, we put nonzero weights on ðx Þ the edge e(xi,xj)AE between xi and xj if xj A N ðxþi Þ or xi A N þj . Specifically, we define 8 > 1 > > < eðxi ,xj Þ ¼ 1 > > > : 0
ðx Þ
~ i Þ A V, vðx ~ j Þ A V,Laðxj Þ ¼ Laðxi Þ if xj A N ðxþi Þ or xi A N þj , vðx ðx Þ ~ i Þ A V, vðx ~ j Þ A V,Laðxj Þ aLaðxi Þ , if xj A N ðxþi Þ or xi A N þj , vðx
else if
~ j Þ A V,xj2 ~ i Þ A V, vðx = N ðxþi Þ vðx
and
ðx Þ xi2 = N þj
ð6Þ ~ i Þ is vertex xi in G. where La(xi) is the class label of xi and vðx Then, a neighborhood graph G~ N ¼ ðV~ N , E~ N Þ is formed by cutting the edges with zero weights from G, so G~ N and G shall have the same number of vertices and nonzero edges. The purpose of constructing G~ N is to represent the similarities between each vertex pair, where the similarity is measured by eN ðxi ,xj Þ A E~ N . Then the pairwise ML and CL constraint sets can be addressed as ML ¼ ðxi ,xj Þ9eN ðxi ,xj Þ ¼ 1,vðxi Þ A V~ N ,vðxj Þ A V~ N ,Laðxj Þ ¼ Laðxi Þ,
ð7aÞ
CL ¼ ðxi ,xj Þ9eN ðxi ,xj Þ ¼ 1,vðxi Þ A V~ N ,vðxj Þ A V~ N ,Laðxj Þ a Laðxi Þ,
ð7bÞ
where v(xi) is the vertex xi in G~ N . Based on the pairwise constraints, two intra- and inter-class neighborhood sub-graphs can be achieved from graph G~ N . We refer to the neighborhood sub-graphs constrained ~ Þ and CL-graph by ML and CL as ML-graph GML ¼ ðX ðMLÞ , EML ~ Þ, where X(ML) and X(CL) are consisted of the vertices GCL ¼ ðX ðCLÞ , ECL of ML-graph and CL-graph, respectively. Considering the neighborhood graph in Fig. 1, the constructed ML-graph and CL-graph are shown in Fig. 3. Denoted by NML and NCL the numbers of points in X(ML) and X(CL), respectively. Note that NML rN and NCL rN, which hold especially when only small proportion of constraints is used. Note that ML-graph and CL-graph can be generalized to have N vertices as G~ N when each vertex has at least one intra-class neighbor and one inter-class neighbor and all constraints are used. It is also worth noting that, all edge lengths in the ML-graph should be ‘‘minimized’’, while all edge lengths in CL-graph should be ‘‘max~ ðMLÞ and W ~ ðCLÞ measure the proximity relations of imized’’. Matrices W vertices in ML-graph and CL-graph respectively. In this work, the ~ ðMLÞ and NCL NCL dimensional entries of NML NML dimensional W ðCLÞ ~ are defined as W 8 > ~ ðMLÞ ¼ W ~ ðMLÞ ¼ 1,
if data pair ðxi ,xj Þ A ML
> ~ ðMLÞ ¼ 0, ~ ðMLÞ ¼ W :W i,j j,i
otherwise
8 > ~ ðCLÞ ¼ 1, ~ ðCLÞ ¼ W
if data pair ðxi ,xj Þ A CL
> ~ ðCLÞ ¼ 0, ~ ðCLÞ ¼ W :W i,j j,i
otherwise
:
,
ð8Þ
The idea behind Eq. (8) is that more separated embeddings of inter-class neighbors and enhanced intra-class neighbor compactness shall be organized. Most importantly, natural clusters within a class can be effectively preserved because those intra-class clusters will not be projected into a single cluster.
4470
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
Class 1
Class 1
Class 2 Inter-class similarity pairs
Class 2 Intra-class similarity pairs
Cluster 2 in Class 2
Cluster 2 in Class 2
Cluster 1 in Class 1
Cluster 1 in Class 1
Cluster 1 in Class 2
Cluster 1 in Class 2
Cluster 2 in Class 1
Cluster 2 in Class 1 Fig. 3. Geometrical interpretations of the constructed ML-graph (left) and CL-graph (right).
3.3. Learning linear large margin projections for local embeddings
2
2
where d ðxi ,xj Þ ¼ :xi xj : is the squared Euclidean distance between ðMLÞ
We in this section elaborate the CMLP criterion and compute the transforming matrix C ¼ ðc1 9c2 9. . .cd Þ A RdN for embedding. Let _i ð A 1,2,. . .,CÞ be the class label of xi A Rn in X, CMLP can be addressed as follows.
~ ~ ðCLÞ , W ~ ðMLÞ and W ~ ðCLÞ are symmetric and xi and xj. Clearly, D ,D positive semidefinite (S.P.S.D). By introducing the Lagrangian function Z of Eq. (9), we get
Zðci , li Þ ¼ cTi ðð1=nCL ÞX ðCLÞ L~
ðCLÞ
X ðCLÞT ð‘ þ =nML ÞX ðMLÞ L~
ðMLÞ
X ðMLÞT Þci
2
li ð:ci : 1Þ
3.3.1. CMLP solution scheme 1 We
first
compute
the
optimization
problem
þ ðCÞ ¼ J2CMLP
ML J CL LPP ðCÞ‘ þ J LPP ðCÞ. Based on the matrix interpretations, we ~ ðMLÞ W ~ ðMLÞ ÞX ðMLÞT CÞ and JCL ðCÞ ¼ have J ML ðCÞ ¼ TrðCT X ðMLÞ ðD LPP
T
LPP
ðCLÞ
ðCLÞ
~ ~ TrðC X ðD W ÞX CÞ, where n NML dimensional matrix X(ML) and n NCL dimensional matrix X(CL) denote the ML and CL ðCLÞ
ðCLÞT
constrained data matrices, respectively. By substituting JCL LPP ðCÞ and 2þ J ML LPP ðCÞ into the problem J CMLP ðCÞ, we can obtain þ J 2CMLP ðCÞ ¼
i
2 ~ ðMLÞ d ðzi , zj ÞW i,j 2nML ðx ,x Þ A ML i j
C A Rnd
ð‘ þ =nML ÞX ðMLÞ L~
ðMLÞ
ðCLÞ
X ðCLÞT
X ðMLÞT ÞCÞ
ML
CL
operator, I is an identity matrix, and parameter ‘ þ is estimated by 3 2 3,2 X X ðMLÞ ðCLÞ 2 2 ~ ~ 5 ‘ þ ¼ 4ð1=nCL Þ d ðxi ,xj ÞW i,j 5 4ð1=nML Þ d ðxi ,xj ÞW i,j
ðxi ,xj Þ A CL
ðxi ,xj Þ A ML
X ðCLÞT Þ=nCL TrðX ðMLÞ L~
n
ðMLÞ
X ðMLÞT Þci ¼ li ci , n
n
n
ð12Þ
n
where li and ci denote the eigenvalues and eigenvectors of ðCLÞ ðMLÞ ðMLÞT matrixð1=nCL ÞX ðCLÞ L~ X ðCLÞT ð‘ þ =nML ÞX ðMLÞ L~ X , respectively. That is, the optimal projection matrix Cn can be obtained by solving arg max nd
TrðCT ðð1=nCL ÞX ðCLÞ L~
ðCLÞ
X ðCLÞT
T
,C C ¼ I ðMLÞ
X ðMLÞT ÞCÞ:
ð13Þ
ð9Þ
~ ~ are diagonal pairwise ML and CL constraints respectively, D and D P ~ ðMLÞ P ~ ðCLÞ ðMLÞ ðCLÞ ~ ~ matrices with D ii ¼ j W i,j and D ii ¼ j W i,j , TrðUÞ is trace
ðCLÞ
X ðCLÞT ð‘ þ =nML ÞX ðMLÞ L~
Note that the obtained projection matrix C is orthogonal, so the similarity based on the Euclidean distance keeps unchanged [23–25]. We refer to this scheme of solving the orthogonal projection axes as CMLP-1.
ðMLÞ ~ ðMLÞ W ~ ðMLÞ subject to CTC ¼I, with Laplacian matrices L~ ¼D ðCLÞ ~ ðCLÞ W ~ ðCLÞ , where nML and nCL are numbers of the andL~ ¼D
¼ nML TrðX ðCLÞ L~
ðCLÞ
ð‘ þ =nML ÞX ðMLÞ L~
j
¼ Max TrðCT ðð1=nCL ÞX ðCLÞ L~
ðð1=nCL ÞX ðCLÞ L~
CAR
X
‘þ
with the multiplier li. By taking the derivatives with respect to ci and li and zeroing it, we can obtain
Cn ¼
X 1 2 ~ ðCLÞ Max d ðzi , zj ÞW i,j nd 2n CL ðx ,x Þ A CL CAR
ð11Þ
ðMLÞ
X ðMLÞT Þ
ð10Þ
3.3.2. CMLP solution scheme 2 In this subsection we compute the transforming axes from the CL ML J2 CMLP ðCÞ criterion. By substituting J LPP ðCÞ and J LPP ðCÞ into 2 JCMLP ðCÞ, we get the following TR problem [24,25]: ðCLÞ
TrðCT X ðCLÞ L~ X ðCLÞT CÞ : ~ ðMLÞ X ðMLÞT CÞ C A Rnd TrðCT X ðMLÞ L
J2 CMLP ðCÞ ¼ Max
ð14Þ
To solve the transforming axes from J 2 CMLP ðCÞ, we usually T ðCLÞ ~ ðCLÞ ðCLÞT optimize JðCÞ ¼ MaxC TrðC X L X CÞ with respect to conðMLÞ ðMLÞT straint TrðCT X ðMLÞ L~ X CÞ ¼ c instead of optimizing the problem in Eq. (14) directly, where c is a constant. Then by
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
constructing the Lagrangian function Z of J 2 CMLP ðCÞ with the multiplier li, we get
Zðci , li Þ ¼ cTi X ðCLÞ L~
ðCLÞ
T
X ðCLÞT ci li ðci X ðMLÞ L~
ðMLÞ
X ðMLÞT ci 1Þ,
Let cr r ¼ 1 denote the generalized eigenvectors, ordered according to the generalized eigenvalues l1 Z l2 Z Z ld , then we can take as transforming axes of CMLP the eigenvectors corresponding to maximum eigenvalues of the following generalðMLÞ ðMLÞT 1 ðCLÞ ðCLÞ ðCLÞT n X Þ X L~ X c ¼ ln cn , ized eigen-problem:ðX ðMLÞ L~ i
i
ðMLÞ
X ðMLÞT CÞ1 ðCT X ðCLÞ L~
ðCLÞ
ðCLÞ
J4 CMLP ðCÞ ¼ Max CAR
nd
ðMLÞ
~ ~ Trðð1‘ ÞCT X ðCLÞ D X ðCLÞT C þ ‘ CT X ðMLÞ W X ðMLÞ CÞ , T ðMLÞ ~ ðMLÞ ðMLÞT T ðCLÞ ~ ðCLÞ ðCLÞT D X C þ ‘ C X W X CÞ Trðð1‘ ÞC X
ð21Þ where ‘ is a trade-off parameter. Analogously, by constructing the Lagrangian function of Eq. (21), we can obtain
i
that is, the projection matrix Cn can be approximately solved by the following a ratio trace (RT) problem [24]:
Cn ¼ argmax Tr½ðCT X ðMLÞ L~
3.3.4. CMLP solution scheme 4 ðMLÞ With similar computational trick to handle L~ , we can further get the following optimization problem from Eq. (18):
ð15Þ
d
4471
Zðci , li Þ ¼ cTi ðð1‘ ÞX ðCLÞ D~ T
ðCLÞ
~ ðMLÞ X ðMLÞT Þc X ðCLÞT þ ‘ X ðMLÞ W i
~ li ðci ðð1‘ ÞX ðMLÞ D
X ðCLÞT CÞ:
ðMLÞ
~ X ðMLÞT þ ‘ X ðCLÞ W
ðCLÞ
X ðCLÞT Þci 1Þ ð22Þ
C A Rnd
ð16Þ Then, this RT problem can be effectively solved by generalized ðMLÞ ðMLÞT X is S.P.S.D, so eigen-decomposition [21]. Note that X ðMLÞ L~ ðMLÞ ðMLÞT X may be singular. Thus we cannot solve inverse of X ðMLÞ L~ the above problem directly. To ensure the computational stability, ðMLÞ ðMLÞT ðMLÞ ðMLÞT we generalize X ðMLÞ L~ X to X ðMLÞ L~ X þ mIn by using a
regularization term m In with a small positive scalar m. We refer to this solution scheme as CMLP-2. 3.3.3. CMLP solution scheme 3 For the Laplacian matrix L¼D W in LPP, the LPP criterion can be reformulated as the following problem [22]: Max TrðCT XWX T CÞ, s:t :TrðCT XDX T CÞ ¼ c:
ð17Þ
C A Rnd
According to [1], matrix D provides a natural measure on the data points. The bigger the value Dii is, the more important the ðMLÞ corresponding point is. Recall that the Laplacian matrix L~ is ðMLÞ ~ ML W ~ ML . Motivated by role of D played, it is given by L~ ¼D natural to address the following variant problem from the CMLP-2 criterion:
J3 CMLP ðCÞ ¼ Max CAR
~ ðMLÞ X ðMLÞT CÞ X ðCLÞT C þ ‘ CT X ðMLÞ W , ðMLÞ ~ ðMLÞ ðMLÞT D X CÞ TrðC X
Trðð1‘ ÞCT X ðCLÞ L~
ðCLÞ
T
nd
ð18Þ where ‘ ð A ½0,1Þ is a control parameter for balancing the tradeðCLÞ ~ ðMLÞ X ðMLÞT C. off between terms CT X ðCLÞ L~ X ðCLÞT C and CT X ðMLÞ W ðMLÞ ~ Then matrix D plays a similar role to D and provides a natural measure on the vertices over ML-graph. By introducing the Lagrangian function of Eq. (18), we can obtain the following problem:
Zðci , li Þ ¼ ci T ðð1‘ ÞX ðCLÞ L~ T
~ li ðci X ðMLÞ D with
‘ X
ðMLÞ
ðCLÞ
~ ðMLÞ X ðMLÞT Þc X ðCLÞT þ ‘ X ðMLÞ W i
X ðMLÞT ci 1Þ
ð19Þ ðCLÞ ~ ðCLÞ
multiplier li. Clearly matrices ð1‘ ÞX L X ðCLÞT þ ~ ðMLÞ X ðMLÞT and X ðMLÞ D ~ ðMLÞ X ðMLÞT are S.P.S.D. Then the transW
ðMLÞ
forming bases ci of problem J3 CMLP ðCÞ can be obtained as eigenn vectors corresponding to the maximum eigenvalues li of the following generalized eigen-problem: n
ðð1‘ ÞX ðCLÞ L~
ðCLÞ
~ ¼ li ðX ðMLÞ D n
~ X ðCLÞT þ ‘ X ðMLÞ W
ðMLÞ
ðMLÞ
X ðMLÞT Þci
n
X ðMLÞT þ mIn Þci : n
ð20Þ n
Note that the projection axes ci can also be solved by the generalized eigen-decomposition method [21]. We refer to this solution scheme as CMLP-3. Next we show the last solution scheme, which we call CMLP-4.
~ ðCLÞ X ðCLÞT þ ‘ X ðMLÞ W ~ ðMLÞ with multiplier li. Similarly,ð1‘ ÞX ðCLÞ D ðMLÞT ðMLÞ ~ ðMLÞ ðMLÞT ðCLÞ ~ ðCLÞ ðCLÞT D X and ð1‘ Þ X X þ ‘ X W X are S.P.S.D, n then the transformation axes ci that maximizes the optimization n problem in Eq. (21) are given as first d leading eigenvalues li of the following generalized eigen-equation: ~ ðð1‘ ÞX ðCLÞ D
ðCLÞ
~ ðMLÞ X ðMLÞT Þcn X ðCLÞT þ ‘ X ðMLÞ W i
~ ¼ li ðð1‘ ÞX ðMLÞ D n
ðMLÞ
~ ðCLÞ X ðCLÞT þ mIn Þcn : X ðMLÞT þ ‘ X ðCLÞ W i
ð23Þ
3.3.5. Remarks Compared with CMLP-2, CMLP-3 and CMLP-4, CMLP-1 in the subtraction form offers some advantages. (1) The transforming axes of CMLP-1 are orthogonal together, but the represented features by the nonorthogonal basis vectors of CMLP-2, CMLP-3 and CMLP-4 can contain high information redundancy. (2) CMLP1 is more computationally efficient than other solution schemes. Note that a similar strategy was applied to improve the LDA [45] and LPP [22]. (3) The matrix inverse operations do not exist in CMLP-1, so the matrix singularity is avoided. (4) CMLP-1 relies on the MMC that can contribute to maximizing the margins between different classes. An efficient implementation of the CMLP-1 algorithm is summarized in Table 1, where zeros(n,n) denotes a n n matrix with all zeros, 1m is an m-dimensional vector with all ones and diag(A1m) is a diagonal matrix with entries A1m. Note that the algorithmic flows of the other CMLP solution schemes can be similarly implemented. 3.4. Kernelized CMLP for nonlinear embedding This section shows the nonlinear extensions of our criteria. We refer to the kernelized CMLP-1, CMLP-2, CMLP-3 and CMLP-4 as KCMLP-1, KCMLP-2, KCMLP-3 and KCMLP-4, respectively. Let f be the mapping from Rn to a high-dimensional kernel feature space Zp ðpcnÞ. This mapping can be implicitly defined by a kernel function. More specifically, the (i,j)th entry of a kernel matrix K is given by Kij ¼K(xi,xj)¼ f(xi)Tf(xj). The Gaussian RBF kernel [35], a typical choice of kernel function, is defined as 2
Kðxi ,xj Þ ¼ expð99xi xj 99 =2s2 Þ:
ð24Þ
Rewriting each solution in kernel space as an expansion in terms of the mapped data, each basis vector c in C can be P ML CL formulated as c ¼ N i ¼ 1 ti fðxi Þ ¼ fðXÞt. Then J LPP ðCÞ and J LPP ðCÞ in kernel space can be expressed as X 1 2 f ~ ðCLÞ , ðJ CL d ðIi ,Ij ÞW LPP ðCÞÞ ¼ i,j 2 ðfðx Þ, fðx ÞÞ A CL i
f ðJ ML LPP ðCÞÞ
j
X 1 2 ~ ðMLÞ , ¼ d ðIi ,Ij ÞW i,j 2 ðfðx Þ, fðx ÞÞ A ML i
j
ð25Þ
4472
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
Table 1 Constrained large Margin Local Projection algorithm (CMLP-1). N
Input
Labeled data points ðxi ,_i Þ9xi A Rn ,_i ðA 1,2,. . .,CÞi ¼ 1 Dimensionality of embedding space d (1r d r n) n d Transformation matrix T CMLP1 ¼ ½j~1 9j~2 9. . .9j~d X’(x19x29y9xN); % Data matrix
Output Step 1 Step 2
Define the k nearest neighbor set N ðxþi Þ of each vertex xiover the data graph G and then set the weight values for the edges or links according to Eq. (6) Define the pairwise ML and CL constraint sets according to Eqs. (7) and (8) and construct the ML-graph and CL-graph from the pairwise constrained G~ N
Step 3 Step 4
~ ðCLÞ ¼ zerosðN CL N CL Þ; ~ ðMLÞ ¼ zerosðN ML N ML Þ and W Initialize matrices W (ML) (CL) X ’ vertices of the ML-graph, X ’ vertices of the CL-graph ~ ðMLÞ and W ~ ðCLÞ based on the ML-graph and CL-graph Setting the weights for W P ~ ðCLÞ T P ~ ðCLÞ T P ~ ðMLÞ T P ~ ðMLÞ T xi xi i,j W Calculate‘ þ ¼ nML ð i D ii xi xi i,j xi xj Þ i,j W i,j xij Þ=nCL ð i D ii
Step 5 Step 6 Step 7 Step 8
T ðCLÞ ~ ðCLÞ 1N ÞW ~ ðCLÞ ÞX ðCLÞT C ðdiagðW J CL LPP ðCÞ ¼ ð1=nCL ÞC X CL T ðMLÞ ~ ðMLÞ 1N ÞW ~ ðMLÞ ÞX ðMLÞT C; J ML ðdiagðW LPP ðCÞ ¼ ð‘ þ =nML ÞC X ML
Step 9
d
ML l½1 Z l½2 Z Z l½d and j~ T j~ ¼ 1, j~ T j ~ ~ ~ ~ jr , l½k r ¼ 1 ’standard eigenvectors and eigenvalues of ðJCL LPP ðCÞJ LPP ðCÞÞj ¼ l j , where r r r r1 ¼ 0, for 8 r A 1,2,. . .,d;
2
2
where d ðIi ,Ij Þ ¼ :CfT fðxi ÞCfT fðxj Þ: denotes the square of Euclidean distance between f(xi) and f(xj) in Zp . Let f G ¼[t19t29...9td], f(X)¼(f(x1),f(x2),y,f(xN)), then ðJCL LPP ðCÞÞ and ML f ðJLPP ðCÞÞ can be written as f ðJCL LPP ðCÞÞ
T
T
¼ TrðG fðXÞ fðX ¼ TrðGT K ðCLÞ L~
ðCLÞ
ðCLÞ
~ ÞðD
ðCLÞ
~ W
ðCLÞ
ÞfðX
Þ fðXÞGÞ
K ðCLÞT GÞ:
ðMLÞ
ð27Þ
J KCMLP4 ðGÞ ðCLÞ
G
ðMLÞ
~ ~ Trðð1‘ ÞGT K ðCLÞ D K ðCLÞT G þ ‘ GT K ðMLÞ W K ðMLÞT GÞ : ðMLÞ ðCLÞ ~ ~ K ðMLÞT G þ ‘ GT K ðCLÞ W K ðCLÞT GÞ Trðð1‘ ÞGT K ðMLÞ D ð28Þ
Then we can achieve the KCMLP-4 transforming axes from solving the following generalized eigen-problem: ~ ðð1‘ ÞK ðCLÞ D fn
ðCLÞ
~ ðMLÞ K ðMLÞT ÞGn K ðCLÞT þ ‘ K ðMLÞ W i
~ ¼ li ðð1‘ ÞK ðMLÞ D
ðMLÞ
~ K ðMLÞT þ ‘ CT K ðCLÞ W
ðCLÞ
K ðCLÞT þ mIÞGni
K ðCLÞT GÞ=ðGT K ðMLÞ L~
ðMLÞ
K ðMLÞT GÞÞ
KCMLP3 : J KCMLP3 ðGÞ ~ ðMLÞ K ðMLÞT GÞ K ðCLÞT G þ ‘ GT K ðMLÞ W : ~ ðMLÞ K ðMLÞT GÞ TrðGT K ðMLÞ D ð33Þ
~ Trðð1‘ ÞGT K ðCLÞ D
G
ð26Þ
K ðMLÞT GÞ:
ðCLÞ
ð32Þ
¼ Max
where K(CL) ¼ f(X)Tf(X(CL)) and K(ML) ¼ f(X)Tf(X(ML)) are kernel Gram matrices. Here we take CMLP-4 criterion for example, that is to solve the nonlinear transforming axes of KCMLP-4 from the following problem:
¼ Max
G
ðCLÞ T
T ðMLÞ ~ ðMLÞ f T ~ ðMLÞ ÞfðX ðMLÞ ÞT fðXÞGÞ ðJML ÞðD W LPP ðCÞÞ ¼ TrðG fðXÞ fðX
¼ TrðGT K ðMLÞ L~
KCMLP2 : J KCMLP2 ðGÞ ¼ Max TrððGT K ðCLÞ L~
ðCLÞ
The nonlinear transforming bases for KCMLP-1, KCMLP-2 and KCMLP-3 can be similarly obtained as KCMLP-4. For KCMLP-1, the control parameter ‘ þ in the kernel space is estimated by
‘ þ ¼ ð1=nCL ÞTrðK ðCLÞ L~
ðCLÞ
K ðCLÞT Þ=ð1=nML ÞTrðK ðMLÞ L~
ðMLÞ
K ðMLÞT Þ
ð34Þ
Another way to kernelize our methods is to apply the KPCA-trick framework [16]. With KPCA-trick, CMLP can be kernelized directly by performing KPCA plus CMLP. KPCA is firstly used to transform original 0
space from Rn to an n -dimensional space Rn , where n0 is rank of centered kernel matrix K over X. After performing KPCA, each x A Rn 0
0
0
is converted to the KPCA-based feature vector x0 A Rn . We then in Rn construct the ML-graph, CL-graph and constrained data matrices over X0 , where X 0 ¼ ½x01 ,x02 ,. . .,x0N . Then linear CMLP can be similarly implemented. As the size of matrices eigen-decomposed in kernelized extensions relies on the number of samples, kernelized methods can improve the computational efficiency when the sample size of the dataset is smaller than its dimensionality.
ð29Þ
Gr dr ¼ 1
be the generalized where m is the regularization factor. Let eigenvectors associated with the first d largest eigenvalues lfr ,r ¼ 1,2,. . .,d of the above generalized eigen-problem. Thus, KCMLP-4 projection matrix is given by G ¼(G19G29...9Gd). Then the embedding result, F(xr), of f(xr), represented by KCMLP-4 is given as 0 1 Kðx1 ,xr Þ C qffiffiffiffiffiffi qffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffi T B B Kðx2 ,xr Þ C C: Fðxr Þ ¼ lf1 G1 9 lf2 G2 9. . .9 lfd1 Gd1 9 lfd Gd B B ::: C @ A KðxN ,xr Þ ð30Þ Note that CMLP-1, CMLP-2 and CMLP-3 can be similarly kernelized. Details will not be provided due to the page limitation. The kernelized problems of KCMLP-1, KCMLP-2 and KCMLP-3 are respectively formulated as KCMLP1 : J KCMLP1 ðGÞ ¼ Max Trðð1=nCL ÞGT K ðCLÞ L~ G
ð‘ þ =nML ÞGT K ðMLÞ L~
ðMLÞ
ðCLÞ
K ðCLÞT G
K ðMLÞT GÞ
ð31Þ
4. Related work: connections, discussions and improvements 4.1. Trace ratio optimization for improving CMLP Because the projection matrices of CMLP-2, CMLP-3 and CMLP4 criteria are not orthogonal and the computational process may be singular, we propose an improvement strategy aided by the TR optimization [10,24,25] to improve them. We first consider the following TR problem: Max TrðCT P~
CT C ¼ I
ðCLÞ
CÞ=TrðCT P~
ðMLÞ
CÞ,
ð35Þ
ðCLÞ ðMLÞ where P~ and P~ are S.P.S.D matrices. This problem is usually converted into the simplified RT problem to find the projection ðCLÞ ðMLÞ matrix Cn ¼ argmaxC A Rnd TrððCT P~ CÞ1 ðCT P~ CÞÞ. But the obtained solution is not orthogonal and does not necessarily best optimize the corresponding TR problem [23]. Guo et al. [23] have proved that the global optimum of TR problem can be equivalently solved by using a trace difference (TD) problem. To find the best TR value ln and optimal matrix Cn, it is equivalent to solve a
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
TD problem, that is to find the zero point of the TD function: ðCLÞ ðMLÞ FðlÞ ¼ argmax T TrðCT ðP~ lP~ ÞCÞ [23]. Then the optimal C C¼I
matrix
Cn
can
be
calculated
by
Cn ¼ argmaxCT C ¼ I
ðCLÞ n ðMLÞ TrðCT ðP~ l P~ ÞCÞ. Another iterative approach called ITR [24,25] was recently proposed to solve TR problem. ITR directly ðCLÞ ðMLÞ CÞ=TrðCT P~ CÞ by assuming optimizes the objective TrðCT P~
that the column vectors of C are orthogonal together. For given lv at each iteration v, Cv is obtained from the following problem: ðCLÞ v ðMLÞ Cv ¼ argmax T TrðCT ðP~ l P~ ÞCÞ, and then renew lv þ 1 C C¼I
ðCLÞ v ðMLÞ v vþ1 by l ¼ TrððCv ÞT P~ C Þ=TrððCv ÞT P~ C Þ until convergence. Proofs show ITR converges to the global optimum [25]. Also under TR criterion, the projection matrix is orthogonal and the similarity between points can be efficiently kept if it is based on Euclidean distance [24]. This work applies ITR to solve our problems. We name the TR criterion based CMLP-2, CMLP-3 and CMLP-4 as TR-CMLP-2, TRCMLP-3 and TR-CMLP-4, respectively. Note that the columnly orthogonal initialized C0 in ITR is usually difficult to be constructed and a bad initialization may greatly increase the number of iterations. This paper initializes l0 instead of initializing C0 to be orthogonal as [25]. The procedures of applying ITR to optimize our problems are summarized as follows. For TR-CMLP-2, TRðCLÞ ðCLÞ CMLP-3 and TR-CMLP-4, we set P~ ¼ X ðCLÞ L~ X ðCLÞT , ð1‘ ÞX ðCLÞ ðCLÞ ~ ðMLÞ X ðMLÞT , ~ ðCLÞ X ðCLÞT þ ‘ X ðMLÞ L~ X ðCLÞT þ ‘ X ðMLÞ W ð1‘ ÞX ðCLÞ D ðMLÞ ðMLÞT ðMLÞ ðMLÞ ðMLÞ ~ ~ X , P~ ¼ X ðMLÞ L~ X ðMLÞT , X ðMLÞ D X ðMLÞT , ð1‘ ÞX ðMLÞ W
~ ðMLÞ X ðMLÞT þ ‘ X ðCLÞ W ~ ðCLÞ X ðCLÞT , respectively. D Step 1: initialize v ¼0 and lv ¼0. ðCLÞ v ðCLÞ v v Step 2: solve eigen-problem ðP~ l P~ Þc ¼ tv c and ðCLÞ vd v ~ ðCLÞ ~ obtain the eigenvectors cd d ¼ 1 ofðP l P Þ. Step 3: the matrix Cv at iteration v is then formed by the vd eigenvectors cd d ¼ 1 according to the first d leading eigenva d ðCLÞ v ðCLÞ v v lues tvd d ¼ 1 of the eigen-problem: ðP~ l P~ Þc ¼ tv c . vþ1 vT ~ ðCLÞ v vþ1 Step 4: renew the TR value l byl ¼ TrðC P C Þ= ðCLÞ v TrðCvT P~ C Þ. Step 5: if 9lv þ 1 lv9 o e, go to step 6; else v¼ vþ1, steps 2–4 repeat. Step 6: output the optimal ln ¼ lv, and projection matrix Cn ¼ ðCLÞ n ðCLÞ argmaxCT C ¼ I TrðCT ðP~ l P~ ÞCÞ. Similarly we refer to the TR criterion based KCMLP-2, KCMLP-3 and KCMLP-4 as TR-KCMLP-2, TR-KCMLP-3 and TR-KCMLP-4, respecðCLÞ ðCLÞ ðCLÞ ¼ K ðCLÞ L~ K ðCLÞT , ð1‘ ÞK ðCLÞ L~ tively. Then we can let P~ ~ ðMLÞ K ðMLÞT , ~ ðCLÞ K ðCLÞT þ ‘ K ðMLÞ W ~ ðMLÞ K ðCLÞT þ ‘ K ðMLÞ W ð1‘ ÞK ðCLÞ D ðMLÞ ðMLÞ ðMLÞ ~ ~ ðMLÞ K ðMLÞT , P~ ¼ K ðMLÞ L~ K ðMLÞT ,K ðMLÞ D K ðMLÞT , ð1‘ ÞK ðMLÞ D ~ ðCLÞ K ðCLÞT respectively to compute the projection K ðMLÞT þ ‘ K ðCLÞ W axes of TR-KCMLP-2, TR-KCMLP-3 and TR-KCMLP-4.
4473
where nonzero e(xi,xj) denotes the edges connecting vertices xi and xj over G if they are neighbors. Initialize two N N dimen~ S and W ~ B with sional intrinsic similarity and penalty matrices W all zeros. We can then update the entries of the adjacency ~ S and W ~ B by matrices W 8 > ~ S ¼W ~ S ¼ 1, if pair ðxi ,xj Þ A set P~1
~ ~ : W i,j ¼ W j,i ¼ 0, if pair ðxi ,xj Þ= 2set P~1 8 > ~ B ¼W ~ B ¼ 1, if pair ðxi ,xj Þ A set P~2 ~ ~ : W i,j ¼ W j,i ¼ 0, if pair xi ,xj Þ= 2set P~2 Based on these definitions, the problem of the Marginal Fisher Analysis (MFA) algorithm [36] can be defined as P 2 B ~ B TrðCT X L~ X T CÞ ðxi ,xj Þ A P~2 d ðzi , zj ÞW i,j : ð38Þ ¼ Max Max P S 2 ~ ~ S X T CÞ C A Rnd C A Rnd TrðCT X L ðxi ,xj Þ A P~1 d ðzi , zj ÞW i,j B ~ B and D ~ S. ~ B W ~ B , L~ S ¼ D ~ S W ~ S, D ~B¼PW ~S ¼PW where L~ ¼ D ii i,j ii i,j j j Then the projection matrix of MFA is obtained from solving the following ratio trace problem [24,25]: S
B
Cn ¼ argmax Tr½ðCT X L~ X T CÞ1 ðCT X L~ X T CÞ: CAR
ð39Þ
nd
Another discriminant locality preserving DR algorithm is Local Discriminant Embedding (LDE) [37], which has a closely related ~ S and W ~ B , the criterion objective to MFA. Based on the matrices W of LDE is defined as X 1 X 2 2 ~ B , s:t: 1 ~ S ¼ 1, d ðzi , zj ÞW d ðzi , zj ÞW ð40Þ Max i,j i,j nd 2 2 CAR ~ ~ ðxi ,xj Þ A P2
ðxi ,xj Þ A P1
To solve the transforming bases from LDE for embeddings, optimizing the above criterion is equivalent to solving the trace ratio problem in Eq. (38) and the solution can be similarly achieved from Eq. (39). Another popular discriminant feature extraction approach to MFA is called Locality Sensitive Discriminant Analysis (LSDA) [38]. The criterion for choosing a ‘‘good’’ map is to optimize the following two objectives: 1 X 2 ~ B ¼ Max TrðCT X L~ B X T CÞ, Max d ðzi , zj ÞW ð41aÞ i,j C A Rnd 2 C A Rnd ~ ðxi ,xj Þ A P2
Min CAR
nd
1 2
X
S
2
S
~ ¼ Min TrðCT X L~ X T CÞ: d ðzi , zj ÞW i,j
ð41bÞ
C A Rnd
ðxi ,xj Þ A P~1
Note that the authors of LSDA solve the transforming axes by ~ S X T CÞ ¼ c and aim at optimizing the setting a constraint TrðCT X D following maximization problem: B
S
S
~ X T ÞCÞ, s:t: TrðCT X D ~ X T CÞ ¼ c Max TrðCT ð‘ X L~ X T þ ð1‘ ÞX W
C A Rnd
ð42Þ 4.2. Connections to MFA [36], LDE [37], LSDA [38], LFDA [9] and NPDA [7] 4.2.1. Mathematical formulations and comparisons Similar to the pairwise definition methods in Eqs. (7) and (8), we define the following two sets over graph G as ðx Þ P~ 1 ¼ ðxi ,xj Þ9eðxi ,xj Þ ¼ 1,xj A Nðxþi Þ or xi A N þj ,
~ i Þ A V, vðx ~ j Þ A V,Laðxj Þ ¼ Laðxi Þ ¼ l, vðx
ð36aÞ
ðx Þ ~ i Þ A V, P~2 ¼ ðxi ,xj Þ9eðxi ,xj Þ ¼ 1,xj A Nðxþi Þ or xi A N þj , vðx
~ j Þ A V,Laðxj Þ aLaðxi Þ vðx
ð36bÞ
with a control parameter ‘ involved. To solve the projective axes of LSDA, we can optimize Eq. (38) as well. In this way, LSDA in TR form is equivalent to LDE and MFA except that LDE and MFA determine the neighbor numbers, k1 and k2, of inter- and intraclass points independently. Note that the constraint sets are flexible, thus we can select either partial or all available constraints. And ML- and CL-graph can be generalized to have N vertices as G~ N . Actually, this condition can be easily met as long as the datasets are not very sparse and a proper k number is used. As a result, when all available constraints are used, and all constrained data pairs are equally weighted by constant one, MFA, LDE and LSDA in TR form are equivalent to our CMLP-2 criterion. Hence MFA, LDE and LSDA are regarded as special cases of CMLP-2
4474
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
problem. It is also noted that when the problem in Eq. (42) is used to solve the projection axes of LSDA, the solution of LSDA is also equivalent to that of our CMLP-3 criterion when eigen-decomposition is applied. Local Fisher Discriminant Analysis (LFDA) [9] and Neighborhood Preserving Discriminant Analysis (NPDA) [7] are another two similar localized DR techniques. With pairwise expressions, the local inter-class scatter S(lbc) and local intra-class scatter S(lwc) of LFDA and NPDA can be described as XN 1 XN 2 2 ~ B , SðlwcÞ ¼ 1 ~ S: SðlbcÞ ¼ d ðxi ,xj ÞW d ðxi ,xj ÞW i,j i,j i,j ¼ 1 i,j ¼ 1 2 2
ð43Þ
Then LFDA and NPDA methods can achieve the maximization discrimination by optimizing the objective function B S MaxC A Rnd TrðCT X L~ X T CÞ=TrðCT X L~ X T CÞ in Eq. (38) with weight ~ S and W ~ B defined as matrices W 8 > ~ S ¼ W i,j =Nl , if Laðxj Þ ¼ Laðxi Þ ¼ l ~ S ¼W ~ S ¼W ~ S ¼ 0, else if Laðxj Þ a Laðxi Þ :W i,j j,i 8 > ~ B ¼W ~ B ¼ ð1=N1=Nl ÞW i,j , if Laðxj Þ ¼ Laðxi Þ ¼ l ~ B ¼W ~ B ¼ 1=N, else if Laðxj Þ aLaðxi Þ :W i,j j,i where Nl is the sample size of class l, the similarity matrix W is similarly defined as LPP. Note that when ML- and CL-graph are generalized to have N vertices (i.e., all data pairs are neighbors) and all constraints are used, LFDA and NPDA can be mathematically explained with our CMLP-2 criterion and can be considered as the generalized variant version of CMLP-2. It is also noted that when all point pairs are equally treated and the simple-minded method is used to define W, the scatter matrices of LFDA can be reduced to those of LDA and MMC. In other words, under this case LDA and MMC can also be induced from CMLP-2 and CMLP-1 criterions when we use the similar weighting method as Eq. (44) to set the weights, constants 1/2nCL and 1/2nML are removed and ‘ þ ¼ 1. 4.2.2. Some remarks For NNS type methods, the construction of graphs and weight matrices for measuring the similarities between points is most important. Different from previous methods that construct graphs using all points and set nonzero weights for edges connecting neighbors, our criteria reply on the ML-graph and CL-graph ~ ðMLÞ , W ~ ðCLÞ are constructed only induced from G~ N and matrices W ðMLÞ ~ ~ ðCLÞ almost always by the constrained points. Thus W and W S B ~ ~ have smaller dimensions than W and W , especially when only small proportion of constraints is applied. The computational burden of MFA, LDE, LSDA, LFDA and NPDA mainly depends on S B scatters X L~ X T and X L~ X T , while the computational cost of our ðCLÞ ðMLÞ ðMLÞT criteria rely on scatters X ðCLÞ L~ X ðCLÞT and X ðMLÞ L~ X . Thus the matrix computational cost can be significantly reduced by our criteria. Another major difference is the above reviewed methods utilize the class information directly and are relatively rigid in a sense that they cannot handle supervised information flexibly. When solving the projection axes of MFA, LDE, LSDA, LFDA and NPDA from Eq. (39), matrix singularity may appear. To address this issue, the subtraction form of CMLP-1 can be used for modeling the objectives. Then they shall share the similar properties and advantages to CMLP-1. As a result, MFA, LDE, LSDA and LFDA can be interpreted with our CMLP-1 criterion and embedded into the CMLP-1 framework to become flexible in regulating the supervised information. To solve the projection axes from our criteria, we can also apply all vertices over G to define the constrained data matrices and weight matrices as the above reviewed methods. That is
X¼X(ML) ¼X(CL). But note that we only weight the edges connected by constrained data pairs when constructing the adjacency ~ ðMLÞ and W ~ ðCLÞ . With this definition, we detail a matrices W weighted least squares formulation for the CMLP-3 criterion in Section 4.3. 4.3. Weighted least squares view of CMLP-3 Least squares method [30,33] is widely used in regression and classification. Note that the cost function of CMLP-3 can be viewed as a generalized eigen-problem. This section will further analyze it using eigen-decomposition and then formulate the CMLP-3 problem as a weighted least square (WLS) problem [28]. 4.3.1. Computing CMLP-3 via eigen-decomposition ðCLÞ ~ ðMLÞ , the Let L~b denote a matrix being L~b ¼ ð1‘ ÞL~ þ ‘ W transforming axes of CMLP-3 consist of the eigenvectors accord~ ðMLÞ X T Þy X L~ b X T , ing to the leading d eigenvalues of matrix ðX D y where notation is pseudo-inverse. To find the relationships between CMLP-3 and least squares, we first give Theorem 1 with proof. Theorem 1. To obtain the transforming axes for CMLP-3 by conducting eigen-decomposition, we can solve the following eigen-problem: P P ~ ðMLÞ X T Þy ðX L~b X T ÞXn ðX D ¼ 2 Xn , where b denotes a diagb
CMLP3
CMLP3
onal matrix and XnCMLP3 include the optimal projective vectors of CMLP-3. See the proof of Theorem 1 in Appendix A. 4.3.2. Equivalence relationship to the weighted least squares We first review the definitions of the WLS and then build the equivalence to our CMLP-3. For a given class indicator matrix Y, the objective function and solution of WLS can be described as ~ ðMLÞ JY T X T XJ2 , Min D F X
XnWLS ¼ ðX D~
ðMLÞ
ðMLÞ
~ ðMLÞ Y T , X T Þy X D
ð45Þ
ðMLÞ
~ ~ where D is a N N diagonal matrix with entryD A R þ , and ii : :F is matrix Frobenius norm. Note that the class indicator matrix introduced in [30] can be used to define Y. Also, if we ~ D ~ ðMLÞ Þ1 , we can then choose a class indicator matrix as Y ¼ Gð ~ obtain X D
ðMLÞ
Y T ¼ H~b . Therefore, the optimal solution in Eq. (45) ~ ðMLÞ X T Þy H~b , which can also be can be rewritten as XnWLS ¼ ðX D equivalently formulated as ! P1 P1 1 X 1 X ðMLÞ T y 0 t t ~ ~ X Þ Hb ¼ U ð U T1 H~b Þ ðX D U T H~b ¼ U 1 0 0 t t ¼ U1
1 X t
I ¼ U1
1 X X X T P Q T ¼ XnCMLP3 Q t
b
ð46Þ
b
From Eq. (46), since Q is an orthogonal matrix, it can be neglected if the similarity of two samples is evaluated based on Euclidean distance. Thus, the main difference between XnCMLP3 P P and XnWLS is the diagonal matrix b. If b is an identity matrix, we have XnWLS ¼ XnCMLP3 . This can only be hold when satis~ ðMLÞ X T ÞrankðX L~b X T Þ ¼ fying the following condition: rankðX D ~ ðMLÞ X T X L~b X T Þ [30]. Otherwise, the WLS problem can be rankðX D solved by using the two-stage (TS) approach [32]. We detail the TS optimization approach in Appendix B.
5. Simulation results and analysis We in this section conduct simulations, including multivariate visualization and classification, to verify the validity of our methods. The performance of our CMLP family (i.e., CMLP-1, CMLP-2, CMLP-3
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
and CMLP-4) and TR criterion based CMLP members (i.e., TR-CMLP-2, TR-CMLP-3 and TR-CMLP-4) are mainly compared with PCA, LDA, MMC, Canonical Correlation Analysis (CCA) [26] and ITR based LDA (TR-LDA) [24]. Note that CCA with the c label coding [26] used here is a multi-class extension of CCA [40]. The performance of our KCMLP family (i.e., KCMLP-1, KCMLP-2, KCMLP-3 and KCMLP-4) and TR criterion based KCMLP members (i.e., TR-KCMLP-1, TR-KCMLP-2, TRKCMLP-3 and TR-KCMLP-4) are compared with the KPCA, KLDA, kernelized MMC (KMMC), kernelized CCA (KCCA) and ITR based KLDA (TR-KLDA) methods. 5.1. Simulation setting and data preparation To avoid parameter selection, the simple-minded method [5] is used to define the weights for NNS type methods. For classification, the one-nearest-neighbor (1NN) classifier with Euclidean metric is applied. A regularization term m I with m ¼0.001 is added to ensure the stability of all generalized eigen-decomposition type approaches. The trade-off parameter ‘ is set to 0.5 in CMLP2, CMLP-3 and CMLP-4. We use the Gaussian RBF function to construct the kernel Gram matrix for nonlinear methods and determine the kernel width s by applying the estimation method of [34]. We perform all simulations on a PC with Intel (R) Core (TM) i5 CPU 650 @ 3.20 GHz 3.19 GHz 4G. In this work, 6 benchmark datasets are evaluated. The first one is ORL database (Available from: http://www.uk.research.att.com/ facedatabase.html); the second one is UMIST database [39]; the third one is USPS handwritten digits database [41]; the fourth one is YALE database (Available from http://cvc.yale.edu/projects/yalefaces/yale faces.html); the fifth one is MIT CBCL face recognition database [29]; the rest is selected from the UCI ML repository (Available from http:// archive.ics.uci.edu/ml/). The images in all real databases are resized due to the computational consideration. Original images of ORL, YALE, UMIST and MIT CBCL databases are resized to 32 32 pixels. The images of USPS database are resized to 16 16 pixels. Table 2 shows the specifications of partial datasets and the rest UCI datasets are described in Table 9. In the simulations, we randomly split each dataset into training and test set. For classification, different settings are prepared to examine the performance of methods. For each method and dataset, the training set processed with PCA operator to eliminate the null space before DR is used to train a 1NN classifier (or learner). The test set is then projected in the reduced space using the DR matrix learned from the training data. Then, the learner is
Table 2 List of the used visualization and pattern classification datasets. Data name
# Input dimensionality
# Of input samples
# Of classes
ORL FACE UMIST FACE YALE FACE MIT CBCL FACE USPS HANDWRITTEN DIGITS GLASS IDENTIFICATION
n¼32 32 ¼1024 n¼32 32 ¼1024 n¼32 32 ¼1024 n¼32 32 ¼1024 n¼16 16 ¼256 n¼10
N ¼400 N ¼564 N ¼165 N ¼3240 N ¼9298 N ¼214
C¼ 40 C¼ 20 C¼ 15 C¼ 10 C¼ 10 C¼ 6
4475
used for evaluating the accuracies of the test set in the reduced feature space. 5.2. Image representation and manifold visualization This section aims at testing our family methods by addressing three multivariate visualization simulations. 5.2.1. Visualization of projection matrix We first examine the visual properties of the delivered projection matrices. The visual results are compared with PCA, LDA, LPP, MMC, CCA and TR-LDA. Note that LDA family (LDA and TR-LDA), MMC and CCA use the ground- truth class labels of all training data, thus we apply all available constraints from the training set for fair competitions in the simulations. This setting is maintained unless there are special remarks. The YALE database is employed for this simulation. In this database, the face images demonstrate variations in lighting conditions, facial expressions and with/without glasses. Some typical sample images are shown in Fig. 4. We select two faces of each person for learning the face image subspace. The number of k in NNS is set to 15. We illustrate the first 10 eigenvectors of the projection matrix of each method. The vectors are then reshaped into a matrix according to the original face size. The results are given in Fig. 5. For our CMLP family, we also show the first 10 eigenvectors. So face images can be projected onto the marginal discriminant subspaces spanned by the eigenvectors of our methods. Clearly, results of LDA family, MMC, CCA, LPP and our CMLP family are more noisy than that of PCA, which demonstrates that they can capture more information of face details, such as lighting conditions and facial expressions. It is also noticed from the visual results delivered by our methods are different to some extent, indicating that our methods focus on reflecting different face details information of the persons. 5.2.2. Performance analysis on multimodality preservation This subsection mainly evaluates the embeddings of our family methods in terms of inter-class separability, intra-class compactness and multimodality preservation. The performance is compared with those of PCA, LDA family, LPP, MMC and CCA. In this simulation, the glass identification dataset is used. Out of the 214 instances of the dataset, there are 163 window glasses and 51 non-window glasses. Each sample has 10 attributes. In our simulation, this six-class dataset is converted into a multimodal one by merging classes 1, 3 and 5 to a single class, and the rest to another class. Number k in NNS is set to 3 for LPP and our methods. Fig. 6 shows the visualization result of each method. The distributions of the original and re-sampled multimodal datasets are also shown. It is obvious that each class of the multimodal dataset has three separate clusters. Note that LDA family can only extract one meaningful feature in binary-class case due to the rank limitation [1,8]. Here, we randomly select the second feature. From the results, we see that LDA, MMC and CCA are not capable of separating the two classes effectively. Most importantly, the intrinsic multimodal structures are lost in their obtained feature spaces. PCA, LPP and TR-LDA seem to be able to keep the intrinsic multimodal structures to some extent, but they
Fig. 4. Typical sample images from the real YALE face database.
4476
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
Fig. 5. Visualization of the projection matrix of PCA, LDA family (i.e., LDA and TR-LDA), LPP, MMC, CCA and our family methods on the YALE face database.
Original dataset
Data with multimodal distribution PCA
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
Class 1 Class 2
CCA
LDA
MMC
TR-LDA
LPP
CMLP-1
CMLP-2
CMLP-3
CMLP-4
TR-CMLP-2
TR-CMLP-3
TR-CMLP-4
Fig. 6. The multimodal embedding results of PCA, LDA family, LPP, MMC, CCA and our family methods on the glass identification dataset.
cannot organize more separate embeddings of the clusters. In contrast, our methods can effectively push points of each cluster closely together and work well in between-class separation and
multimodality preservation. It is also interesting to find that there seems to have a clear boundary between each cluster pair in our results.
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
4477
Fig. 7. Typical sample images from the real MIT-CBCL face recognition database.
Fig. 8. The two-dimensional embedding results of PCA, LPP, LE, LLE, CMLP-1, CMLP-2 and TR criterion based CMLP family on the MIT-CBCL face database (four individuals).
5.2.3. Performance analysis on local manifold preservation We here address a face manifold visualization experiment by using the MIT CBCL face database. In this simulation, we mainly evaluate the embeddings in terms of inter-class separability, intra-class compactness and local preservation. MIT-CBCL provides two sets: (1) High resolution pictures, including frontal, half-profile and profile view; (2) Synthetic face images (324 images per person). In our study, the second set is employed. Some typical face images are shown in Fig. 7. To make the results clear, we choose four individuals from the set for visualization to test PCA, LPP, LE, LLE, CMLP-1, CMLP-2 and TR criterion based CMLP family methods. We show the two-dimensional embeddings in Fig. 8. The k numbers in NNS for CMLP-1, CMLP-2, TR-CMLP-2, TR-CMLP-3, TRCMLP-4, LE, LLE and LPP are set to 95, 95, 45, 45, 45, 35, 35 and 35, respectively. We clearly find our CMLP-1, CMLP-2 and the TR criterion based CMLP family methods deliver more separated manifolds of the faces of different persons and are capable of organizing natural clusters of faces. PCA, LE, LLE and LPP can preserve the global or local manifold information of faces to some extent, but they are unable to exhibit the clear organizations of face clusters Fig. 9. To investigate how our methods behave in locality preservation, we take CMLP-1 and CMLP-2 as examples by illustrating the zooming in of the embedded faces of the second person. We observe clearly from Fig. 10 that our CMLP-1 and CMLP-2 are able to capture the intrinsic local manifolds of the faces efficiently. It is noted that the poses of the faces change continuously and smoothly
from frontal faces to side faces, from left to right. At the same time, the illuminations of the faces change continuously and smoothly from dark to light, from left to right. Another face manifold visualization task is also prepared to evaluate our CMLP-1 and TR-CMLP-2 methods. In this simulation, all faces of the 10 individuals are used. The number of k in NNS is set to 75 for TR-CMLP-2 and 125 for CMLP-1. The results are shown in Fig. 11. We find almost all faces are appropriately organized and more separated embeddings are delivered. Based on the above visualization simulations, we demonstrate our techniques are promising methods in visualizing real datasets. This is clearly an essential property in recognizing the images or objects. 5.3. Face recognition on ORL database In this section, we use the ORL database for face recognition. The images of ORL database are taken at different time instances, with varying lighting conditions, facial expressions and facial details. Some typical images are shown in Fig. 12. In this study, settings under different numbers of training data, k numbers and proportions of constraints are tested. In our simulations, the image sets are partitioned into different galleries and probe sets, where GP/Pq means p images per subject with class labels are randomly selected for training the learner and the remaining q images are for testing. To ensure that our results are not biased from a specific random realization of the training/test set, for each GP/Pq, we compute the averaged accuracy over different
4478
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
Fig. 9. Zooming in of the embedded faces of the second person by CMLP-1 and CMLP-2.
Face 1 Face 2 Face 3 Face 4 Face 5 Face 6 Face 7 Face 8 Face 9 Face 10
Fig. 10. The two-dimensional embedding results of CMLP-1 (right) and TR-CMLP-2 (left) on the real MIT-CBCL face recognition database (ten individuals).
Fig. 11. Typical sample images from the real ORL face database.
realizations of training/test sets. The ML and CL constraints are created depending on whether the class labels of neighbors in the training set are the same or different.
5.3.1. Performance analysis on the proportion of constraints In this simulation, we analyze the performance of our CMLP and TR criterion based CMLP families over different constraints. Three settings over different GP/Pq are evaluated. In our
simulations, we fix the k number, GP/Pq and d value. We then regulate the proportions of constraints randomly selected from the ML and CL constraint sets. We first report the results of linear methods in Fig. 13, where the horizontal axis represents the proportions of constraints. Note that in all our simulations testing the proportions of constraints, u% constraints mean that u% ML constraints plus u% CL constraints. The vertical axis is the corresponding accuracy. In all simulations testing the proportions of constraints, we average the accuracies over 35 random
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
4479
Fig. 12. Recognition accuracies of linear methods vs. proportions of constraints on the ORL face database. First row: k¼7, second row: k¼ 21.
Fig. 13. Recognition accuracies of nonlinear methods vs. proportions of constraints on the ORL face database.
selections of constraints. From Fig. 13, we see the accuracy of each method varies with the increasing proportion of constraints. The results show that numbers of constraints affect the performance of our methods significantly. When the proportion of constraints increase, the accuracies of CMLP-1, CMLP-2 and TR criterion based CMLP family go up steadily till reach the highest records. We see from Fig. 13 that CMLP-3 and CMLP-4 are little affected by the
constraints. Results reveal that CMLP-1, CMLP-2 and TR criterion based CMLP family outperform those of CMLP-3 and CMLP-4 in most cases. We also observe that TR criterion based CMLP family method is superior to the corresponding eigen-decomposition based CMLP family method. We also analyze the performance of our kernelized algorithms over varying constraints. The results are illustrated in Fig. 14.
4480
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
Fig. 14. Recognition accuracies of linear methods vs. number of k in NNS on the ORL face database.
Number of Neighbors (k) in NNS
Number of Neighbors (k) in NNS
Number of Neighbors (k) in NNS
Fig. 15. Recognition accuracies of nonlinear methods vs. number of k in NNS on the ORL face database.
Compared with Fig. 13, the following observations are found. (1) For fixed GP/Pq, k and d, the accuracies of all methods vary with the constraints. In all cases, TR-KCMLP-2 works the best. KCMLP-3 and KCMLP-4 are still insensitive to the constraints and their accuracies are close for each case. (2) The ranking of nonlinear methods is similar to that of linear methods. The major difference is that the superiority of KCMLP-3 and KCMLP-4 exceed that of KCMLP-2 in these two cases. KCMLP-3 and KCMLP-4 even outperforms CMLP-1 in some cases. (3) CMLP-2 is little affected by the varying constraints for the latter two cases in the second row of Fig. 14.
5.3.2. Performance analysis on the neighbor number k in NNS The number of k is an adjustable parameter and the selection of k has a significant effect on the performance of NNS type DR methods [46]. So investigating how the k number affects the accuracies of our methods is important. Next we prepare three simulations to test LPP and our methods by varying the k numbers and fixing GP/Pq and d values. The results of linear methods are illustrated in Fig. 15. For settingG3/P7, the accuracies of our family methods go up as the number of k increases. LPP exhibits an unusual behavior. The accuracy rate of LPP initially increases with the k value and then starts going down after some point till reaching the lowest record. After this point, the accuracy continues to increase. For settings G4/ P6 andG5/P5, all methods share similar trends compared with themselves. In these two settings, the overall performance of CMLP-1, CMLP-2 and CMLP-4 rise as the number of k increases. The TR-CMLP-2 result is better than other methods across all k values. But TR-CMLP-2 starts descending to some extent after some point in each case. CMLP-3 exhibits a similar trend to TR-CMLP-4. The accuracies of TR-CMLP-4 and CMLP-3 initially increase and start decreasing at some point, and then continue growing. With the
increasing k value, the results of LPP initially increase and then start going down. Overall, LPP is comparable to CMLP-3 in some cases. Observing from the results, we find that too large and too small k value may incur degeneracy in the embeddings. We then report the results of nonlinear methods in Fig. 16. From Fig. 16, we see clearly that: (1) KLPP shares the similar trend to LPP for each setting. KLPP delivers close results to KCMLP-3 and KCMLP-4 in each case. TR-KCMLP-2 remains to rank first. (2) KCMLP-2, KCMLP-3 and KCMLP-4 are insensitive to the k values. (3) The KCMLP-1, TR-KCMLP-3 and TR-KCMLP-4 embeddings tend to deteriorate as k value increases, but the accuracies of TR-KCMLP-3 and TR-KCMLP-4 start going up after some point for caseG3/P7. (4) With the increasing training sample size, the performance of TR-KCMLP-3 and TR-KCMLP-4 are improved rapidly compared with other methods. For case G5/P5, the accuracies of TR-KCMLP-3 and TR-KCMLP-4 exceed that of KCMLP-2, and rank second.
5.3.3. Performance analysis on the number of dimensions To investigate how our family methods perform over different numbers of reduced dimensions, we firstly compare the accuracies of linear methods with PCA, LDA family, MMC and CCA by varying the dimensions with k values fixed. Three settings over different training samples are tested. Note that for LDA, TR-LDA and their kernelized extensions, we only report the results over first C-1 dimension, because the results keep unchanged after the bound C-1 [8]. The results of linear and nonlinear methods are averaged over 15 runs. Fig. 17 plots the accuracies of linear methods. We firstly show the accuracies of our methods in the first row of Fig. 17, from which we find that: (1) TR-CMLP-2 outperforms the remaining methods across all d values. (2) CMLP2 is competitive with CMLP-1 and both are slightly worse than TRCMLP-2. (3) In most case, CMLP-4 is comparable to TR-CMLP-3, and
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
4481
Fig. 16. Recognition accuracies of linear methods vs. number of reduced dimensions on the ORL database.
Fig. 17. Recognition accuracies of nonlinear methods vs. number of reduced dimensions on the ORL database.
both are slightly better than CMLP-3. (4) The performance of TRCMLP-4 lies in between those of CMLP-1 (or CMLP-2) and TRCMLP-3 (or CMLP-4). (5) CMLP-3 is worse on this dataset compared with other methods. The second row of Fig. 17 reports the comparison results of CMLP-1 and TR-CMLP-2 with PCA, LDA
family, MMC and CCA. The following observations are found. (1) TR-CMLP-2 is consistently the best and the accuracy of TRCMLP-2 goes up faster than the remaining methods with the increasing d. In all cases, CCA is comparable to LDA. This is because for a given class indicator matrix, CCA can be equivalent
4482
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
to LDA [26]. (2) For G3/P7, MMC delivers similar results to CMLP-1 as d increases to over 20 and both are competitive to TR-LDA. TRLDA outperforms LDA, CCA, PCA and LPP. LPP is the worst. (3) For G4/P6, the overall superiority of the methods is similar to the case of G3/P7, except that the rankings of LDA and TR-LDA. (4) For G5/ P5, TR-LDA obtains the competitive results to TR-CMLP-2, and both rank two. The performance of MMC and CMLP-1 are close and rank thee together. LPP is still the worst. We then evaluate the performance of our nonlinear methods by comparing with KPCA, KLDA family (i.e., KLDA and TR-KLDA), KMMC and KCCA over changed d values. The results of our methods are plotted in the first row of Fig. 18. Results show KCMLP-3 and KCMLP-4 exhibit similar trend in each case and both rank last. TRKCMLP-2 still ranks first. The overall ranking of other methods is KCMLP-14TR-KCMLP-44TR-KCMLP-34KCMLP-2. Note that TRKCMLP-3 delivers comparable results to KCMLP-2 for case G4/P6. The results of TR-KCMLP-2 and KCMLP-1 are compared with KPCA, KLDA family, KMMC and KCCA. The results are shown in the second row of Fig. 18. We find that: the figures can be divided into two major groups. The first group includes KLDA, KCCA, TR-KLDA, KCMLP-1 and TR-KCMLP-2. The second group contains KMMC, KPCA and KLPP. The first group is superior to the second group. (2) KLDA obtains the competitive results to TR-KCMLP-2 in each case and rank first together. KCMLP-1 is comparable to TR-KLDA and both rank second. (3) The trends of KCCA and KLDA are similar in most cases. (2) The performance superiority of the second group algorithms is KMMC4KPCA4KLPP. The mean and highest accuracies according to Figs. 17 and 18 are recorded in Tables 3 and 4, respectively. We also report the mean running time computed in seconds. The following observations are found. (1) We experimentally show that employing supervised information, including class labels or PC, can boost the recognition performance of DR methods. (2) The ranking of the methods keeps consistent with Figs. 17 and 18. (3) By observing the accuracies, we find some linear methods even outperform their kernelized extensions. This observation keeps consistent with the conclusion of [14] that says that nonlinear DR techniques
are not yet capable of outperforming the linear techniques on benchmark problems. (4) For running time performance, our linear methods are comparable to or even faster than PCA, LPP, LDA family, CCA and MMC in some cases. Our nonlinear methods are also competitive to KPCA, KLPP, KLDA family, KCCA and KMMC in most cases. Note that nonlinear methods are faster than the corresponding linear methods on this dataset. This is because the kernelized methods scale with the number of points of the dataset. And it is noted that the dimensionality of the data space is much higher than the sample size in this dataset. 5.4. Face recognition on UMIST database This section evaluates the performance of our methods using the real UMIST database. This database consists of 564 images of 20 individuals. Some typical images are shown in Fig. 19. In this experiment, three simulation settings under different numbers of training data, nearest neighbors (k) and constraints are prepared. In our study, 3, 5 and 7 images per individual with class labels are randomly selected for training the learner. For each case on testing k values and d values, accuracies are averaged over 15 splits of training and testing samples. The ML and CL constraints are created depending on whether the class labels of neighboring pairs in the training set are the same or not. 5.4.1. Performance analysis on the proportion of constraints We first test the effects of varied proportions of constraints on the performance of our methods. Different constraints are randomly selected from the constraint sets. To investigate the relationships between accuracy rates and constraints, for each setting, we fix the values of k and d and vary the proportions of constraints. We first plot the accuracies of linear methods with k¼21 in Fig. 20. We see clearly that results can be portioned into two major parts. The first part includes CMLP-1, CMLP-2 and TRCMLP-2. TR-CMLP-2 keeps the first place in each case. CMLP-1 and CMLP-2 are comparable in most cases. CMLP-1 slightly works
Fig. 18. Typical sample images from the real UMIST face database.
Table 3 Performance comparisons of linear methods on the ORL face database. Result
Simulation setting
Method
ORL face database (G3/P7)
PCA LDA LPP MMC CCA TR-LDA CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2 TR-CMLP-3 TR-CMLP-4
ORL face database (G4/P6)
ORL face database (G5/P5)
Mean
Best
Time
Mean
Best
Time
Mean Std.
Best
Time
0.6928 0.7800 0.6217 0.8359 0.7816 0.8247 0.8472 0.8360 0.6956 0.7436 0.8756 0.7481 0.7843
0.7720 0.9045 0.7331 0.8660 0.8241 0.9203 0.9320 0.9260 0.8400 0.8526 0.9425 0.8740 0.8824
1.5413 1.2559 2.5377 1.9200 2.5967 1.6375 1.1584 1.2476 1.6569 1.5711 2.1163 1.2797 4.6579
0.7842 0.8405 0.6900 0.8535 0.8407 0.8491 0.8599 0.8466 0.7189 0.7625 0.8891 0.7526 0.7965
0.8466 0.9147 0.7685 0.9155 0.8957 0.9257 0.9414 0.9352 0.8474 0.8947 0.9539 0.8623 0.9026
1.5695 1.3323 2.6602 2.0499 2.7754 1.7977 1.2735 1.3213 1.8784 1.6767 2.2320 1.3515 4.2306
0.8131 0.8562 0.7290 0.8874 0.8719 0.9076 0.8924 0.8914 0.7657 0.8023 0.9177 0.8269 0.8502
0.8952 0.9226 0.8568 0.9240 0.9200 0.9450 0.9558 0.9485 0.8850 0.9200 0.9607 0.9250 0.9136
2.1216 1.4540 3.0957 2.3414 2.9419 2.1094 1.0028 1.0880 1.1132 1.1715 2.9603 1.1469 3.2235
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
4483
Table 4 Performance comparisons of nonlinear methods on the ORL face database. Result
Simulation setting
Method
ORL face database (G3/P7)
KPCA KLDA KLPP KMMC KCCA TR-KLDA KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
ORL face database (G4/P6)
ORL face database (G5/P5)
Mean
Best
Time
Mean
Best
Time
Mean Std
Best
Time
0.6221 0.7666 0.5681 0.6701 0.7338 0.7371 0.7556 0.6826 0.6165 0.6256 0.7929 0.6907 0.7002
0.7153 0.8952 0.7227 0.7189 0.8317 0.8648 0.8256 0.7794 0.7580 0.7260 0.8932 0.7915 0.8057
0.0569 0.0364 0.0629 0.0498 0.0428 0.0535 0.1578 0.2178 0.2404 0.2599 0.4118 0.2544 0.2724
0.6826 0.8433 0.6119 0.7381 0.7973 0.8026 0.8421 0.7426 0.6704 0.6831 0.8696 0.7312 0.7697
0.7561 0.9112 0.8008 0.8330 0.9012 0.9106 0.9160 0.8293 0.7769 0.7724 0.9390 0.8455 0.8780
0.0979 0.0613 0.1143 0.0908 0.0736 0.0902 0.0950 0.1015 0.1127 0.1230 0.1961 0.1500 0.1596
0.7470 0.8816 0.6686 0.8030 0.8875 0.8495 0.8653 0.7990 0.7339 0.7493 0.9049 0.8132 0.8230
0.8350 0.9315 0.8600 0.8750 0.9257 0.9450 0.9226 0.8750 0.8354 0.8257 0.9585 0.9159 0.9150
0.1956 0.1036 0.2299 0.1803 0.1422 0.1574 0.1712 0.1781 0.2087 0.2282 0.3902 0.2895 0.3122
Fig. 19. Recognition accuracies of linear methods vs. proportions of constraints on the UMIST face database.
Fig. 20. Recognition accuracies of nonlinear methods vs. proportions of constraints on the UMIST database.
4484
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
better than CMLP-2 in case G7. The second part is consisted of CMLP-3, CMLP-4, TR-CMLP-3 and TR-CMLP-4. TR-CMLP-3 and TRCMLP-4 have similar results in each setting. In most cases, CMLP4 is the best among the second part members. CMLP-3 delivers the close results to TR-CMLP-3 and TR-CMLP-4 for the cases of G3 and G7, but it works the worst in the case G5. Also, results show that the overall performance of all methods ascends with the increasing proportions of constraints. We then report the results of nonlinear methods in Fig. 21. The same settings are tested. For each Gp, results are reported over k value equaling to 3 or 21. For case k¼3, the overall performance of KCMLP-1, KCMLP-2, KCMLP-4, TR-KCMLP-2, TR-KCMLP-3 and TRKCMLP-4 methods monotonically increase with respect to the constraints when the proportion of constraints increases. The results once again show the increasing proportion of constraints is unable to significantly improve the performance of KCMLP-3. When k value increases to 21, the trends of all methods become more flat with the increasing proportion of constraints, compared with the case of k¼3. The ranking of these nonlinear methods becomes clearer. The obtained results of TR-KCMLP-3 and TR-KCMLP-4 are comparative in each setting and both get the last place. TR-KCMLP-2 is still the best one, orderly followed by KCMLP-1, KCMLP-2, KCMLP-4 and KCMLP-3 for each simulation setting.
monotonically increase as k value increases. TR-CMLP-3 and TRCMLP-4 deliver similar unusual trends in each setting. In other words, the recognition accuracy of TR-CMLP-3 and TR-CMLP-4 initially decrease and then monotonically start going up after some special point. (2) CMLP-1 and CMLP-2 achieve the close results to TR-CMLP-2, and rank first together. LPP is worse than other methods. The results of CMLP-3, CMLP-4, TR-CMLP-3 and TR-CMLP-4 are comparable in most cases. It is noted that CMLP-4 delivers close results to CMLP-1 and CMLP-2 in the case of G7. The results of KLPP and our nonlinear methods are shown in Fig. 23. We observe the following phenomenon. (1) TR-KCMLP-3 and TR-KCMLP-4 deliver similar unusual trends to TR-CMLP-3 and TR-CMLP-4. (2) KCMLP-3 and KCMLP-4 are insensitive to the k value when k is relatively small, but the accuracies of them increases fast after some special point. (3) KLPP produces an unusual trend, which is different from that of LPP. The performance of KLPP initially increases and then begins to decrease after some point. Once again, we mathematically show that too large or too small k may cause the result of the LPP embedding to deteriorate. (4) Relatively, the number of k does not cause the large fluctuation in the results of KCMLP-1, KCMLP-2 and TRKCMLP-2. But TR-KCMLP-2 seems to incur degeneracy in the case of G7 to some extent when the k value increases.
5.4.2. Performance analysis on the neighbor number k in NNS This simulation tests the performance of LPP and our methods over different neighborhood sizes. To investigate the relations between accuracy rates and k values, the number of d is fixed to C-1. We first report the obtained results by LPP and our linear methods in Fig. 22. The following observations are found. (1) For each setting, the performance of all the methods varies with the increasing k values. The accuracies of LPP, CMLP-3 and CMLP-4
5.4.3. Performance analysis on the number of dimensions This study tests our methods with varying d values. The accuracies of linear methods are plotted in Fig. 24. We easily find results can be divided into two major parts. The first part includes the CMLP-1, CMLP-2 and TR-CMLP-2. The accuracy rates of this part methods climb rapidly as the number of d increases. The second part contains CMLP-3, CMLP-4, TR-CMLP-3 and TR-CMLP-4. Compared with the first part, accuracies of this part members rise steadily and
Fig. 21. Recognition accuracies of linear methods vs. number of k in NNS on the UMIST face database.
Fig. 22. Recognition accuracies of nonlinear methods vs. number of k in NNS on the UMIST face database.
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
4485
Fig. 23. Recognition accuracies of linear methods vs. number of reduced dimensions on the UMIST database.
Fig. 24. Recognition accuracies of nonlinear methods vs. number of reduced dimensions on the UMIST database.
slowly with the increasing d. Thus the first part is better than the second part. CMLP-4 organizes the comparative results to CMLP-1 and CMLP-2 when the number of d increases to a larger one for case G7. We then compare CMLP-1, CMLP-2 and TR-CMLP-2 with PCA, LDA family, MMC and CCA in the second row of Fig. 24. We have the following observations. (1) TR-CMLP-2 works the best and LPP is the
worst. (2) For case G3, our methods work better than other methods. CCA delivers close results to LDA, and both are superior to MMC and PCA. LDA is slightly better than TR-LDA in this case. (3) For the latter two cases, TR-LDA works better than LDA, and obtains the competitive accuracies to our methods. PCA is comparable to CCA. MMC works slightly better than CCA and LDA.
4486
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
We also test our nonlinear methods under different d. The results are recoded in Fig. 25. We see from the first row of Fig. 25 that KCMLP-1, KCMLP-2 and TR-KCMLP-2 are comparable and rank first together. KCMLP-3 and KCMLP-4 deliver comparable results and both rank second. TR-KCMLP-3 and TR-KCMLP-4 exhibit close results and both rank third. LPP exhibits the competitive results to TR-KCMLP-3 and TR-KCMLP-4 when big d values are involved. From the second row of Fig. 25, we find the accuracies of KLDA family are comparable to our KCMLP-1 and TR-KCMLP-2 in all cases. KCCA and KMMC deliver close accuracies to KCMLP-2 in each setting. KPCA works better than KLPP in each case, but both are inferior to other methods in most cases. Note that KPCA is better for case G7 and even achieves comparable results to KMMC and KCMLP-2 when d increases to a high level. We report the means and highest accuracies according to Figs. 24 and 25 in Tables 5 and 6, respectively. The mean running time are also recorded. We obtain the following observations. (1) Seeing from the accuracies, we find the performance superiorities of methods keep consistent with the results in Figs. 24 and 25. (2) We once
again experimentally show utilizing supervised class labels or pairwise constrained information can improve the learning ability of DR methods. (3) On one hand, results show virtully all noninear methods are able to deliver the comparative results to the coresponding linear methods on this dataset, and even outperfrom the linear ones. On the other hand, the runing time performance of nonlinear methods are also suprior to the coresponding linear methods, because the number of the samples in this dataset is smaller than its dimemsion. We also note that kernelized methods only depend on the number of points. The above analyses can suggest us to emply nonlinear methods for minging or extracting the features from this dataset for applications, because nonlinear methods offer an advantage to linear ones on this dataset. 5.5. Handwritten digits recognition on USPS database In this section, the USPS database [41] is used to test our methods. This database consists of 9298 handwritten digits (‘0’–‘9’). In this study, the publically available set from
Fig. 25. Typical sample images from the USPS handwritten digits database.
Table 5 Performance comparisons of linear methods on the UMIST face database. Result
Simulation setting
Method
UMIST face database (G3)
PCA LDA LPP MMC CCA TR-LDA CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2 TR-CMLP-3 TR-CMLP-4
UMIST face database (G5)
UMIST face database (G7)
Mean
Best
Time
Mean
Best
Time
Mean Std
Best
Time
0.7484 0.7939 0.6464 0.7829 0.8036 0.7824 0.8510 0.8432 0.7597 0.7362 0.8714 0.7333 0.7290
0.8388 0.9292 0.7989 0.9340 0.9212 0.9170 0.9204 0.9301 0.8485 0.8585 0.9431 0.8447 0.8194
0.1552 0.1184 0.2244 0.1670 0.1842 0.1312 0.1351 0.1666 0.1784 0.1883 0.2618 0.1819 0.2241
0.8175 0.8365 0.7547 0.8683 0.8479 0.8733 0.9001 0.9031 0.8240 0.8440 0.9220 0.8369 0.8486
0.9242 0.9469 0.8449 0.9606 0.9389 0.9653 0.9686 0.9690 0.9265 0.9328 0.9632 0.9200 0.9263
0.2079 0.1700 0.2769 0.2190 0.2378 0.1839 0.1846 0.2167 0.2293 0.2393 0.3156 0.2339 0.3043
0.8712 0.8801 0.8029 0.9083 0.8980 0.8903 0.9286 0.9216 0.8598 0.8887 0.9425 0.8639 0.8794
0.9543 0.9618 0.8662 0.9703 0.9495 0.9726 0.9751 0.9683 0.9452 0.9590 0.9772 0.9203 0.9339
0.2457 0.2235 0.3140 0.2564 0.2863 0.2258 0.2300 0.2450 0.2757 0.2880 0.3641 0.2752 0.3468
Table 6 Performance comparisons of nonlinear methods on the UMIST face database. Result
Simulation setting
Method
UMIST face database (G3)
KPCA KLDA KLPP KMMC KCCA TR-KLDA KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
UMIST face database (G5)
UMIST face database (G7)
Mean
Best
Time
Mean
Best
Time
Mean Std
Best
Time
0.7691 0.8430 0.6569 0.8149 0.8410 0.8446 0.8542 0.8362 0.7573 0.7661 0.8824 0.7212 0.7205
0.8427 0.9231 0.8311 0.8971 0.9215 0.9456 0.9126 0.8990 0.8641 0.8388 0.9456 0.7903 0.7981
0.0823 0.0698 0.0718 0.0608 0.0645 0.0869 0.1017 0.1501 0.1510 0.1421 0.1794 0.1240 0.1748
0.8671 0.9287 0.7527 0.9034 0.9512 0.8913 0.9151 0.9054 0.8687 0.8686 0.9324 0.8356 0.8376
0.9150 0.9495 0.9099 0.9360 0.9374 0.9548 0.9560 0.9581 0.9455 0.9392 0.9632 0.9140 0.9119
0.0864 0.0646 0.0868 0.0723 0.0703 0.0738 0.0992 0.1355 0.1553 0.1581 0.1814 0.1663 0.1699
0.8852 0.9244 0.7877 0.9096 0.9321 0.9065 0.9336 0.9151 0.8860 0.8913 0.9449 0.8686 0.8651
0.9455 0.9619 0.8916 0.9523 0.9541 0.9638 0.9683 0.9637 0.9481 0.9523 0.9728 0.9368 0.9302
0.0813 0.0790 0.0825 0.0746 0.0604 0.0783 0.0896 0.1246 0.1044 0.1107 0.1664 0.1276 0.1387
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
USPS Database (D=256, N=2000, k=5, d=10, G30/P170)
4487
USPS Database (D=256, N=2000, k=5, d=10, G60/P140) 0.98
0.95
0.96 0.94
0.9 Recognition Accuracy
Recognition Accuracy
0.92
0.85
0.8
0.88 0.86
CMLP-1 CMLP-2
0.84
CMLP-3
0.82
CMLP-4
0.75
0.9
TR-CMLP-2
CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2
0.8
TR-CMLP-3 TR-CMLP-4
TR-CMLP-3 TR-CMLP-4
0.78 0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Proportion of Constraints
0.8
0.9
1
0
0.1
USPS Database (D=256,N=2000,k=21,d=10,G30/P170)
0.2
0.3
0.4 0.5 0.6 0.7 Proportion of Constraints
0.8
0.9
1
USPS Database (D=256, N=2000, k=21, d=10, G60/P140) 1
0.94 0.98 0.92 0.96 0.94 Recognition Accuracy
Recognition Accuracy
0.9 0.88 0.86 0.84 0.82
CMLP-1 CMLP-2
0.8
CMLP-3 CMLP-4
0.9 0.88 0.86
CMLP-1 CMLP-2
0.84
CMLP-3 CMLP-4
TR-CMLP-2
0.78 0.76
0.92
TR-CMLP-3 TR-CMLP-4
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Proportion of Constraints
0.8
0.9
1
TR-CMLP-2
0.82 0.8
TR-CMLP-3 TR-CMLP-4
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Proportion of Constraints
0.8
0.9
1
Fig. 26. Recognition accuracies of linear methods vs. proportions of constraints on the USPS database.
http://cs.nyu.edu/ roweis/data.html is used. This set includes 16 16 pixels in 8-bit grayscale images of ‘0’ through ‘9’. Each digit has 1100 images and each image is represented by a 256dimensional vector. We show some samples in Fig. 26. We select 200 samples from each digit character for experiments. Two settings over GP/Pq are evaluated. The ML and CL constraint sets are similarly created based on whether the underlying class labels of similar digits are the same or not. In this simulation, we average the recognition accuracy over 15 runs in each case about testing k values and numbers of reduced dimensions.
5.5.1. Performance analysis on the proportion of constraints To see how our methods behave in recognizing the handwritten digits over different proportions of constraints, we test our methods under two settings: G30/P170 and G60/P140, where the k value is set to 5 and 21, respectively. The results of linear methods are plotted in Fig. 27, from which we have the following findings. (1) The performance superiorities of the methods are exactly the same for the case of k¼5 over the two settings. (2) Compared with the case of G30/P170, linear methods are more sensitive to the constraints for case G60/P140. (3) When k¼21, the results are correspondingly stable compared that of case k¼5. This result indicates that increasing k value improves the stability of linear methods. (4) All methods deliver satisfactory results by only using small proportions of constraints for case G30/P170, k¼21. This property is appreciated in
all pairwise constrained methods, because we can select only small proportion of constraints in real applications. As a result, NML and NCL are far smaller than N in size, implying that the matrix computational burdens of our approaches can be greatly reduced. It is also noted that, random constraints produce larger deviation over repetitions. Thus it is important to investigate how to select the optimal constraints for in our future work. In Fig. 27, the recognition accuracies of our nonlinear methods vs. proportion of constraints are illustrated. From Fig. 27, we obtain the following observations: (1) The performance of each method varies with the proportion of the constraints. Compared with the case k ¼21, varied number of constraints has a significant effect on the performance of each method when the k value is 5. (2) The ranking of these methods is not consistent in different settings. (3) Large deviations are produced by the randomly selected constraints over repetitions as well.
5.5.2. Performance analysis on the neighbor number k in NNS This experiment investigates the performance of LPP and our methods by varying the k values. We fix the d value to 10. The same settings are tested. We show the averaged accuracies of linear and nonlinear methods in Fig. 28 and 29, respectively. From Fig. 28, we obtain: (1) The performance of CMLP-3 and CMLP-4 monotonically increases with respect to the k values as k increases. (2) For case G30/P170, accuracies of TR-CMLP-3 and
4488
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
USPS Database (D=256, N=2000, k=5, d=10, G60/P140)
USPS Database (D=256, N=2000, k=5, d=10, G30/P170)
0.96
0.98
0.94
0.96
0.92 Recognition Accuracy
Recognition Accuracy
0.94
0.92
0.9
0.88 KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
0.86
0.84
0.9
0.88
KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
0.86
0.84
0.82
0.82 0
2
4
6
8 10 12 14 Proportion of Constraints
16
18
20
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Proportion of Constraints
0.8
0.9
1
USPS Database (D=256, N=2000, k=21, d=10, G60/P140)
USPS Database (D=256, N=2000, k=21, d=10, G30/P170)
0.96
0.98
0.96
0.94
Recognition Accuracy
Recognition Accuracy
0.94
0.92
0.9
0.88 KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
0.86
0.84
0.92
0.9
0.88
KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
0.86
0.82
0.84 0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Proportion of Constraints
0.8
0.9
1
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Proportion of Constraints
0.8
0.9
1
Fig. 27. Recognition accuracies of nonlinear methods vs. proportions of constraints on the USPS database. USPS Database (D=256, N=2000, d=10, G60/P140)
1
USPS Database (D=256, N=2000, d=10, G30/P170)
0.95
0.9
0.95
Recognition Accuracy
Recognition Accuracy
0.85
0.8
0.75
0.7 LPP CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2 TR-CMLP-3 TR-CMLP-4
0.65
0.6
0.55
0
30
60
90 120 Number of Neighbors (k) in NNS
150
180
0.9
0.85
0.8
LPP CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2 TR-CMLP-3 TR-CMLP-4
0.75
0
30
60
90 120 Number of Neighbors (k) in NNS
150
180
Fig. 28. Recognition accuracies of linear methods vs. number of k in NNS on the USPS database.
TR-CMLP-4 initially increase and start descending after some point, and then continue going up as k value increases. For case G60/P140, the performance of TR-CMLP-3 and TR-CMLP-4 monotonically goes down with the increasing k values. (3) LPP delivers
opposite trend to TR-CMLP-3 and TR-CMLP-4 for case G30/P170. The performance of LPP monotonically increases with respect to the k value. (4) The performance of CMLP-1, CMLP-2 and TRCMLP-2 initially increases with the k values, but starts
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
USPS Database (D=256, N=2000, d=10, G60/P140)
USPS Database (D=256, N=2000, d=10, G30/P170)
1
1
0.95
0.95
0.9
0.9
Recognition Accuracy
Recognition Accuracy
4489
0.85
0.85
0.8
0.8 KLPP
KLPP
KCMLP-1 KCMLP-2
KCMLP-1 KCMLP-2
KCMLP-3
0.75
KCMLP-3
0.75
KCMLP-4
KCMLP-4
TR-KCMLP-2 TR-KCMLP-3
TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
TR-KCMLP-4
0.7
0.7 0
30
60
90 120 Number of Neighbors (k) in NNS
150
180
0
30
60 90 120 Number of Neighbors (k) in NNS
150
180
Fig. 29. Recognition accuracies of nonlinear methods vs. number of k in NNS on the USPS database.
USPS Database (D=256, N=2000, k=21, G60/P140)
USPS Database (D=256, N=2000, k=21, G30/P170)
0.95
0.95 0.9 0.9 0.85 0.85 Recognition Accuracy
Recognition Accuracy
0.8 0.75 0.7 PCA LDA LPP MMC CCA TR-LDA CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2 TR-CMLP-3 TR-CMLP-4
0.65 0.6 0.55 0.5 0.45
5
10
15 20 Reduced Dimensionality
25
0.8 0.75 0.7
PCA LDA LPP MMC CCA TR-LDA CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2 TR-CMLP-3 TR-CMLP-4
0.65 0.6 0.55 0.5
30
0.45
5
10
15 20 Reduced Dimensionality
25
30
Fig. 30. Recognition accuracies of linear methods vs. number of reduced dimensions on the USPS database.
degenerating after some point. The following observations are found from Fig. 29. (1) Compared with the linear methods, nonlinear ones are insensitive to the k values. (2) It is interesting to find an obvious fluctuation appears in the result of each method. This phenomenon implies that point with obvious trend changes may deliver special properties. After that point, these methods exhibit the increasing or descending trends, respectively. (3) Similarly, our nonlinear methods are able to outperform KLPP on these cases.
5.5.3. Performance analysis on the number of dimensions To test how our methods behave in recognizing the digits on this database, two setting over Gp/Pq are tested. For the linear and nonlinear methods, the neighbor number, k, is always set to 21. The results of linear methods with different reduced dimensions d ¼1, 2, y, 60 are plotted in Fig. 30. To make the figures clear, we only show the accuracies over the first 30 dimensions. The accuracies of LDA and KLDA families are averaged over the first C-1 dimension. We obtain the following observations from Fig. 30. (1) For case G30/P170, TR-LDA delivers comparable results to our CMLP-1, CMLP-2 and TR-CMLP-2, and rank first together. LDA
obtains close accuracies to our TR-CMLP-3 and TR-CMLP-4. PCA, CCA and MMC outperform our CMLP-3 and CMLP-4 before around d¼10 and are inferior to CMLP-3 and CMLP-4 as the number of d increases. LPP is the worst. (2) For G60/P140, the performance of TR-LDA increases faster than other methods with the increasing d and delivers the best results before around d ¼5. Our CMLP-1, CMLP-2, TR-CMLP-3 and TR-CMLP-4 outperform other methods in most cases. And the accuracies delivered by CMLP-1, CMLP-2 and TR-CMLP-4 are very close after around d¼10. LDA and CCA exhibit the comparative results to TR-CMLP-2. As d increases to around 20, the results of CMLP-3 and CMLP-4 are close to CMLP-1, CMLP-2 and TR-CMLP-4. We also observe PCA is comparable to MMC, and both outperform LPP. (3) Overall, the performance of supervised methods offers an advantage over those unsupervised methods in most cases. The recognition accuracies of nonlinear methods under different d values are illustrated in Fig. 31. We find that KLDA family and KCCA achieve the close results to KCMLP-1, TR-KCMLP-2, TRKCMLP-3, TR-KCMLP-4 and KCMLP-2. It is also noted that the performance of the KLDA family rise faster than the other nonlinear methods with the increasing d values. KMMC delivers the comparable results to KCMLP-3 and KCMLP-4 in most cases. KPCA
4490
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
USPS Database (D=256, N=2000, k=21, G60/P140)
USPS Database (D=256, N=2000, k=21, G30/P170)
0.95 0.9
0.9 0.85
0.8
Recognition Accuracy
Recognition Accuracy
0.8
0.7
KPCA KLDA KLPP KMMC KCCA TR-KLDA KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
0.6
0.5
0.4 5
10
15 20 Reduced Dimensionality
25
0.75 0.7 KPCA KLDA KLPP KMMC KCCA TR-KLDA KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
0.65 0.6 0.55 0.5 0.45
30
5
10
15 20 Reduced Dimensionality
25
30
Fig. 31. Recognition accuracies of nonlinear methods vs. number of reduced dimensions on the USPS database. Table 7 Performance comparisons of linear methods on the USPS handwritten digits database. Result
Simulation setting
Method
USPS handwritten digit database (G30/P170)
PCA LDA LPP MMC CCA TR-LDA CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2 TR-CMLP-3 TR-CMLP-4
USPS handwritten digit database (G60/P140)
Mean
Std
Best
Time
Mean
Std
Best
Time
0.8337 0.7755 0.7793 0.8397 0.8377 0.8121 0.8960 0.8927 0.8455 0.8542 0.9043 0.8693 0.8655
0.0120 0.0197 0.0133 0.0197 0.0141 0.0222 0.0116 0.0131 0.0222 0.0165 0.0131 0.0222 0.0165
0.8844 0.9102 0.8328 0.8750 0.8961 0.9228 0.9308 0.9310 0.9306 0.9316 0.9389 0.9283 0.9278
0.3411 0.2238 0.3908 0.3378 0.2748 0.2594 0.2915 0.3306 0.3442 0.3636 0.5669 0.4007 0.4294
0.8883 0.8015 0.8259 0.8712 0.8938 0.8183 0.9228 0.9222 0.8927 0.8931 0.9130 0.9224 0.9225
0.0067 0.0163 0.0096 0.0132 0.0076 0.0158 0.0086 0.0082 0.0108 0.0111 0.0082 0.0108 0.0109
0.9259 0.9421 0.8869 0.9138 0.9354 0.9325 0.9550 0.9556 0.9575 0.9520 0.9519 0.9598 0.9625
0.9644 0.5658 0.8342 0.7496 0.4482 0.4323 0.6918 0.7266 0.8235 0.9090 1.2268 0.9809 1.0808
Table 8 Performance comparisons of nonlinear methods on the USPS handwritten digits database. Result
Simulation setting
Method
USPS handwritten digits database(G30P170)
KPCA KLDA KLPP KMMC KCCA TR-KLDA KCMLP-1 KCMLP-2 KCMLP-3 KCMLP-4 TR-KCMLP-2 TR-KCMLP-3 TR-KCMLP-4
USPS handwritten digits database(G60P140)
Mean
Std
Best
Time
Mean
Std
Best
Time
0.8434 0.8056 0.7890 0.8696 0.8900 0.7931 0.8966 0.8874 0.8656 0.8623 0.9046 0.8840 0.8892
0.0116 0.0079 0.0087 0.0263 0.0105 0.0102 0.0112 0.0125 0.0129 0.0126 0.0125 0.0129 0.0126
0.8830 0.9141 0.8682 0.9120 0.9162 0.9145 0.9289 0.9227 0.9171 0.9153 0.9312 0.9221 0.9295
0.8383 0.6461 0.7034 0.6770 0.4930 0.4708 0.5481 0.6108 0.6817 0.7403 1.2110 0.8424 0.9080
0.8742 0.8014 0.8214 0.8969 0.8926 0.8059 0.9219 0.9188 0.8999 0.8882 0.9212 0.9212 0.9139
0.0080 0.0039 0.0088 0.0102 0.0079 0.0064 0.0076 0.0099 0.0128 0.0087 0.0099 0.0128 0.0087
0.9214 0.9306 0.9050 0.9294 0.9286 0.9325 0.9524 0.9513 0.9489 0.9425 0.9479 0.9505 0.9480
1.3239 1.0075 1.1484 1.0570 0.7321 0.6839 0.7946 0.8909 1.0065 1.1054 1.8542 1.2877 1.3820
outperforms KLPP across all values of d. From the results, we see clearly that too large number of d incur degeneracy in the embeddings of linear and nonlinear CCA methods to some extent.
The means and highest accuracies according to the test results of Figs. 30 and 31 are recorded in Tables 7 and 8, respectively. The mean running time and standard deviations (Std) over repetitions are also reported. Observing from the results, we find: (1) The
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
4491
Table 9 Performance comparison of each method based on nine standard UCI datasets. Method
# Dimensions # Samples # Classes # Train PCA LDA LPP MMC CCA TR-LDA DCT-1 DCT-2 DLPP DLPP/MMC GoLPP SPP SOLPP CMLP-1 CMLP-2 CMLP-3 CMLP-4 TR-CMLP-2 TR-CMLP-3 TR-CMLP-4
Dataset Vote
WDBC
WOBC
Heart-statlog
Control-chart
Lung-cancer
Sonar
Tic-Tac-Toe
Soybean
16 435 2 20 C 0.5501 0.5389 0.5340 0.5493 0.5374 0.5508 0.5509 0.5460 0.5423 0.5502 0.5468 0.5493 0.5511 0.5568 0.5498 0.5502 0.5517 0.5602 0.5551 0.5547
31 569 2 20 C 0.6235 0.6297 0.6335 0.6385 0.6271 0.6431 0.6350 0.6432 0.6396 0.6503 0.6475 0.6438 0.6425 0.6512 0.6508 0.6489 0.6478 0.6531 0.6437 0.6459
9 699 2 10 C 0.5782 0.5692 0.5616 0.5620 0.5702 0.5824 0.5825 0.5846 0.5769 0.5828 0.5819 0.5795 0.5874 0.5891 0.5794 0.5826 0.5817 0.5905 0.5855 0.5917
13 270 2 10 C 0.4902 0.4892 0.4876 0.4985 0.4894 0.5108 0.4969 0.4937 0.4963 0.5106 0.4968 0.5061 0.5035 0.5207 0.5160 0.5064 0.5098 0.5176 0.5065 0.5046
60 600 6 20 C 0.9364 0.8682 0.8359 0.9414 0.8566 0.8897 0.8393 0.8291 0.9020 0.9445 0.9167 0.9313 0.9504 0.9514 0.9521 0.9412 0.9409 0.9479 0.9504 0.9506
56 32 3 6C 0.4644 0.4765 0.4732 0.4847 0.4759 0.5007 0.4927 0.4849 0.4860 0.5006 0.4958 0.4972 0.5068 0.5196 0.4852 0.4960 0.4874 0.5164 0.5136 0.5123
60 208 2 20 C 0.6977 0.6706 0.6824 0.6938 0.6435 0.6856 0.6784 0.6846 0.6995 0.7063 0.7012 0.7045 0.7104 0.7135 0.7095 0.7044 0.7071 0.7263 0.7111 0.7097
9 958 2 40 C 0.3961 0.3948 0.3895 0.3993 0.3910 0.4018 0.3877 0.3983 0.3951 0.4022 0.3998 0.3986 0.4017 0.4050 0.4029 0.4031 0.3994 0.3977 0.4024 0.4015
35 47 4 6C 0.9380 0.9807 0.9453 0.9830 0.9760 0.9820 0.9682 0.9426 0.9616 0.9865 0.9789 0.9756 0.9851 0.9854 0.9789 0.9829 0.9824 0.9920 0.9858 0.9874
numerical mean and best results shown in the tables are consistent to those of the above figures in terms of performance superiority. (2) The standard deviation over repetitions produced by each method is comparable, implying that the stability of these methods on this dataset is similar. (3) Considering the running time performance, our linear CMLP family is capable of delivering the competitive results to other methods. Our nonlinear methods are also capable of achieving comparable result to or even outperform the other methods in some cases. (4) In most cases, the overall performance of nonlinear algorithms outperforms those of the corresponding linear methods. 5.6. Classification on UCI datasets This section further evaluates our presented linear CMLP family algorithms based on classifying nine standard UCI datasets, including Vote, WDBC, WOBC, Heart-statlog, Control-chart, Lung-cancer, Sonar, Tic-Tac-Toe and Soybean. In this study, the classification performance will be compared with DLPP [49], DLPP/MMC [50], GoLPP, Sparsity Preserving Projections (SPP) [27], SOLPP [53], Discrete Cosine Transform (DCT) including DCT-1 and DCT-2 [51], LPP, PCA, LDA, MMC, CCA and TR-LDA. For each method, k number is set to 9 in each dataset. For our methods, 80% constraints are used. The control parameter in the DLPP/MMC objective is set to 0.5. The specifications of UCI datasets are described in Table 9. We report the averaged results over 20 random splits of training and test samples in Table 9. We observe from Table 9 that our techniques can deliver comparable or even better results than other methods, including the class labels or sparsity representation guided LPP extensions, i.e., DLPP, DLPP/MMC, SPP and SOLPP, in most cases. DCT-1 and DCT-2 perform better than PCA in most cases, which is consistent with [52]. We also notice that these LPP extensions outperform the original LPP algorithm in most cases. 6. Concluding remarks This article discusses the constrained locality preserving marginal dimensionality reduction problem. In this setting, neighborhood
relations are identified as inter- and intra-class attributed to the graph-induced pairwise constraints. Two adjacency matrices are then constructed over two new graphs, ML- and CL-graph, to reflect the proximity relations of vertices. Finally, a constrained marginal projection criterion named CMLP is derived. Four solution schemes and an improvement strategy are proposed to obtain the projection axes. CMLP has four major benefits. First, CMLP works directly based on the ML- and CL-graph and only constrained points are applied for learning the projections, thus CMLP is computationally efficient. Second, with marginal criteria, enhanced inter-cluster separation and intra-cluster compactness are delivered. Specifically, large margins between both inter- and intra-class clusters can be delivered without losing multimodal structures. Third, mathematical comparisons between our work and related works in terms of objective functions and solutions indicate that our CMLP criteria are more general and are able to offer certain advantages over some widely used discriminant locality preserving techniques. We examine the CMLP criteria and their nonlinear versions through extensive simulations over 15 standard datasets. By visualizing the projection matrices, our methods tend to capture some detailed information on the faces. Based on visualizing the glass identification dataset, CMLP criteria organize more separate embeddings of inter-class similarity points with multimodal structures clearly kept. For face visualization, local information of face data is effectively kept by our techniques. For image recognition on UMIST, MIT CBCL and USPS datasets, four settings over parameters are systematically evaluated. From all our investigated cases, the overall performance of our methods is comparable or even outperforms some widely used linear and nonlinear techniques. This is known as the fourth benefit of our criteria. But we must say selection of optimal reduced dimensions still remains an open problem. Based on investigating the effects of constraints on our approaches, results demonstrate our methods rely on the selected constraints to some extent. So exploring how to choose the informative constraints requires to be addressed. We also observe the performance of our methods and LPP are closely associated with the k values, but the determination of optimal k is also an open problem. Therefore designing new effective or employing existing adaptive graph constructions for our algorithms is also worth studying in our future work.
4492
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
Acknowledgments The authors would like to express our sincere thanks to the anonymous reviewers’ comments and suggestions which have made the paper a higher standard. The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region (SAR), China (CityU8/CRF/09).
~ ðMLÞ Y T . Let X^ ¼ XT X and calculate the auxiliary matrix as M ¼ X^ D T We can then calculate the SVD of M as M ¼ U M SM U M and set 1=2 V nM ¼ U M SM . The optimal projection matrix can be given by V nTS ¼ XV nM . Next we will elaborate that the optimal projection matrix V nTS obtained by two-stage approach is equivalent to that in Eq. (49). ~ ðMLÞ JY T X T XJ2 , we have By solving the WLS problem MinX D F
X ¼ ðX D~
Appendix A. Proof of Theorem 1
~ ðX D
By performing singular value decomposition (SVD) to the ~ matrix X D
~ XD
ðMLÞ
ðMLÞ
X T , we can have
S2t
XT ¼ U
0
0
!
0
UT ,
ð47Þ
where U is an orthogonal matrix and S2t is a diagonal matrix. Let U¼[U1,U2] denote a partition of U such that U 1 A Rnt and ~ ðMLÞ X T and U2 lies in the U 2 A RnðntÞ , where t is the rank of X D ~ ðMLÞ X T satisfying U T X D ~ ðMLÞ X T U 2 ¼ 0. Since L~b is a null space of X D 2 symmetrical matrix, it can be decomposed by performing the T Cholesky Decomposition, that is L~ ¼ G~ G~ , where G~ is a lower b
~ then X L~ X T ¼ H~ H~ T . Denote triangular matrix. If we let H~b ¼ X G, b b b P T P1 T ~ H ¼ t U 1 Hb and let I¼P bQ be the SVD of the matrix H, P where P and Q are orthogonal matrices and b is a diagonal matrix with rank q, we can then obtain 1 X
T U T1 H~b H~b U 1
1 X
t
¼ HHT ¼ P
2 X
t
ðMLÞ
¼U ¼U
P1
0
t
0 P1 t
0
! P
0 P
0 I
P1 P1 t
t
0
X2 b
PT
! P 2 b
0
P
0 0
0 !
t
! T 0 U T H~b H~b U 0
0
0 PT
P1 P
t
t
0
! 0 UT 0
UT
P
0
X T Þ1 X and the auxiliary matrix M can be given by
~ ðMLÞ X T ðX D ~ ðMLÞ X T Þ1 X D ~ ðMLÞ Y T ~ ðMLÞ Y T ¼ Y D M ¼ X^ D X1 X1 T T ¼ H~b U 1 U 1 H~b : t t
ð50Þ
~ ðMLÞ Y T and second equation holds as H~b ¼ X D P P 2 1 T T ~ X ¼ U 1 t U 1 . Because H ¼ t U 1 H~b and its SVD is XD P P H¼P bQT, we haveM ¼ HT H ¼ Q 2b Q T . The above equation indiP2 T P cates that Q b Q is the SVD of M, we thus have V nM ¼ Q 1 b and the optimal projection matrix using the two-stage approach can be given by X1 ~ ðMLÞ X T Þ1 X D ~ ðMLÞ Y T Q V nTS ¼ XV nM ¼ ðX D b X1 X1 X1 T ~ ¼ U1 ð t U 1 Hb ÞQ t b X1 X T X1 X1 ¼ U1 P Q Q ¼ U1 P ð51Þ t b t b The
ðMLÞ
T
which is right equivalent to the optimal solution obtained by the direct eigen-decomposition approach in Eq. (49).
ð48Þ
b
X T Þy ðX L~b X T Þ ¼ U
~ ðMLÞ Y T . We thus have X^ ¼ XT X ¼ Y D ~ ðMLÞ X T X T Þ1 X D
References
PT :
Therefore, according to the formulations in Eqs. (47) and (48), we can obtain the following equation: ~ ðX D
ðMLÞ
ðMLÞ
t
0 I
! UT :
ð49Þ
P Then if we let XnCMLP3 ¼ U 1 1 t P q , where Pq consists of the first q columns of the matrix P when only the first q diagonal elements P ~ ðMLÞ X T Þy ðX L~b X T ÞXn of b are nonzero, we then have ðX D CMLP3 ¼ P2 n n b XCMLP3 , which indicate that XCMLP3 includes the optimal projection bases of the CMLP-3. This completes the proof of Theorem 1. Appendix B. Two-stage approach for optimization In the TS approach, we first solve a weighted least square problem by regressing X on YT, i.e. projecting the original high-dimensional dataset into the low-dimensional subspace. We can then calculate a auxiliary matrix M A Rdd and its SVD. Finally, the optimal projection matrix can be obtained from the SVD of M. It is noted that the size of matrix M is small. As a result, the computational cost for calculating the SVD of matrix M is relatively low. The basic steps of performing two-stage approach can be summarized as: ~ ðMLÞ JY T X T XJ2 . Solve the weighted least square problem MinX D F
[1] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using Laplacianfaces, IEEE Transactions on Patten Analysis and Machine Intelligence 27 (3) (2005) 228–340. [2] D.Q. Zhang, Z.H. Zhou, S.C. Chen, Semi-supervised dimensionality reduction, in: Proceedings of the 7th SIAM International Conference on Data Mining (SDM’07), Minneapolis, MN, 2007, pp. 629–634. [3] M.S. Baghshah, S.B. Shouraki, Non-linear metric learning using pairwise similarity and dissimilarity constraints and the geometrical structure of data, Pattern Recognition 43 (2010) 2982–2992. [4] S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [5] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (6) (2003) 1373–1396. [6] M.A. Turk, A.P. Pentland, Face recognition using eigenfaces, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1991. [7] H.F. Hu, Orthogonal neighborhood preserving discriminant analysis for face recognition, Pattern Recognition 41 (2008) 2045–2054. [8] A.M. Martinez, A.C. Kak, PCA versus LDA, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2) (2001) 228–233. [9] M. Sugiyama, Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis, Journal of Machine Learning Research 8 (2007) 1027–1061. [10] M.B. Zhao, Z. Zhang, W.S. Chow, Trace ratio criterion based generalized discriminative learning for semi-supervised dimensionality reduction, Pattern Recognition 45 (4) (2012) 1482–1499. [11] L.M. Zhang, L.S. Qiao, S.C. Chen, Graph-optimized locality preserving projections, Pattern Recognition 43 (6) (2010) 1993–2002. [12] T. Jebara, J. Wang, S.-F.Chang, Graph construction and b-matching for semisupervised learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, New York, NY, USA, 2009. [13] Y. Liu, J. Rong, Distance Metric Learning: A Comprehensive Survey, 2006. [14] L. Maaten, E. Postma, J. Herik, Dimensionality Reduction: A Comparative Review, Tilburg University Technical Report, TiCC-TR 2009-005, 2009. [15] X. He, D. Cai, S. Yan, H. Zhang, Neighborhood preserving embedding, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), Beijing, China, 2005, pp. 1208–1213. [16] J. Yang, A.F. Frangi, J.Y. Yang, D. Zhang, Z. Jin, KPCA plus LDA: a complete kernel fisher discriminant framework for feature extraction and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2) (2005) 234–244. [17] D. Sun, D.Q. Zhang, Bagging constraint score for feature selection with pairwise constraints, Pattern Recognition 43 (6) (2010) 2106–2118.
Z. Zhang et al. / Pattern Recognition 45 (2012) 4466–4493
[18] D.Q. Zhang, S.C. Chen, Z.H. Zhou, Constraint score: a new filter method for feature selection with pairwise constraints, Pattern Recognition 41 (5) (2008) 1440–1451. [19] M.S. Baghshah, S.B. Shouraki, Semi-supervised metric learning using pairwise constraints, in: Proceedings of the International Joint Conferences on Artificial Intelligence, 2009, pp. 1217–1222. [20] Y. Bengio, J.F. Paiement, P. Vincent, O. Delalleau, N.L. Roux, M. Ouimet, Outof-sample extensions for LLE, ISOMAP, MDS, Laplacian Eigenmaps, and spectral clustering, in: Proceedings of the Advances in Neural Information Processing Systems, vol. 16, 2004, pp.177–184. [21] K. Fukunaga, Introduction to statistical pattern recognition, second edition, Academic Press, New York, 1990. [22] Y. Xu, A.N. Zhong, J. Yang, D. Zhang, LPP solution schemes for use with face recognition, Pattern Recognition 43 (12) (2010) 4165–4176. [23] Y. Guo, S. Li, J. Yang, T. Shu, L. Wu, A generalized Foley-Sammon transform based on generalized fisher discriminant criterion and its application to face recognition, Pattern Recognition Letter 24 (1–3) (2003) 147–158. [24] H. Wang, S. Yan, D. Xu, X. Tang, T. Huang, Trace ratio vs. ratio trace for dimensionality reduction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, 2007, pp. 1–8. [25] Y. Jia, F. Nie, C.S. Zhang, Trace ratio problem revisited, IEEE Transactions on Neural Network 20 (4) (2009) 729–735. [26] T. Sun, S. Chen, Class label versus sample label-based CCA, Applied Mathematics and Computation 185 (1) (2007) 272–283. [27] L.S. Qiao, S.C. Chen, X.Y. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognition 43 (1) (2010) 331–341. [28] T. Strutz, Data fitting and uncertainty: a practical introduction to weighted least squares and beyond, Viewegþ Teubner Verlag, 2010. [29] B. Weyrauch, J. Huang, B. Heisele, V. Blanz, Component-based face recognition with 3D morphable models, in: Proceedings of the 1st IEEE Workshop on Face Processing in Video, Washington, D.C., 2004. [30] L. Sun, S.W. Ji, J.P. Ye, A least squares formulation for canonical correlation analysis, in: Proceedings of the 25th International Conference on Machine learning, Helsinki, Finland, 2008, pp. 1024–1031. [31] B. Yang, S.C. Chen, Sample-dependent graph construction with application to dimensionality reduction, Neurocomputing 74 (1–3) (2010) 301–314. [32] L. Sun, B. Ceran, J.P. Ye, A scalable two-stage approach for a class of dimensionality reduction techniques, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 313–322. [33] L. Sun, S.W. Ji, J.P. Ye, Hypergraph spectral learning for multi-label classification, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 668–676. [34] E. Kokiopoulou, Y. Saad, Orthogonal neighborhood preserving projections: a projection-based dimensionality reduction technique, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (12) (2007) 2143–2156. ¨ [35] B. Scholkopf, A. Smola, Learning with kernels, MIT Press, Cambridge, MA, 2002, pp. 25–55. [36] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1) (2007) 40–51.
4493
[37] H. Chen, H. Chang, T. Liu, Local discriminant embedding and its variants, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 846–853. [38] D. Cai, X. He, K. Zhou, J. Han, H. Bao, Locality sensitive discriminant analysis, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007, pp. 708–713. [39] D.B. Graham, N.M. Allinson, Characterizing Virtual Eigensignatures for General Purpose Face Recognition, Face Recognition: From Theory to Applications, vol. 163 of NATO ASI Series F, Computer and Systems Sciences, 1998, pp. 446–456. [40] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Computation 16 (12) (2004) 2639–2664. [41] J. Hull, A database for handwritten text recognition research, IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (5) (1994) 550–554. [42] L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering, in: L.K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in Neural Information Processing Systems, vol. 17, MIT Press, Cambridge, MA, 2005, pp. 1601–1608. [43] L.S. Qiao, S.C. Chen, X.Y. Tan, Sparsity preserving discriminant analysis for single training image face recognition, Pattern Recognition Letters 31 (5) (2010) 422–429. [44] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, IEEE Transactions on Neural Networks 17 (1) (2006) 157–165. [45] F. Song, D. Zhang, D. Mei, Z. Guo, A multiple maximum scatter difference discriminant criterion for facial feature extraction, IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 37 (6) (2007) 1599–1606. [46] Y. Guo, J. Gao, P. Kwan, Twin kernel embedding, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (8) (2008) 1490–1495. [47] S.I. Daitch, J.A. Kelner, D.A. Apielman, Fitting a graph to vector data, in: Proceedings of the 26th Annual International Conference on Machine Learning, New York, NY, USA, 2009. [48] L.M. Zhang, S.C. Chen, L.S. Qiao, Graph optimization for dimensionality reduction with sparsity constraints, Pattern Recognition 45 (2012) 1205–1210. [49] W.W. Yu, X.L. Teng, C.Q. Liu, Face recognition using discriminant locality preserving projections, Image and Vision Computing 24 (2006) 239–248. [50] G.F. Lu, Z. Lin, Z. Jin, Face recognition using discriminant locality preserving projections based on maximum margin criterion, Pattern Recognition 43 (2010) 3572–3579. [51] N. Ahmed, T. Natarajan, K.R. Rao, Discrete cosine transform, IEEE Transactions on Computers (1974) 90–93. [52] M. Faundez-Zanuy, J. Roure, V. Espinosa-Duro, J.A. Ortega, An efficient face verification method in a transformed domain, Pattern Recognition Letters 28 (2007) 854–858. [53] W.K. Wong, H.T. Zhao, Supervised optimal locality preserving projection, Pattern Recognition 45 (2012) 186–197.
Zhao Zhang is currently working toward the Ph.D. degree at the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong. His current interests include machine learning, pattern recognition and applications.
Mingbo Zhao is currently pursuing the Ph.D. degree at the Department of Electronic Engineering, City University of Hong Kong, Hong Kong. His current interests include pattern recognition and their applications.
Tommy W.S. Chow (IEEE M’93–SM’03) received the B.Sc (First Hons.) and Ph.D. degrees from the University of Sunderland, Sunderland, UK. He is currently a Professor in the Electronic Engineering Department. His research interests are in the area of Machine learning including Supervised and unsupervised learning, Data mining, Pattern recognition and fault diagnostic. He is serving as the Associate Editor of Pattern Analysis and Applications and International Journal of Information Technology. In professional activities, he has been an active Committee Member of the HKIE (Hong Kong Institution of Engineers) Control Automation and Instrumentation (CAI) division since 1992, and was the Division Chairman (1997–98) for the HKIE CAI division. He has authored or coauthored of over 120 technical papers in international journals, 5 book chapters, and over 60 technical papers in international conference proceedings.