Stable local dimensionality reduction approaches

Stable local dimensionality reduction approaches

Pattern Recognition 42 (2009) 2054 -- 2066 Contents lists available at ScienceDirect Pattern Recognition journal homepage: w w w . e l s e v i e r ...

781KB Sizes 83 Downloads 141 Views

Pattern Recognition 42 (2009) 2054 -- 2066

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

Stable local dimensionality reduction approaches Chenping Hou a,b,∗ , Changshui Zhang b , Yi Wu a , Yuanyuan Jiao a a

Department of Mathematics and Systems Science, National University of Defense Technology, Changsha 410073, China State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing 100084, China b

A R T I C L E

I N F O

Article history: Received 9 July 2008 Received in revised form 9 December 2008 Accepted 12 December 2008

Keywords: Dimensionality reduction Manifold learning Locally linear embedding Laplacian eigenmaps Local tangent space alignment

A B S T R A C T

Dimensionality reduction is a big challenge in many areas. A large number of local approaches, stemming from statistics or geometry, have been developed. However, in practice these local approaches are often in lack of robustness, since in contrast to maximum variance unfolding (MVU), which explicitly unfolds the manifold, they merely characterize local geometry structure. Moreover, the eigenproblems that they encounter, are hard to solve. We propose a unified framework that explicitly unfolds the manifold and reformulate local approaches as the semi-definite programs instead of the above-mentioned eigenproblems. Three well-known algorithms, locally linear embedding (LLE), laplacian eigenmaps (LE) and local tangent space alignment (LTSA) are reinterpreted and improved within this framework. Several experiments are presented to demonstrate the potential of our framework and the improvements of these local algorithms. © 2008 Elsevier Ltd. All rights reserved.

1. Introduction The problem of dimensionality reduction arises in all kinds of fields, such as multi-variable analysis, machine learning, pattern recognition, image processing and scientific visualization. It assumes that the high dimensional data are lying on a low dimensional manifold. For each high dimensional data, we aim to discover a meaningful or an expected representation on the low dimensional manifold [1]. Mathematically, given N points x1 , . . . , xN in a high dimensional subspace of RD , the goal of dimensionality reduction is to find a mapping f : RD → Rd ,

yi = f (xi ), i = 1, . . . , N.

(1)

where yi ∈ Rd , i = 1, . . . , N and d is the dimensionality of the embedding space. However, it is hard to express the mapping explicitly in real applications and we only need to find the low dimensional embeddings {yi |i = 1, . . . , N}. In methodology, there are mainly two kinds of dimensionality reduction approaches [1]: global and local. Global methods may mainly include principal component analysis (PCA) [2], multidimensional scaling (MDS) [3] and its variant-Isomap [4], and maximum variance unfolding (MVU) [5,6]. Local approaches try to preserve the local

∗ Corresponding author at: Department of Mathematics and Systems Science, National University of Defense Technology, Changsha 410073, China. E-mail address: [email protected] (C. Hou). 0031-3203/$ - see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2008.12.009

structures, e.g., the local approximation coefficients of locally linear embedding (LLE) [7,8]. It mainly includes LLE, laplacian eigenmaps (LE) [9,10] and local tangent space alignment (LTSA) [11]. Despite their successful empirical results and rigorous theoretical analysis, these local techniques also have some intrinsic deficits. One of the most serious problems is that they are not so stable to both parameters and noises [12,13]. For example, even a small vibration on the parameters or the data points would cause large changes of the embeddings. On the contrary, it has been shown that MVU, which is a global method and explicitly unfolds the manifold, is relatively not so sensitive to the parameters and the small noises [1,5]. There are some researches to address this problem. Chang et al. have proposed a robust modification of LLE, RLLE [15]. They provided an efficient algorithm to detect and remove the large noises, namely, the outliers. However, RLLE would also fail when the data have some small noises. Wang et al. have generated a multiple weights LLE, MLLE [16]. It uses the (k–d) linear independent combination weights to represent the local structure. In some real applications, the points are often reduced to 2D or 3D. MLLE may thoroughly fail if the data points are not lying on or close to a 2D or 3D nonlinear manifold. Chen et al. have also proposed an effective preprocessing procedure for current manifold learning algorithms [12]. They analyzed the input data statistically and then detected the noises. Although this procedure is efficient in some real applications, they paid no attention to the learning methods themselves and thus, the intrinsic deficits of local approaches are not eliminated. Ridder et al. have also solved the robustness problem of LLE [13] by introducing a weighted reformulation in the embedding step. However, it mainly focuses on the

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

probabilistic subspace mixture models. LLE is only used as a supplementary technique. Considering these approaches, we show that there are mainly two reasons for the deficiencies of local approaches. (1) Local approaches only concern about the local geometry characters. Unlike MVU, they neglect the expected distance configurations and cannot unfold the manifold explicitly. (2) The inferior performance of local techniques for dimensionality reduction arises from the eigenproblems that they encounter. Typically, the smallest eigenvalues in these problems are very small (around 10−7 or smaller), whereas the largest eigenvalues are fairly large (around 102 or larger). It is hard to accurately derive the eigenvectors, which correspond to several smallest nonzero eigenvalues, even for state-of-the-art eigensolvers [14].

where f represents the function that meets some constraints. H is a collection of these functions and V is called the loss function. The last item in Eq. (2) is the regularization function. K is the kernel matrix derived from {xi , i = 1, 2, . . . , l}.  is a tuning parameter that could balance the effects of two items. The adding of −f 2K can make the optimization procedure stable and simultaneously, avoid over fitting [17]. Recalling the mathematical expression of dimensionality reduction in Eq. (1) and the formulation of the Tikhonov regularization in Eq. (2), we introduce a common framework. Evoked by the basic idea of the stable dimensionality reduction method—MVU, the framework which explicitly unfolds the manifold, has the following form: Y ∗ = arg min Y∈HY

The motivation is the central idea of MVU, i.e., explicitly unfolding the manifold and employing the kernel transformation to avoid solving the above-mentioned eigenproblems. We propose a new framework to improve the performances of local approaches. The item that explicitly unfolds the manifold is intentionally added. Moreover, we reformulate the original method as a semi-definite program through kernel transformation. The embeddings, which are derived by ranking the eigenvectors corresponding to the d most significant eigenvalues are robust to both parameters and noises. There are mainly two contributions that should be highlighted. • The first contribution is to propose a unified framework, in which we can analyze and improve previous local approaches. The original and improved approaches are all formulated as semi-definite problems. • The second contribution is to show that the proposed framework can be used as a general platform to improve the performances of existing algorithms. Three well-known algorithms, LLE, LE and LTSA, are analyzed and improved. The intrinsic drawbacks of these methods are compensated to some extent. For convenience, we call the improved LLE, LE and LTSA as stable LLE (SLLE), stable LE (SLE) and stable LTSA (SLTSA), respectively. The rest of this paper is organized as follows. In Section 2, the proposed framework for unfolding the manifold is introduced. In Section 3, LLE, LE and LTSA are reinterpreted and improved within this framework. Some preliminary discussions are also provided. We present several promising experiments in Section 4. The conclusions and further research directions are discussed in Section 5. 2. Framework for unfolding the manifold There are mainly two intuitions of this framework. First, all dimensionality reduction methods hold a common spirit, i.e. they all aim to reconstruct the manifold while preserve some properties of the original data. However, local approaches merely try to character the local geometry structures. They do not consider how to explicitly unfold the manifold. Additionally, it is commonly recognized [17] that the solution to this problem is closely related to searching a symmetric semi-positive definite kernel. The Euclidean coordinates can be obtained by the spectral factorization coefficients of the optimal kernel matrix. Second, another intuition is evoked by the Tikhonov regularization [18]. Given l pairs of training points (xi , yi ), for i = 1, 2, . . . , l, a problem that uses the Tikhonov regularization technique tries to minimize the following problem: arg min(1/l) f ∈H

l  i=1

V(f (xi ), yi ) − f 2K ,

(2)

2055

N 

V((xi ), yi ) − D(Y).

(3)

i=1

Here, Y =[y1 , y2 , . . . , yN ] is a d×N matrix formed by the embeddings of all points. (xi ) represents the local properties that are derived from the original data, e.g., the local linear reconstruction behavior of LLE. It is just a formal function that represents the local properties. We will show its concrete form for different methods in the next section. V is the loss function, e.g., the squared loss in LLE. D(Y) is a function of Y. It is intentionally added to explicitly unfold the manifold. HY consists of the matrix Y which satisfies some common constraints, such as centralization.  is a positive trade-off constant which can balance the weights of two items. Similarly to MVU, one can explicitly unfold the manifold by maximizing the sum of distance. In the following derivations, we define D(Y) as follows: Dl (Y) =

N  

yi − yj 2 ,

(4)

i=1 j∼i

Dg (Y) =

N  N 

yi − yj 2 .

(5)

i=1 i=1

Here j ∼ i in Eq. (4) indicates that j is a neighboring point of i. Thus, Dl (Y) mainly concerns about the distances between nearby points. On the contrary, we set Dg (Y) in Eq. (5) to enlarge distances between any two points. It is the same as the objective function of MVU. As what we have mentioned above, the instabilities of these local approaches can also attribute to the factorizing of the abnormal matrix, i.e., a matrix with very large and very small eigenvalues. Here, we employ a simple transformation K = Y T Y and reformulate Eq. (3) in terms of the elements of K. Through this transformation, the results are eventually derived by the spectral decomposition of a optimal kernel matrix. Comparing with the decomposition of abnormal matrixes in the original methods, the optimal kernel matrixes can be accurately factorized [5,14]. We formally give the kernel expression of Eq. (3) as follows: K ∗ = arg min K∈HK

N 

VK ((xi ), K) − DK (K).

(6)

i=1

Here VK and DK are the functions derived from V and D through K = Y T Y. Recall the expressions of Dl (Y) and Dg (Y) in Eqs. (4) and (5), their corresponding kernel formulations are listed as follows: DKl (K) =

N   (Kii − Kij − Kji + Kjj ),

(7)

i=1 j∼i

DKg (K) =

N  N  i=1 j=1

(Kii − Kij − Kji + Kjj ).

(8)

2056

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

We now explain why local approaches can be improved within this framework. • Without loss of generality, we can assume that the following equations hold. N  N 

N

i=1 yi

= 0, thus

Kij = 0,

i=1 j=1

tr(K) =

N N 1  (Kii − Kij − Kji + Kjj ) 2N i=1 j=1

1 = DKg (K). 2N

(9)

Consider the expression of tr(K) in Eq. (9), DKg (K) in Eq. (8) (approximately for DKl (K) in Eq. (7)) can be regarded as a special regularizer in the ambient space [19]. Recalling the central idea of Belkin's graph regularization framework in semi-supervised learning, it could play the same role as regularization. Moreover, it also imposes the smooth constraints on the mapping f in Eq. (1) and, thus, the influences of noises and parameters are reduced (see [19] for more discussions). • With the same purpose of the Tikhonov regularization [18], we can also regard DK (K) as a Tikhonov regularizer. It could play the same role as the regularization and avoid over-fitting. The influence of inaccurate local information is reduced [19]. • Through kernel transformation and factorization, most valuable information tend to concentrate to the first d largest eigenvalues of K because rank(K)  d (since K = Y T Y and Y ∈ Rd×N ). We only pick up the eigenvectors corresponding to these eigenvalues. Even when the local information are slightly inaccurate (mainly caused by the noises and unsuitable parameters), they would not take great effect on these large eigenvalues and the corresponding eigenvectors. Thus, in these situations, our improved method performs better than the original local approaches. • Through kernel transformation, we need not solve the previous eigenproblems. Instead, we factorize the optimal semi-definite kernel matrix and pick up eigenvectors corresponding to the most d significant eigenvalues. Generally, current eigensolvers could derive more accurate “large eigenvectors” than the small ones [17]. It is more suitable to derive embeddings by picking up eigenvectors corresponding to the large eigenvalues. Therefore, the inaccuracy induced by the previous spectral decomposition is also reduced. Intuitively, if we try to unfold a manifold, the distortions, which are caused by the inaccurate determination of local configurations, would turn small. In other words, we do not character the local configurations merely by the inaccurate local descriptions. The combination of unfolding the manifold will reduce the influence of noises and the bad determination of parameters. There is also another intuition. Since the adding of DKg (K) or DKl (K) tends to unfold the manifold, the coefficients which characterize the local prosperities for distant points (commonly, large noises or unnecessary neighbors) turn small sharply. Hence, the influences of noises and neighborhood sizes are also reduced. Moreover, comparing with the representations of local geometry, distance character seems to be relatively robust to noise. This can be verified by the comparative stability of the distance-based methods, e.g., PCA [2], Isomap [4] and MVU [5]. 3. Stable LLE, LE and LTSA In this section, within the proposed framework, we first analyze and then improve three local approaches: LLE, LE and LTSA. The common algorithm procedure of SLLE, SLE and SLTSA is then sum-

marized. Finally, we provide some preliminary discussions about the determination of parameters and the computational complexity. 3.1. Stable LLE (SLLE) We will analyze LLE and provide the concrete form of (xi ). Based on these interpretations, SLLE is then proposed. The main idea of LLE is to assume that any point in the manifold can be linearly reconstructed by its nearby neighbors. It assumes that these reconstruction weights are preserved during the procedure of dimensionality reduction. We just introduce LLE briefly, see [7,8] for more details. There are mainly three steps of LLE. (1) Construct a connected graph whose nodes are the input points xi and whose edges connect each point to its k nearest neighbors. (2) Reconstruct xi by its neighbors through linear approximation. If xj is connected with xi , the element Wij is the contribution of xj in constructing xi . Otherwise, Wij = 0. (3) Compute the low dimensional representations by seeking points in the low dimensional space. These nearby relationships are required to preserve. By preserving the local affine coefficient matrix W in dimensionality reduction, one can assign (xi ) as the representation of this property.   2 It is natural to take N j∼i Wij yj  as the first item in Eq. (3). i=1 yi − Here j ∼ i indicates that xj is a neighbor of xi . We also add the central constraint. Moreover, the dimensions of all points are orthonormal to each other in the spectral decomposition. The embeddings of LLE can be considered as solutions to the following optimization problem:

arg min Y

s.t.

 2  N      yi −  W y ij j     i=1 j∼i N 

yi = 0

i=1

YY T = Id×d .

(10)

Comparing the components in Eq. (10) with that in Eq. (3), it is clear that LLE cannot explicitly unfold the manifold since it does not have the corresponding D(Y). If we add Dl (Y) in Eq. (4), we can formulate the following problem:

arg min Y

s.t.

 2  N  N       yi − Wij yj  yi − yj 2   −  i=1  j∼i i=1 j∼i N 

yi = 0

i=1

YY T = Id×d . With some algebraic operations, this problem can be directly solved by spectral decomposition. However, we may face the abnormal eigen-decomposition problem as in original LLE. To avoid this, we would like to represent this problem in terms of the kernel matrix K as in MVU. Unfortunately, it is difficult to express the second constraint. We apply a substitutive strategy. Considering that Y can be derived by spectral decomposition of the kernel matrix, it is natural that  2 YY T = Id×d . We employ N i=1 yi  = d as a substitution of the second

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

constraint. The optimal embeddings derived by SLLE can be defined as solutions to the following problem:

arg min Y

s.t.

 2  N  N       yi −  W y yi − yj 2 ij j  −    i=1  j∼i i=1 j∼i N 

yi 2 = d.

(11)

i=1

The optimization problem in Eq. (11) can be reformulated in terms of the kernel matrix and it is shown in Eq. (12).

arg min K

⎛ N    ⎝ Kii − Wij Kij − Wij Kji i=1

+

j∼i



j∼i

Wij Wip Kjp ⎠

(12)

This is a semi-definite problem, which can be effectively solved by the advanced optimization techniques. In all our examples, we apply the CSDP [20] program. We would like to show the relationship between the feasible regions F(Y), i.e., the set formed by Y that satisfies the constraints in Eq. (10), and F(K), i.e., the set formed by K that satisfies the constraints in Eq. (12). More concretely, Y = [y1 , y2 , . . . , yN ] ∈ F(Y)  if and only if YY T = Id×d and N i=1 yi = 0. K ∈ F(K) if and only if N N i=1 j=1 Kij = 0, tr(K) = d and K  0. The following theorem shows their relationships. Theorem 1. Assume Y = [y1 , y2 , . . . , yN ] ∈ F(Y). If K = Y T Y, then K ∈ F(K). On the contrary, if K ∈ F(K) and Y is derived by picking up the eigenvectors corresponding to the d largest eigenvalues of K, then Y ∈ F(Y). Proof. On one hand, since Y = [y1 , y2 , . . . , yN ] ∈ F(Y), YY T = Id×d and N T i=1 yi = 0. If K = Y Y, then yTi yj

i=1 j=1 T

⎞2 yip ⎠ .

i=1

⎝Kii −



Wij Kij −

j∼i N  



Wij Kji +

j∼i



⎞ Wij Wip Wjp ⎠

j∼i p∼i

(Kii − Kij − Kji + Kjj ).

i=1 j∼i

K  0.

i=1 j=1



−

tr(K) = d

Kij =

N  i=1

Kij = 0

N N  

p ⎝

p=1

N 

 2  N  N       yi − Wij yj  yi − yj 2 , fobj(Y) =   −  i=1  j∼i i=1 j∼i

gobj(K) =

i=1 j=1

N N  



The objective functions of the problems in Eqs. (11) and (12) also have close relationships. Denote

N   (Kii − Kij − Kji + Kjj )

N N  

l yip yjp =

i=1 j=1 p=1

d 

 Since p  0 for p = 1, 2, . . . , d, N i=1 yip = 0 for p = 1, 2, . . . , d. It implies N that i=1 yi = 0. This is equivalent to the first constraint in Eq. (10). Thus, if K ∈ F(K) and Y is derived by picking up the eigenvectors corresponding to the d largest eigenvalues of K, then Y ∈ F(Y).

i=1 j∼i

s.t.

N  N  d 

and



j∼i p∼i

−

Kij =

i=1 j=1

i=1 N 

yi = [yi1 , yi2 , . . . , yid ]T . With some algebraic computations, the (i, j) el ement of K, i.e., Kij , is dp=1 l yip yjp . Thus, N  N 

yi = 0

2057

=

N  i=1

yTi

⎛ ⎞ N  ⎝ yj ⎠ = 0, j=1

T

tr(K) = tr(Y Y) = tr(YY ) = d, K = Y T Y  0. Therefore, if Y ∈ F(Y) and K = Y T Y, then K ∈ F(K). On the other hand, if K ∈ F(K) and Y is derived by picking up the eigenvectors corresponding to the d largest eigenvalues, YY T = Id×d . Assume K = Y T Y, where  is a diagonal matrix that is formulated by the d largest eigenvalues of K, i.e., 1  2  · · ·  d  0. Denote

We will show that if K = Y T Y, i.e., Kij = (yi , yj ), then fobj(Y) = gobj(K). Since  2 ⎛ ⎞T ⎛ ⎞        yi −  ⎝ ⎠ ⎝ yi − Wij yj  = yi − Wij yj Wij yj ⎠    j∼i j∼i j∼i   = yTi yi − Wij (yTi yj ) − Wij (yTj yi ) j∼i

+



j∼i

Wij Wip (yTj yp )

j∼i p∼i

= Kii −



Wij Kij −

j∼i

+





Wij Kji

j∼i

Wij Wip Kjp .

j∼i p∼i N   i=1 j∼i

yi − yj 2 =

N   (Kii − Kij − Kji + Kjj ). i=1 j∼i

Then, fobj(Y) = gobj(K). In summary, the optimal solutions to the problems in Eqs. (11) and (12) have close relationship. As in MVU [5] and KPCA [21], the optimal Y can be approximated by the eigenvectors which correspond to the d largest eigenvalues of the optimal kernel matrix K. In general, SLLE has several advantages. (1) Since we replace the spectral decomposition of an abnormal matrix by a well posed kernel matrix, SLLE could derive the embedding more accurately. Moreover,through this transformation, the most valuable information tends to concentrate to the d largest eigenvalues of K. Thus, influences of the inaccurate determination of the local approximation coefficients, which are caused by the contamination of noises, are also alleviated. (2) SLLE tends to explicitly unfold the manifold, the reconstruction weights for distant points turn small sharply. Thus, influences of the neighborhood size are also reduced. (3) When  = 0, SLLE is similar to the original LLE. It contains LLE as a special case.

2058

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

3.2. Stable LE (SLE)

SLE has several advantages.

Similarly to the above section, we will analyze LE first and then propose the improved LE, SLE. LE applies the graph Laplacian to characterize the local geometry structure of a manifold. It has the same procedure as LLE, except for a few differences in the second step. Similarly to LLE, a neighborhood graph is constructed in the first step. The entry in the similar matrix W is defined as exp{xi − xj 2 /t} if one element of the vertex pair (xi , xj ) is among the k nearest neighbors of the other. Otherwise, the entry is set to zero. Low dimensional embeddings are derived by factoring (D − W). Here D is a diagonal  matrix with Dii = j Wij . LE can be interpreted as solutions to the following problem:

arg min Y

s.t.

N  

3.3. Stable LTSA (SLTSA) yi − yj 2 Wij

i=1 j∼i N 

yi = 0

i=1

YDY T = I.

(13)

Comparing with the proposed framework in Eq. (3), LE neglects to explicitly unfold the manifold. It only concerns about the local similarity and does not refer to the expected global distance. Thus, LE is very sensitive to the size of neighborhood. We add Dg (Y) shown in Eq. (5) to the original LE. Due to the same reason as mentioned  T in LLE, if we change the constraint YDY T = I to N i=1 Dii yi yi = d, the problem in Eq. (13) can be reformulated in terms of the elements of the kernel matrix K. Consequently, SLE can be defined as solutions to the following problem:

arg min Y

s.t.

N  

yi − yj 2 Wij − 

i=1 j∼i N 

N  N 

yi − yj 2

i=1 j=1

yi = 0

i=1 N 

Dii yTi yi = d.

(14)

As stated in the end of Section 2, if we replace Y by K, we cannot only avoid solving the abnormal eigen-decomposition problem of the original LE, but also concentrate the information to alleviate the influence of noises and parameters. The problem in Eq. (14) is converted to a semi-definite problem.

K

s.t.

N  

Wij (Kii − Kij − Kji + Kjj ) − 2Ntr(K)

i=1 j∼i N  N 

LTSA characterizes the local geometry of a manifold by constructing an approximated tangent of each point. Those tangent spaces are then aligned to give the global coordinates. There are also mainly three steps of LTSA. The first is to extract the local information. For each i = 1, . . . , N, one should determine the k nearest neighbors xij of xi and compute g2 , . . . , gd of the correlation matrix the d largest unit eigenvectors g1 ,√ (Xi − x¯i eT )T (Xi − x¯i eT ) and set Gi =[e/ k, g1 , . . . , gd ]. Here Xi is a D×(k+1) matrix formed by xi and its k nearest neighbors {xi1 , xi2 , . . . , xik }. x¯i represents xij 's mean and e is a k-dimensional column vector with all ones. The second step is to construct the alignment matrix. Form a matrix B by the local sums, i.e., B(Ii , Ii ) ← B(Ii , Ii ) + I − Gi GTi . Here Ii = {i1 , i2 , . . . , ik } is the neighborhood index set of the point xi . The final step is to align the global coordinates. One can compute the (d + 1) smallest eigenvectors of B and pick up the eigenvectors corresponding to the second to the (d + 1)st smallest eigenvalues. Since the smallest eigenvalue is zero and the corresponding eigenvector is a constant (see [11] for more details). Considering the procedures of LTSA, we can discover that the  2 basic idea of LTSA is to minimize the alignment error N i=1 Yi Wi  . T Here Yi = [yi1 , yi2 , . . . , yik ], Wi = I − Gi Gi . Embeddings derived by LTSA can be interpreted as solutions to the following problem:

arg min Y

i=1

arg min

(1) When  = 0, SLE is almost the same as LE, except for a few differences in the second constraints. SLE contains LE as a special case. (2) SLE expects to unfold the manifold globally. Weights for distant points, which are computed by exp{xi − xj 2 /t}, reduce drastically. Thus, the influence of the neighborhood size is alleviated and SLE is also stable to the size of neighborhood. (3) Through kernel transformation, we cannot only derive the solution of SLE by solving a well posed problem, but also improve the stability of original LE. (4) SLE is of high application potential. These properties will be verified by the experiments in Section 4.

Kij = 0

N 

Y

(15)

(16)

LTSA only concerns about the local characters. It employs the local linear projections and then aligns these projections by the linear transformations. It does not explicitly unfold the manifold either. Similarly to LLE, LTSA is also sensitive to the parameters and noises. We define the embeddings derived by SLTSA as solutions to the following problem:

i=1

K  0.

yi = 0.

i=1

arg min Dii Kii = d

Yi Wi 2

i=1

s.t. YY T = I

i=1 j=1 N 

N 

s.t.

N 

Yi Wi 2 − 

i=1 N 

N  

yi − yj 2

i=1 j∼i

yi = 0

i=1

Based on the results in Theorem 1, it is straightforward to draw the similar conclusions. Solutions to the problem in Eqs. (14) and (15) are closely related with each other.

N  i=1

yi 2 = d.

(17)

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

Due to the same reason that has been proposed in analyzing LLE  2 and LE, we can substitute YY T = I by N i=1 yi  = d and express the optimization problem in terms of the kernel matrix K. Through kernel transformation, one can reformulate Eq. (17) as a semi-definite problem.

(2) SLE: VK ((xi ), K) =

arg min K

N  N 

VK ((xi ), K) =

DK (K) =

Kij = 0

k  k  k 

(i)

(i)

Wqp Wqj Kij ip

N  

(Kii − Kij − Kji + Kjj ).

(21)

i=1 j∼i

i=1 j=1

tr(K) = d K  0.

(18)

Step 4: Solve the optimization problem. Solve the following optimization problems: N 

(i)

Here Wqp represents the (q, p) element of Wi . Kij ip is the (ij , ip ) element of the kernel matrix K. In general, SLTSA have several advantages.

arg min

(1) When  = 0, it is almost the same as LTSA. Thus, LTSA can be regarded as one special case of SLTSA. (2) The local distances are enlarged and the influences of distant points are reduced. This would make SLTSA stable to the neighborhood size. (3) It is more suitable to use SLTSA than LTSA in real applications. This will be verified in the following experiments.

s.t.

3.4. Common algorithm procedure Considering the analysis in the previous three sections, we can unify three stable local dimensionality reduction approaches of the same procedure. Step 1: Construct a neighborhood graph. One can construct a neighborhood graph by connecting every points to its k nearest neighbors or points within an -ball. Step 2: Characterize the local configurations. Different approaches apply different descriptions to characterize local configurations of the manifold. For example, (1) SLLE: Compute the local reconstruction weights W by local linear approximation and assign the constraint matrix C = I. (2) SLE: Compute the local similar matrix W by local similarity and assign the constraint matrix C = D. (3) SLTSA: Computing the local affine matrix W by LTSA and assign the constraint matrix C = I. Step 3: Formulate the optimization problem. Formulate every algorithm within the proposed framework in terms of the kernel matrix K. Different methods have different loss functions. For example, (1) SLLE: ⎛ VK ((xi ), K) = ⎝Kii −

 j∼i

DK (K) =

(20)

j=1 p=1 q=1

i=1 j∼i

s.t.

(Kii − Kij − Kji + Kjj ).

(3) SLTSA:

N   (Kii − Kij − Kji + Kjj )

N N  

Wij (Kii − Kij − Kji + Kjj )

i=1 j=1

(i) (i) Wqp Wqj Kij ip

i=1 j=1 p=1 q=1

−

 j∼i

DK (K) = k  k  k N  

2059

N   i=1 j∼i

Wij Kij −

 j∼i

(Kii − Kij − Kji + Kjj ).

Wij Kji +



⎞ Wij Wip Kjp ⎠

j∼i p∼i

(19)

K

VK ((xi ), K) − DK (K)

i=1 N N  

Kij = 0

i=1 j=1

tr(CK) = d K  0.

(22)

Step 5: Derive the low dimensional representations. Factorize K by the spectral decomposition and assign the most d significant eigenvectors (corresponding to the d largest eigenvalues) of the optimal kernel matrix K as the embeddings. 3.5. Related discussions We will provide some important discussions about the proposed approaches in this section. How to determine the balance parameter  is the first problem that we meet. It cannot be too large or too small. If  is too large, the minimization of Eq. (22) tends to enlarge distances between any two related points. This would ignore the local information. On the contrary, if  is too small, the effect of unfolding the manifold will decrease. In fact, it is difficult to determine this balance parameter directly. Different data sets have different numerical scales. In order to eliminate the influence of the numerical scale, we turn to another parameter , which is the proportion of two items in Eq. (3). Since it is the ratio between two items, we can constraint it between zero and one and any  can be uniquely determined by a fixed  and vice versa. To guarantee that Y is consistent with X, we employ the original data X instead of Y and thus  is given by

=

N

i=1 V((xi ), xi )

D(X)

.

(23)

We now only need to determine the parameter . Since it is constrained to between zero and one, we apply the grid search technique [22] to determine the parameter . More concretely, we assume that  ∈ {0.1, 0.2, . . . , 1} and compute the corresponding optimal values of the objective function in Eq. (22). The , which achieves the smallest optimal value is selected. The next concern may be about the computational complexity. Since we have reformulated the original local methods in their kernel expressions, the computational complexity is enlarged. The original method has O(n2 ) computational complexity (sparse matrix) for

2060

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

spectral decomposition. However, it needs more time to solve the corresponding semi-definite program. One possible way to solve this problem is to employ the kernel factorization method that is proposed by Weinberger [23]. The variables and constraints are all decreased to form an approximated small size optimization problem. 4. Experiments There are seven experiments described in this section. To compare the improved approaches with the original methods, the first experiments are performed on toy data, with different noise variances and parameters k. Experimental results show that, the improved methods are not only robust to noises, but also robust to the parameter k. The second and the third experiments show that our methods are stable to noises in real applications. The fourth and the fifth are proposed to illustrate that our methods are stable to the parameter k for real data. The last two experiments aim to demonstrate that the improved approaches are of high application potentials. The parameter  in these experiments is determined by the strategy discussed in Section 3.5. One point should be highlighted here. In the following experiments, we aim to compare the improved algorithms with their corresponding original methods. We do not plan to compare the performances of different improved methods, since different improved methods have different application scopes and it is unwise to compare them on a particular data set. As shown in the following experiments, we cannot directly specified which method is always the best. 4.1. Standard diversity In order to measure the diversity of two embeddings, we introduce a particularly designed measurement metric-standard diversity (SD). It mainly focuses on the shape of a manifold and is invariant to translation, scale change and symmetrical deformation. (1) (1) (1) Assume Y (1) =[y1 , y2 , . . . , yN ] are the expected embeddings. The (2)

(2)

(2)

(j)

derived representations are Y (2) = [y1 , y2 , . . . , yN ], where yi ∈ Rd for i=1, 2, . . . , N and j=1, 2. We would like to find a linear transformation from Y (2) to Y (1) , through which Y (1) can be approximated best. In other words, we expect to find c ∈ Rd and T ∈ Rd×d that minimize N 

(1)

yi

(2)

− c − Tyi 2 .

(24)

i=1

The optimal solution to this problem is c∗ =

N 

(1)

yi /N,

T ∗ = (Y (1) − c∗ 1N )(Y (2) )† .

(25)

i=1

Here (·)† represents the M–P inverse. The SD from Y (2) to Y (1) is defined as the mean of fitting errors, that is SD(Y (2) → Y (1) ) =

N 1  (1) (2) yi − c∗ − T ∗ yi 2 . N

(26)

i=1

4.2. Experiments on the Toy data set In this section, we present several experiments on a Toy data set. It consists of 300 samples, which are selected from a 2D helical line. The generating function is [t sin(t), 0.6 × t cos(t)], where t is a parameter, which is uniformly selected in the interval [0, 4], for N = 300 points. The first experiment is performed on a noisy data set (see Fig. 1(a)). We add the noises, which are randomly sampled from the

normal distribution with zero mean and standard deviation = 0.5, to the original helical line data. If one takes the curve length as a parameter, the expected embeddings are shown in Fig. 1(e). More concretely, for the ith point, its ideal vertical value should be the curve length between the first and the ith points. The low dimensional embeddings, which are derived by LLE, LE, LTSA, SLLE, SLE and SLTSA, are shown in Figs. 1(b), (c), (d), (f), (g) and (h), respectively. The parameter k is eight for these experiments. As seen from Fig. 1, due to the intrinsic deficits of LLE, it fails to discover the low dimensional embeddings. There are some folds in its embeddings. LE and LTSA cannot unfold the noisy helical line either. The embeddings of points which are sampled from the big circle overlap with each other. On the contrary, the embeddings derived by SLLE, SLE and SLTSA are very similar to the expected embeddings. They can achieve better results than the original approaches. Thus, the improved methods are more stable to noises. We also provide some numerical comparisons for illustration. Since MLLE performs better than the original LLE and the intuition of the proposed framework is from MVU, we also compare the proposed methods with MLLE and MVU. With different standard variances of the noises, i.e., , we compute the pre-defined SDs between the derived embeddings and the expected embeddings. After repeating every run (every method with every noise variance) for 50 times, we list their means and standard derivations (Std) in Table 1. As seen from Table 1, it is obvious that the improved methods all achieve smaller SDs than the corresponding original approaches. It implies that our methods perform better. MLLE provides satisfied embeddings when the noise variance is small. But it fails when is large. MVU is very stable to the noises and also performs well. With the increase of noise variance, all the methods seem to obtain bad embddings. This consists with the intuitions. Moreover, based on these comparisons, SLLE and SLE may perform better than SLTSA on this data set. In order to show that the improved methods are also stable to the parameter k, we provide some experiments on the helical line data without noises (this would guarantee that all comparisons are performed on the same data). We vary the parameter k from 5 to 9 and compute the embeddings derived by LLE, SLLE, LE, SLE, LTSA and SLTSA, respectively. The SDs for each method with each parameter k are listed in Table 2. As seen from Table 2, LLE is very sensitive to k, there are great differences among results of different k. When k = 5 and 7, LE and LTSA both achieve the similar accuracies. When k = 9, the SDs take great changes. On the contrary, the accuracies of our improved approaches, i.e., SLLE, SLE and SLTSA, do not change too much for different k. Thus, the proposed methods are not only stable to noises (see previous experiment), but also not sensitive to the parameter k. Moreover, there are no distinct differences among the SDs of our methods. 4.3. Experiments on the Translated Lenna and the Rotated Earth data sets In order to show that our methods are also stable to noises on the real data, we provide some experiments on two real data sets with different kinds of noises. The first data set we have employed is the Translated Lenna data set. Examples are generated by translating the Lenna image across a larger background of random noises. The noises are independent from one example to another. This often occurs when a robot captures pictures of an object. The background is often contaminated by noise. We constrain the gray level of the noisy background from 100 to 120. The input consists of N = 900 grayscale images, with each image containing a 30 × 30 Lenna image superimposed on a 61 × 61 background of noises. Some of the typical images are shown in Fig. 1.

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

Original

LLE

10

LE

LTSA 0.1

0.5

2

0.05

1 0

2061

0

0

0 −0.5

−1

−0.05

−10 −10

−5

0

5

100

Expected

200

300

100

SLLE

200

200

0

0.05

0

−0.2

0

−1

−0.4

300

100

200

300

200

300

0.1

−0.05

0 100

100

SLTSA

0.2

1

400

300

SLE

2

600

200

100

200

300

100

200

300

Fig. 1. In (b)–(e), x-axis: index of points; y-axis: length of the curve. (a) Noisy helical line; (b) embeddings derived by LLE; (c) embeddings derived by LE; (d) embeddings derived by LTSA; (e) expected embeddings; (f) embeddings derived by SLLE; (g) embeddings derived by SLE; and (h) embeddings derived by SLTSA.

Table 1 Mean and Std of SDs between expected embedding and that derived by LLE, SLLE, LE, SLE, LTSA, SLTSA, MLLE and MVU on helical line data with different noise variances. Noise level

= 0.3

LLE SLLE LE SLE LTSA SLTSA MLLE MVU

2.9723 0.9219 1.8634 0.8328 2.2207 1.5872 1.9436 1.3018

= 0.5 (0.5054) (0.0892) (0.0334) (0.0442) (0.0765) (0.0840) (0.5590) (0.0189)

3.5317 1.0925 2.0415 1.3124 2.3467 1.7521 3.2355 1.4355

= 0.7 (0.3090) (0.2792) (0.0823) (0.1029) (0.2957) (0.2875) (0.6303) (0.1208)

3.8984 1.4534 2.5265 1.7406 2.6998 2.1662 3.7760 1.5425

(0.3529) (0.2783) (0.3295) (0.3905) (0.4793) (0.2265) (0.3074) (0.2095)

Table 2 SDs between expected embedding and that derived by LLE, SLLE, LE, SLE, LTSA and SLTSA on helical line data with different parameters k. Parameter

k=5

k=7

k=9

LLE SLLE LE SLE LTSA SLTSA

3.2924 0.7384 0.7340 0.7383 1.2196 0.7372

4.2445 0.7394 0.7315 0.7394 1.2177 0.7363

3.7479 0.7847 2.3244 0.7834 2.3745 0.7554

Since the intrinsic dimensionality of this data set, i.e., the dimensionality of the embedding space, is two. With different parameters k, we apply LLE, SLLE, LE, SLE, LTSA and SLTSA to derive the embeddings. The computed coordinates by LLE, SLLE, LE, SLE, LTSA and SLTSA are shown in Figs. 1 (a), (d), (b), (e), (c) and (f), respectively. Four peak points, accompanied with their corresponding real images shown in Fig. 1, are marked by different kinds of symbols. As seen from Figs. 2 and 3, it is obvious that SLLE and SLE can represent the intrinsic structure of the Translated Lenna data set in a reasonable way. The translation behavior of the Lenna image is revealed. LLE, however, cannot properly represent such character. Although we can probably know this behavior from the embeddings

of LE, they are not as well as the embeddings derived by SLE. Though the embeddings of SLTSA are not so good as that of LLE and LE, they are also better than the embeddings of LTSA. LTSA fails to discover the intrinsic structure of this data set thoroughly. Comparing the results derived by our three methods, it seems that SLLE and SLE perform better than SLTSA on this data set. Another experiment is performed on a data set of 900 rotated Earth images. This data set is proposed by Ham and has been applied in [24]. It consists of 100 × 100 rendered images of the globe by rotating its azimuthal and elevation angles. Some marginal pictures are also drawn in Fig. 4. Different to the above data set, which is generated by translating an image over the noisy background, the noises in this set are induced during the process of capturing the images. They may be caused by the deficits of the optical elements, the blurring during the capturing, or the slight changing of the background. The intrinsic dimensionality of this data is two. We have also employed LLE, SLLE, LE, SLE, LTSA and SLTSA on this data set. The embeddings derived by these methods and the representations of four marginal images, i.e., the pictures shown in Fig. 4, are shown in Fig. 5. Although the images are contaminated by noises, the improved algorithms, i.e., SLLE, SLE and SLTSA perform better than their corresponding original methods. Although LE and LTSA can approximately reveal the intrinsic structure, their improved approaches can represent this character in a more practical way. Moreover, it looks like that SLE and SLTSA could derive more faithful embeddings than SLLE on this data set.

4.4. Experiments on the Breast Cancer and the Teapot data sets In this section, we present two experiments on two real data sets to illustrate that our methods are not sensitive to the parameter k. The first experiment is to show that SLLE and SLE are not so sensitive to the neighborhood size. It is performed on the `Wisconsin Diagnostic Breast Cancer' data set [25]. 569 instances, with 32 attributes of each case, can be divided into two classes: malignant and benign. Excluding the first attribute (ID) and the second attribute

2062

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

Fig. 2. Typical images of the Translated Lenna data set. The titles indicate their corresponding embeddings in (Fig. 3). (a) Lenna picture identified by the circle; (b) Lenna picture identified by the triangle; (c) Lenna picture identified by the diamond; and (d) Lenna picture identified by the square.

2

0.5

0

0

0.2 0.1 0

−2

−4

−0.5

−2

−1

0

1

2

5

0

−5

−4

−2

−1 −0.5

2

4

0

0.5

−0.2 −0.1

1

0.05

0.5

0

0

−0.05

−0.5

−0.1

−1 0

−0.1

−1

−0.5

0

0.5

1

−0.15 −0.1

−0.05

0

0.05

0.1

−0.05

0

0.05

0.1

Fig. 3. Embeddings of the Translated Lenna data set. (a) LLE with k = 12; (b) LE with k = 11; (c) LTSA with k = 5; (d) SLLE with k = 12; (e) SLE with k = 11; and (f) SLTSA with k = 5.

Fig. 4. Typical images of the Rotated Earth data set. The titles indicate their corresponding embeddings in (Fig. 5). (a) Earth picture identified by circle; (b) Earth picture identified by triangle; (c) Earth picture identified by diamond; and (d) Earth picture identified by square.

(diagnosis), we employ the first 98 instances, which contain 64 malignant cases and 34 benign cases. The 2D embeddings, which are derived by LLE, LE, SLLE and SLE with different parameters k, are shown in Fig. 6. The red star represents the embedding of a benign instance and the black dot represents the embedding a malignant case. As seen form Fig. 6, SLLE and SLE are not sensitive to the size of neighborhood. When k = 10 and 20, they can both rep-

resent the 30-dimensional data samples by the 2D points that can be easily separated. On the contrary, LLE cannot discover such property. When k = 20, a large percent of different kinds of embeddings are mixed together. With different k, although the emeddbings of LE do not take great changes, both of them cannot represent two different kinds of samples in a divisible way. Therefore, it is difficult to interpret these representations intuitively.

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

2

2063

1 0.05

1 0.5 0

0 0

−1 −2

−2

0

−0.05

−0.5 −0.5

2

0

−0.05

0.5

0

0.05

0.4 0.5

0.2 0.2

0

−0.5 −2

0

0

−0.2

−0.1

−0.4

−1

0

1

0.1

2

−0.2 −1

−0.5

0

0.5

−0.5

1

0

0.5

Fig. 5. Embeddings of the Rotated Earth data set. (a) LLE with k = 12; (b) LE with k = 11; (c) LTSA with k = 7; (d) SLLE with k = 12; (e) SLE with k = 11; and (f) SLTSA with k = 7.

2

5

0 0 −2 −4

−5

−2

0

2

5

0

−5 1

2

3

0.4

1

0.2

0

0

−1

−0.2

−2

−5

0

5

4

1

2

0.5

0

0

−2

−0.5

−4 0

2

−2

−1 0

2

4

−2

−0.4 0

2

−1

0

1

−1

0

1

0.5

0

−1

−0.5 0

1

Fig. 6. 2D representations, derived by LLE, SLLE, LE and SLE with different k, of the Breast Cancer' data set. For interpretation of the reference to colors in Figs. 6 and 8 the reader is referred to web version of this article. (a) LLE with k = 10; (b) SLLE with k = 10; (c) LE with K = 10; (d) SLE with k = 10; (e) LLE with k = 20; (f) SLLE with k = 20; (g) LE with k = 20; (h) SLE with k = 20.

The other experiment is to show that SLTSA is also stable to the size of neighbors. It is performed on N = 400 color images of a teapot [26]. They are captured from different angles in a plane. Each vectorized image has the dimensionality of 23028, resulting from 3 byte of the color information for each of 76 × 101 pixels. The 2D representations, which are derived by LTSA and SLTSA with different sizes of neighbors, i.e., k, are shown in Fig. 7. Four particularly selected points

and their corresponding images are also drawn. Intuitively, LTSA and SLTSA all perform well when k = 5. When k = 4, however, LTSA fails thoroughly. Except for a little irregular, SLTSA can also reveal the intrinsic structure of the data set when k=4. The position of a point on the circle approximately corresponds to the angle of this teapot. On the contrary, the embeddings of LTSA cannot reveal any useful information when k = 4. SLTSA is more stable to the size of neighborhood.

2064

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

Fig. 7. Low dimensional representations of the Teapot data set, derived by LTSA and SLTSA with different neighborhood sizes k. (a) LTSA with k = 5; (b) SLTSA k = 5; (c) LTSA with k = 4; and (d) SLTSA with k = 4.

Fig. 8. 2D representations discovered by SLLE, SLE and SLTSA of `9' in the Mnist data base with k = 20. (a) SLLE embedding; (b) SLE embedding; and (c) SLTSA embedding.

4.5. Experiments on the Handwritten Digit and the Face data sets In this section, we provide two experiments on another two real data sets. They are presented to show that SLLE, SLE and SLTSA have high application potentials. The first experiment is performed on the real data set which consists of digital handwritten images from the Mnist [27]. 500 images are randomly chosen from training set of the number `9'. Each image is of the resolution 28 × 28. Hence, D = 784. Fig. 8 shows the first two components of these images discovered by SLLE, SLE and SLTSA with k = 20. Six intentionally selected points are marked by the red circles. Their corresponding true images, which are listed from left to right, are also provided. These two components appear to capture the nonlinear degrees of freedom associated with the shape of digits. With the decrease of vertical coordinates, the circle of `9' turns small. With the increase

of horizontal coordinates, line direction of `9' turns from right to left gradually. The proposed approaches can reveal these useful characters and thus, they are of high application potentials. Moreover, the embddings of three methods are similar, they all perform well on this data set. The final example is performed on 841 face images of a person. This data set is proposed by Ham and has applied in [24]. It consists of 120 × 100 images captured by varying the poses of a person's head with a fixed camera. Fig. 9 shows the dimensionality reduction results of SLLE, SLE and SLTSA. Several typical images are also drawn. Images on the left correspond to the points on the left margin and images on the top correspond to the marker samples on the top margin. As seen from Fig. 9, all methods can approximately discover the intrinsic structures of this data set. With the increase of horizontal coordinates, the person turns the face to his right gradually.

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

2065

Fig. 9. Low dimensional embeddings, derived by SLLE, SLE and SLTSA, on the Face image data. (a) SLLE embedding; (b) SLE embedding; and (c) SLTSA embedding.

Meanwhile, with the decrease of vertical coordinates, the person raises his head gradually. Our method can represent this character in a realistic way. Moreover, the first two embeddings are much more similar. It looks like that embeddings derived by SLLE and SLE is more realistic than that computed by SLTSA on this data set.

Science Foundation of China, under Grant nos. 60673090, 60721003 and the support from Graduate School of National University of Defense Technology (B070201) for their supports.

5. Conclusions

[1] V. de Silva, J.B. Tenenbaum, Global versus local methods in nonlinear dimensionality reduction, in: S. Becker, S. Thrun, K. Obermayer (Eds.), Proceedings of NIPS, vol. 15, 2003, pp. 721–728. [2] M. Collins, S. Dasgupta, R. Schapire, A generalization of principal component analysis to the exponential family, in: NIPS 13, 2001. [3] T.F. Cox, M.A.A. Cox, Multidimensional Scaling, Chapman & Hall, London, 1994. [4] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [5] K.Q. Weinberger, L.K. Saul, Unsupervised learning of image manifolds by semidefinite programming, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-04).2, 2004, pp. 988–995. [6] K.Q. Weinberger, B.D. Packer, L.K. Saul, Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization, in: Z. Ghahramani, R. Cowell (Eds.), Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS-05), 2005, pp. 381–388. [7] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. [8] L.K. Saul, S.T. Roweis, Think globally, fit locally: unsupervised learning of low dimensional manifolds, Journal of Machine Learning Research 4 (2003) 119–155. [9] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Advances in Neural Information Processing System 14 (2001) 585–591. [10] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (2003) 1373–1396. [11] Z.Y. Zhang, H.Y. Zha, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, SIAM Journal of Scientific Computing 26 (1) (2004) 313–338. [12] H.F. Chen, G.F. Jiang, K. Yoshihira, Robust nonlinear dimensionality reduction for manifold learning, in: Proceeding of 18th International Conference on Pattern Recognition. ICPR, 2006, pp. 447–450. [13] D. De Ridder, V. Franc, Robust manifold learning, Technical report CTU-CMP2003-08, Center for Machine Perception, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University, Prague. pp. 1–36, 2003. [14] L.J.P. Maaten, E.O. Postma, H.J. Herik, Dimensionality reduction: a comparative review. http://www.cs.unimaas.nl/l.vandermaaten/dr/DR_draft.pdf . [15] H. Chang, D.Y. Yeung, Robust locally linear embedding, Pattern Recognition 39 (6) (2006) 1053–1065. [16] J. Wang, ZY. Zhang, HY. Zha, Adaptive Manifold Learning, Advances in Neural Information Processing Systems 17, MIT Press, Cambridge, MA, 2005, pp. 1473–1480. [17] F. Lu, S. Keles, et al. Kernel regularization and dimensionality reduction. Technical Report. No 1119, Department of Statistics, University of Wisconsin, 2006.

In this paper, a unified framework that explicitly unfolds the manifold is proposed. It aims to make the local dimensionality reduction approaches stable to both noises and parameters. To avoid solving the abnormal eigenproblem, we reformulate the original optimization problem to a semi-definite program through kernel transformation. Three well-known local approaches are reinterpreted and improved within this framework. We have proposed several experiments on both synthetic and real data sets. Experimental results show that the performances of these methods are all improved. The proposed methods are more stable to both the noises and the parameter k. We also illustrate that the improved methods are more suitable for real applications. The next step is to continue the research about how to determine the balance parameter, , in a more efficient way since it is very important for the proposed framework. Moreover, since we have reformulated the original spectral decomposition problem to a semidefinite problem, how to effectively solve this semi-definite problem is also our future work. As shown in [23], we may accelerate the computational speed by kernel matrix factorization. Finally, as shown in [17], we will also research the “newbie” problem, i.e., how to effectively compute the embedding of a new point. Acknowledgments Thanks to two anonymous reviewers for their constructive suggestions. We would like to thank Feiping Nie and Yangqing Jia for their useful discussions. We would like to thank Kilian Q. Weinberger, Ji Hun Ham, Yann LeCun and Corinna Cortes for providing the useful data. We are grateful to creators of `Wisconsin Diagnostic Breast Cancer' database. Thanks the National Basic Research Program of China under Grant no. 2005CB321800, National Natural

References

2066

C. Hou et al. / Pattern Recognition 42 (2009) 2054 -- 2066

[18] A.N. Tikhonov, Regularization of incorrectly posed problems, Soviet Mathematics 4 (1963) 1624–1627. [19] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from examples, Technical Report, University of Chicago, Department of Computer Science, TR-2004-06. Available at: http://www.cs. uchicago.edu/research/publications/techreports/TR-2004-06 . [20] B. Borchers, CSDP, a C library for semidefinite programming, Optimization Methods and Software 11 (1999) 613–623. [21] S. Mika, B. Scholkopf, et al., Kernel PCA and de-noising in feature space, Advances in Neural Information Processing Systems 11, MIT Press, Cambridge, MA, 1999. [22] S.M. Lavalle, M.S. Branicky, On the relationship between classical grid search and probabilistic roadmaps, Algorithmic Foundations of Robotics V. STAR 7 (2004) 59–75.

[23] K.Q. Weinberger, F. Sha, Q. Zhu, L.K. Saul, Graph Laplacian regularization for large-scale semidefinite programming, in: B. Schoelkopf, J. Platt, T. Hofmann (Eds.), Advances in Neural Information Processing Systems 19, MIT Press, Cambridge, MA, 2007Available at: http://www.cs.ucsd.edu/∼saul/papers/ graph_ nips06.pdf . [24] J.H. Ham, D.D. Lee, L.K. Saul, Semisupervised alignment of manifolds, in: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS 05), 2005, pp. 120–127. [25] The Wisconsin Diagnostic Breast Cancer dataset, http://archive.ics.uci.edu/ml/ datasets/Breast+Cancer+Wisconsin+(Diagnostic) . [26] The teapot database, http://www.weinbergerweb.net/Downloads/Data.html . [27] The USPS dataset, http://www.cs.toronto.edu/roweis/data.html .

About the Author—CHENPING HOU received his B.S. degree in Applied Mathematics from the National University of Defense Technology, Changsha, China, in 2004. He is now a Ph.D. candidate in the Department of Mathematics and System Science. He worked as a visiting student at Tsinghua University since May, 2008. His current research interests include dimensionality reduction, manifold learning, and semi-supervised learning. About the Author—CHANGSHUI ZHANG received his B.S. degree in Mathematics from Peking University, China, in 1986, and Ph.D. degree from Department of Automation, Tsinghua University, in 1992. He is currently a professor of Department of Automation, Tsinghua University. He is an Associate Editor of the journal Pattern Recognition. His interests include artificial intelligence, image processing, pattern recognition, machine learning, evolutionary computation and complex system analysis, etc. About the Author—YI WU is a professor in the Department of Mathematics and System Science at the National University of Defense Technology in Changsha, China. He earned his bachelor's and master's degrees in applied mathematics at the National University of Defense Technology in 1981 and 1988. He worked as a visiting researcher at New York State University in 1999. His research interests include applied mathematics, statistics, and data processing. About the Author—YUANYUAN JIAO is a Ph.D. candidate in Department of Mathematics and Systems Science, National University of Defense Technology. Her research interests include estimation theorem and its applications.