Incremental locally linear embedding

Incremental locally linear embedding

Pattern Recognition 38 (2005) 1764 – 1767 www.elsevier.com/locate/patcog Rapid and brief communication Incremental locally linear embedding Olga Kou...

170KB Sizes 3 Downloads 146 Views

Pattern Recognition 38 (2005) 1764 – 1767 www.elsevier.com/locate/patcog

Rapid and brief communication

Incremental locally linear embedding Olga Kouropteva∗ , Oleg Okun, Matti Pietikäinen Machine Vision Group, Infotech Oulu and Department of Electrical and Information Engineering, University of Oulu, P.O. Box 4500, FI 90014, Oulu, Finland Received 3 March 2005; accepted 14 April 2005

Abstract The locally linear embedding (LLE) algorithm belongs to a group of manifold learning methods that not only merely reduce data dimensionality, but also attempt to discover a true low dimensional structure of the data. In this paper, we propose an incremental version of LLE and experimentally demonstrate its advantages in terms of topology preservation. Also compared to the original (batch) LLE, the incremental LLE needs to solve a much smaller optimization problem. 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Dimensionality reduction; LLE; Online mapping; Topology preservation

1. Introduction Manifold learning is a perfect tool for data mining that discovers structure of high dimensional datasets and provides better understanding of the data. A recently proposed manifold learning algorithm called locally linear embedding (LLE) [1] provides nonlinear dimensionality reduction in unsupervised manner and obtains representative low dimensional projections of high dimensional data. Nevertheless, LLE is unsuitable for sequentially coming data, since it operates in a batch mode, i.e., when new data arrive, one needs to rerun the entire algorithm with the original data augmented by the new samples. As described in Ref. [2], it is much more challenging to make LLE incremental than other manifold learning algorithms. Moreover, LLE searches for the bottom eigenvectors and eigenvalues, which are frequently ill-conditioned. Ill-conditioning means that eigenvalues or/and eigenvectors ∗ Corresponding author. Tel.: +358 8 553 2527; fax: +358 8 553 2612. E-mail addresses: [email protected].fi (O. Kouropteva), [email protected].fi (O. Okun), [email protected].fi (M. Pietikäinen).

are susceptible to small changes of a matrix for which they are computed. As a result, problems one faces while making LLE incremental are more formidable than those for other manifold learning algorithms. In this paper we propose an incremental version of LLE (ILLE) and compare it with two known nonparametric LLE generalization procedures. Promising and encouraging results are demonstrated in the experimental part.

2. Batch and incremental LLE To project data the batch LLE needs the following steps: (1) For each point, find its K nearest neighbors. (2) Minimize.  2  N  N     xi −  ε(W) = w x ij j  ,   i=1  j =1

(1)

subject to wij = 0, if xj ∈ / {xi1 , xi2 , . . . , xiK }, and N j =1 wij = 1.

0031-3203/$30.00 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2005.04.006

O. Kouropteva et al. / Pattern Recognition 38 (2005) 1764 – 1767

(3) Set d - the number of dimensions in the embedding space and minimize  2  N  N     yi −  , (Y) = w y (2) ij j     i=1 j =1 subject to 1/N

N

N T i=1 yi yi = I , and i=1 yi = 0.

Step 3 is equivalent to computing the bottom d + 1 eigenvectors associated with the d + 1 smallest eigenvalues of a sparse and symmetric matrix M = (I − W)T (I − W). The first eigenvector (composed of 1’s) whose eigenvalue is smallest is excluded. The remaining d eigenvectors form the embedding Y. LLE lacks generalization to new data. This makes LLE less attractive in a dynamic environment, where a complete rerun of the algorithm becomes prohibitively expensive. There are two ways of LLE generalization: (1) to derive and use a transformation between the original and projected data, and (2) to solve an incremental eigenvalue problem. Below we describe two known generalization algorithms belonging to the former case, and propose ILLE that uses the latter approach. Suppose we are given already processed data X = {x1 , x2 , . . . , xN }, corresponding projected points Y = {y1 , y2 , . . . , yN }, and a new point xN+1 ∈ RD , which is sampled from the same data manifold as X. We are asked to find a new embedding point yN+1 ∈ Rd corresponding to xN +1 . Linear generalization: In order to obtain the new embedded coordinates, the LLE intuition is used, i.e., any nonlinear manifold can be considered as locally linear. There are two possibilities of linear generalization (LG): LG1: As it was done in our earlier work, let us put the K nearest neighbors of xN +1 and the corresponding embedded 1 2 K points into the matrices: XN+1 ={xN +1 , xN +1 , . . . , xN +1 } 1 2 K N+1 and Y = {yN +1 , yN +1 , . . . , yN +1 }. By using the assumption that the manifold is locally linear, the following equation is approximately true: YN+1 = ZXN+1 , where Z is an unknown linear transformation matrix of size d × D, which can be determined as Z = YN+1 (XN+1 )−1 . Because XN+1 is the neighborhood of xN+1 and LLE preserves local structures, the new projection can be found as yN +1 = ZxN+1 . LG2 [1]: To find yN+1 , first, the K nearest neighbors of xN +1 are found. Then the linear weights, wN+1 , that best reconstruct xN+1 from its neighbors,  are computed by Eq. (1) with the sum-to-one constraint: j =1 wN+1j  = 1. Finally, the new output yN+1 is found as yN+1 = j =1 wN+1j yj , where the sum is over the yi ’s corresponding to the K nearest neighbors of xN+1 . Incremental LLE: In order to construct embedding LLE searches for the bottom eigenvectors of an N × N Hermitian matrix. Hence, in case of LLE, one has to deal with the ill-conditioned eigenproblem. Ill-conditioning means that eigenvalues and/or eigenvectors of a particular matrix are

1765

very sensitive to small perturbations of the matrix. For example, changing of the matrix in norm by at most  can change any eigenvalue i by at most , i.e., computing i = 10−5 to within ± = 10−4 means that no leading digits of the computed i may be correct [3]. When a new data point arrives, ILLE computes the new cost matrix Mnew , which is exactly the same as if it would be computed by LLE applied to the old data augmented by the new data point. This is done as follows: first, distances between points, which either belong to the K nearest neighbors of the new point or contain the new point as one of their K nearest neighbors, are recalculated. Then the weights for the points whose distances have been changed are updated by solving Eq. (1) and the new matrix Mnew of size (N + 1) × (N + 1) is calculated by using these weights. The classical eigenproblem is defined as the solution of the equation MyiT = i yiT or in matrix form MYT = diag{1 , 2 , . . . , d }YT . Since typical eigenvectors are orthogonal, we can rewrite the eigenproblem: YMYT = diag{1 , 2 , . . . , d }. Without loss of generality, we assume that the eigenvalues of the new cost matrix Mnew are the same as those of the cost matrix computed for N points. This can be done since the eigenvalues, we are dealing with, are very close to zero, usually they are of order 10−p , where p is large enough (practically about 10). ThereT =diag{ ,  , . . . ,  }, fore, we can write Ynew Mnew Ynew 1 2 d where {i }, i = 1, . . . , d are the smallest eigenvalues of the cost matrix computed for N points. The new coordinates are obtained by solving a d × d minimization problem: T min (Ynew Mnew Ynew − diag{1 , 2 , . . . , d }),

Ynew

(3)

subject to the LLE constraints imposed on the embedding coordinates. Thus, the N × N problem of the third LLE step is reduced to the d × d problem, where d>N . Since d is usually very small, say 10 or so, the minimization is not time consuming and can be done for every arriving point.

3. Experiments In the experiments we applied the two linear generalization algorithms and ILLE to the datasets represented in Table 1. All datasets were divided into training (70%) Table 1 Datasets used in the experiments Data

# of points

Dimensionality

S-curve Wine Fray faces MNIST digits (3&8) Coil-20 Oleg’s faces Olga’s faces

2000 178 1965 1984 1440 1130 1200

3 13 560 784 4096 6300 6300

1766

O. Kouropteva et al. / Pattern Recognition 38 (2005) 1764 – 1767 0.74

0.6

Spearman's rho

0.7 0.68 0.66 0.64 0.62 0.6

Linear generalization 1 Linear generalization 2 Incremental LLE

0.58 procrustes measure

Linear generalization 1 Linear generalization 2 Incremental LLE

0.72

0.58

0.56 0.54 0.52 0.5 0.48 0.46 0.44

2

4

6 8 10 12 # of iterations

(a)

14

0.42

16

2

4

(b)

6 8 10 12 # of iterations

14

16

Fig. 1. Spearman’s rho (a) and procrustes measure (b) for the wine dataset. Table 2 Spearman’s rho (Sp ) and procrustes measure (Procr) for the datasets Data

# of iterations

# of points per iteration

S-curve

60

10

Sp Procr

Wine

17

3

Fray faces

28

21

MNIST digits

22

27

Coil-20

27

16

Sp Procr

Sp Procr

Sp Procr

Sp Procr

Oleg’s faces

33

10

Sp Procr

Olga’s faces

36

10

Sp Procr

and test (30%) sets. The training sets were projected by the batch LLE to two dimensional spaces and the test sets were mapped to the corresponding space by the generalization/incremental algorithms. In order to quantitatively compare the quality of projections, we calculate two characteristics, namely, Spearman’s rho and procrustes measure. The Spearman’s rho estimates the correlation of rank order data, i.e., how well the corresponding low dimensional projection preserves the order of the pairwise distances between the high dimensional data points converted to ranks. The best value of Spearman’s rho is equal to one. In its turn, the procrustes measure determines how well a linear transformation (translation, reflection, orthogonal rotation, and scaling) of the points in the projected space conforms to the points in the corresponding high dimensional space. The smaller the value of procrustes measure, the better fitting is obtained. Both Spearman’s rho and procrustes measure are commonly used for estimating topology preservation when doing dimensionality reduction. In the experiment for the wine data, 119 training points are projected by the batch LLE (K = 15) and the other 51

LG1

LG2

ILLE

47 43 0 0 1 0 0 0 0 1 2 16 0 14

12 16 8 10 8 14 0 22 27 5 31 17 6 5

1 1 9 7 19 14 22 0 0 21 0 0 30 17

test points are added during 17 iterations (n = 3 points per time) by three methods: LG1, LG2, and ILLE. In Fig. 1 plots show the Spearman’s rho and procrustes measure estimated after each iteration. One can see that ILLE performs better than LG1 and LG2 when the number of new samples increases. In order to make the final conclusion, we estimate the Spearman’s rho and procrustes measure after each iteration for all datasets and count the number of iterations for which a particular method outperforms others in terms of the Spearman’s rho and procrustes measure. The number of iterations, number of points per iteration, and the results for all datasets are listed in Table 2. The largest resulting values are underlined. By looking at Table 2, a number of interesting observations can be made. • When manifold is well and evenly sampled, and the relationships between original and embedded datasets are close to be locally linear as in case of S-curve, Oleg’s faces, LG1 and especially LG2 are sufficient for successful generalization of test points, and this fact is confirmed by both Spearman’s rho and procrustes measure.

O. Kouropteva et al. / Pattern Recognition 38 (2005) 1764 – 1767

• However, if the data manifold is nonuniformly sampled, and the relationships are locally nonlinear as in case of Olga’s and Fray’s faces, ILLE emerges as a clear winner, since it does not rely on linear relationships. This is again supported by both Spearman’s rho and procrustes measure. 4. Conclusion In this paper we proposed a new method, called incremental LLE, and compared it with two linear generalization methods. The results demonstrate that ILLE is a powerful method for data whose distribution is not uniform and the local linearity constraints do not hold. In contrast, the linear generalization methods deal perfectly with well-sampled manifolds of artificially generated data but may have problems with real-world datasets. One of the directions of the future work is to compare the classification performance of the linear generalization algorithms and ILLE on different datasets.

1767

Acknowledgements This work was supported by the Infotech Oulu Graduate School and the Nokia Foundation.

References [1] L.K. Saul, S.T. Roweis, Think globally, fit locally: unsupervised learning of nonlinear manifolds, J. Mach. Learning Res. 4 (2003) 119–155. [2] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, M. Ouimet, Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering, in: S. Thrun, L. Saul, B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA, 2004. [3] Z. Bai, J. Demmel, J. Dongorra, A. Ruhe, H. van der Vorst, Templates for the solution of algebraic eigenvalue problems, SIAM, Philadelphia, 2000.

About the Author—OLGA KOUROPTEVA received the Master degree in computer science from the University of Joensuu, Finland, in 2001. She has got the award for the best Finnish Master’s thesis in pattern recognition in 2001 (Pattern recognition Society of Finland). She received the Diploma degree in applied mathematics from Saint-Petersburg State University, Russia, in 2002. Since June 2001, she is working as a researcher in Information Processing Laboratory of the University of Oulu. Her research interests include machine learning, data mining and neural networks. About the Author—OLEG G. OKUN got his Candidate of Technical Sciences (Ph.D.) degree in 1996 from the Institute of Engineering Cybernetics, Belarusian Academy of Sciences. Since 1998 he joined Machine Vision Group of the University of Oulu, Finland, where he is currently a senior lecturer. His research interests include machine learning and data mining as well as their applications in bioinformatics and finance. He has about 50 scientific publications. About the Author—MATTI PIETIKÄINEN received his Doctor of Technology degree in electrical engineering from the University of Oulu, Finland, in 1982. In 1981, he established the Machine Vision Group at the University of Oulu. The research results of his group have been widely exploited in industry. Currently, he is Professor of Information Engineering, Scientific Director of Infotech Oulu research center, and Leader of Machine Vision Group at the University of Oulu. From 1980 to 1981 and from 1984 to 1985, he visited the Computer Vision Laboratory at the University of Maryland, USA. His research interests are in machine vision and image analysis. His current research focuses on texture analysis, face image analysis, learning in machine vision, and machine vision for sensing and understanding human actions. He has authored about 165 papers in international journals, books, and conference proceedings, and nearly 100 other publications or reports. He is Associate Editor of Pattern Recognition journal, and was Associate Editor of IEEE Transactions on Pattern Analysis and Machine Intelligence from 2000 to 2005. He was Chairman of the Pattern Recognition Society of Finland from 1989 to 1992. Since 1989, he has served as a member of the governing board of the International Association for Pattern Recognition (IAPR) and became one of the founding fellows of the IAPR in 1994. He has also served on committees of several international conferences. He is a senior member of the IEEE and Vice-Chair of IEEE Finland Section.