Relevance learning in generative topographic mapping

Relevance learning in generative topographic mapping

Neurocomputing 74 (2011) 1351–1358 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Releva...

528KB Sizes 1 Downloads 129 Views

Neurocomputing 74 (2011) 1351–1358

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Relevance learning in generative topographic mapping Andrej Gisbrecht, Barbara Hammer  University of Bielefeld, CITEC Cluster of Excellence, Germany

a r t i c l e i n f o

abstract

Available online 22 February 2011

Generative topographic mapping (GTM) provides a flexible statistical model for unsupervised data inspection and topographic mapping. Since it yields to an explicit mapping of a low-dimensional latent space to the observation space and an explicit formula for a constrained Gaussian mixture model induced thereof, it offers diverse functionalities including clustering, dimensionality reduction, topographic mapping, and the like. However, it shares the property of most unsupervised tools that noise in the data cannot be recognized as such and, in consequence, is visualized in the map. The framework of visualization based on auxiliary information and, more specifically, the framework of learning metrics as introduced in [14,21] constitutes an elegant way to shape the metric according to auxiliary information at hand such that only those aspects are displayed in distance-based approaches which are relevant for a given classification task. Here we introduce the concept of relevance learning into GTM such that the metric is shaped according to auxiliary class labels. Relying on the prototypebased nature of GTM, efficient realizations of this paradigm are developed and compared on a couple of benchmarks to state-of-the-art supervised dimensionality reduction techniques. & 2011 Elsevier B.V. All rights reserved.

Keywords: GTM Topographic mapping Relevance learning Supervised visualization

1. Introduction Generative topographic mapping (GTM) has been introduced as a generative statistical model corresponding to the classical self-organizing map for unsupervised data inspection and topographic mapping [1]. An explicit statistical model has the benefit of great flexibility and easy adaptability to complex situations by means of appropriate statistical assumptions. Further, by offering an explicit mapping of latent space to observation space and a constrained Gaussian mixture model based thereof, GTM offers diverse functionality including visualization, clustering, topographic mapping, and various forms of data inspection. Like standard unsupervised machine learning and data inspection methods, however, GTM shares the garbage in - garbage out problem: the information inherent in the data is displayed independent of the specific user intention. Hence, if ‘garbage’ is present in the data, this noise is presented to the user since the statistical model has no way to identify the noise as such. The domain of data visualization by means of dimensionality reduction techniques constitutes a matured field of research, many powerful nonlinear reduction techniques as well as a Matlab implementation being readily available, see e.g. [27,28,16,35,12,4,22,36,26]. However, researchers in the community start to appreciate that the inherently ill-posed problem of unsupervised data visualization and

 Corresponding author. Tel.: + 49 521 106 12115; fax: + 49 521 106 12181.

E-mail address: [email protected] (B. Hammer). 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.12.015

dimensionality reduction has to be shaped according to the user’s needs to arrive at optimum results. This is particularly pronounced for real-life data sets which frequently do not allow a widely loss-free embedding into low dimensionality. Therefore, it has to be specified which parts of the available information should be preserved while embedding. On the one hand, formal evaluation measures have been developed which allow an explicit formulation and evaluation based on the desired result, see e.g. [31–33,17]. On the other hand, researchers start to develop methods which can take auxiliary information into account. This way, the user can specify which information in the data is interesting for the current situation at hand by means of e.g. labeled data. There exist a few classical mechanisms which take class labeling into account to reduce the data dimensionality: Feature selection constitutes one specific type of dimensionality reduction. Feature selection constitutes a well investigated research topic with numerous proposals based on general principles such as information theory or dedicated approaches developed for specific classifiers; see e.g. [13] for an overview. However, this way, the dimensionality reduction is restricted to very simple projections to coordinate axes. More variable albeit still linear projection methods are in the focus of several classical discriminative dimensionality reduction tools: Fisher’s linear discriminant analysis (LDA) projects data such that within class distances are minimized while between class distances are maximized. One important restriction of LDA is given by the fact that, this way, a meaningful projection to

1352

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358

dimensionality at most c  1, c being the number of classes, can be obtained. Hence, for two class problems only a linear visualization is found. Partial least squares regression (PLS) constitutes another classical method which objective is to maximize the covariance of the projected data and the given auxiliary information. It is also suited for situations where data dimensionality is larger than the number of data points; in such cases a linear projection is often sufficient and the problem is to find good regularizations to adjust the parameters accordingly. Informed projection [9] extends principal component analysis (PCA) to also minimize the sum squared error of data projections and the mean value of given classes, this way achieving a compromise of dimensionality reduction and clustering in the projection space. Another technique relies on metric learning according to auxiliary class information. For a metric which corresponds to a global linear matrix transform to low dimensionality this results in a linear discriminative projection of data, as proposed e.g. in [11,7]. Modern techniques extend these settings to general nonlinear projection of data into low dimensionality such that the given auxiliary information is taken into account. One way to extend linear approaches to nonlinear settings is offered by kernelization. This incorporates an implicit nonlinear mapping to a highdimensional feature space together with the linear low-dimensional mapping. It can be used for every linear approach which relies on dot products in the feature space only such that an efficient computation is possible, such as several variants of kernel LDA [18,3]. However, it is not clear how to choose the kernel since its form severely influences the final shape of the visualization. In addition, the method has quadratic complexity with respect to the number of data due to its dependency on the full Gram matrix. Another principled way to extend dimensionality reduction to auxiliary information is offered by an adaptation of the underlying metric which measures similarity in the original data space, see e.g., [6,34]. The principle of learning metrics has been introduced in [20,21]: the standard Riemanian metric of the given data manifold is substituted by a form which measures the information of the data for the given classification task. The Fisher information matrix induces the local structure of this metric and it can be expanded globally in terms of path integrals. This metric is integrated into self-organizing maps (SOM), multidimensional scaling (MDS), and a recent information theoretic model for data visualization which directly relies on the metric in the data space [20,21,33]. A drawback of the proposed method is its high computational complexity due to the dependency of the metric on path integrals or approximations thereof. A slightly different approach is taken in [10]: Instead of learning the metric, an ad hoc adaptation is used which also takes given class labeling into account. The corresponding metric induces a k-nearest neighbor graph which is shaped according to the given auxiliary information. This can directly be integrated into a supervised version of Isomap. The principle of discriminative visualization by means of a change of the metric is considered in more generality in the approach [8]. Here, a metric induced by prototype-based matrix adaptation as introduced e.g. in [24,25] is integrated in several popular visualization schemes including Isomap, manifold charting, locally linear embedding, etc. Alternative approaches to incorporate auxiliary information modify the cost function of dimensionality reduction tools to include the given class information. The approaches introduced in [15,19] can both be understood as extensions of stochastic neighbor embedding (SNE). SNE tries to minimize the deviation of the distribution of data induced by pairwise distances in the original data space and projection space, respectively. Parametric embedding (PE) substitutes these distributions by conditional probabilities of classes, given a data point, this way mapping

both, data points and class centers at the same time. For this procedure, however, an assignment of data to unimodal class centers needs to be known in advance. Multiple relational embedding (MRE) incorporates several dissimilarity structures in the data space induced by labeling, for example, into one latent space representation. For this purpose, the difference of the distribution of each dissimilarity matrix and the distribution of an appropriate transform of the latent space are accumulated, whereby the transform is adapted during training according to the given task. The weighting of the single components is taken according to the task at hand, whereby the authors report an only mild influence of the weighting on the final outcome. It is not clear, however, how to pick the form of the transformation to take into account multimodal classes. Colored maximum variance unfolding (MVU) incorporates auxiliary information into MVU by substituting the raw data which is unfolded in MVU by the combination of the data and the covariance matrix induced by the given auxiliary information. This way, differences which should be emphasized in the visualization are weighted by the differences given by the prior labeling. Like MVU, however, the method depends on the full Gram matrix and is computationally demanding, such that approximations have to be used. These approaches constitute promising candidates which emphasize the relevance of discriminative nonlinear dimensionality reduction. Only few of these methods allow an easy extension to new data points or approximate inverse mappings. Further, most methods suffer from high computational costs which make them infeasible for large data sets. In this contribution, we extend GTM to the principle of learning metrics by combining the technique of relevance learning as introduced in supervised prototype-based classification schemes and the prototype-based unsupervised representation of data as provided by GTM. We propose two different ways to adapt the relevance terms which rely on different cost functions connected to prototype-based classification of data. Unlike [2], where a separate supervised model is trained to arrive at appropriate metrics for unsupervised data visualization, we can directly integrate the metric adaptation step into GTM due to the prototype-based nature of GTM. We test the ability of the model to visualize and cluster given data sets on a couple of benchmarks. It turns out that, this way, an efficient and flexible discriminative data mining and visualization technique arises.

2. The generative topographic mapping The GTM as introduced in [1] models data x A RD by means of a mixture of Gaussians which is induced by a lattice of points w in a low-dimensional latent space which can be used for visualization. The lattice points are mapped via w/t ¼ yðw,WÞ to the data space, where the function is parameterized by W; one can, for example, pick a generalized linear regression model based on Gaussian base functions y : w/FðwÞ  W

ð1Þ

where the base functions F are equally spaced Gaussians with variance s1 . Every latent point induces a Gaussian  D=2   b b pðxjw,W, bÞ ¼ exp  Jxyðw,WÞJ2 ð2Þ 2p 2 with variance b K modes pðxjW, bÞ ¼

K X k¼1

1

, which gives the data distribution as mixture of

pðwk Þpðxjwk ,W, bÞ

ð3Þ

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358

k¼1

by means of an expectation maximization (EM) approach with respect to the parameters W and b. In the E step, the responsibility of mixture component k for point n is determined as pðxn jwk ,W, bÞpðwk Þ r kn ¼ pðwk jxn ,W, bÞ ¼ P n k0 k0 k0 pðx jw ,W, bÞpðw Þ

ð5Þ

In the M step, the weights W are determined solving the equality

UT Gold UWTnew ¼ UT Rold X

ð6Þ

where U refers to the matrix of base functions F evaluated at points wk , X to the data points, R to the responsibilities, and G is a diagonal P matrix with accumulated responsibilities Gnn ¼ k r kn ðW, bÞ. The variance can be computed by solving 1

bnew

¼

1 X kn r ðWold , bold ÞJFðwk ÞWnew xn J2 ND k,n

ð7Þ

where D is the data dimensionality and N is the number of data points.

3. Relevance learning The principle of relevance learning has been introduced in [14] as a particularly simple and efficient method to adapt the metric of prototype-based classifiers according to the given situation at hand. It takes into account a relevance scheme of the data dimensionalities by substituting the squared Euclidean metric by the weighted form dk ðx,tÞ ¼

D X

l2d ðxd td Þ2

ð8Þ

d¼1

In [14], the Euclidean metric is substituted by the more general form (8) and, parallel to prototype updates, the metric parameters k are adapted according to the given classification task. The principle is extended in [24,25] to the more general metric form dX ðx,tÞ ¼ ðxtÞT XT XðxtÞ

ð9Þ

Using a square matrix X, a positive semi-definite matrix which gives rise to a valid pseudo-metric is achieved this way. In [24,25], these metrics are considered in local and global form, i.e. the adaptive metric parameters can be identical for the full model, or they can be attached to every prototype present in the model. Here we introduce the same principle into GTM. 3.1. Labeling of GTM Assume that data point xn is equipped with label information ln which is an element of a finite set of different labels. By posterior labeling, GTM gives rise to a probabilistic classification of data points, assigning the label of prototype tk to data point xn with probability rkn. Thereby, posterior labeling of GTM can be PN PK done in such a way that the classification error n¼1 k¼1 r kn sk ðxn Þ is minimized with sk ðxn Þ equal to zero, if the prototype tk has the same label as xn and equal to one otherwise. Thus the prototype tk ¼ yðwk ,WÞ is labeled 0 cðt Þ ¼ argmax@ k

c

X

njln ¼ c

1 r

kn A

We can introduce relevance learning into GTM by substituting the Euclidean metric in the Gaussian functions (2) by more general diagonal metric (8) which includes relevance terms or the metric induced by a full matrix (9) which can also take correlations of the dimensions into account. Thereby, we can introduce one global metric for the full model, or, alternatively, we can introduce local metric parameters kk or Xk , respectively, for every prototype tk . We refer to the latter version as local method. Using this posterior labeling of the prototypes, the parameters of the GTM model should be adapted such that the data loglikelihood is optimum. Analogous to [1], it can be seen that optimization of the parameters W and b of GTM can be done in the same way as beforehand, whereby the new metric structure (8) and (9) has to be used when computing the responsibilities (5). We assume that the metric is changed during this optimization process on a slower time scale such that the auxiliary information is mirrored in the metric parameters. Thereby, we assume quasi-stationarity of metric parameters when performing original EM training. A similar procedure has been used in [24] for simultaneous metric and prototype learning, and [30] provides an explanation in how far this procedure is reasonable in the context of self-organizing maps. Essentially, the adaptation can be understood as an adiabatic process this way [5] overlaying fast parameter adaptation by EM optimization of the log-likelihood and slow metric adaptation according to the objectives as will be detailed below. This assumption is substantiated if the data log-likelihood is evaluated in a concrete learning process. Fig. 1 displays the value of the data log-likelihood before and after metric adaptation for every epoch in a typical learning process as will be detailed below (the adaptation concerns local full matrices adapted using the soft robust learning vector quantization cost function for the letter data set using the parameters as detailed in the experiments). Obviously, the data log-likelihood increases using metric adaptation for all but the first few epochs, in which the size of the decrease is almost negligible as compared to the size of the log-likelihood.

ð10Þ

× 105

4

log-likelihood

n¼1

3.2. Metric adaptation in GTM

1500

2

1000

0

500

difference

where, usually, pðwk Þ is taken as uniform distribution of the prototypes. Training of GTM optimizes the data log-likelihood !! N K X Y pðwk Þpðxn jwk ,W, bÞ ð4Þ ln

1353

0

-2 llh before llh after diff -4 1

10

20

30

40

50 60 cycle

70

80

90

-500 100

Fig. 1. Log-likelihood (scale on the left) using the adapted metric before and after metric learning for every epoch of the adaptation of prototypes of GTM by means of EM, and the difference (scale on the right). Obviously, the difference is positive in all but the first few epochs, in which the decrease of the log-likelihood is very small as compared to its size. Hence this way of adapting the metric parameters seems reasonable.

1354

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358

Now the question is how to design an efficient scheme for metric learning based on the structure as provided by GTM and the given auxiliary labeling. Unlike approaches such as fuzzylabeled SOM [23], we use a fully supervised scheme to learn metric parameters. Unlike the original framework of learning metrics [20,21], however, we make use of the prototype-based structure of the induced GTM classifier which allows us to efficiently update local metric parameters without the necessity of a computationally costly approximation of the Riemannian metric induced by the general Fisher information. For this purpose, we introduce two different cost functions E (see below) motivated from prototype-based learning which are used to optimize the metric parameters. For metric adaptation, we simply use a stochastic gradient descent of the cost functions. Naturally, more advanced schemes would be possible, but a simple gradient descent already leads to satisfactory results, as we will demonstrate in experiments. To avoid convergence to trivial optima such as zero we pose constraints on the metric parameters of the form JkJ ¼ 1 or traceðXT XÞ2 ¼ 1, respectively. This is achieved by normalization of the values, i.e. after every gradient step, k is divided by its length, and X is divided by the square root of traceðXT XÞ. Thus, a high-level description of the algorithm is possible as depicted in Table 1. Usually, we alternate between one EM step, one epoch of gradient descent, and normalization in our experiments. Since EM optimization is much faster than gradient descent, this way, we can enforce that the metric parameters are adapted on a slower time scale. Hence we can assume an approximately constant metric for the EM optimization, i.e. the EM scheme optimizes the likelihood as before. Metric adaptation takes place considering quasi stationary states of the GTM solution due to the slower time scale. Note that metric adaptation introduces a large number of additional parameters into the model depending on the input dimensionality. One can raise the question whether this leads to strong overfitting of the model. We will see in experiments that this is not the case: when evaluating the clustering performance of the resulting GTM, the training error is representative for the generalization error. One can substantiate this experimental finding with a theoretical counterpart: using posterior labeling, GTM offers a prototype-based classification scheme with local adaptive metrics. This function class has a supervised pendant: generalized matrix learning vector quantization as introduced in [24]. The worst case generalization ability of the latter class can be investigated based on classical computational learning theory. It turns out that its generalization ability does not depend on the number of parameters adapted during training, rather, large margin generalization bounds can be derived. In consequence, very good generalization ability can be proved (and experimentally observed) as detailed in [24]. Since the formal argumentation in [24] depends on the considered function class only and not the way in which training takes place, the same generalization bounds apply to GTM with adaptive metrics as introduced here. Now, we discuss concrete cost functions E for the metric adaptation.

3.3. Generalized relevance GTM (GRGTM) Metric parameters have the form k or kk for a diagonal metric (8) and X or Xk for a full matrix (9), depending on whether a local or global scheme is considered. In the following, we define the general parameter Yk which can be chosen as one of these four possibilities depending on the given setting. Thereby, we can assume that Yk can be realized by a matrix which has diagonal form (for relevance learning) or full matrix form (for matrix updates). The cost function of generalized relevance GTM is taken from generalized relevance learning vector quantization (GRLVQ), which can be interpreted as maximizing the hypothesis margin of a prototype-based classification scheme such as LVQ [14,24]. The cost function has the form   X X d þ ðxn ,t þ ÞdY ðxn ,t Þ EðYÞ ¼ ð11Þ En ðYÞ ¼ sgd Y þ  dY þ ðxn ,t Þ þ dY ðxn ,t Þ n n where sgdðxÞ ¼ ð1 þexpðxÞÞ1 , t þ is the closest prototype in the data space with the same label as xn and t is the closest prototype with a different label. The adaptation formulas can be derived thereof by taking the derivatives. Depending on the form of the metric, the derivative of the metric is @dk ðx,tÞ ¼ 2li ðxi ti Þ2 @li

ð12Þ

for a diagonal metric and X @dX ðx,tÞ ¼ 2ðxj tj Þ Oid ðxd td Þ @Oij d

ð13Þ

for a full matrix. For simplicity, we denote the respective squared distances to the closest correct and wrong prototype, respectively, by 0 d þ ¼ dY þ ðxn ,t þ Þ and d ¼ dY ðxn ,t Þ. The term sgd is a shorthand 0 notation for sgd ððd þ d Þ=ðd þ þ d ÞÞ. Given a data point xn the derivative of the corresponding summand of cost function E with respect to metric parameters yields @En d @d þ 0 ¼ 2sgd   þ @Y þ ðd þ þd Þ2 @Y

ð14Þ

for the parameters of the closest correct prototype and @En dþ @d 0   ¼ 2sgd   2 þ  @Y ðd þ d Þ @Y

ð15Þ

for the parameters attached to the closest wrong prototype. All other parameters are not affected. These updates take place for the local modeling of parameters, which we refer to by local generalized relevance GTM (LGRGTM) or local generalized matrix GTM (LGMGTM), respectively. If metric parameters are global, the update corresponds to the sum of these two derivatives, referred to by generalized relevance GTM (GRGTM) or generalized matrix GTM (GMGTM), respectively. 3.4. Robust soft GTM (RSGTM)

Table 1 Integration of relevance learning into GTM. INIT REPEAT E-STEP: DETERMINE rkn BASED ON THE GENERAL METRIC M-STEP: DETERMINE W AND b AS IN GTM LABEL PROTOTYPES ADAPT METRIC PARAMETERS BY STOCHASTIC GRADIENT DESCENT OF E NORMALIZE THE METRIC PARAMETERS

Unlike GRLVQ, robust soft LVQ (RSLVQ) [29] has the goal to optimize a statistical model which defines the data distribution. It is assumed that data are given by a Gaussian mixture of prototypes which are labeled. The objective is to maximize the logarithm of the probability of a data point being generated by a prototype of the correct class versus the overall probability. In the limit of small variance of the Gaussians, a learning rule which is similar to the standard LVQ rule results. The objective for a 1 general variance b of the Gaussian modes corresponds to the

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358

following cost function: EðYÞ ¼

X X En ðYÞ ¼ log n

P

! kjcðtk Þ ¼ ln pðw Þpðx jw ,W, bÞ

n

k

n

k

pðxn jW, bÞ

ð16Þ

Here, we can choose Gaussian modes as provided by GTM, i.e. the modes and corresponding mixture are given in analogy to formulas (2) and (3) where the new parameterized metric (8) and (9) as well as the labeling (10) of GTM is used. We obtain the update rules by taking the derivatives, as beforehand: ! @En 1 @Sk b @dYk ðxn ,tk Þ n kn kn n kn ¼ ðsk ðx Þðq r Þð1sk ðx ÞÞr Þ    Sk @Yk 2 @Yk @Yk ð17Þ n

where sk ðx Þ indicates whether prototype and data label coincide, pðxn jwk ,W, bÞpðwk Þ n k0 k0 k0 jcðtk0 Þ ¼ ln pðx jw ,W, bÞpðw Þ

qkn ¼ P

ð18Þ

refers to the probability of mode k among the correct modes, and Sk ¼



b

D=2

2p

 detðYk Þ

ð19Þ

normalizes the Gaussian modes to arrive at valid probabilities. The derivative is 1 @S 1  ¼ S @li li

ð20Þ

for a relevance vector and 1 @S  ¼ O1 ji S @Oij

ð21Þ

for full matrices. Table 2 Parameters used for training. Data

Number of prototypes

Number of base functions

Landsat Phoneme Letter

10  10 10  10 30  30

44 44 30  30

1355

We refer to this version as local relevance robust soft GTM (LRSGTM) and local matrix robust soft GTM (LMRSGTM), respectively. The global versions can be obtained by adding the derivatives, we refer to these algorithms as relevance robust soft GTM (RSGTM) and matrix robust soft GTM (MRSGTM), respectively.

4. Experiments 4.1. Classification We test the efficiency of relevance learning in GTM on three benchmark data sets as described in [21,33]: Landsat Satellite data with 36 dimensions, 6 classes, and 6435 samples, Letter Recognition data with 16 dimensions, 26 classes, and 20,000 samples, and Phoneme data with 20 dimensions, 13 classes, and 3656 samples. Prior to training all data sets were normalized by subtracting the mean and dividing by standard deviation. GTM is initialized using the first two principal components. The mapping yðw,WÞ is induced by generalized linear regression based on Gaussian base functions. The learning rate of the gradient descent for the metric parameters has been optimized for the data and is chosen in the range of 10  6 to 10  2. More precisely, an exhaustive search of the parameter range is done and the value is picked for the learning rate which leads to the best convergence of the relevance profile. Thereby, the number of epochs is chosen as 100, which is sufficient to allow convergence of matrix parameters; typically, convergence of the EM scheme can be observed at a faster scale. The number of prototypes and base functions has been taken to suite the size of the data, it is shown in Table 2. Due to the complexity of the training, an exhaustive search of these parameters has been avoided, but reasonable numbers have been chosen. Typically, the results are only mildly influenced by small changes of these numbers. The variance of the Gaussian base functions has been chosen such that it coincides with the distance between neighbored base functions. We report the results of a repeated stratified 10-fold crossvalidation with one repeat (letter) and 10 repeats (phoneme, landsat), respectively, reporting also the variance over the repeats. We evaluate the models in comparison to several recent alternative supervised visualization tools by means of the test error

Fig. 2. Mean accuracy of the classification obtained by diverse supervised GTM schemes as introduced in this article and alternative state-of-the-art approaches.

1356

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358

Fig. 3. Visualization of the result of GTM (top) and robust soft GTM with local matrix learning (bottom) on the MNIST data set. Pie charts give the responsibility of the prototypes for the given classes. Supervision achieves a better separation of the classes within receptive fields of prototypes, introducing dead units if necessary.

Fig. 4. Visualization of the result of GTM (top) and robust soft GTM with local matrix learning (bottom) on the Phoneme data set. Pie charts give the responsibility of the prototypes for the given classes. Supervision achieves a better separation of the classes within receptive fields as can be seen by the pie charts.

obtained in the cross-validation. These alternatives are taken from [33].1 The alternative methods include parametric embedding (PE), supervised Isomap (S-Isomap), colored maximum variance unfolding (MUHSIC), multiple relational embedding (MRE), neighborhood component analysis (NCA), and supervised neighborhood retrieval visualizer (SNeRV) based on different weighting of retrieval

objectives in the cost function of the model (l) and a Riemannian metric based on the Fisher information matrix The results are shown in Fig. 2. In two of the three cases, metric adaptation improves the classification accuracy compared to simple GTM. Thereby, matrix adaptation yields to superior results compared to the adaptation of a simple relevance vector. Further, results based on the robust soft learning vector quantization seem slightly better for all data sets. For all three data sets, we obtain state-of-the-art results which are comparable to the best alternative supervised visualization tools which are currently available in the literature.

1

We would like to thank the authors of [34] for providing the results.

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358

We also report the results obtained by a five-nearest neighbor classifier for the original data in the original (high-dimensional) data space. Interestingly, for all cases, the supervised results using the full information are only slightly better than the results obtained by GTM with local matrix adaptation in two dimensions. This demonstrates the high quality of the supervised visualization. Note that, unlike the experiments from [33] which restrict to a subset of 1500 samples in all cases due to complexity issues, we can train on the full dataset due to the efficiency of relevance GTM, and, in two of the three cases, we can even perform a 10-fold repeat of the experiments in reasonable time. 4.2. Visualization The result of a visualization of the Phoneme data set and the MNIST data set (this data set consists of 60,000 points with 768 dimensions representing the 10 digits, a subsample of 6000 images was used in this case) using robust soft GTM with local matrix adaptation is shown in Figs. 3 and 4, whereby the full data set is used for training. A comparison to simple GTM shows the ability of matrix learning to arrive at a topographic mapping which better mirrors the underlying class structures: the pie charts display the percentage of points of the different classes assigned to the respective prototype based on the receptive fields. Interestingly, in both cases, the pie charts obtained with metric learning display less classes for the single prototypes corresponding to better separated receptive fields, whereas the classes are spread among the prototypes if metric adaptation does not take place. This is also mirrored in the better classification accuracy of GTM with matrix learning. The arrangement of the classes on the map differs for the different visualizations. For metric learning, multiple modes of the classes can be observed. For standard GTM, the distribution is less clear since the single prototypes combine different classes in their receptive fields.

5. Discussion In this contribution, a method has been proposed to integrate auxiliary information in terms of relevance updates into GTM; the benefit of this approach has been demonstrated on several benchmarks. Unlike approaches such as fuzzy-labeled SOM [24], metric parameters are adapted in a supervised fashion based on the classification ability of the model. As [22], the work is based on adaptive metrics to incorporate auxiliary information into the model. Unlike this latter work [22], however, the proposed method relies on the prototype-based nature of GTM and transfers the relevance update scheme of supervised learning schemes such as [15,25] to this setting, resulting in an efficient topological mapping. As demonstrated on several benchmarks, the classification accuracy is competitive to state-of-the-art methods for supervised visualization, whereby GTM provides additional functionality due to the explicit topographic mapping of the latent space into the observation space accompanied by an explicit generative statistical model. As demonstrated by means of visualization, the class separation is much more accurate for supervised GTM compared to the original method, thus clearly focussing on the relevant aspects for the given classification. However, the evaluation as proposed in this contribution can only serve as indicator whether useful mappings are obtained by relevance GTM. Since data visualization is an inherently ill-posed problem, a clear evaluation by means of a single (or few) quantitative measures seems hardly satisfactory, the respective goal being often situation dependent. For a proper evaluation of the model for concrete tasks, an empirical user study would be interesting.

1357

References [1] C. Bishop, M. Svensen, C. Williams, The generative topographic map, Neural Computation 10 (1) (1998) 215–234. [2] K. Bunte, B. Hammer, P. Schneider, M. Biehl, Nonlinear discriminative data visualization, in: M. Verleysen (Ed.), ESANN 2009, d-side Publishing, 2009, pp. 65–70. [3] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation 12 (2000) 2385–2404. [4] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (2003) 1373–1396. ¨ Physik a [5] M. Born, V.A. Fock, Beweis des Adiabatensatzes, Zeitschrift fur Hadrons and Nuclei 51 (3–4) (1928) 165–180. ¨ [6] K. Bunte, B. Hammer, T. Villmann, M. Biehl, A. Wismuller, Exploratory observation machine (XOM) with Kullback–Leibler divergence for dimensionality reduction and visualization, in: M. Verleysen (Ed.), ESANN’10, D side, 2010, pp. 87–92. [7] K. Bunte, B. Hammer, M. Biehl, Nonlinear dimension reduction and visualization of labeled data, in: X. Jiang, N. Petkov (Eds.), International Conference on Computer Analysis of Images and Patterns, Springer, 2009, pp. 1162–1170. [8] K. Bunte, B. Hammer, A. Wismueller, M. Biehl, Adaptive local dissimilarity measures for discriminative dimension reduction of labeled data, Neurocomputing 73 (7–9) (2010) 1074–1092. [9] D. Cohn, Informed projections, in: S. Becker, S. Thrun, K. Obermayer (Eds.), NIPS, MIT Press, 2003, pp. 849–856. [10] X. Geng, D.-C. Zhan, Z.-H. Zhou, Supervised nonlinear dimensionality reduction for visualization and classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B 35 (6) (2005) 1098–1107. [11] J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Neighbourhood components analysis, Advances in Neural Information Processing Systems, vol. 17, MIT Press, 2004, pp. 513–520. [12] A. Gorban, B. Kegl, D. Wunsch, A. Zinoyev (Eds.), Principle Manifolds for Data Visualization and Dimensionality Reduction, Springer, , 2008. [13] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182. [14] B. Hammer, T. Villmann, Generalized relevance learning vector quantization, Neural Networks 15 (8–9) (2002) 1059–1068. [15] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T.L. Griffiths, J.B. Tenenbaum, Parametric embedding for class visualization, Neural Computation 19 (9) (2007) 2536–2556. [16] J.A. Lee, M. Verleysen, Nonlinear Dimensionality Reduction, Springer, 2007. [17] J.A. Lee, M. Verleysen, Quality Assessment of Dimensionality Reduction: Rankbased Criteria Neurocomputing, vol. 72(7–9), Elsevier, 2009, pp. 1431–1443. [18] B. Ma, H. Qu, H. Wong, Kernel clustering-based discriminant analysis, Pattern Recognition 40 (1) (2007) 324–327. [19] R. Memisevic, G. Hinton, Multiple relational embedding, in: L.K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in Neural Information Processing Systems, vol. 17, MIT Press, Cambridge, MA, 2005, pp. 913–920. [20] J. Peltonen, Data exploration with learning metrics, D.Sc. Thesis, Dissertations in Computer and Information Science, Report D7, Espoo, Finland, 2004. [21] J. Peltonen, A. Klami, S. Kaski, Improved learning of Riemannian metrics for exploratory analysis, Neural Networks 17 (2004) 1087–1100. [22] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. [23] F.-M. Schleif, B. Hammer, M. Kostrzewa, T. Villmann, Exploration of massspectrometric data in clinical proteomics using learning vector quantization methods, Briefings in Bioinformatics 9 (2) (2008) 129–143. [24] P. Schneider, M. Biehl, B. Hammer, Adaptive relevance matrices in learning vector quantization, Neural Computation 21 (2009) 3532–3561. [25] P. Schneider, M. Biehl, B. Hammer, Distance learning in discriminative vector quantization, Neural Computation 21 (2009) 2942–2969. [26] J. Tenenbaum, V. da Silva, J. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [27] L. van der Maaten, G. Hinton, Visualizing high-dimensional data using t-sne, Journal of Machine Learning Research 9 (2008) 2579–2605. [28] L. van der Maaten, E. Postma, H. van den Herik, Dimensionality reduction: a comparative review, Technical report, Tilburg University Technical Report, TiCC-TR 2009-005, 2009. [29] S. Seo, K. Obermayer, Soft learning vector quantization, Neural Computation 15 (7) (2003) 1589–1604. [30] A. Spitzner, D. Polani, Order parameters for self-organizing maps, in: L. Niklasson, M. Boden, T. Ziemke (Eds.), Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN 98), vol. 2, Springer, 1998, pp. 517–522. [31] J. Venna, Dimensionality reduction for visual exploration of similarity structures, Ph.D. Thesis, Helsinki University of Technology, Espoo, Finland, 2007. [32] J. Venna, S. Kaski, Local multidimensional scaling, Neural Networks 19 (2006) 89–99. [33] J. Venna, J. Peltonen, K. Nybo, H. Aidos, S. Kaski, Information retrieval perspective to nonlinear dimensionality reduction for data visualization, Journal of Machine Learning Research 11 (2010) 451–490. [34] T. Villmann, B. Hammer, F.-M. Schleif, T. Geweniger, W. Herrmann, Fuzzy classification by fuzzy labeled neural gas, Neural Networks 19 (2006) 772–779. [35] M. Ward, G. Grinstein, D.A. Keim, Interactive Data Visualization: Foundations, Techniques, and Application, A.K. Peters, Ltd., 2010.

1358

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358

[36] K.Q. Weinberger, L.K. Saul, An introduction to nonlinear dimensionality reduction by maximum variance unfolding, in: Unfolding, Proceedings of the 21st National Conference on Artificial Intelligence, AAAI, 2006.

Andrej Gisbrecht received his Diploma in Computer Science in 2009 from the Clausthal University of Technology, Germany, and continued there as a Ph.D.-student. Since early 2010 he is a Ph.D.-student at the Cognitive Interaction Technology Center of Excellence at Bielefeld University, Germany.

Barbara Hammer received her Ph.D. in Computer Science in 1995 and her venia legendi in Computer Science in 2003, both from the University of Osnabrueck, Germany. From 2000–2004, she was a leader of the junior research group ‘Learning with Neural Methods on Structured Data’ at University of Osnabrueck before accepting an offer as Professor for Theoretical Computer Science at Clausthal University of Technology, Germany, in 2004. Since 2010, she is holding a professorship for Theoretical Computer Science for Cognitive Systems at the CITEC Cluster of Excellence at Bielefeld University, Germany. Several research stays have taken her to Italy, UK, India, France, the Netherlands, and the USA. Her areas of expertise include hybrid systems, selforganizing maps, clustering, and recurrent networks as well as applications in bioinformatics, industrial process monitoring, or cognitive science.