Extending metric multidimensional scaling with Bregman divergences

Extending metric multidimensional scaling with Bregman divergences

Pattern Recognition 44 (2011) 1137–1154 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr ...

3MB Sizes 2 Downloads 105 Views

Pattern Recognition 44 (2011) 1137–1154

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Extending metric multidimensional scaling with Bregman divergences Jigang Sun, Malcolm Crowe, Colin Fyfe  University of the West of Scotland, Paisley, UK

a r t i c l e in f o

abstract

Article history: Received 10 May 2010 Received in revised form 15 November 2010 Accepted 16 November 2010

Sum of weighted square distance errors has been a popular way of defining stress function for metric multidimensional scaling (MMDS) like the Sammon mapping. In this paper we generalise this popular MMDS with Bregman divergences, as an example we show that the Sammon mapping can be thought of as a truncated Bregman MMDS (BMMDS) and we show that the full BMMDS improves upon the Sammon mapping on some standard data sets and investigate the reasons underlying this improvement. We then extend a well known family of MMDS, that deploy a strategy of focusing on small distances, with BMMDS and investigate limitations of the strategy empirically. Then an opposite strategy is introduced to create another family of BMMDS that gives increasing mapping quality. A data preprocessing method and a distance matrix preprocessing are introduced. Crown Copyright & 2010 Published by Elsevier Ltd. All rights reserved.

Keywords: Multidimensional scaling Sammon mapping Bregman divergence Distance matrix preprocessing Strategy focusing on small distances Stress function definition

1. Introduction The quantity and dimensionality of data often captured automatically has been growing exponentially over the last decades. We are now in a world in which the extraction of information from data are of paramount importance. One means of doing this is to project the data onto a low dimensional manifold and allow a human investigator to search for patterns in this projection by eye. This is obviously based on the assumption that the low dimensional projection can capture relevant features of the high dimensional data set; the earliest methods for this type of analysis were principal component analysis (PCA) [8] and factor analysis [5], the former giving the linear projection which is closest to the original data while the latter tries to capture specific relevant features of the data set in which ‘relevant’ often has to accord with some human intuition. However such projections are linear and there is no a priori reason that a high dimensional data set should lie on a linear subspace. Fortunately however high dimensional data sets often lie on a (non-linear) embedded manifold of the data set where a manifold generally is thought of as a topological space which is locally Euclidean. Finding such embedded manifolds is generally not a simple closed-form process however, unlike finding linear subspaces. The alternatives are to find the manifold and then project the data onto this manifold or to directly attempt to ascertain the projections of the data.  Corresponding author.

E-mail addresses: [email protected] (J. Sun), [email protected] (M. Crowe), [email protected] (C. Fyfe).

Metric multidimensional scaling (MMDS) [2] is a group of visualisation and dimensionality reduction methods that projects data points from high dimensional data space to low dimensional latent space in which the structure of the original data set can be identified by eye. It creates the low dimensional representation by keeping distance between latent points as close as possible to the distances between the original data points by moving the latent points around, which is good at maintaining the global structure of the manifold. A well known example is the Sammon mapping [11] which has been one of the most successful nonlinear MMDS methods since its advent in 1969. It improves the quality of scaling by focusing on small distances in data space: explicitly the Sammon mapping tries to capture the local structure of the data while paying less attention to the global structure of the data. There exists newer methods such as locally linear embedding (LLE) [16] which project data points from high dimensional space to low dimensional space by assuming that a data point is a linearly combination of its near neighbours in data space and the linearity still holds in the low dimensional space, and the self-organising map (SOM) [12] which is an artificial neural network that uses unsupervised competitive learning to produce a low dimensional discretized representation of the input space. However traditional methods are still of research interests and their performance is improved. For example, inspired by Isomap’s [19] changing of the metric from Euclidean distance to graph distance, metric replacement has been tried on the Sammon mapping by GeoNLM [20]. But many articles concentrate on improving the optimisation algorithm, such as Samann [10], its implementation by neural networks, triangulation [14], and an algorithm speed-up method [15]. By contrast, modification of the stress function is rarely studied.

0031-3203/$ - see front matter Crown Copyright & 2010 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.11.013

1138

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

This paper greatly extends the work discussed in [18] in the following aspects. The underlying reasons BMMDS outperforms corresponding MMDS methods are discussed in detail via the example extending the Sammon mapping with Bregman divergences. Visualisation quality is assessed in terms of neighbourhood preservation by an objective quantitative measure found in the literature, with analysis of distance change. The limitations of the first group methods are noted and the advantages of a second group are explained. More examples of the performance of the proposed methods with respect to more standard data sets are demonstrated. Specifically, in the remainder of this section, MMDS and Bregman divergences will be briefly reviewed. In Section 2, MMDS is generalised to BMMDS: as an example, the classical Sammon mapping is extended to a Bregmanised version, and a comparative study is given in Section 3, following which in Section 4 a criterion for base convex functions for Bregman divergences is proposed. In Section 5, a group of classical MMDS that preserve local distances are generalised to Bregman divergences. In Section 6 a novel method that preserve local distances by an opposite strategy that overcomes the limitations of the first group is proposed. Finally Section 7 concludes with future work being suggested.

Fig. 1. The divergence is the difference between F(p) and the value of FðqÞ þ ðpqÞrFðqÞ.

1.1. Stress function of MMDS

Euclidean norm

MMDS keeps inter-point Euclidean distances in latent space as close as possible to the original distances in data space via optimisation of a stress function. We will call the space into which the data are projected the latent space. MMDS is a group of methods that creates one latent point for every data sample. If we use Dij to represent the distance between points Xi and Xj in data space, Lij to represent the distance in latent space between the corresponding mapped points Yi and Yj, and the size of data set is N, the stress functions of MMDS can be generalised [2, p. 171] as N N X 1X ðL Dij Þ2 wðDij Þ, EMMDS ðYÞ ¼ C i ¼ 1 j ¼ i þ 1 ij

ð1Þ

where C is a normalisation scalar which does not affect the embedding, and weight wðDij Þ Z 0. Usually w(Dij)¼0 for data missing sample (Dij is unknown). Examples are Example 1. If we set C ¼1 and w(Dij)¼1, Eq. (1) becomes linear MMDS ELMMDS ðYÞ ¼

N X

N X

ðLij Dij Þ2 :

ð2Þ

i ¼ 1 j ¼ iþ1

P PN 1 Example 2. If we set C ¼ N i¼1 j ¼ i þ 1 Dij and w(Dij)¼ Dij , (1) becomes the nonlinear Sammon mapping [11] EuSammon ðYÞ ¼ PN

i¼1

N N X X ðLij Dij Þ2 : Dij j ¼ i þ 1 Dij i ¼ 1 j ¼ i þ 1

1 PN

8 7 F(p)

Function F()

6 5

d(p,q)

4 3 2

F(q)

1 0 –3

–2

–1

0

1q

2

p

3

dF ðp,qÞ ¼ JpJ2 JqJ2 /pq, rFðqÞS ¼ JpJ2 JqJ2 /pq,2qS ¼ JpqJ2 : Example 2. The Itakura–Saito divergence [1] based on convex function FðxÞ ¼ logðxÞ, x 40 ISðp,qÞ ¼

p p log 1 q q

ð5Þ

will be used to improve MMDS in Section 5. Bregman divergence can be viewed as the difference between F(p) and its truncated Taylor series expansion around q. In fact when F is in one variable, the divergence (4) is expressed as dF ðp,qÞ ¼ FðpÞFðqÞðdF=dqÞðpqÞ. This dF (p,q) is equal to the higher order terms d2 F ðpqÞ2 d3 F ðpqÞ3 d4 F ðpqÞ4 þ 3 þ 4 þ : 2 2! 3! 4! dq dq dq Bregman divergences have the following useful properties: nonnegativity, dF ðp,qÞ Z0 and dF ðp,qÞ ¼ 02p ¼ q, and being nonsymmetric dF ðp,qÞ a dF ðq,pÞ except in special cases such as FðÞ ¼ J  J2 , the Euclidean norm.

2. Linking MMDS and Bregman divergences ð3Þ

1.2. Bregman divergences Consider a strictly convex function F : S-R defined on a convex set S  Rd (Rd denotes d-dimensional real vector space), a Bregman divergence [1] between two points, p and q A S, is defined to be dF ðp,qÞ ¼ FðpÞFðqÞ/ðpqÞ, rFðqÞS,

Illustration of Bregman Divergence for quadratic function 9

ð4Þ

where the angled brackets indicate an inner product and rFðqÞ is the derivative of F evaluated at q. Thus it can be used to ‘measure’ the convexity of F. Fig. 1 illustrates how the Bregman divergence is the difference between F(p) and the value which would be reached at q with a linear estimate, rFðqÞ, for the curvature at q. Example 1. The squared Euclidean distance is a special case of the Bregman divergence in which FðÞ ¼ J  J2 , the squared

We could consider using Bregman divergences in either data space or latent space as shown in [7], however in this paper, we investigate using Bregman divergences between Lij and Dij instead of the squared distance between them which we used above. We will call this a Bregmanised MMDS (BMMDS) whose stress function is a sum of divergences: EBMMDS ðYÞ ¼

N N X X

dF ðLij ,Dij Þ

i ¼ 1 j ¼ iþ1

¼

N N X X

ðFðLij ÞFðDij ÞðLij Dij ÞrFðDij ÞÞ:

ð6Þ

i ¼ 1 j ¼ iþ1

This is the standard representation of Bregman divergences but in this paper, we note that we may alternatively express this as the tail of the Taylor series expansion of F(Lij) about F(Dij) and so is

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

Then an ExtendedSammon mapping will take into account all of the terms

equivalent to N N X X

EBMMDS ðYÞ ¼

2

2

d FðDij Þ ðLij Dij Þ 2 2! dD

i ¼ 1 j ¼ iþ1

1139

ij

! d3 FðDij Þ ðLij Dij Þ3 dn FðDij Þ ðLij Dij Þn þ    þ þ    : þ n 3 3! n! dDij dD ij

ð7Þ We now show that linear MMDS is an MDS method using a Bregman divergence and that the Sammon mapping can be seen as an approximation to MMDS using a Bregman divergence but using only the first term of (7).

EExtSammon ðYÞ ¼

N N X X

TijSammon 

i ¼ 1 j ¼ iþ1

ðLij Dij Þ3 ð1Þn ðn2Þ! ðLij Dij Þn þ  þ  þ n! 3!D2ij Dn1 ij

!

ð11Þ whose first term is the Sammon mapping; i.e. the Sammon mapping can be viewed as an approximation to the ExtendedSammon mapping. However since the higher order terms do not vanish, the question now arises as to whether, by utilising (9) and thereby implicitly incorporating the higher order terms, we gain anything.

3. The ExtendedSammon mapping 2.1. Linear MMDS 2

When FðxÞ ¼ x ,x 4 0, Eq. (7) is identical to Eq. (2) with C ¼2, i.e. trivially linear MMDS can be viewed as a Bregman divergence using the underlying convex function, F(x) ¼x2. Of course, it is well known that the solution of the linear MMDS objective is to locate the latent points at the projections of the data points onto their first principal components, i.e. minimising the divergence (in this case Euclidean distance) between the data points and their projections, so what we are really doing here is to restate an accepted equivalence in terms of Bregman divergences. 2.2. The Sammon mapping We define base convex function FðxÞ ¼ xlogx,

xARþ þ :

ð8Þ

Given dFðxÞ=dx ¼ logx þ 1, the ExtendedSammon mapping (the naming is explained shortly) is defined as EExtSammon ðYÞ ¼

N N X X

ðLij logLij Dij logDij ðLij Dij ÞðlogDij þ 1ÞÞ

i ¼ 1 j ¼ iþ1

¼

 N N X X

Lij log

i ¼ 1 j ¼ iþ1

 Lij Lij þ Dij : Dij

ð9Þ

Given the higher order derivatives of the base function which are shown in Table 1, by (7), (9) is equivalent to EExtSammon ðYÞ ¼

! ðLij Dij Þ2 ðLij Dij Þ3 ð1Þn ðn2Þ! ðLij Dij Þn  þ  þ þ  : 2 n1 2!Dij n! Dij 3!Dij

N N X X i ¼ 1 j ¼ iþ1

In order to compare the ExtendedSammon mapping with the Sammon mapping, we re-express Eq. (3) to be EuSammon ðYÞ ¼ PN PN PN PN P 2 ð2= N i¼1 j ¼ i þ 1 Dij Þ i¼1 j ¼ i þ 1 ðLij Dij Þ =2Dij ¼ ð2= i¼1 PN PN PN 2 Sammon Sammon D Þ T , where T ¼ ðL D ij ij Þ =2! j ¼ i þ 1 ij i¼1 j ¼ i þ 1 ij ij Dij . The constant scalar is only for normalisation purposes, so we therefore concentrate on ESammon ðYÞ ¼

N N X X

TijSammon :

Table 1 The higher order derivatives for this base function equation (8).

2

dx

1 40 x

d3 FðxÞ dx

3

1  2 o0 x

d4 FðxÞ dx

4

2! 40 x3

d5 FðxÞ dx

5

3!  4 o0 x

3.1. Implementation of stress optimisation The first standard data set on which the method is tested is the Swiss roll data set, shown in Fig. 2(a). We use this data set because [13] has extensive comparisons of the mappings of this data set given by e.g. Isomap [19], locally linear embedding [16], selforganising map [12], curvilinear component analysis [9], etc. It is implemented using standard Newton’s gradient descent method. A high dimensional data X is projected onto a P dimensional output Y, usually P¼2. Yki stands for the k-th dimension of output i-th data Yki ’Yki a

@EExtSammon ðYÞ , @Yki

k ¼ 1,2, . . . ,P

a is a factor that reduces stress in each iteration. The implementation is based on Cawley’s implementation from http://theoval.cmp. uea.ac.uk/matlab/sammon/sammon.m. The first order derivative is used for stress minimisation see Appendix A. If the size of input data set is n, the runtime complexity is O(n2). If size of the data set is large, before doing the experiment, preprocessing has to be performed to reduce the size of input. Later we will see that in (11) TSammon is the dominating term. ij It focuses on short distances, if the short distances are too small, not only will the optimisation speed be slowed down, but also the projection quality is affected. Our method is to remove data points in the densest region so that the shortest distances are not too small. The algorithm is in the following steps 1. choose a small radius e; 2. traverse all data points to get the point i whose number of neighbours in e ball is greatest; 3. remove point i from the data set; 4. goto step 2 until the number of neighbours in e ball of all data point is zero.

ð10Þ

i ¼ 1 j ¼ iþ1

d2 FðxÞ

We will show the advantages of the ExtendedSammon mapping by experiments on some standard data sets. The expression of stress function of the ExtendedSammon mapping used to implement is (9); however (11) is useful for analytical purposes.

d6 FðxÞ dx



dn FðxÞ n dx





ð1Þn ðn2Þ! 40 xn



6

4! 40 x5

The size of the original Swiss roll is 500. The smallest distance is 0.0047. We set e ¼ 0:03; there are 416 data points left after the density preprocessing. Then the distance matrix is divided by the mean distance. Dividing the distance matrix by a constant does not change the neighbourhood relationships of data points in data space, but has an effect on convergence speed of optimisation and on scaling results if using the typical Newton optimisation method. Average time used is listed in Table 4. Fig. 2 displays two dimensional projections by the Sammon and ExtendedSammon mappings. We see that both the Sammon mapping and the ExtendedSammon mapping capture the curve, but the ExtendedSammon mapping does

1140

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

stress=334.7072 1.2 1

Z

0.8 0.8 0.6 0.4 0.2

0.6 0.4

1.5 0.2

1 1

0.5

0

0.5 0

Y

0 –0.5

–0.5 –1

–1 –1.5

–0.2

X –0.4

–1.5

–0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

stress=405.4696 1.6

0.8

1.4

0.75

1.2 0.7 1 LCMC

0.8 0.6

0.65

0.6

0.4 0.55 0.2 0.5

0

Sammon ExtendedSammon

–0.2

0.45 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0

20

40

60

80

100

k Fig. 2. Swiss roll and its projections by the Sammon and ExtendedSammon mappings. (a) Random Swiss roll. (b) Visualised by the Sammon mapping. (c) Visualised by ExtendedSammon mapping. (d) Visualisation quality measured by LCMC.

rather better in differentiating points which lie at the same X, Y coordinate but different Z (vertical) coordinate in Fig. 2(a). However this subjective comparison needs to be augmented quantitatively and with more analysis which hopefully will reveal why one projection is preferable to another. 3.2. Quantitative visualisation quality measure The question arises as to how to quantitatively assess the quality of the mapping which each method finds: clearly if we use (10), then we would expect the Sammon mapping scaling to do best since it is designed to optimise this criterion. Similarly with the other criteria. Therefore we select a criterion which appeals intuitively but is not part of the criterion which each method is designed to optimise. In words, we wish the mapping to have the quality that local points in data space should remain local points in latent space while more distant points should not become neighbours in latent space. Fig. 2(d) assesses mapping quality by the local continuity meta-criterion (LCMC) [4], which is defined   P K K 2 to be LCMCðKÞ ¼ ð1=NKÞ N i ¼ 1 CðN data ðiÞ \ N latent ðiÞÞK =ðN1Þ , where C(  ) denotes set cardinality. N Kdata ðiÞ and N Klatent ðiÞ stand for K nearest neighbours in data space and latent space respectively.

The maximum of CðN Kdata ðiÞ \ N Klatent ðiÞÞ is K, so the coefficient 1/N is to average among all points, and 1/K is normalising the measure P K K between 0 and 1. Since ð1=NKÞ N i ¼ 1 CðN data ðiÞ \ N latent ðiÞÞ is 1 when K ¼N 1, K 2 =ðN1Þ is added as base-line. The higher the LCMC, the larger the intersection of neighbours of point i in data space and latent space. LCMC in Fig. 2(d) shows quantitatively that the ExtendedSammon mapping has significantly enhanced the scaling over the Sammon mapping particularly when we consider a small number of very near neighbours. 3.3. How the projection works We first look at the empirical changes to distances brought by the ExtendedSammon mapping, then investigate its underlying causes contributed by higher order terms in (12). 3.3.1. Changes to projected distances Fig. 3 shows graphs giving intuitions about the projections made by the Sammon and ExtendedSammon mappings. In Fig. 3(a) the distances Dij in data space are quantised into 40 sets, Dh , h ¼ 1, 2, . . . ,40. D h , the mean of Dh , is represented on the horizontal axis.

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

1141

0.35

Sammon ExtendedSammon

1 0.3 0.95 0.25 Sammon ExtendedSammon

0.9

0.2

0.15

0.85

0.1 0.8 0.05

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

distance distribution in DATA space, number of group=40 4000 3500

D =3.0 6

3000 number of instances

L =D 5 ExtendedSammon Sammon

stress

4 3 L
2500 2000 1500

L >D

2

1000

1

500

0

0 0

1

2

3

4

5

6

0

0.5

1

1.5

2

distances

Lij

0.035 mapped by Sammon calculated by ExtendedSammon mapped by ExtendedSammon

T T 0.02

0.03

T 0.025

mean of stress

mean of stress

T T

0.015

0.01

0.02

0.015

0.01 0.005 0.005

0

0 0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0

0.5

1

1.5

2

Fig. 3. Analysis of the projection of Swiss roll by the Sammon and ExtendedSammon mappings. (a) Distance projection. (b) Relative standard deviations of distance projection. (c) Stress comparison between the Sammon and ExtendedSammon mappings. (d) Distance distribution of Swiss roll data set. (e) Stress contributions from terms of the ExtendedSammon mapping. (f) The ExtendedSammon mapping reassesses the output of the Sammon mapping and reconfigure data points to reduce stress.

The vertical axis shows mean of Lij =Dij which indicates relative distance change. It is more sensitive than absolute change of Lij  Dij.

It can be seen that Dij 40:57 are less changed in both mappings; those with Dij o0:57 are reduced most. Although it is well known that the Sammon mapping puts more stress on local distances and less

1142

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

stress on long distances, small distances are not projected close to their original value as expected. They are actually over compressed instead whereas long distances are preserved better. The reason for this phenomenon is that during optimisation, we can consider that short distances and long distances are competing for opportunities to go closer to their original values as much as possible. Although each individual short distance has a stronger effect than a longer one, the number of distances that are longer than 0.57 greatly exceeds the number of short ones, as shown in Fig. 3(d), the histogram of the distances in the original data space. So the net effect is that the long distances are constrained as shown. Nevertheless short distances mapped by the ExtendedSammon mapping are closer to their counterpart distances in data space than those mapped by the Sammon mapping, which means that local distance preservation is improved. The long distances mapped by the ExtendedSammon mapping are slightly longer than those mapped by the Sammon mapping. In Fig. 3(b) we can see that the relative standard deviation achieved by the ExtendedSammon mapping on small distances (Dij o0:57) is smaller than by the Sammon mapping, which means controlling of small distances by the ExtendedSammon mapping is improved than by the Sammon mapping; as a trade-off, more freedom is given to large distances (Dij 40:57). We now develop a further analysis of the stress function of the ExtendedSammon mapping in contrast to the Sammon mapping in terms of stress formation. 3.3.2. Changes to stress The stress function of the ExtendedSammon mapping equation (11) can be further expressed as EExtSammon ðYÞ ¼

¼

ðLij Dij Þ3 2!ðLij Dij Þ4 þ 3!D2ij 4!D3ij i ¼ 1 j ¼ iþ1 ! 3!ðLij Dij Þ5 4!ðLij Dij Þ6 þ   6!D5ij 5!D4ij N N X X

N N X X

TijSammon 

ðTijSammon þTij3 þ Tij4 þ Tij5 þTij6 þ   Þ

i ¼ 1 j ¼ iþ1

¼

N N X X

TijExtSammon :

ð12Þ

i ¼ 1 j ¼ iþ1

where Tij3 ¼ ðLij Dij Þ3 =3!D2ij , Tij4 ¼ 2!ðLij Dij Þ4 =4!D3ij , Tij5 ¼ 3!ðLij  Dij Þ5 =5!D4ij , Tij6 ¼ 4!ðLij Dij Þ6 =6!D5ij . We denote TijExtSammon ¼ TijSammon þ Tij3 þTij4 þ Tij5 þ Tij6 þ   . add extra unfolding The higher order terms in addition to TSammon ij power to the mapping, the ExtendedSammon mapping should always

outperform the Sammon mapping. As shown empirically above with the Sammon mapping, small distances are projected on average shorter than the originals, i.e. L ij o D ij , which means in (12) T3ij, T4ij, T5ij, T6ij, . . . are all positive, so TijExtSammon 4TijSammon . Thus the Sammon mapping is discarding a sum of non-negative terms which will tend to increase the stress function. However long distances projected by the Sammon mapping on average are longer, i.e. L ij 4 D ij , which means T3ij , T5ij, . . ., 1 are negative. Since (12) is a sum of divergences of limited value, so T2n+ ij 3 jTij j4 jTij4 j and jTij5 j4 jTij6 j, thus Tij3 þTij4 o0 and Tij5 þ Tij6 o0, . . ., so TijExtSammon o TijSammon ; thus the Sammon mapping is discarding a series of terms which will tend to decrease the stress function. The overall effect is shown in Fig. 3(c); we can see that The Sammon mapping is symmetric with respect to Lij ¼Dij, whereas the ExtendedSammon mapping is not. For Lij oDij stress of the ExtendedSammon mapping is higher than the Sammon mapping, while the opposite holds for Lij 4 Dij . Suppose the graph of stress is a bowl into which distances fall like balls, on average, for the Sammon mapping, balls are equally bouncing on either side; but the ExtendedSammon mapping tends to bounce balls rightwards, i.e. distances are likely projected longer. We now show by the example of the Swiss roll how, in addition to TSammon , other terms in (12) improve projection quality by ij further tuning local distances and longer distances. In Fig. 3(e), we can see the contribution to the stress among terms in (12). The TSammon is the strongest term. The terms T3ij, T4ij and T5ij make more ij contributions to small distances (D h r 0:57) but are less important to longer distances (D h 40:57). We initialise the output of the ExtendedSammon mapping as the actual configuration found by the Sammon mapping. At the start of stress optimisation performed by the ExtendedSammon mapping, in Fig. 3(f), the ExtendedSammon mapping calculates the stress using its own stress function other than the Sammon mapping. For D h r0:57 the mean stresses evaluated using the stress function of the ExtendedSammon mapping are much higher than the stresses evaluated by the Sammon mapping, especially the short distances D h that are around 0.4. In other words, the ExtendedSammon mapping will reassess the projection of the Sammon mapping using its own stress criterion and then do adjustments to the latent points of the output projected by the Sammon mapping. The results are that the short distances founded by the Sammon mapping become less short to reduce the stresses; and correspondingly long distances are stretched. The actual mapping result (our empirical finding) of the ExtendedSammon mapping turns out to be that stresses for short distances, D h that are around 0.4, fall greatly but are still much higher than the result of the Sammon mapping; while the stresses for long distances D h 40:57 increase slightly.

Table 2 Two groups of base convex functions and their derivatives, x 4 0. Parameter

d2 F

F(x)

dx 1stGroup

2ndGroup

dn F n dx

d3 F

2

dx

3

t¼1

xlogðxÞ

1 40 x

1  2 o0 x

ð1Þn ðn2Þ! xn1

t¼2

 log(x)

1 40 x2

2  3 o0 x

ð1Þn ðn1Þ! xn

t¼3

1 x

2 40 x3

6  4 o0 x

ð1Þn ðnÞ! xn þ 1

t¼4

1 x2

6 40 x4

24  5 o0 x

ð1Þn ðn þ 1Þ! xn þ 2

Generic, t Z 3

1 xt2

ðt2Þðt1Þ 40 xt

ðt2Þðt1Þt  o0 xt þ 1

ð1Þn ðt2Þðt1Þt    ðt þ n3Þ xt þ n2

l

 x exp 

l

1 2

l

 x exp  40

l



1

l

3

 x exp  o 0

l

 x 1 exp  l ðlÞn

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

With the above discussion in mind that BMMDS brings more unfolding power to the mappings than traditional MMDS, we wish to go beyond the ExtendedSammon mapping to find other mappings which give similar advantages to those above. We first explicitly state the criteria we will use for base function selection and then develop and investigate two groups of Bregman mappings.

4. Criterion for base convex function selection Based on the stress formation of the ExtendedSammon mapping, a convex function F(x) for use in MMDS applications defined on R þ þ should in general satisfy 8 2 d FðxÞ d4 FðxÞ dð2nÞ FðxÞ > > 40, 40, . . . , 40 > < 2 4 ð2nÞ dx dx dx > d3 FðxÞ d5 FðxÞ dð2n þ 1Þ FðxÞ > > o0, o0, . . . , o0, n ¼ 1,2,3, . . . : : 5 3 ð2n þ 1Þ dx dx dx ð13Þ x

If a convex function, such as FðxÞ ¼ e , x 40, is used as the base function, the overall stress minimises and Lij approaches Dij, but local distance preservation is not improved; thus it is not considered. Having met (13), the second order derivative is then primarily considered to be the most important, because it is a major influence in (7) as shown above. It needs to be big at small distances and small at long distances. The bigger it is at local distances, the greater force 5

10

1143

concentrates on small distances; the smaller it is at long distances, the more freedom the long distances have. This represents the focusing power on local distance.

5. The first group of mappings using Bregman divergences Ref. [2, pp. 255–257] refers to [3] where the weight in (1) is generalised as wðDij Þ ¼ 1=Dtij in XGvis MDS software. (XGvis is an interactive visualisation system for proximity data; it is available free from http://www2.research.att.com/areas/stat/xgobi/). When t is positive, the scaling emphases the representation of small dissimilarities. We have extended the case for t ¼1, the Sammon mapping, with the ExtendedSammon mapping, now we give more extensions for other values of t. Table 2 lists more base functions corresponding to t ¼2,3,4 besides (8) as the base convex function for the ExtendedSammon mapping. They meet the criterion (13) discussed above. As depicted in Fig. 4 we can see the strategy of focusing on small distances. From t¼ 1 to t ¼4, the second order derivatives are higher and higher for small distances, and lower and lower for long distances, the rate of changes for small distances is much higher than the rate of changes for the long distances. Based on the proposed base functions, in addition to the ExtendedSammon mapping, other BMMDS stress functions are created as follows. t ¼ 2,FðxÞ ¼ logðxÞ: This is the Itakura–Saito (5) divergence   N N N N X X X X Lij Lij EIS ðYÞ ¼ ISðLij ,Dij Þ ¼ log 1 , ð14Þ Dij Dij i ¼ 1 j ¼ iþ1 i ¼ 1 j ¼ iþ1 which is an extension of the MMDS whose stress function is PN PN 2 2 i¼1 j ¼ i þ 1 ðLij Dij Þ =Dij , the elastic scaling [2, p. 255]. t ¼ 3,FðxÞ ¼ 1=x:

0

EReciprocal ðYÞ ¼

10

N N X X i ¼ 1 j ¼ iþ1

¼

N N X X i ¼ 1 j ¼ iþ1

1 1 1  ðLij Dij Þ  2 Lij Dij Dij ! Lij 1 2  þ , Lij Dij D2ij

!!

ð15Þ

–5

10

which is an extension of the MMDS whose stress function is PN PN 2 3 i¼1 j ¼ i þ 1 ðLij Dij Þ =Dij . 2 t ¼ 4,FðxÞ ¼ 1=x : ! N N X X 1 1 2 EInverseQuadratic ðYÞ ¼  ðL D Þ ij ij L2ij D2ij D3ij i ¼ 1 j ¼ iþ1 ! N N X X 2Lij 1 3 ¼  2þ 3 , ð16Þ 2 L D Dij ij ij i ¼ 1 j ¼ iþ1

–10

10

0

1

2

3

4

5

Fig. 4. Second order derivatives of selected base functions.

Table 3 Summary of divergences in contrast with corresponding MMDS.

1stGroup

2ndGroup

Divergence

Parameter

F(x)

Terms of BMMDS (6)

Term of MMDS (1)

ExtendedSammon (9)

t ¼1

xlogðxÞ

Lij Lij log Lij þ Dij Dij

ðLij Dij Þ2 , Sammon Mapping (3) Dij

Itakura–Saito (14)

t ¼2

logðxÞ

Lij Lij log 1 Dij Dij

ðLij Dij Þ2 , elastic scaling, p. 255 of [2] D2ij

Reciprocal (15)

t ¼3

1 x

Lij 1 2  þ Lij Dij D2ij

ðLij Dij Þ2 D3ij

Inverse quadratic (16)

t ¼4

1 x2

2Lij 1 3  þ 3 Dij L2ij D2ij

ðLij Dij Þ2 D4ij

Exponential family (18)

l

 x exp 

l

      Lij Dij Lij Dij Dij exp  exp  exp  þ

l

l

l

l

  Dij exp  ðLij Dij Þ2

l

1144

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

which is an extension of the MMDS whose stress function is PN PN 2 4 i¼1 j ¼ i þ 1 ðLij Dij Þ =Dij . The 1stGroup stress functions defined using Bregman divergences and their corresponding MMDS terms are summarised in Table 3. 5.1. Experiment on the Swiss roll data The mapping of the Swiss roll by the 1stGroup is shown in Figs. 2(c) and 5(a)–(c). It can be seen that, with the increase of

stress=852.364

parameter t, the roll is fatter and fatter—this is the result of increasing power of focusing on small distances, and has a tendency to unroll—the unfolding power increases. Distance projection is seen Fig. 5(d), which shows that while short distances in latent space approach their value in data space, long distances are mapped longer than their corresponding data distances. The improvement for short distances is greater than for long distances when t increases. The improvement on distance preservation of t¼3–4 is less notable. This is also reflected in Fig. 5(e), where low

stress=3371.8385

1 0.8 0.6

stress=24443.864

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.4 0.2 0 –0.2 –0.4 –0.6 –0.8

0

0

–0.2

–0.2 –0.4

–0.4 –0.5

0

0.5

1

0

0.5

1

1.1

1.5

0

0.5

0.35

1.05

1.5

Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.3

1

1

0.25

0.95

0.2 Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.9

0.85

0.15

0.1

0.8

0.05

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0.85 0.8

LCMC

0.75 0.7 0.65

Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.6 0.55 0.5 10

20

30

40

50

60

70

80

90

100

k Fig. 5. Mappings of the Swiss roll data set by the 1st Group. (a) Mapped by Itakura–Saito, t ¼2. (b) Mapped data, t¼ 3. (c) Mapped data, t ¼4. (d) Distance projection. (e) Relative standard deviations. (f) LCMC.

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

standard deviation means that the mapped distances are neat in length and less dispersible. We can see that when t increases, the standard deviation for short distances (Dij o 0:57) reduces, it means the power controlling on short distances is increased; meanwhile the standard deviation for long distances (Dij 40:57) increases, which means the attention to long distances reduces. A clear comparison of quality for all methods is given in the single measure LCMC as shown in Fig. 5(f). The effect is easily seen for small neighbourhoods. The improvement from the Sammon to ExtendenSammon mapping is more evident in LCMC than in distance. When t Z 2, in return for good mapping, the time spent on optimisation increases significantly; the time cost of each method is shown in Table 4. 5.2. Experiment on the open box Although the open box is described as a ‘benchmark’ manifold in [13], we could not find its source online so we created the data set (Fig. 6(a)) according to Lee and Verleysen’s description. The open box data set consists of 316 data points. On page 15 of [13], it is stated that it ‘is connected but neither compact (in contrast with

1145

a cube or closed box) nor smooth (there are sharp edges and corners)’. It is interesting to see in Fig. 7 that not all of the distances have the same opportunity to become longer. The result of the 1stGroup of divergences is displayed in Fig. 7(b)–(e). The box opens wider in the mouth, in other words the manifold has been found such that distances between points on opposite sides on the top are longer than on the bottom. Distances between points on the lid remain almost unchanged whereas on the bottom they have contracted. The mouth becomes rounder and rounder, and sharp points and edges are smoothed except on the lid. In a square consisting of four neighbouring data points in the four vertical sides, the horizontal side becomes longer than the vertical side. This is a smooth deformation that is in contrast to methods such as curvilinear component analysis [6] which tear the box in order to create a two dimensional representation of the data. As an overall effect, neighborhood preservation is continuously improved: from Fig. 7(a) to (b) the overlap between vertical side and bottom reduces; overlapping disappears in Fig. 7(c). In Fig. 7(g) we can see that the longest distances are least stretched; the Itakura–Saito divergence has even compressed some

Table 4 Time cost in seconds. Parameter

Swiss roll

Sammon

t ¼1

9.4558

1stGroup

t ¼1 t ¼2 t ¼3 t ¼4

20.3957 52.3481 916.1055 7.4962e +003

2ndGroup

l¼1

20.0366 37.7861 55.6793

1 2 1 l¼ 4 1 l¼ 8 1 l¼ 16



Open box

Iris

Digits

Artificial faces

0.4359

36.5498

35.8103

0.9331 8.9086 68.3107 1.0992e + 003

69.3453 243.9635 465.9253 3.8604e + 003

43.4934 343.3642 2.1341e +003 2.8522e +004

10.9874 13.8167

2.2203 6.1122

93.8840 87.8985

74.9580 104.4449

15.7562

14.4050

323.3225

177.4509

134.2731

264.9798

76.8601

1.4909e + 003

1.6730e + 003

3.5217e +003

140.0840

149.8436

1.9717e + 004

1.9711e +004

2.2237 4.4938 19.2546 98.6858 470.1591

4500 4000

number of instances

3500 2 1 0

4

2

3000 2500 2000 1500 1000

3 1

2 1

0 0

500 0

0

1

2

3

4

distances Fig. 6. Open box data set and its distance distribution. (a) Open box data set. (b) The distances distribution of open box in data space.

5

1146

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

stress=861.9625

stress=968.5424

stress=1316.6953

4.2

11.8

4

11.6

3.8

11.4

14.8

15.2 15

3.6

11.2

14.6

3.4

11

14.4

3.2

10.8

14.2

3

10.6

14

2.8

10.4

13.8

2.6

10.2

13.6 13.4

2

2.5

3

3.5

9.5

4

10

10.5

11

11.5

13

stress=28175.9386

stress=4222.7642 9.2

13.5

14

14.5

0.8

19

Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.75

9 8.8

0.7 LCMC

18.5

8.6 8.4 8.2

15

18

0.65

8 0.6

7.8

17.5

7.6 0.55

7.4 17 7

7.5

8

8.5

16.5

9

17

17.5

18

18.5

19

10

20

30

40

50

60

70

80

90

100

k

0.4

1.25

Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.35

1.2

0.3

1.15

0.25

1.1

0.2

1.05

0.15 1 Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.95 0.9 0.5

1

1.5

2

0.1 0.05

0.5

1

1.5

2

Fig. 7. The experiment on the open box by the 1stGroup. (a) Mapped by the Sammon mapping. (b) Mapped by the ExtendedSammon mapping, t ¼ 1. (c) Mapped by Itakura– Saito divergence, t ¼ 2. (d) Mapped data, t ¼3. (e) Mapped data, t ¼4. (f) LCMC. (g) Distance projection. (h) Relative standard deviation of relative distance change.

of the longest distances. On this data set, when t is too big, mapping quality will not be improved. When t ¼3 short distances are shorter than their original distances, but when t ¼4 all short distances are mapped longer than their original distances, little improvement is indicated by LCMC (Fig. 7(f)). In Fig. 7(h) we can see that the differences of control on small distances by different methods are less than the long distances.

5.3. Experiment on artificial faces, MNIST handwritten digits and iris data sets Artificial faces is a set of 698 images which were created for Isomap [19]. The images varied under three dimensions: the face could rotate left and right, rotate up or down and the lighting direction could

change. Each image is 64  64 pixels and so we have 4096 dimensional data. Unlike [13], we used this raw data without compression with principal component analysis. MNIST handwritten digits are available from http://cs.nyu.edu/ roweis/data.html. Samples are in 8-bit grayscale; each sample is 28  28 pixels. Thirty six samples are randomly drawn from test subset for digit 0–9; the subset we choose is 360 samples in total. The iris flower data set or Fisher’s iris data set is a multivariate data set introduced by Sir Ronald Aylmer Fisher (1936) as an example of discriminant analysis. It consists of 50 samples from each of three species of iris flowers (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample, they are the length and the width of sepal and petal, in centimeters. In order to compare with the 2ndGroup methods, those three data sets are not density processed. Fig. 8 shows the experiment results by LCMC and distance projection. We can see that, on the

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

1147

Iris 0.78

1.15

0.76 1.1

LCMC

0.74 0.72

1.05

0.7

1

0.68

Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.95 0.66 0.9

0.64 Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.62 0.6 0.58 5

10

15

20

25

0.85 0.8 30

0.5

1

1.5

2

2.5

k ArtificialFaces 2 0.5

Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

1.8 1.6

0.45

LCMC

1.4 0.4

1.2 1

0.35 0.8

Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.3

0.6 0.4

5

10

15

20

25

30

35

40

45

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

k

LCMC

digits 0.38

2.4

0.36

2.2

0.34

2

0.32

1.8

0.3

1.6

0.28

1.4

0.26

1.2

0.24

1 Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.22 0.2 0.18 5

10

15

20

25

30

35

40

45

50

Sammon ExtendedSammon, t=1 IS, t=2 Reciprocal, t=3 InverseQuadratic, t=4

0.8 0.6 0.4 55

0.4

0.6

0.8

1

1.2

1.4

k Fig. 8. Experiment on artificial faces, MNIST handwritten digits and iris data by the 1stGroup. (a) Iris, LCMC. (b) Iris, distance projection. (c) Artificial faces, LCMC. (d) Artificial faces, distance projection. (e) Digits, LCMC. (f) Digits, distance projection.

1148

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

stress=8810.3373

stress=10422.6043

stress=15584.023

stress=284558.3588

Fig. 9. Mappings of artificial faces by the 1stGroup, maximum 23 data are sampled vertically. (a) By the Sammon mapping. (b) By the ExtendedSammon mapping, t ¼ 1. (c) By IS, t¼ 2. (d) t ¼4.

three data sets, the ExtendedSammon mapping outperforms the Sammon mapping; and with the increase of t, small distance preservation is improved as expected. For the iris data set, the mapping quality of small neighbourhood is improved with the increase of t as shown in Fig. 8(a); but this is not the case on the digits and the artificial faces data sets as shown in Fig. 8(e) and (c). The quality for t ¼4 is the worst as shown in Fig. 8(e) although preservation of short distances is the best as shown in Fig. 8(f). The strategy that improves mapping quality by giving small distances increasing weight by increasing power does not always work. Not surprisingly for all the three data sets, when t increases from 3 to 4 that weight on small distances are too high, optimisation time soars

exponentially as illustrated in Table 4. Examples of the mappings of the artificial faces data set are shown in Fig. 9.

6. The second group of mappings using Bregman divergences In fact, giving small distances increasingly high stress is not the only way of focusing on small distances. Let us examine the base function  x FðxÞ ¼ exp  , l40 ð17Þ

l

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154 2

2

whose second order derivative is d2 F=dx ¼ ð1=l Þexpðx=lÞ. As shown in Fig. 4, it uses a strategy that is different from the 1stGroup 2 to focus on small distances. d2 F=dx does not give very high weight to small distances, instead it is only limited to a small maximum value, 1, for very small distances, but it is low at long distances. In this way small distances are given relatively high stress. For the same long distance, the smaller the l is, the smaller the second derivative is, and thus its degree of focusing power on small distances is increased. The Bregman divergences based on the convex function (17) is created as        N N X X Lij Dij Lij Dij Dij EExp ðXÞ ¼ exp  exp  exp  þ ,

l

i ¼ 1 j ¼ iþ1

l

l

which is an extension of the MMDS whose stress function is PN PN 2 i¼1 j ¼ i þ 1 expðDij =lÞðLij Dij Þ . An obvious advantage of this method is that, unlike the 1stGroup of methods, in which considerable delay will be caused by very small distances, even identical points are allowed: for the corresponding MMDS, a zero distance will be given a weight of 1 instead of an infinity as for the 1stGroup methods. 6.1. Experiment on Swiss roll data Because small distances do not cause delay to convergence, the original 500 points Swiss roll data set is used. The experiments with the 2ndGroup of divergences on the Swiss roll data set are shown in Fig. 10. In Fig. 10(a)–(e), with the decrease

l

ð18Þ stress=134.4754

1149

stress=326.9864

stress=400.1457

1

1

0.8 0.8 0.6

0.6 0.5

0.4

0.4

0.2

0.2

0

0

0

–0.2

–0.2

–0.4

–0.4

–0.6

–0.6

–0.8

–0.8

–1

–1

–0.8

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

0.8

–0.5

–1 –0.8

–0.6

–0.4

stress=174.1316

–0.2

0

0.2

0.4

0.6

0.8

–0.8 –0.6 –0.4 –0.2

stress=22.369

0

0.2

0.4

0.6

0.8

1

Swiss roll

1 0.9

1

0.85

0.5

0.5 0.8 LCMC

0

0

0.75 0.7

–0.5 –0.5

0.65 –1

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16

0.6

–1

0.55

–1.5

0.5 –1

–0.5

0

0.5

1

–1

–0.5

0

0.5

1

10

20

30

40

50

60

70

80

90

k

1.6

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16

1.5 1.4

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16

0.4 0.35 0.3

1.3 1.2

0.25

1.1

0.2

1

0.15

0.9

0.1

0.8

0.05

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Fig. 10. Mappings of the Swiss roll data set by the 2ndGroup divergences. (a) Mapped data, l ¼ 1. (b) Mapped data, l ¼ 12. (c) Mapped data, l ¼ 14. (d) Mapped data, l ¼ 18. 1 (e) Mapped data, l ¼ 16 . (f) LCMC. (g) Distance projection. (h) Relative standard deviations.

1150

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

distances as seen in Fig. 10(g) and (h), but with the sacrifice of running time as shown in Table 4 in return for the slightly raised LCMC in Fig. 10(f) (in this context, it seems that LCMC is not a good predictor of unfolding).

of l, the unrolling power increases considerably. Unlike the 1stGroup which are good at mapping short distances, the 2ndGroup of divergences are good at enlarging long distances. From Fig. 10(g) when l decreases from 1 to 12, short distance preservation improves greatly, which is also reflected in Fig. 10(f) indicating that there is a great leap in projection quality in spite of the fact that long distances 1 are only slightly prolonged. Although when l decreases from 12 to 16 , short distances are slowly drawn closer to their original values, long distances are prolonged more and more, while the corresponding LCMC increases more and more slowly. In Fig. 10(h) while more and more freedom is given to long distances, attention to short distances increases; this is the same scenario as with the 1stGroup in Fig. 5(e). 1 When l decreases from 18 to 16 , as demonstrated by Fig. 10(d) and (e) the unfolding is improved obviously, and so is the release of long

stress=318.1486

6.2. Experiment on the open box Fig. 11(a)–(e) shows the mapping of open box when l decreases 1 from 1 to 16 . We can see by eye that the neighbourhood preservation is obviously improved with the decrease of l: in Fig. 11(a) points on the left side of the open box are mixed with points on the bottom; there are only a few samples in the bottom corner, which stick to the vertical sides in Fig. 11(b); in Fig. 11(c) mixing does no longer exist. There are no obvious change between Fig. 11(d) and (e). Quantitatively in

stress=538.6551

stress=424.5168

6.4

6.2

5.8

6

6.2

5.6 5.8

6

5.4

5.8

5.6

5.2

5.6

5.4

5

5.2

5.4

4.8

5

5.2

4.6

4.8

5

4.4

4.6

4.8

4.2

4.4 4.2

5

5.5

6

6.5

7

3.5

4

stress=148.9268

4.5

5

5.5

4

6

4.5

5

stress=38.4431

5.5

6

6.5

Openbox

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16

0.8

4.5 4

0.75 LCMC

4 3.5 3.5

0.7

0.65

3 3

0.6 2.5 0.55

2.5 2

2.5

3

3.5

4

1.5

4.5

2

2.5

3

3.5

4

4.5

10

20

30

40

50

60

70

80

90

100

k

1.35

0.4

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16

1.3 1.25

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16

0.35 0.3

1.2 0.25 1.15 1.1

0.2

1.05

0.15

1

0.1

0.95 0.05 0.9 0

0.5

1

1.5

2

0.5

1

1.5

2

Fig. 11. The mapping of the open box by the 2ndGroup divergences. (a) Mapped data, l ¼ 1. (b) Mapped data, l ¼ 12. (c) Mapped data, l ¼ 14. (d) Mapped data, l ¼ 18. (e) Mapped 1 data, l ¼ 16 . (f) LCMC. (g) Distance projection. (h) Relative standard deviation.

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

1151

iris 1.25 1.2

0.75

1.15 1.1 0.7

LCMC

1.05 1 0.65

0.95 λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16 InverseQuadratic, t=4

0.6

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16 InverseQuadratic, t=4

0.9 0.85 0.8 0.75

0.55 5

10

15

20

25

30

35

40

45

50

0.5

1

1.5

2

2.5

k ArtificialFaces 4 0.55

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16 InverseQuadratic, t=4

3.5

3 0.5

LCMC

2.5 0.45 2 λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16 InverseQuadratic, t=4

0.4

0.35

10

20

30

40

50

60

70

80

90

1.5

1

0.5 100

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

k Digits

6 0.35 5

λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16 ExtendedSammon

0.3

LCMC

4

0.25

3 λ=1 λ=1/2 λ=1/4 λ=1/8 λ=1/16 InverseQuadratic, t=4

0.2

0.15 5

10

15

20

25

30

2

1

35

0.4

0.6

0.8

1

1.2

1.4

k Fig. 12. Experiment on artificial faces, MNIST handwritten digits and iris data by the 2ndGroup. (a) Iris data set, LCMC. (b) Iris data set, distance projection. (c) Artificial faces data set, LCMC. (d) Artificial faces data set, distance projection. (e) Digits data set, LCMC. (f) Digits data set, distance projection.

1152

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

stress=5991.6581

stress=1231.2975

stress=5051.8885

stress=63.4807

1 Fig. 13. Mappings of artificial faces by the 2ndGroup, maximum 23 data are sampled vertically. (a) l ¼ 12. (b) l ¼ 14. (c) l ¼ 18. (d) l ¼ 16 .

Fig. 11(f), from l ¼ 12, LCMC improves a little although the relative mean of distances in Fig. 11(g) increases steadily. The longest distances that are greater than 2.0 are the least stretched by all methods. This is almost certainly because, as seen in Fig. 6(b), the proportion of long distances is much smaller than in the Swiss roll data set as seen in Fig. 3(d). Compared to the experiment on the Swiss roll data in Fig. 10(g), in Fig. 7(g) the lines for each method are roughly parallel; these methods stretch distances evenly. Fig. 11(h) shows that controlling of long distances is more spread out than small distances. We note that, as far as the LCMC criterion is concerned, there is little advantage in having smaller values of the l parameter in the base function. When l decreases from 18, small distances start to be

over stretched and the LCMC increases a little, but with much more running time as shown in Table 4. 6.3. Experiment on artificial faces, MNIST handwritten digits and iris data sets The experiment on artificial faces, MNIST handwritten digits and iris data set by the 2ndGroup methods are more encouraging than by the 1stGroup, as shown in Fig. 12. For all the three data sets, with the decrease of l, short distances are preserved more and more close to their original value in data space; for small neighbourhood, quality measured by LCMC increases steadily.

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

Nevertheless, it is worthy noting that when the parameter is too 1 small, l ¼ 16 , quality improvement is not significant, but the cost of convergence time is up to 10 times of l ¼ 18 as shown in Table 4. The mapping result on the three data sets by the two group of methods show that better preservation of short distances does not mean better visualisation quality. As we can see in the iris data set, the best method in 1stGroup is of the highest parameter t ¼4 in 1 Fig. 8(a), but it only competes with l ¼ 16 for k o5, then its quality 1 drops quickly and worse than the l ¼ 16 until to about k ¼35 as shown in Fig. 12(a); although short distances are better preserved 1 by t ¼4 than by l ¼ 16 as shown in Fig. 12(b). To the artificial faces, the best method in 1stGroup is of the highest parameter t ¼4 in Fig. 8(c), but its quality is far below the medium parameter l ¼ 14 of the 2ndGroup for beyond k o100 in Fig. 12(c); and again the short distances are much better preserved by t¼ 4 than l ¼ 14 as shown in Fig. 12(d). To the hand written digits, the ExtendedSammon mapping is the best in the 1stGroup as shown in Fig. 8(e), but its quality is much worse than l ¼ 12 beyond k o35 as shown in Fig. 12(e) with the equivalent of short distance preservation. Examples of the mapping of the artificial faces data set are shown in Fig. 13.

1153

1stGroup methods are Lij Yki Ykj @EExtSammon ðYÞ @EExtSammon ðYÞ @Lij ¼ ¼ log , @Yki @Lij @Yki Dij Lij   @EIS ðYÞ @EIS ðYÞ @Lij 1 1 Yki Ykj ¼ ¼  , @Yki @Lij @Yki Dij Lij Lij @EReciprocal ðYÞ @EReciprocal ðYÞ @Lij ¼ ¼ @Yki @Lij @Yki

1 1  D2ij L2ij

!

Yki Ykj , Lij

@EInverseQuadratic ðYÞ @EInverseQuadratic ðYÞ @Lij 1 1 ¼ ¼2  @Yki @Yki @Lij D3ij L3ij

!

Yki Ykj ; Lij

derivative for the 2ndGroup methods is      Yki Ykj Dij Lij @EExp ðYÞ @EExp ðYÞ @Lij 1 ¼ ¼ exp  : exp  l l l Lij @Yki @Lij @Yki

7. Conclusion and future work

References

We have shown that linear MMDS can be thought of as a BMMDS with the special case when the underlying convex function is F(x) ¼ x2 and that the Sammon mapping is a truncated version of a Bregman divergence with the convex function FðxÞ ¼ xlogx. We have shown that the full Bregman mapping of some standard data sets improves upon the mapping provided by the Sammon mapping using a criterion which is not related to the stress functions used but is designed to appeal to an intuitively plausible rationale for why one mapping may be better than another. We have used our intuition gained from this improvement to develop two groups of mappings based on different underlying convex functions: one group concentrates on shorter distances while the other gives more attention to the longer distances in data space. Perhaps the most noteworthy contribution is the smooth deformation of the projection so that even with the extremely discontinuous open box data set, a mapping is created in which near neighbours in data space continue to be near neighbours in the two dimensional latent projection. Future work will investigate the creation of divergences which merge the properties of our two groups of divergences, i.e. which simultaneously pay attention to small and large distances in data space. We have also begun an investigation into right Bregman divergences [17], i.e. where our stress function is

[1] A. Banerjee, S. Merugu, I.S. Dhillon, J. Ghosh, Clustering with Bregman divergences, Journal of Machine Learning Research 6 (2005) 1705–1749. [2] I. Borg, P.J.F. Groenen, Modern Multidimensional Scaling, second ed., Springer, New York, 2005. [3] A. Buja, D.F. Swayne, Visualization methodology for multidimensional scaling, Journal of Classification 2002 (19) . [4] L. Chen, A. Buja, Local multidimensional scaling for nonlinear dimension reduction, graph drawing and proximity analysis, Ph.D. Thesis, University of Pennsylvania, 2006. [5] D. Child, The Essentials of Factor Analysis, Rinehart and Winston, London, Holt, 1973. [6] P. Demartines, J. He´rault, Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets, IEEE Transactions on Neural Networks 8 (1) . [7] C. Fyfe, Data visualization using Bregman divergences, Computing and Information Systems Technical Reports, November 2008. [8] H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 24 (1933) 417–441. [9] J. He´rault, C. Jausions-Picaud, A. Gue´rin-Dugue´, Curvilinear component analysis for high-dimensional data representation: I. Theoretical aspects and practical use in the presence of noise, in: IWANN, vol. 2, 1999, pp. 625–634. [10] J. Mao, A.K. Jain, Artificial neural networks for feature extraction and multivariate data projection, IEEE Transactions on Neural Networks 6 (2) (1995) 296–317. [11] J. Sammon, A nonlinear mapping for data structure analysis, IEEE Transactions on Computing 18 (1969). [12] T. Kohonen, M.R. Schroeder, T.S. Huang, Self-Organizing Maps, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2001. [13] J.A. Lee, M. Verleysen, Nonlinear Dimensionality Reduction, Springer, Berlin, 2007. [14] R.C.T. Lee, J.R. Slagle, H. Blum, A triangulation method for sequential mapping of points from n-space to two-space, IEEE Transactions on Computers 6 (1977) 288–292. [15] E. Pekalska, D. de Ridder, R.P.W. Duin, M.A. Kraaijveld, A new method of generalizing Sammon mapping with application to algorithm speed-up, in: 5th Annual Conference of the Advanced School for Computing and Imaging, ASCI, 1999, pp. 221–228. [16] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. [17] J. Sun, M. Crowe, C. Fyfe, Curvilinear component analysis and Bregman divergences, in: European Symposium on Artificial Neural Networks, d-side Publications, 2010, pp. 81–86. [18] J. Sun, M. Crowe, C. Fyfe, Extending metric multidimensional scaling with bregman divergences, in: N. Garcı´a-Pedrajas, F. Herrera, C. Fyfe, J.M. Benı´tez, M. Ali, (Eds.), Trends in Applied Intelligent Systems, Lecture Notes in Computer Science, vol. 6097, Springer, 2010, pp. 615–626. [19] J.B. Tenenbaum, V. Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [20] L. Yang, Sammon’s nonlinear mapping using geodesic distances, in: Proceedings of the 17th International Conference on Pattern Recognition, ICPR’04, vol. 2, IEEE Computer Society, Washington, DC, USA, 2004, pp. 303–306.

ERBMMDS ðYÞ ¼

N N X X

dF ðDij ,Lij Þ

i ¼ 1 j ¼ iþ1

¼

N N X X

ðFðDij ÞFðLij ÞðDij Lij ÞrFðLij ÞÞ

i ¼ 1 j ¼ iþ1

and connections which it has to mappings in the literature such as curvilinear component analysis [9]. This investigation is ongoing. Another work will investigate the reasons increasing local distance preservation does not increase mapping quality by the 1stGroup.

Appendix A. Derivatives used for stress optimisation The distance between point i and j in latent space is Lij ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi PP 2 k ¼ 1 ðYki Ykj Þ . Given @Lij =@Yki ¼ Yki Ykj =Lij , derivatives for the

1154

J. Sun et al. / Pattern Recognition 44 (2011) 1137–1154

Jigang Sun is a Ph.D. student at the University of the West of Scotland after completing his masters thesis there in 2008.

Malcolm Crowe is a professor in the School of Computing and was the Head of the School and Vice-Dean of the faculty for many years.

Colin Fyfe is a professor in computational intelligence with over 300 refereed publications. He has supervised 20 Ph.D. students and is on the editorial boards of five international journals.