The use of kernel principal component analysis to model data distributions

The use of kernel principal component analysis to model data distributions

Pattern Recognition 36 (2003) 217 – 227 www.elsevier.com/locate/patcog The use of kernel principal component analysis to model data distributions C...

646KB Sizes 0 Downloads 32 Views

Pattern Recognition 36 (2003) 217 – 227

www.elsevier.com/locate/patcog

The use of kernel principal component analysis to model data distributions C.J. Twining ∗ , C.J. Taylor Imaging Science and Biomedical Engineering, Stopford Building, University of Manchester, Oxford Road, Manchester M13 9PT, UK Received 3 May 2001; accepted 15 January 2002

Abstract We describe the use of kernel principal component analysis (KPCA) to model data distributions in high-dimensional spaces. We show that a previous approach to representing non-linear data constraints using KPCA is not generally valid, and introduce a new ‘proximity to data’ measure that behaves correctly. We investigate the relation between this measure and the actual density for various low-dimensional data distributions. We demonstrate the e2ectiveness of the method by applying it to the higher-dimensional case of modelling an ensemble of images of handwritten digits, showing how it can be used to extract the digit information from noisy input images. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Feature extraction; Principal components; Kernel functions; Density estimation; Support vector machines; Kernel PCA

1. Introduction Many computer vision problems involve modelling the distribution of some training data set, which can be represented as a set of points lying in some high-dimensional input space. There are several properties which we might require such a model to possess. Firstly, that it generalises from the data set to produce a continuous, compact model of the variation seen. It should also illuminate any structure inherent in the data. Finally, given a set of test points, which may have been produced by some noisy process, we would like to be able to determine whether or not the test points lie within the range of variation seen across the training set; that is, partition the test data into valid and invalid examples. Furthermore, if a test point has been classi:ed as an invalid example, we would like to be able to generate the closest valid example to that test point. In many situations, one can assume that the distribution can be modelled as a multivariate Gaussian and linear principal component ∗

Corresponding author. E-mail address: [email protected] (C.J. Twining). URL: http://www.isbe.man.ac.uk/∼ct/personal.html.

analysis (PCA) [1,2] performed in the input space can be used to give the parameters of the Gaussian. However, there are also situations where such a linear model, or combinations of linear models [3], are not suAcient. Kernel principal component analysis (KPCA) [4,5] is a technique for non-linear feature extraction, closely related to methods applied in Support Vector Machines [6,7]. It has proved useful for various applications, such as de-noising [8] and as a pre-processing step in regression problems [9]. It has also been applied, by Romdhani et al. [10], to the construction of non-linear statistical shape models of faces, but we will show that their approach to de:ning and constraining allowed variability is not generally valid. In this paper, we show how KPCA components de:ned for the case of Gaussian kernel functions can be used to construct functional measures of ‘proximity to data’. For one such measure, the pseudo-density, we investigate its relation to the actual density, both theoretically and experimentally. We demonstrate its validity and e2ectiveness by modelling the data distribution of images of handwritten digits, showing how it can be used to extract digit information from noisy unseen images of digits.

0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 2 ) 0 0 0 5 1 - 1

218

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

2. Kernel principal component analysis KPCA is a method of non-linear feature extraction. The non-linearity is introduced by mapping the data from the input space I to a feature space F. Linear PCA is then performed in the feature space. The key to the method is that linear PCA can be expressed solely in terms of dot products in the feature space. Hence, the non-linear mapping need not be explicitly constructed, but can be speci:ed by de:ning the functional form of the dot products in terms of a Mercer kernel function K, where {K: I × I → R}. In what follows, we concentrate on the case of a Gaussian kernel function. 2.1. The basic KPCA algorithm Consider data points ˜x and ˜y 1 in the input space I = Rn . The non-linear mapping  : Rn → F is de:ned such that (˜x) · (˜y) ≡ K(˜x; ˜y)   (˜x − ˜y) 2 = exp − 22

∀˜x; ˜y ∈ Rn ;

(1)

where · is the vector dot product in the in:nite-dimensional feature space F, and the parameter  is the width of the Gaussian kernel. It is important to note that (Rn ) is an embedded submanifold of F. Properties of this embedding [11] will become important later. For a data set {˜xi : i = 1–N }, we have the corresponding set of mapped data points {i = (˜xi ): i = 1–N } in the feature space F. Centred data points in F are de:ned thus: N  ˜ y) ≡ (˜y) − 1 (˜ (˜xi ) N i=1

∀˜y ∈ Rn :

(2)

Following from the de:nition of the dot product (1), we de:ne the unnormalised and normalised kernel matrices over the data set, with elements:   ˜xi − ˜xj 2 Kij ≡ (˜xi ) · (˜xj ) = exp − ; (3) 22 N  ˜ ij ≡ (˜ ˜ xi ) · (˜ ˜ xj ) = Kij − 1 K Kip N p=1



N N 1  1  Kqj + 2 Kpq : N q=1 N p;q=1

is an extremum of the Lagrangian: N 1  ˜  L= [b · (˜xi )]2 − [b · b − a ]; 2N i=1 2

where a determines the value of b 2 ≡ b · b , and the Lagrange multiplier  ¿ 0. Setting the derivatives of L with respect to b to zero, and taking the dot product with ˜ xj ) gives the eigenvector equation: (˜  bj =

1 ˜ Kij bi N i

Linear PCA is then performed by :nding a set of vectors ˜ xi )} which represent the principal {b } in the span of {(˜ axes [1] of the mapped data set. For N data points, there can at most be N − 1 such axes, where each corresponding b 1 The notation ˜ x is used to denote vectors in a :nite-dimensional space, whilst bold vectors (e.g. (˜x)) are used to distinguish those in in:nite-dimensional spaces.

˜ xk ): where bk ≡ b · (˜

(6)

In contrast to Mika et al. [8] and SchKolkopf et al. [4], we choose to normalise the eigenvectors with respect to the data thus: N  (bi )2 = 1 ∀ : (7) i=1

˜ ij (4) that It follows from the de:nition of K N N   ˜ ij ≡ 0 hence bj ≡ 0 ∀ : K j=1

(8)

j=1

For the corresponding vectors b in the feature space: N N 1  ˜ 1  b = b ) ≡ bi (˜xi ); (˜ x i i N i=1 N i=1 1 : (9) N Hence the set of b thus de:ned, form an orthogonal but ˜ xi )}. not an orthonormal basis for the space spanned by {(˜

Given a set of solutions {bi : i = 1–N; = 1–M }, ordered in terms of non-increasing eigenvalue, we de:ne an M -dimensional KPCA space as follows. A test point ˜x ∈ Rn is mapped to a point p ˜ (˜x) in this space, with unnormalised KPCA components  p (˜x) ≡ N b · (˜x) = bi K(˜x;˜xi ) b 2 =

i

=





bi exp −

i

(4)

(5)

(˜x − ˜xi )2 22

 :

(10)

Note that normalised components p˜ (˜x) can also be de:ned ˜ for  in the above equation. This then by substituting  corresponds to a translation of the origin in KPCA space, so that the mean of the data lies at the origin. 3. KPCA space In this section, we investigate the properties of the mapped data set and of test points in the space of KPCA components. As de:ned in the previous section, KPCA space is the M -dimensional space of the :rst M unnormalised non-linear components, where Eq. (10) gives us the values of the components of an arbitrary point ˜x in the input space. However,

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

219

it should be noted that not all points in KPCA space correspond to points in input space. This is because, as noted previously, (Rn ) forms an embedded submanifold in the feature space F. This can be easily demonstrated. The form of the mapping  is determined by the de:nition of the kernel function K (1). From this, it follows directly that (˜x)2 ≡ (˜x) · (˜x) = 1

(11)

for all points ˜x. Hence, (Rn ) must lie on some embedded submanifold in F. However, it should be noted that this property of constancy of the modulus will not be preserved for points p ˜ (˜x) when the submanifold is projected into KPCA space. 3.1. The embedded submanifold We :rst consider points in input space which are far from any data points. It can be seen directly from Eq. (10) that all the non-linear components for these points will tend to zero as the points recede from the data. Hence, all points far from the data in input space will map to the vicinity of the origin in KPCA space. Let us now consider points which have non-zero values of the components. The importance of the kernel width can be seen directly from Eq. (10); in the limit where  is in:nitesimal, the components will only be non-zero in the neighbourhood of some of the data points (those for which the bi are non-zero), where the size of the neighbourhood will be related to the kernel width. Also, the values of a component over the input space will take both negative and positive values, since, from Eq. (8) at least one of the bi must be of the opposite sign to the rest. These properties will persist up to some :nite size of the kernel width, although the :rst property will eventually fail (see Appendix A for details). Indeed, if the :rst property did not persist for some :nite range of , then the method would be useless as a feature extractor. Hence, for  within this range, if we consider any single component p (˜x), the extrema will tend to lie in the neighbourhood of some of the data points, and these extreme values will bracket the value obtained for points at in:nity in input space. Since this is an ordering property, it will be preserved if we make the transition to normalised components, since this just corresponds to a translation of the origin. Furthermore, as noted by SchKolkopf et al. [4], it follows from Eqs. (9) and (11) that the unnormalised components p (˜x) are bounded thus: |p (˜x)| 6 N b  (˜x) =



N :

(12)

Hence, the embedded submanifold itself lies within a strictly bounded region of KPCA space. The behaviour of the KPCA components should be contrasted with the behaviour of linear PCA components. For linear PCA, all components are zero at the mean of the data,

Fig. 1. Two views of the embedded submanifold (surface) and the data (thick black line) in the :rst 3 dimensions of KPCA space for the exactly solved case of a uniform data density on the unit circle in R2 . The white line shows the image of a straight line in input space (see the inset).

and the absolute value of the components increases without bound as we move away from the data. So, we can see that a valid data region for linear data can be constructed by placing an appropriate upper bound on the modulus of each of the linear components. In their use of KPCA to model the non-linear distribution of shapes for faces Romdhani et al. [10] de:ned their valid shape region by placing an upper bound on the modulus of each of the normalised KPCA components, by analogy with the linear case. However, the arguments above as regards the behaviour of the KPCA components across input space show clearly that this will be invalid, except in the large  limit where KPCA approaches linear PCA (see Appendix A). To illustrate these points, consider a path in input space which starts from in:nity, approaches the vicinity of the data (which may lie in several disconnected regions), and :nally recedes to in:nity again. The corresponding path in KPCA space starts at the origin, and moves towards the periphery of the submanifold whenever the path in input space approaches a data region. The path may loop back towards the origin several times, as the path in input space moves between di2erent data regions, :nally returning to the origin as the path recedes to in:nity in input space. For example, consider the case of data evenly distributed on the unit circle in R2 . In Appendix B, we give the exact solution for all the p (˜x) in the limit of an in:nite number of data points. In Fig. 1, we show two views of the calculated submanifold projected into the :rst 3 dimensions of KPCA space, where we have used a kernel width  of 0.1. It can be seen that the image of the unit circle lies precisely at the periphery of the submanifold. The image of a straight line in input space, which crosses the unit circle, is also shown. The image of this line begins and ends at the origin. It approaches the periphery of the submanifold twice, corresponding to its points of intersection with the unit circle, whilst between these points, it approaches, but does not reach, the origin. So, we can now see that the property which distinguishes points in the vicinity of the data from all other points in input space is that they lie near the periphery of the submanifold. Since the submanifold is bounded, and since points

220

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

at the periphery bracket the origin whichever direction we consider, this suggests that an appropriate measure for proximity to the data set would be some polynomial function of the sum-of-squares over the values of the components. We de:ne a set of such functions thus:  M m=2  (m)

f (˜x) ˙ p (˜x)p (˜x) where m = 1; 2; : : : ; ∞

=1

 =

M N  

m=2 bi bj K(˜x;˜xi )K(˜x;˜xj )

:

(13)

i; j=1 =1

We choose to focus on the simplest case where m = 2. We will call the normalised form of the function f(2) (˜x) the pseudo-density (˜ ˆ x), where the normalisation factor can be computed in closed form as follows: N M 1  bi bj K(˜x;˜xi )K(˜x;˜xj ); A i; j=1 =1  A = n n=2 Tr(K1=2 B) where Kij1=2 = Kij and

(˜ ˆ x) =

Bij =

M 

bi bj :

(14)

=1

4. The pseudo-density The pseudo-density as de:ned above is a normalised, positive de:nite function on input space. It has the property that (˜ ˆ x) → 0 as the point ˜x recedes from the data. An obvious question to ask is what is the relation between the pseudo-density and the actual density. For the case of evenly spaced data on the unit circle in R2 , which can be solved exactly (see Appendix B), the pseudo-density gives us an appropriate, smoothed estimate of the actual density in the limit of small . This result also illustrates the importance of our choice of normalisation for the feature space basis vectors (Eq. (7)). However, this result does not enable us to distinguish between, say, the actual density and a positive de:nite, monotonic function of the density. As regards the relationship between this and other kernel-based methods of density estimation (e.g. naKPve or adaptive kernel density estimation [12], or Support Vector Machine methods of density estimation [6,7]), we note that these all involve linear combinations of kernel functions. The pseudo-density function de:ned above involves a quadratic combination of kernels; hence, there will not be a direct equivalence between this and the other methods. Also, the pseudo-density has the further degree of freedom as regards the number of KPCA components used in the calculation (that is, the value of M ). The results of an experimental comparison are given below. Let us consider a data set which is generated from a speci:ed probability density function (˜x). We can compute the values of the Bhattacharyya measure [13,14] between the actual data distribution, the generating distribution (˜x), and

the calculated pseudo-density. The Bhattacharyya measure between two normalised distributions A(˜x) and B(˜x) on Rn is de:ned as   d˜x A(˜x)B(˜x): (15) L[A; B] ≡ Rn

Hence, L is positive de:nite, with a maximum value of 1, where larger values of the measure correspond to greater similarities between the two distributions. In Fig. 2, we show the e2ect of varying the kernel width on the Bhattacharyya measure for two examples of one-dimensional distributions. For the :rst example, which consists of a sharp Gaussian spike superimposed on a broader Gaussian, we can see that the maximum value of the measure between estimate and generating distribution corresponds to a value of the kernel width such that the pseudo-density tracks the histogram. This occurs because if the pseudo-density is to follow the spike at all, we must use a small value of the kernel width, hence will tend to over:t on the broader Gaussian. In the second example, where the di2erence in width between the two component Gaussians is not so large, we see that the maximum measure value corresponds to a smoother estimate, with a larger measure value than that between the generating distribution and the histogram. In Fig. 3, we consider a two-dimensional distribution. In the :rst case, a sample of 300 points was taken, and the pseudo-density calculated using all the components corresponding to computationally non-zero eigenvalues. We give, for comparison, the Bhattacharyya measures between the generating distribution and both naKPve and adaptive kernel density estimates with Gaussian kernels. The naKPve kernel density estimate is calculated thus: fKDE (˜x; ) ˙

N  i=1



(˜x − ˜xi )2 exp − 22

 ;

(16)

where the value of  is :xed, whereas in adaptive kernel density estimates, the value of  varies according to the local density of the data points [12]. In this case, we see that the maxima of the Bhattacharyya measure for the pseudo-density and for the other kernel estimates have nearly identical values. Note, however, that the peak for the pseudo-density is broader than that for the other estimates. Consider now, however, the case where the data are noisy. We take as our noisy data a set of 250 points from the original distribution, and add a further 50 points taken from a Qat distribution, as shown in the histogram. To calculate the pseudo-density, we now use only the :rst 10 KPCA components. As can be seen from the :gure, the higher KPCA components extract the salient features of the generating distribution from the noise, and enable us to obtain a reasonable estimate of that distribution, with higher maximum values of the Bhattacharyya measure than either the naKPve or adaptive kernel estimates.

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

221

Fig. 2. Experimental results for two di2erent distributions. Left: The generating distributions (grey line), the pseudo-density (black line) and the data histograms at a kernel width of 0.15. Right: The Bhattacharyya measure between the pseudo-density and the data (grey line), and between the pseudo-density and the generating distribution (black line), as a function of the kernel width. The horizontal line gives the Bhattacharyya measure between the generating distribution and the data.

Fig. 3. Top left: A two-dimensional generating distribution. Bottom left: The Bhattacharyya measures between the pseudo-density (black line), the naKPve and adaptive KDE estimates (dashed and solid grey lines, respectively) and the generating distribution as a function of kernel=window width. Top right: The generating distribution and the histogram of the noisy data. Bottom right: The corresponding measures for the case with added noise.

222

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

Fig. 4. The digit training set in the :rst 3 dimensions of KPCA space. White circles: The digit ‘0’. Light grey circles: The digit ‘6’. Dark grey circles: The digit ‘1’, Crosses: Other digits.

5. Handwritten digits In this section, we demonstrate the e2ectiveness of our technique by applying it to the case of real-world data; images of handwritten digits. As a test data set, these present inherent structure, in terms of the di2erent digits, and also signi:cant variation, in terms of the di2erences between examples of the same digit. The data are also high-dimensional, since we consider the images themselves as our input. We aim to extract the salient information from noisy examples of unseen digit images; that is, we show that the input noisy images can be mapped to noise-free examples of the correct digit using gradient ascent of the pseudo-density function. The input data consists of greyscale images of handwritten digits, taken from the UCI [15] database. The original 32×32 binary images were reduced to 16 × 16 images by averaging over four-pixel neighbourhoods to produce quantised pixel values, Iij ∈ {0; 14 ; 12 ; 34 ; 1}. A 500-image training set and a 200-image test set were then constructed by randomly selecting 50 and 20 images, respectively, of each digit. Each image was then directly encoded as a vector with quantised components, lying in a 256-dimensional image input space. KPCA was applied to the training set. From the allowed pixel values, we know that all the data are constrained to lie in the unit cube in 256 dimensions, which then gives a minimum data separation length of 14 and a maximum of 16. A kernel width was then chosen within this range ( =2), such that it gave a reasonable separation of digits in the :rst few dimensions of KPCA space. Fig. 4 shows the training set

in the :rst 3 dimensions of KPCA space. It can clearly be seen that the :rst 3 components separate the digits {0; 1; 6} from the other digits. The higher components separate the other digits in an analogous manner. For the construction of the pseudo-density, we retained all components which corresponded to computationally non-zero eigenvalues, which corresponds to a value of M = 443. The test set was taken, and noise applied, either salt and pepper, speckle or additive Gaussian. Note that the resultant image values Iij were then truncated to lie within the range 0 –1. The algorithm could deal equally well with values lying outside this range; however, it would then be problematic visualising the resultant noisy images. Gradient ascent was performed on the pseudo-density function to :nd the closest pattern consistent with the training set. (The form of the derivatives of (˜ ˆ x) can be calculated directly from Eqs. (4) and (14).) The halting value for the gradient ascent was chosen such that the corresponding isosurface of the pseudo-density enclosed all the points from the training set. It would obviously be an advantage if we could choose the isosurface so that it enclosed a given proportion of the density estimate. However, the form of the pseudo-density does not allow us to explicitly perform the integration. So, we select the value on the isosurface to be a proportion of the minimum value of the pseudo-density across the data. Note that the aim here is not to reconstruct the original image without noise (as is the aim of Mika et al. [8]), but to iterate to a clean image of the correct digit. So rather than recovering the explicit image information (i.e. the pixel values), we instead aim to accurately reconstruct the implicit information, that is, the identity of the digit. Our experiment is thus closer to solving a classi:cation problem rather than the de-noising problems that have been studied previously. Examples of the evolution and results of this process for the case of additive Gaussian noise of variance 0:9 are given in Fig. 5. For these examples, it can be seen that, apart from the case of digit 9, the process produces the correct digit as output. Note, however, that the resultant digit does not necessarily belong to the same subset as the input digit (e.g. digits 4 and 5 in the :gure). This can be explained if the pseudo-density is such that within the set containing all instances of a particular digit, there are several peaks of the pseudo-density function. We could try to vary the kernel width such that the pseudo-density only possessed one peak within each digit set; however, we would then run the risk of over-smoothing in the transition regions (e.g. between 5’s and 9’s, see Fig. 5). The algorithm was then applied to the full test set of 200 images with noise added. The resultant images were classi:ed by hand, either as a particular digit, or as indeterminate. Fig. 6 shows the percentage of correctly identi:ed digits as a function of noise density=Gaussian variance. Firstly, note that the percentage of correctly identi:ed digits, even for the case of zero noise, never exceeds 94.5%. For the case of speckle noise, the correctly identi:ed percentage falls only slowly, and does not fall below 85% over the

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

223

Fig. 5. Example evolution of the gradient ascent algorithm. From the left: original digit, digit with Gaussian noise of variance 0.9, samples from the evolution, and :nal output image.

range considered. For the case of additive Gaussian noise, values are above 72.5% over the range. The only type of noise which signi:cantly a2ects the performance is the salt and pepper noise. In Fig. 7, we give examples from the test set for the case of salt and pepper noise of density 0:6. It can be seen that the input images are extremely noisy. Despite this, the algorithm converges to the correct digit in 69.5% of these cases. 6. Conclusions We have shown how Gaussian KPCA components in the data input space can be used to de:ne a new normalised ‘proximity to data’ measure, which we have called the pseudo-density. Theoretical considerations show that, in the appropriate limit of small kernel width, the extrema of the pseudo-density will lie in regions close to the data, whilst the pseudo-density will tend to zero in regions far from the data. We have demonstrated this explicitly for the continuum case of points on the unit circle in R2 , which we have solved exactly. Furthermore, we have investigated the relationship between the pseudo-density and the actual generating density, for some examples of one- and two-dimensional distributions. We have shown that the pseudo-density gives

as reliable an estimate as the naKPve or adaptive kernel density estimates for clean data, but with the added feature that, by restricting the number of KPCA components used, the pseudo-density gives a quantitatively better estimate than the other kernel approaches in the case where there is noise present in the data. We have also applied our technique to the high-dimensional data case of images of handwritten digits. As we would have expected from the previous work on KPCA [8], the structure inherent in the data is extracted, explicitly mapping di2erent input digits to di2erent regions of the KPCA space. Construction of the pseudo-density, and the application of a gradient ascent algorithm allows us to extract the correct digit from unseen noisy input images, even at high levels of noise. We have shown elsewhere that this approach can also be applied to the problem of shape approximation in non-linear shape spaces [16], representing an advance over the approach described by Romdhani et al. [10]. The results we have presented show that our technique for constructing the pseudo-density produces a useful representation of high-dimensional, non-linear data sets, and ful:ls two of the requirements given in the Introduction: that of illuminating structure inherent in the data, and of producing the nearest valid example to some unseen, noisy test data. However, as regards the :rst requirement given in the Introduction, we note that although the pseudo-density

224

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

Fig. 6. The percentage of correctly identi:ed noisy digits for speckle noise (dashed black line), additive Gaussian noise (solid black line) and salt and pepper noise (grey line) as a function of noise density=Gaussian variance.

Fig. 7. Top: Original digits. Middle: Digits with salt and pepper noise of density 0.6 added. Bottom: The output from the algorithm.

model is continuous in input space, the technique as applied above is not compact; we have to calculate the kernel function between the test point and all the data points in the training set. We note that the question of optimising the KPCA algorithm as applied to large data sets (e.g. N ¿ 3000) has already been addressed by several authors [17,18], as has the question of constructing reduced-set approximations to the exact KPCA results [19]. What is still unresolved is how to optimise the KPCA parameters; the kernel width and the number of non-linear components retained. Also how to put the choice of a valid data region (i.e. choice of isosurface of the pseudo-density) into a proper probabilistic framework. Appendix A In this appendix, we will show how to obtain a series expansion for the normalised kernel matrix in the limit of large kernel width. This then allows us to calculate the lim-

iting form of some of the eigenfunctions, which illustrate the exact way in which the non-linear KPCA components are related to the linear PCA components in this limit. Let us consider the continuum limit, with centred data density (˜x), where ˜x ∈ Rn . The unnormalised kernel matrix (1) is replaced by the kernel function K(˜x; ˜y) = exp(−a(˜x − ˜y)2 );

where a ≡

1 : 22

(A.1)

The normalised kernel function is given by the continuum analogue of Eq. (4) thus   ˜ x; ˜y) = K(˜x; ˜y) − d˜z (˜z)K(˜x;˜z) − d˜z (˜z)K(˜z; ˜y) K(˜  +

d˜u d˜v (˜u)K(˜u;˜v)(˜v):

(A.2)

Consider the limit where the kernel width  is large with respect to the span of the data; that is, (˜x) is only non-zero

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

inside some region of characteristic width l, where l. It follows that in integrals of the form  d˜x (˜x)K(˜x; ˜y)g(˜x; ˜y;˜z;˜u : : :); (A.3) we can replace K(˜x; ˜y) by its Taylor expansion about the point ˜x = 0. Applying this to the de:nition above gives the following expansion for the normalised kernel function: 2

2

2

2

˜ x; ˜y) = K(˜x; ˜y) − [e−a˜x + e−a˜y − 1] K(˜ + a Tr(C)[e−a˜x + e−a˜y − 2] + O(a2 ); where C&' ≡

2

 d˜x x& x' (˜x) ≡ x& x' ; &; ' = 1; 2; : : : ; n:

(A.4)

We now apply this to the continuum version of the eigenvector equation (6):  ˜ x; ˜y)b (˜x) =  b (˜y): d˜x (˜x)K(˜ (A.5) To leading order, there exists the following set of solutions: b (˜x) = A ) x) exp(−a˜x 2 ) where

A ) C)' A ' = lin

and

 = 2alin =

lin : 2

(A.6)

˜A are the eigenvectors of the covariance matrix C)' of the data in input space; that is, they represent the directions of the set of principal axes de:ned by linear PCA acting on the input data. The associated unnormalised KPCA components are given by  p (˜y) ˙ d˜x (˜x)K(˜x; ˜y)b (˜x) 2

A ' y' − 2A ' x' x) x) ] + O(a2 ) ≈ a e−a˜y [lin

A ' y' e−a˜y ≈ alin 3

= O(l ):

2

to leading order since x' x) x)  (A.7)

Hence, we can clearly see that these components will not give a contribution to the calculation of the pseudo-density which reQects the actual density. It also illustrates the exact way in which KPCA in the large- limit approaches linear PCA. In the region ˜y:

p (˜y) ˙ lin A ' y'

(A.8)

which is just the linear principal component along the direction ˜A . 2 Note that here the Greek subscript x indicates the components ' of ˜x in the input space Rn .

225

Appendix B In this appendix, we consider the continuum limit of data evenly distributed on the unit circle in R2 . In polar coordinates, the unnormalised Gaussian kernel matrix has the form K([1; *p ]; [1; *q ]) ≡ K(*p ; *q ) ≡ Kpq   1 = exp − 2    1 ×exp 2 cos(*p − *q ) : 

(B.1)

Centred feature space vectors are de:ned in a way analogous to the discrete case (2):  2 ˜ *) = (r; *) − 1 (r; d, (1; ,) (B.2) 2 0 which leads to the following expression for the normalised kernel matrix:  2 ˜ pq ≡ (1; ˜ *p ) · (1; ˜ *q ) = Kpq − 1 K d* K(*p ; *) 2 0  2  2 1 1 d* K(*; *q ) + 2 d* d K(*; ) − 2 0 4 0     1 1 ; (B.3) = Kpq − exp − 2 I0  2 where I0 (z) is a modi:ed Bessel function of the :rst kind:  2 1 In (z) = d* ez cos * cos(n*) for integral n: (B.4) 2 0 ˜ pq become eigenfunctions in the conThe eigenvectors of K tinuum limit thus:  2 1 ˜  b (*) = d K(*; )b ( ); 2 0   2     1 1 1 d exp 2 cos( − *) exp − 2 = 2   0   1 b ( ): − I0 (B.5) 2 This has the set of paired orthogonal solutions: b (*)[1] = cos( *); b (*)[2] = sin( *);     1 1 for = 1; 2; : : : ; ∞:  = exp − 2 I  2

(B.6)

As in the discrete case (7), we choose eigenfunctions which are uniformly scaled with respect to the data:  2 d*(b (*)[1; 2] )2 = constant ∀ : (B.7) 0

226

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227

The normalised and unnormalised KPCA components of a test point, with polar coordinates (r; *), are identical and are given by  2 1 d b ( )[1; 2] K([1; ]; [r; *]); p (r; *)[1; 2] = 2 0    2 1 1 d b ( )[1; 2] exp − 2 (1 + r 2 ) = 2 0 2

r ×exp 2 cos( − *) ;    1 r = b (*)[1; 2] exp − 2 (1 + r 2 ) I 2 2 for = 1; 2; : : : ; ∞:

(B.8)

The normalised sum-of-squares of the components at a point, which is the pseudo-density, is given by    1 1 2 1 (r; ˆ *) = lim r exp − + M →∞ 2 2 2 ×

M

I 2 (r=2 )

M =1 2 .=1 I. (1=2 )

 :

(B.9)

Using the following expansion ([20, Eq. (8.451.5)]), valid for x1, and keeping only the leading term: ∞ ex  (−1)k /(n + k + 1=2) I (x) ≈ √ 2x k=0 (2x)k k!/(n − k + 1=2)

ex =√ 2x



1+O

  1 x

(B.10)

leads to the following expressions for the components and the pseudo-density in the region r2 with 2 1:   b (*)[1; 2] 1

2 (B.11) p (r; *)[1; 2] ≈ √ exp − 2 [r − 1] ; 2 2r   1 1 (B.12) (r; ˆ *) ≈ 3=2 exp − 2 [r − 1]2 : 2 r  Integrating over *, this gives the radial distribution:   1 1 (B.13) P(r) ≈ √ exp − 2 [r − 1]2 :    Hence, we can see that in the limit 2 1, the pseudo-density gives a normalised, smoothed estimate of the radial delta-function distribution of the input data, with the width of the Gaussian kernel acting as a smoothing parameter. It should be noted that this result depends on the uniform scaling of the eigenfunctions, which leads to a :nite M → ∞ limit in Eq. (B.9) when the leading term in the expansion (B.10) is substituted for I . Consider now the mapping of the input space R2 to the KPCA space with coordinates p (r; *)[1; 2] . The origin and the point at in:nity in R2 both map to the origin in KPCA space. In KPCA space, the

distance from the origin of the image of a point (r; *) is given by the square root of (r; ˆ *). Hence, we can see that the submanifold formed by the mapping of the plane is folded, with the point at in:nity and the origin both lying in the region of the origin in KPCA space, and the data points occupying the portion of the submanifold furthest from the origin. An example illustrating this is shown in Fig. 1, where the projection of the submanifold into the :rst 3 dimensions of KPCA space has been calculated from Eq. (B.8) for the case  = 0:1. References [1] H. Hotelling, Analysis of a complex of statistical variables into principal components, J. Educ. Psychol. 24 (1933) 417–441. [2] I.T. Joli2e, Principal Component Analysis, Springer Series in Statistics, Springer, New York, 1986. [3] T.F. Cootes, C.J. Taylor, A mixture model for representing shape variation, Image Vision Comput. 17 (8) (1999) 567–573. [4] B. SchKolkopf, A. Smola, K.-R. MKuller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10 (1998) 1299–1319. [5] B. SchKolkopf, A. Smola, K.-R. MKuller, Kernel principal component analysis, in: B. SchKolkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 327–352. [6] J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins, Support vector density estimation, in: B. SchKolkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 293–306. [7] V. Vapnik, S. Mukherjee, Support vector method for multivariate density estimation, in: S.A. Solla, T.K. Leen, K.-R. MKuller (Eds.), Adv. Neural Information Processing Systems, Vol. 12, MIT Press, Cambridge, MA, 2000. [8] S. Mika, B. SchKolkopf, A. Smola, K.-R. MKuller, M. Scholz, G. RKatsch, Kernel PCA and de-noising in feature spaces, in: M.S. Kearns, S.A. Solla, D.A. Cohn (Eds.), Adv. Neural Inform. Processing Systems, Vol. 11, MIT Press, Cambridge, MA, 1999, pp. 536–542. [9] R. Rosipal, M. Girolami, L.J. Trejo, Kernel PCA for feature extraction and de-noising in non-linear regression, Technical Report No. 4, Department of Computing and Information Systems, University of Paisley, 2000. [10] S. Romdhani, S. Gong, A. Psarrou, A multi-view nonlinear active shape model using kernel PCA, in: T. Pridmore, D. Elliman (Eds.), Proceedings of the 10th British Machine Vision Conference (BMVC99), BMVA Press, 1999, pp. 483–492. [11] C.J.C. Burges, Geometry and invariance in kernel based methods, in: B. SchKolkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 89–116. [12] B.W. Silvermann, Density Estimation for Statistics and Data Analysis, Chapman & Hall, London, 1986. [13] A. Bhattacharyya, On a measure of divergence between two multinomial populations, Sankhya: The Indian Journal of Statistics 7 (1946) 401–406.

C.J. Twining, C.J. Taylor / Pattern Recognition 36 (2003) 217 – 227 [14] W.J. Krzanowski, F.H.C. Marriott, Multivariate Analysis: Part 2, Classi:cation, Covariance Structures and Repeated Measurements, Kendall’s Library of Statistics, Arnold, London, 1995. [15] P.M. Murphy, D.W. Aha, UCI repository of machine learning databases, available from: http://www.ics.uci.edu/ ∼mlearn/MLRepository.html, 1994. [16] C.J. Twining, C.J. Taylor, Kernel principal component analysis and the construction of non-linear active shape models, in: T.F. Cootes, C.J. Taylor (Eds.), Proceedings of the 12th British Machine Vision Conference (BMVC2001), Vol. 1, BMVA Press, 2001, pp. 23–32. [17] P. Moerland, Mixture models for unsupervised and supervised learning, Ph.D. Thesis, Swiss Federal Institute of Technology,

227

Lausanne, available as IDIAP Research Report IDIAP-RR 00-18, 2000. [18] A. Smola, B. SchKolkopf, Sparse greedy matrix approximation for machine learning, in: P. Langely (Ed.), Proceedings of the 17th International Conference on Machine Learning (ICML ’00), Morgan Kaufmann, San Francisco, CA, 2000, pp. 911–918. [19] B. SchKolkopf, S. Mika, C.J.C. Burges, P. Knirsch, K.-R. MKuller, G. RKatsch, A. Smola, Input space vs. feature space in kernel-based methods, IEEE Trans. Neural Networks 10 (5) (1999) 1000–1017. [20] I.S. Gradshteyn, I.M. Ryzhik, Table of Integrals, Series and Products, Academic Press, London.

About the Author—CAROLE J. TWINING obtained her B.A. degree in Physics from the University of Oxford (1985), and her Ph.D. in Theoretical Particle Physics from the Department of Applied Maths and Theoretical Physics, University of Liverpool (1990). From 1990 to 1992 she held an S.E.R.C. postdoctoral fellowship at the Department of Theoretical Physics, University of Oxford. Since 1999, she has been a Research Associate in the Division of Imaging Science and Biomedical Engineering, University of Manchester. Her current research interests include the theoretical aspects of shape, appearance and behaviour modelling, and information theory-based approaches to image registration and machine learning. About the Author—CHRIS TAYLOR received his B.Sc. degree in Physics and Ph.D. in Computer Image Analysis from the University of Manchester in 1967 and 1972, respectively. He currently holds a Chair in the Division of Imaging Science and Biomedical Engineering, University of Manchester. His long-term interest is in practical applications of computer vision in both medicine and industry. His current research interests are in the areas of model-based vision (with particular emphasis on modelling variable objects), medical image interpretation, and vision-based human–computer interaction, including face and gesture recognition.