A kernel-based parametric method for conditional density estimation

A kernel-based parametric method for conditional density estimation

Pattern Recognition 44 (2011) 284–294 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr A ...

632KB Sizes 3 Downloads 373 Views

Pattern Recognition 44 (2011) 284–294

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

A kernel-based parametric method for conditional density estimation Gang Fu a, Frank Y. Shih b,n, Haimin Wang c a b c

Perot Systems Government Services, Fairfax, VA 22031, USA Computer Vision Laboratory, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA Space Weather Research Laboratory, Department of Physics, New Jersey Institute of Technology, Newark, NJ 07102, USA

a r t i c l e in f o

a b s t r a c t

Article history: Received 16 November 2009 Received in revised form 9 June 2010 Accepted 23 August 2010

A conditional density function, which describes the relationship between response and explanatory variables, plays an important role in many analysis problems. In this paper, we propose a new kernelbased parametric method to estimate conditional density. An exponential function is employed to approximate the unknown density, and its parameters are computed from the given explanatory variable via a nonlinear mapping using kernel principal component analysis (KPCA). We develop a new kernel function, which is a variant to polynomial kernels, to be used in KPCA. The proposed method is compared with the Nadaraya–Watson estimator through numerical simulation and practical data. Experimental results show that the proposed method outperforms the Nadaraya–Watson estimator in terms of revised mean integrated squared error (RMISE). Therefore, the proposed method is an effective method for estimating the conditional densities. & 2010 Elsevier Ltd. All rights reserved.

Keywords: Conditional density estimation Kernel principal component analysis Kernel function Nadaraya–Watson estimator

1. Introduction In this paper we develop a parametric method to estimate conditional density from finite sets of samples. Let Y be a scalar response variable and X be a d-dimensional explanatory variable. Let (X1,Y1), y, (Xn,Yn) be independent and identically distributed (i.i.d.) samples from an unknown density function fX,Y(x,y). Of our interest is the estimation of fY9X(y9x) at any given X¼ x, which is known as conditional density estimation (CDE). Conditional density function plays an important role in many fields because it describes the comprehensive relationship between responses and explanatory variables. Estimation of a conditional density function can be viewed as a generalization of regression, since regression aims at estimating the conditional mean E(y9x) while conditional density estimation is to obtain the full probability density function fY9X(y9x). Conditional density function is quite informative because from conditional density function, we can extract various statistical quantities of interest, such as mean, prediction interval, moments, distribution, quantile, and so on. Nonparametric methods, such as Nadaraya–Watson estimator [1,2], local linear estimator [3], and finite mixture models [4], provide a consistent approach for estimating a large class of unknown densities. In practice, the greatest challenge is often the lack of specific knowledge about the parametric form of the underlying density, and thus it would be difficult to find an

n

Corresponding author. E-mail address: [email protected] (F.Y. Shih).

0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.08.027

appropriate parametric model. In this case nonparametric methods are more preferred, since they are quite flexible and no assumption is required on the parametric form of the underlying densities, which allows nonparametric methods to approximate the densities of almost any form. The performance of these nonparametric methods usually depends heavily on bandwidths, which are the most critical parameters and serve as smooth control factors, since the fundamental approach of these methods to estimate densities is smoothing data. There are quite a few methods proposed in the literature regarding how to choose bandwidths [3,5–10], but the optimal bandwidth selection based on finite samples is still a difficult task and could be even more difficult when multivariate data are present. One reason is that all the bandwidth selection methods aim at obtaining optimal bandwidths and then applying to all the data samples. Better smoothing might be needed when data are scarce, while less smoothing is needed to prevent subtle features from being diminished. Therefore, invariant bandwidths may not be suitable for the entire domain, and adaptive bandwidths would be more preferable, becoming an even more difficult task. Due to its intrinsic difficulty, the applications of nonparametric methods are limited and confined to bivariate data in many cases. We propose a new parametric method in this paper to estimate conditional density, which overcomes the above difficulties of nonparametric estimators. We use an exponential function to approximate the unknown density function, and compute the parameters of the density approximation function from the given explanatory variable via a nonlinear mapping using Kernel Principal Component Analysis (KPCA) [11]. A new

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

kernel function is developed to be used in KPCA, which leads to potentially unlimited dimensional feature space. Besides, the kernel function used in KPCA is completely different from the one used in nonparametric methods, in which the kernel functions serves as a smooth function, while the kernel function in KPCA is used to compute the dot product of two vectors. The remaining part of this paper is organized as follows. The nonparametric methods are briefly reviewed in Section 2. In Section 3, the new kernel function and the parametric method for conditional density estimation are presented. Section 4 describes experimental results of the proposed method with the Nadaraya– Watson estimator. Finally, discussion and conclusion are given in Section 5.

2. Nonparametric estimation of conditional density The conditional density fY9X(y9x) can be estimated based on joint density fX,Y(x,y) and marginal density fX(x) since we have fX,Y ðx,yÞ , fX ðxÞ

fY9X ðy9xÞ ¼

fX ðxÞ a0:

ð1Þ

Nadaraya–Watson kernel regression [1,2] can be applied to estimate fX,Y(x,y) and fX(x), respectively: n 1X f^ X,Y ðx,yÞ ¼ W ðY yÞWS ðX i xÞ, ni¼1 h i

ð2Þ

n 1X f^ X ðxÞ ¼ WS ðX i xÞ, ni¼1

ð3Þ

where h is a scalar bandwidth, S is a d  d nonsingular symmetric matrix bandwidth, and   ! 1 1 Yi y 2 Wh ðYi yÞ ¼ pffiffiffiffiffiffi exp  , ð4Þ 2 h 2ph   1 1 WS ðX i xÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp  ðX i xÞT S1 ðX i xÞ , 2 ð2pÞd :S:

ð5Þ

285

which is reduced to the Nadaraya–Watson estimator [14]. So the local linear estimator can be considered as a generalization of the Nadaraya–Watson estimator. Alternatively, finite mixture models can be used to solve the CDE problem. The idea is that we model the conditional density by a mixture of a number of density components. In the case of using Gaussian densities as the density components, the conditional density fY9X(y9x) can be written as f^ Y9X ðy9xÞ ¼

L X

pi ðxÞNðy; hi ðxÞÞ,

ð10Þ

i¼1

where L is the number of density components, pi(x) is the weight of the ith density component, and pi(x) is the vector-valued parameters of the ith Gaussian density component N(.;.). The mixture parameters, {L,p1(x), ..., pL(x),y1(x), ..., yL(x)}, can be estimated by a number of algorithms, such as the Expectation– Maximization algorithm [15], the Figueiredo and Jain’s algorithm [16] and the Iterative Pairwise Replacement algorithm [17]. The Nadaraya–Watson density estimator provides a solid performance baseline. As compared to the Nadaraya–Watson estimator, the local linear estimator is slightly better in some aspects, such as independence of the regression design, the efficiency in the asymptotic minmax sense, and the better boundary behavior [18–20]. However, according to Ja´come et al. [14], the local linear estimator does not necessarily yield more efficient estimates of conditional density than the Nadaraya– Watson estimator. The performance of finite mixture models is competitive with the Nadaraya–Watson estimator, but its computational cost of estimating parameters is higher. Therefore, in the remaining part of this paper, the Nadaraya–Watson estimator is selected for comparison with the proposed method. The performance of these nonparametric methods depends critically on the selection of bandwidths. Various methods have been proposed to select bandwidths, such as ad-hoc, bootstrap, cross-validation, plug-in, and dual-tree-based algorithms [3,5–9]. According to Fan and Yim [7] and Hall et al. [8], the crossvalidation method yields the best performance. The cross-validation method is to minimize the mean integrated squared error (MISE) of the estimated conditional density f^ ðy9xÞ Y9X

in which Gaussian densities are chosen as the kernel function. Based on f^ ðx,yÞ and f^ ðxÞ, we can obtain the conditional density X,Y

X

fY9X(y9x) as f^ ðx,yÞ f^ Y9X ðy9xÞ ¼ X,Y , f^ X ðxÞ

ð6Þ

which is known as the Nadaraya–Watson estimator [12,13]. The conditional density function fY9X(y9x) can also be estimated using local linear estimator, proposed by Fan et al. [3]. Note that as h-0, EfWh ðYyÞ9X ¼ xg-fY9X ðy9xÞ:

a0 , a1

n X

2

fWh ðYi yÞa0 aT1 ðX i xÞg WS ðX i xÞ

ð8Þ

i¼1

first derivative of fY9X(y9x) at X¼x. In fact, if we set a1 ¼0, Eq. (8) is changed to

a0

n X i¼1

fWh ðYi yÞa0 g2 WS ðX i xÞ,

where ZZ I1 ¼ f^ Y9X ðy9xÞ2 f ðxÞ dydx, ZZ I2 ¼ f^ Y9X ðy9xÞfY9X ðy9xÞf ðxÞ dydx, ZZ I3 ¼ fY9X ðy9xÞ2 f ðxÞ dydx:

ð12Þ

Since I3 does not depend on f^ Y9X ðy9xÞ, it is irrelevant to bandwidth selection. Thus, I3 could be ignored safely. Let I denote the revised mean integrated squared error (RMISE). We have I¼I1 2I2. Logically, minimizing RMISE is equivalent to minimizing MISE. When RMISE is minimized, ideally the estimated f^ ðy9xÞ would be identical to the truth density f (y9x). In the Y9X

ð1Þ ð1Þ and then f^ Y9X ðy9xÞ ¼ a^ 0 and f^ Y9X ðy9xÞ ¼ a^ 1 , where f^ Y9X ðy9xÞ is the

a^ 0 ¼ arg min

ð11Þ

ð7Þ

The left-side is the regression of Wh(Y  y) on X. We construct a new data set (X1,Wh(Y1  y)), y, (Xn,Wh(Yn  y)), and define the following minimization problem: ða^ 0 , a^ 1 Þ ¼ arg min

using the ‘‘leave-one-out’’ strategy. The MISE is defined as ZZ MISE ¼ ðf^ Y9X ðy9xÞfY9X ðy9xÞÞ2 f ðxÞ dydx ¼ I1 2I2 þ I3 ,

ð9Þ

Y9X

cross-validation method [8], I1and I2 are estimated as n Z X i ^I1 ¼ 1 f^ Y9X ðy9X i Þ2 dy, ni¼1 n X i ^I2 ¼ 1 f^ ðY 9X Þ, n i ¼ 1 Y9X i i

ð13Þ

286

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

i where f^ Y9X ðy9X i Þ is estimated using the ‘‘leave-one-out’’ strategy. R i To compute ^I1 , we need to compute the integral, f^ Y9X ðy9X i Þ2 dy, which is defined as Z X 2 n Z Wh ðYj yÞWS ðX j X i Þ dy i j ¼ 1,j a i 2 : ð14Þ f^ Y9X ðy9X i Þ dy ¼ Xn 2 W ðX X Þ S j i j ¼ 1,j a i

Clearly, the divisor of the right-hand side of Eq. (14) is analytically computable as well as the dividend, since we have 0 12 Z n X @ Wh ðYj yÞWS ðX j X i ÞA dy j ¼ 1,j a i

! n n X X ðYj Yt Þ2 1 WS ðX j X i ÞWS ðX t X i Þ: ¼ pffiffiffiffi exp  4h2 2 ph j ¼ 1,j a i t ¼ 1,t a i ð15Þ Since ^I2 can be also computed analytically, ^I ¼ ^I1 2^I2 can then be analytically determined based on h and S. Then, we find the optimal h and S by maximizing ^I, which is an optimization problem and can be solved by many methods, such as Newton’s method. Finally, the h and S maximizing ^I will be chosen as the optimal bandwidths.

of C, respectively, such that

lV ¼ CV,

ð18Þ

where l 4 0 and V a0. The eigenvector V lies in the feature space F, which spans over f(X1), y, f(Xn), and thus can be represented as their linear combination. So there must exist a weighting vector b, such that V ¼ Ub:

ð19Þ

Placing Eqs. (17) and (19) into Eq. (18), we obtain

lFb ¼

1 FFT Fb: n

ð20Þ

Multiplying both sides by UT , we have

lFT Fb ¼

1 T F FFT F: n

ð21Þ

We define a kernel matrix K as follows: K ¼ FT F

ð22Þ

Plugging Eq. (22) into (21) and omitting K in the both sides, we have nlb ¼ K b:

ð23Þ

3. Parametric method of conditional density estimation In this section, we develop a kernel-based parametric conditional density estimator (KPCDE), which is an alternative approach to solve the CDE problem and does not need estimating bandwidths. We use an exponential function to construct the conditional density function, and the parameters of the approximate density function are mapped from the explanatory variable nonlinearly. A new kernel function in the KPCA is developed to implement the nonlinear mapping.

3.1. Feature extraction using kernel PCA The KPCA is a variant to principal component analysis (PCA) [21], which is a kernel-based machine learning technique. Other kernel-based techniques include support vector machine (SVM) and kernel Fisher discriminant (KFD). These kernel-based techniques have achieved tremendous success in various fields. A survey of these techniques can be found in [22]. In KPCA, we first map original variables X1, y, Xn into a higher-dimensional feature space F using a nonlinear mapping function f:

f: Rd -F, X-fðXÞ:

ð16Þ

The dimension of F can be much higher than that of the original input space. We extract principal components of the feature space F. Let C denote the covariance matrix of F: C¼

n 1X 1 fðX i ÞfðX i ÞT ¼ FFT , ni¼1 n

ð17Þ

where F ¼ [f(X1),f(X2), ..., f(Xn)]. Note that the covariance matrix C should be centered, and the technical details are discussed in [11]. For the sake of simplicity, we assume that the mapped vectors f(X1), y, f(Xn) are already centered in the feature space F. Let V and l be an eigenvector and the corresponding eigenvalue

This indicates that b is an eigenvector of matrix K. We apply eigenvalue decomposition to K as K ¼ U SU T ,

ð24Þ

where S is a diagonal matrix, of which diagonal elements are nonzero eigenvalues, and U is the matrix consisting of all the eigenvectors. Supposing the dimension of the feature space F is l, we only choose the first lth significant eigenvectors of K, denoted as U1  l, which are usually associated with significant eigenvalues. Given a probe sample X¼x, its feature vector O(x) is computed as T OðxÞ ¼ U1l FT fðxÞ:

ð25Þ

Let us focus on the computation of K now. Each element of K is the dot product of two mapped vectors, such that Ki,j ¼ fT(Xi)f(Xj). Since the dimension of f(X) could be extremely high, explicitly mapping the original X to feature space and computing dot products could be quite expensive and run into intractability problems in practice. We use a kernel function [23], denoted as k(.), to compute the dot product, such that k(Xi,Xj)¼ f(Xi)Tf(Xj). According to Mercer’s theorem [24], any continuous symmetric positive semi-definite kernel function can be expressed as a dot product of two vectors in some space. In reverse, a dot product can be reformulated in terms of a certain kernel function. Thus, choosing an arbitrary kernel function k(.) would imply a corresponding mapping function f(.). The nonlinear mapping can be implemented more efficiently by choosing an appropriate kernel. There are various kernel functions in machine learning, such as Gaussian RBF, polynomial, sigmoidal, and inversed multiquadric [22], as summarized in Table 1. Considering the polynomial kernel, we choose y ¼0 or y ¼1. When y ¼0, the kernel function computes a dot product of all products of d vector entries of two vectors [11,25]. When y ¼1, the dot product contains from 1 to d vector entries of two vectors, which may be more preferable. When d is large, however, the value of the kernel function could be intractable. To solve this

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

problem, we develop a new kernel function as follows: 8 T dþ1 > < 1ðu vÞ , ifðuT vÞ a1, kðu,vÞ ¼ 1ðuT vÞ > : d þ 1, otherwise,

ð26Þ

287

degree up to d. Supposing that l is the dimension of feature space F indicated by the kernel function in Eq. (26), we have the upper bound of l as (  ) d  X m þ i1 l rmin ,n , ð28Þ m1 i¼0

where d40. It can be shown that the kernel function (26) is positive semi-definite and it implements the following mapping function f:

where n is the number of samples and m is the dimension of X. When d-N, the kernel function is converted to

fðxÞ ¼ ð1,fxi g1 r i r m ,fxi xj g1 r i,j r m ,:::,fxi1 ,:::,xid g1 r i1 ,:::,id r m ÞT ,

kðu,vÞ ¼

ð27Þ

T

1 , 1ðuT vÞ

uT v o 1:

ð29Þ

where x ¼(x1, ..., xm) . In the pre-processing step,  wenormalize the data samples {Xi}1 r i r n to ensure max1 r i r n :X i : o 1 by dividing each sample Xi by a constant. Thus, the kernel function k(.) is bounded no matter how large d is. The kernel function (26) is useful because the corresponding mapping function f(x) consists of all the polynomial terms of x of

In this case the dimension of F is only bounded by the number of samples.

Table 1 Common kernel functions.

After obtaining the feature vector   O(x), we construct an approximate density function f^ Y9X y9x as follows:

3.2. Conditional density function reconstruction

Gaussian RBF

  2 :uv: kðu,vÞ ¼ exp  c

T f^ Y9X ðy9X ¼ xÞ ¼ eOðXÞ HxðyÞ ,

Polynomial Sigmoidal Inv. multiquadric

(/u,vS + y)d tan hðkð/u,vSÞ þ yÞ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ð1= :uv: þ c2 Þ

where O(x) is the l-dimensional feature vector of x, H is an l  r transformation matrix, and xðyÞ ¼ ð1,y,:::,yr ÞT . We use an exponential function in Eq. (30), as it is inspired by Gibbs distribution and covers a number of probability densities.

ð30Þ

Fig. 1. (a) Sampling process of Cox–Ingersoll–Ross model, (b)–(d) true and estimated CDF, by two methods, at x ¼0.05, 0.10, and 0.15. The solid line shows the true conditional density, the dashed line shows the conditional density estimated by the proposed method, and the dotted line shows the conditional density estimated by the Nadaraya–Watson estimator.

288

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

Table 2 Summary of RMISE for the Cox–Ingersoll–Ross model, showing mean, standard deviation, and median for the proposed method (KPCDE) and the Nadaraya– Watson estimator (N–W).

Mean Standard deviation Median

KPCDE

N–W

 96.814 5.995  96.250

 91.138 5.443  90.405

Table 3 Summary of RMISE for the multivariate model, showing mean, standard deviation, and median for the proposed method (KPCDE) and the Nadaraya–Watson estimator (N–W).

Mean Standard deviation Median

KPCDE

N–W

 2.318 0.041  2.313

 2.224 0.035  2.232

In this method, we need to estimate the parameters, d, l, r, and H. Given certain parameters d, l, r and a training sample set D, we can estimate H by maximizing RMISE. Following Eqs. (11) and (12), we define the RIMSE as g(H)¼I1 2I2, and g(H) can be estimated as Z T 1 X 2 X OðX i ÞT HxðYi Þ ^ e ð31Þ gðHÞ ¼ e2OðX i Þ HxðyÞ dy jDj X A D jDj X A D i

i

where O(Xi) is the feature vector of Xi, xðYi Þ ¼ ð1,Yi ,:::,Yir ÞT and 9D9 denotes the number of samples of the set D. There is no analytical solution to compute Eq. (31) directly, because the integral of Eq. (30) is not analytically computable. We define an array to approximate function (30), whose domain can be estimated from the training data. By using this approximation array, we can compute the numerical result of the integral of (30) and then Eq. ^ (31). We intend to find the optimal matrix H by minimizing gðHÞ. Let rg and Q be the first-order and second-order partial derivatives of g(.), respectively. We have  T @g @g @g rg ¼ , ,:::, ð32Þ @H1,1 @H1,2 @Hm,m

Fig. 2. (a)–(i) True and estimated CDFs at nine points. The solid line shows the true conditional density, the dashed line shows the conditional density estimated by the proposed method, and the dotted line shows the conditional density estimated by the Nadaraya–Watson estimator.

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

289

Fig. 3. (a) 3D view of the model defined in Eq. (39), (b) random samples drawn from the model, and (c)–(f) true and estimated conditional densities at various points. The solid line shows the true conditional density, the dashed line shows the conditional density estimated by the proposed method, and the dotted line shows the conditional density estimated by the Nadaraya–Watson method.

290

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

and 2

2

@ g 6 6 @H1,1 @H1,1 6 6 @2 g 6 Q ¼6 6 @H1,2 @H1,1 6 ... 6 6 @2 g 4 @Hl,m @H1,1

2

@ g @H1,1 @H1,2

2

...

...

...

... @2 g @Hl,m @H1,2

... ...

3

@ g 7 @H1,1 @Hl,m 7 7 7 7 ... 7: 7 7 ... 7 7 @2 g 5 @Hl,m @Hl,m

ð33Þ

We start with a randomly generated matrix H0, and then use the following function to update H iteratively as Hi ’Hi1 gQ 1 rg,

ð34Þ

where g 40 is a small number. After a number of iterations, Hi will converge to a stable point, which will be taken as the solution ^ of H. Note that gðHÞ may not always converge to the global minima instead, sometimes it could get stuck in a local minima, ^ because the function gðHÞ is not a convex function. Therefore, we have to test a couple of different starting points. In practice, after trying a certain number of times, such as five times, usually we can obtain a good estimation of H, and the obtained minimum ^ gðHÞ can be considered as the global minimum. In this way, we obtain the optimal H with respect to the given parameters (d,l,r). We apply a cross-validation method to determine(d,l,r). Given arbitrary (d,l,r), we estimate its performance as follows: i) computing the feature vectors of all training samples in the training data set D, and then dividing the feature vectors into two parts equally, denoted as D1 and D2; ii) computing the optimal H1 by using the samples of D1; iii) computing the RMISE of the samples of D2 using H1 as the performance measure of the given (d,l,r). Based on grid search of (d,l,r), we can find the optimal combination, denoted as (do,lo,ro), with the best performance. Then, we compute Ho using all the samples of both D1 and D2. Thus we obtain the optimal solution (do,lo,ro,Ho). When we estimate the conditional density function fY9X(y9X  ¼x) of any given X, in order to ensure the estimated f^ Y9X y9X ¼ x to be a valid density function, we normalize the estimated conditional density function as f^ Y9X ðy9X ¼ xÞ ( R

f^ Y9X ðy9X ¼ xÞ : f^ ðy9X ¼ xÞ dx

ð35Þ

in [7], where m ¼0.21459, y ¼0.08571, and s ¼0.0783. The process is simulated at a weekly frequency, so the interval time dt is set to 1/52. In the experiment, the initial interest rate X0 at time t0 is first generated based on the invariant distribution of the process [7], which is a Gamma distribution:   2my s2 XG , : ð37Þ s2 2m Given the interest rate Xi at time ti, we can generate the interest rate Xi + 1 at time ti + 1 by Eq. (36). We repeatedly generate a sample set of 4000 weekly observations. Fig. 1(a) shows the sampling process in a run. The data set {(X0,X1), ..., (Xi,Xi + 1), ...} is created. Then, we randomly select a half of them (i.e., 2000 samples) for training and the other half for testing. Two methods, the proposed method and the Nadaraya–Watson estimator, are applied to estimate conditional densities at each sample point. The experiment is repeated 100 times. Simulation results are summarized in Table 2. Fig. 1(b)–(d) shows the typical estimates of conditional densities at x¼0.05, x ¼0.10, and x ¼0.15, respectively. We observe that the proposed method performs better than the Nadaraya–Watson estimator. 4.2. Multivariate model To compare the performance of two methods on multivariate data, we consider the following multivariate model: Y9X  Nðx1 þx2 ,0:015þ 0:05x1 x2 Þ,

ð38Þ

where x1 and x2 are uniformly distributed over [ 0.3,0.3] and X¼(x1,x2). In this experiment, 4000 samples of (X,Y) are generated, a half for training and the other half for testing. Two methods, the proposed method and the Nadaraya–Watson estimator, are applied to estimate conditional densities. The experiment is repeated 100 times. Simulation results are summarized in Table 3. Fig. 2(a)–(i) shows the typical estimates of conditional densities at various points. It is observed that the proposed method yields better performance than the Nadaraya– Watson estimator.

Y9X

4.3. Beta distribution model Since Eq. (35) is not analytically computable, we apply the similar technique of computing Eq. (31) to compute the numerical result of Eq. (35).

We are interested in comparing the performance of two methods under non-normality assumption, since in practice there are a lot of non-normal data. We consider the following beta distribution model:

4. Experimental results We compare the proposed method against the Nadaraya– Watson estimator using numerical simulation and practical data. The simulation study covers univariate, multivariate, Beta distribution, and bimodal data, while the real data uses Standard & Poor’s 500 Index closing values from years 1926 to 1956. 4.1. Cox–Ingersoll–Ross model The well-known Cox–Ingersoll–Ross model [26] is of particular interest in finance. It models the evolution of a short-term interest rate Xt: pffiffiffiffiffi dXt ¼ mðyXt Þ dt þ s Xt dWt , ð36Þ where (m,y,s) are constant parameters and Wt is a Wiener process modeling the random market risk factor. We use the same setup

fY9X ðY9XÞ ¼ R 1 0

Y a1 ð1YÞb1 ua1 ð1uÞb1 du

,

ð39Þ

where a ¼2(1  X)+5X, b ¼5(1 X) +2X, and X is uniformly distributed over [0,1]. Fig. 3(a) shows a 3D view of this model, and Fig. 3(b) shows random samples generated from the model in Eq. (39). Table 4 Summary of RMISE for the Beta distribution model, showing mean, standard deviation, and median for the proposed method (KPCDE) and the Nadaraya– Watson method (N–W).

Mean Standard deviation Median

KPCDE

N–W

 0.804 0.021  0.804

 0.793 0.020  0.791

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

In the experiment, we generate 4000 samples; a half is used for training and the other half for testing. The experiment is repeated 100 times. Simulation results are summarized in Table 4. Fig. 3(c)–(f) shows the typical estimates of conditional densities

291

at four points. In this example, the underlying conditional density is a Beta distribution, which is a non-Gaussian distribution, and the proposed method still outperforms the Nadaraya–Watson estimator.

Fig. 4. (a) 3D view of the model in Eq. (40), (b) random samples drawn from the model, and (c)–(f) true and estimated conditional densities at various points. The solid line shows the true conditional density, the dashed line shows the conditional density estimated by the proposed method, and the dotted line shows the conditional density estimated by the Nadaraya–Watson estimator method.

292

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

4.4. Bimodal model To compare the performance of the two methods under nonnormality assumption further, we consider the following bimodal model: fY9X ðY9XÞ ¼

  ! 1 1 1 Yu1 2 pffiffiffiffiffiffi exp  2 2 s1 2ps1   !! 1 1 Yu2 2 þ pffiffiffiffiffiffi exp  , 2 s2 2ps2

three-dimensional view of this model and Fig. 4(b) shows random samples generated from the model. In the experiment, we generate 4000 samples; a half is used for training and the other half for testing. The experiment is repeated 100 times, and the simulation results are summarized in Table 5. Fig. 4(c)–(f) shows the typical estimates of conditional densities at four points. In this example, the proposed method works well and outperforms the Nadaraya–Watson estimator, though the underlying density is of more complicated form in this experiment.

ð40Þ 4.5. Standard & Poor’s 500 Index data

where u1 ¼0.4, s1 ¼ 0.25+0.05X, u2 ¼ 0.4, s2 ¼0.25 0.05X, and X is uniformly distributed over [  1, 1]. Fig. 4(a) shows a

We are interested in the performance of two methods in real world. A real data set, which consists of about 8000 daily closing

Table 5 Summary of RMISE for the bimodal model, showing mean, standard deviation, and median for the proposed method (KPCDE) and the Nadaraya–Watson estimator (N–W).

Table 6 Summary of RMISE for the S&P Index data, showing mean, standard deviation, and median for the proposed method (KPCDE) and the Nadaraya–Watson estimator (N–W).

Mean Standard deviation Median

KPCDE

N–W

 1.170 0.022  1.173

 1.162 0.0195  1.162

Mean Standard deviation Median

KPCDE

N–W

 0.826 0.0163  0.825

 0.794 0.0150  0.795

Fig. 5. (a) Daily closing values of the Standard & Poor’s 500 Index from 1926 to 1956, (b)–(d) conditional densities estimated by the two methods at (b) x¼ 10.0, (3) x¼ 20.0 and (4) x ¼30.0, using the proposed method (solid curve) and the Nadaraya–Watson estimator (dashed curve).

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

values of Standard & Poor’s (S&P) Index from years 1926 to 1956, is used in this experiment and is plotted in Fig. 5(a). We sample the process at a weekly frequency. A data set is created by adding samples (Xt,Xt + 5) since each week contains 5 observations. 2000 samples are randomly selected for training, and another 2000 samples are used for testing. Two methods are applied to estimate conditional densities at each sample point. The experiment is repeated 100 times. Simulation results are summarized in Table 6. Fig. 5(b)–(d) shows the typical estimates of conditional densities at three points. It is observed that the proposed method performs better than the Nadaraya–Watson estimator.

5. Discussion and conclusion In this paper, we present a new parametric method for conditional density estimation by using KPCA and an approximate density function. The proposed method and the Nadaraya– Watson estimator are compared through numerical simulation and practical data. Furthermore, we propose a new kernel function to be used in KPCA. In our experiments, the proposed method outperforms the Nadaraya–Watson estimator, which shows that the proposed method is an effective method for solving the CDE problem. We apply two polynomials of X and Y, respectively, in Eq. (30). The degree of the polynomial of X depends on the variance of conditional densities along X, while the degree of the polynomial of Y indicates the complexity of conditional densities of Y. Generally speaking, the polynomials of higher degrees are more capable, and they can be used to estimate more complicated modals. However, higher degrees mean more coefficients and the estimation of increased coefficients based on finite number of samples would be less reliable. Therefore, the complexity of the proposed method, indicated by the degrees of the polynomials in Eq. (30) is limited by the number of training samples. To improve the performance further, we need to reduce the number of coefficients to be estimated. One possibility is to use more effective functions in Eq. (30). We will investigate alternative functions, such as Legendre polynomials and trigonometric polynomials. Our purpose is to use the functions of lower degrees to reproduce more information of underlying densities. In addition, we will also consider extracting basis functions from training data, which seems to hold much promise. We are exploring the applications of this method to various research areas. A very important application is for the forecasting of solar flares, the violent eruptions from the sun that can affect telecommunications, power grids, and space missions. In this regard, the input parameters come from measurements of solar surface magnetic fields, and the output will be the probability density of solar flare occurrence. Automated real-time prediction can be implemented based on this development.

293

References [1] E.A. Nadaraya, On estimating regression, Theory of Probability and its Applications 10 (1964) 186–190. [2] G.S. Watson, Smooth regression analysis, Shankya Series A 26 (1964) 359–372. [3] J. Fan, Q. Yao, H. Tong, Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems, Biometrika 83 (1996) 189–206. [4] M.A.T. Figueiredo, A.K. Jain, Unsupervised learning of finite mixture models, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2000) 381–396. [5] D.M. Bashtannyk, R.J. Hyndman, Bandwidth selection for kernel conditional density estimation, Computational Statistics and Data Analysis 36 (2001) 279–298. [6] R.J. Hyndman, Q. Yao, Nonparametric estimation and symmetry tests for conditional density functions, Journal of Nonparametric Statistics 14 (2002) 259–278. [7] J. Fan, T.H. Yim, A cross-validation method for estimating conditional probability densities, Biometrika 91 (2004) 819–834. [8] P. Hall, J. Racine, Q. Li, Cross-validation and the estimation of conditional probability densities, Journal of the American Statistical Association 99 (2004) 1015–1026. [9] M.P. Holmes, A.G. Gray, J. Charles Lee Isbell, Fast nonparametric conditional density estimation, in: Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, 2007. [10] P. Hall, R.C.L. Wolff, Q. Yao, Methods for estimating a conditional distribution function, Journal of the American Statistical Association 94 (1999) 154–163. [11] B. Scholkopf, A. Smola, K.-R. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10 (1998) 1299–1319. [12] M. Rosenblatt, Conditional probability density and regression estimates, Multivariate Analysis II (1969) 25–31. [13] R.J. Hyndman, D.M. Bashtannyk, G.K. Grunwald, Estimating and visualizing conditional densities, Journal of Computational and Graphical Statistics 5 (1996) 315–336. [14] M.A. Ja´come, I. Gijbels, R. Cao, Comparison of presmoothing methods in kernel density estimation under censoring, Computational Statistics 23 (2008) 381–406. [15] A.P. Dempster, D. Basu, D. Cox, A.W.F. Edwards, Maximum likelihood from incomplete data via the EM algorithm (with discussion),, Journal of the Royal Statistical Society 39 (1977) 1–38. [16] M.A.T. Figueiredo, A.K. Jain, Unsupervised learning of finite mixture models, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 381–396. [17] D.W. Scott, W.F. Szewczyk, From kernels to mixtures, Technometrics 43 (2001) 323–335. [18] J. Fan, Design-adaptive nonparametric regression, Journal of the American Statistical Association 87 (1992) 998–1004. [19] J. Fan, Local linear regression smoothers and their minimax efficiency, The Annals of Statistics 21 (1993) 196–216. [20] J. Fan, I. Gijbels, Variable bandwidth and local linear regression smoothers, The Annals of Statistics 20 (1992) 2008–2036. [21] K.I. Diamantaras, S.Y. Kung, in: Principal Component Neural Networks: Theory and Applications, Wiley, New York, 1996. ¨ ¨ ¨ [22] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf, Introduction to kernelbased learning algorithms, IEEE Transactions on Neural Networks 12 (March) (2001) 181–201. [23] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: COLT ’92: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, United States, 1992, pp. 144–152. [24] J. Mercer, Functions of positive and negative type and their connection with the theory of integral equations, Philosophical Transactions of the Royal Society of London, A 209 (1909) 415–446. [25] V.N. Vapnik, in: The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. [26] J.C. Cox, J.E.I. Jr, S.A. Ross, A theory of the term structure of interest rates, Econometrica 53 (1985) 385–407.

Acknowledgement This work is supported by the National Science Foundation (NSF) under Grants IIS-0324816 and ATM-0716950.

Gang Fu received his B.S. degree in Computer Engineering from the Soochow University, P.R. China, in 1998, and his M.S. degree in Computer Engineering from Soochow University, P.R. China, in 1998. He received his Ph.D. degree in Computer Science from New Jersey Institute of Technology, NJ, in 2007. He previously worked for Microsoft Corporation, Seattle, WA, and currently is working for Perot Systems Government Services, Fairfax, VA. His research interests include image processing, pattern recognition, and computer vision.

294

G. Fu et al. / Pattern Recognition 44 (2011) 284–294

Frank Y. Shih received his Ph.D. from the Purdue University. He is presently a full professor in the CS Department, New Jersey Institute of Technology, USA, and the Director of Computer Vision Laboratory. Dr. Shih held a visiting professor position at the Princeton University, Columbia University, National Taiwan University, National Institute of Informatics, Tokyo, and Conservatoire National Des Arts Et Metiers, Paris. He is an internationally well-known scholar and serves as the Editor-in-Chief for the International Journal of Multimedia Intelligence and Security. In addition, he is on the Editorial Board of the International Journal of Pattern Recognition, the International Journal of Pattern Recognition Letters, the International Journal of Pattern Recognition and Artificial Intelligence, the International Journal of Recent Patents on Engineering, the International Journal of Recent Patents on Computer Science, the International Journal of Internet Protocol Technology, and the Journal of Internet Technology. He served as a steering member, committee member, and session chair for numerous professional conferences and workshops. He has received numerous grants from the National Science Foundation, Navy and Air Force, and Industry. He is the recipient of Research Initiation Award of NSF in 1991 and Board of Overseers Excellence in Research Award of NJIT in 2009. Dr. Shih has authored three books: ‘‘Digital Watermarking and Steganography’’, ‘‘Image Processing and Mathematical Morphology’’, and ‘‘Image Processing and Pattern Recognition’’. He has published 103 journal papers, 89 conference papers, and eight book chapters. His current research interests include image processing, computer vision, watermarking and steganography, digital forensics, sensor networks, pattern recognition, bioinformatics, information security, robotics, fuzzy logic, and neural networks.

Haimin Wang received his B.S. degree in Astronomy from the Nanjing University in China, in 1982, and Ph.D. in Astrophysics from California Institute of Technology, in 1988. He is presently a Distinguished Professor in Physics, New Jersey Institute of Technology (NJIT). He is the Associate Director of Big Bear Solar Observatory, and Center for Solar-Terrestrial Research of NJIT. Dr. Wang received CAREER award from National Science Foundation in 1996. He is a current member of Science Advisory Board for the 4 m Advanced Technology Solar Telescope (ATST). He has served on a number of funding review panels, including a NASA Senior Review in 2006. He is the recipient of Harlan J. Perlis Research Award of NJIT in 2003. He has published over 150 papers in refereed journals. His current research interest includes solar instrumentation; solar magnetic fields, solar flares, and coronal mass ejections; application of information technology in solar feature detections.