Gaussian process regression with skewed errors

Gaussian process regression with skewed errors

Journal Pre-proof Gaussian process regression with skewed errors M.T. Alodat, Mohammed K. Shakhatreh PII: DOI: Reference: S0377-0427(19)30670-3 http...

1MB Sizes 0 Downloads 118 Views

Journal Pre-proof Gaussian process regression with skewed errors M.T. Alodat, Mohammed K. Shakhatreh

PII: DOI: Reference:

S0377-0427(19)30670-3 https://doi.org/10.1016/j.cam.2019.112665 CAM 112665

To appear in:

Journal of Computational and Applied Mathematics

Received date : 3 August 2018 Revised date : 18 September 2019 Please cite this article as: M.T. Alodat and M.K. Shakhatreh, Gaussian process regression with skewed errors, Journal of Computational and Applied Mathematics (2019), doi: https://doi.org/10.1016/j.cam.2019.112665. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Elsevier B.V. All rights reserved.

Journal Pre-proof

Gaussian Process Regression With Skewed Errors M. T. Alodat∗ and 1

Mohammed K. Shakhatreh

Department of Statistics, Yarmouk University, Jordan

Department of Mathematics and Statistics, Jordan University

p ro

2

2

of

1

of Science and Technology, Jordan

Abstract

urn

al

Pr e-

In statistical literature, the Gaussian process is used as a prior process to treat a non-linear regression from a Bayesian viewpoint with normally distributed error terms. In this paper, we modify the Gaussian process regression (GPR) model by assuming the error terms of the GPR model follow a skew normal distribution instead of the normal distribution. We refer to this new modification as the Gaussian process regression with skewed errors (GPRSE). A key advantage of our model is that the well known GPR model is a subclass of the GPRSE model. Most importantly, we derive its predictive distribution for a new input location in a closed form. Moreover, the marginal likelihood of the GPRSE model is derived, and is used to train the model hyper-parameters. We also conduct simulation experiments to demonstrate the efficiency of our proposed approach compared to the GPR model. Keywords: Gaussian process Regression; Closed Skew Normal Distribution; Bayesian Statistics; Predictive Distribution; Prior distribution.

1

Introduction

Jo

Over the last decade, Gaussian processes (GPs) have received considerable attention by many authors due to their frequently appearance in many areas of statistical theory and practice, particularly in machine learning approaches. Moreover, statistical inferences including predictions using the GPR model are analytically tractable. Additionally, the GPs have been used as a powerful tool to fit non-parametric regression models from Bayesian viewpoint. O’Hagan. [1] was the first who used the Gaussian process as a prior process to treat a non-linear regression model. Neal [2] demonstrated O’Hagan’s approach to Bayesian learning in networks. Rasmussen and Williams [3] ∗

Corresponding author: [email protected]

1

Journal Pre-proof

Jo

urn

al

Pr e-

p ro

of

used the GPR model to analyze noisy data and to handle many interesting classification problems that occur in the context of machine learning . Brahim–Belhouariy and Bermak [4] used Gaussian process regression (GPR) model to predict the future value of a non-stationary time series. Vanhatalo et al. [5] introduced the GPR model with student-t likelihood that is robust against outliers. Macke et al. [6] applied the GPR to estimate the cortical map of the human brain. Jyl¨anki et al. [7] considered a robust efficient implementation of GPR model with the Student-t model for the error terms. It is needless to mention that the GPR model usually assumed that the observation model follows a Gaussian distribution. Unfortunately, this assumption may not be satisfied due to different reasons. One of these reasons is that the GPR model is not resistance against outliers which leads to unstable estimates of the observation model parameters and therefore it yields inaccurate predictive results for this model. Such outliers may occur in the data because of the failure in the measurement process or due to dropping a relevant explanatory variable from the model. Additionally, the GPR model is inappropriate when the observation model possess skewness pattern in its data. To overcome this deficiency, one needs to modify the GP regression, so that the resulting modification is capable to cope with the outliers effectively. For example, Vanhatalo et al. [5] introduced the GPR model with student-t likelihood that is robust against outliers whereas Jyl¨anki et al. [7] considered a robust efficient implementation of GPR model. However, the proposed GPR model with student-t likelihood is analytically intractable, though it gives satisfactory predictive performances in comparison to the GPR model. It is worth mentioning that the presence of outliers in the observation model can lead to skewness in the observation model, see for example Alodat and Al-rawwash [8] who showed by means of empirical application to data set that the normality assumption is violated. Alternatively, a flexible modification to the GPR model can be introduced in order to address some of the deficiencies of the GPR model . This modification can be achieved by incorporating the skew normal distribution, introduced in Azzalinie [9, 10], Azzalinie and Dalla Valle [11] and Azzalinie and Capitanio [12], into the observation model. In this case, we have a new GPR model that is based on skew Gaussian white noise. Recently, random processes that exhibiting skewness patterns have been defined by several researchers. Alodat and Aludaat [13] and Alodat and Al-rawwash [14] defined a random skew Gaussian process based on the skew normal theory. Additionally, their findings were applied to Sea Level Pressure (SLP) data. Discrete skew versions of the Gaussian random processes were defined by Gonz´ales–Fari´as et al. [15] and Allard and Naveau [16]. For more examples about skew random processes or fields, the reader is referred to Zhang and El-Shaarawi [17] and Alodat and Al-rawwash [8] and Alodat et al. [18]. The starting point of constructing such these random processes lies in using the multivariate skew normal distribution, which appeared in the pioneer works of Azzalinie [9, 10], Azzalinie and 2

Journal Pre-proof

Dalla Valle [11] and Azzalinie and Capitanio [12]. The skew-normal or skew Gaussian distribution is defined as follows. A random vector Z(n×1) is said to have an n−dimensional multivariate skew normal distribution if it has the probability density function (PDF)  PZ (z) = 2φn (z; 0, Σ) Φ αT z ,

z ∈ Rn

(1)

Pr e-

p ro

of

where φn (.; 0, Σ) is the PDF of Nn (0, Σ) and Φ (.) is the cumulative distribution function (CDF) of N (0, 1), and α(n×1) is vector referred to as the skewness parameter. A more general family of (1) is obtained by using the transformation X = µ + Z, µ ∈ Rn . It can be readily seen that the PDF of X is PX (x) = PZ (x − µ). We use the notation X ∼ SN n (µ, Σ, α) to denote an n−dimensional skew normal distribution with parameters µ, Σ and α. An alternative generalization to (1) is given by Gonz´ales–Fari´as et al. [15] as follows. Let µ ∈ Rp , D be an arbitrary q × p matrix, Σ and ∆ are positive definite matrices of dimensions p × p and q × q, respectively. A random vector Y is said to have a p−dimensional closed skew normal distribution (CSN ) with parameters µ, Σ, D, v, and ∆, if its PDF takes the following form PY (y; µ, Σ, D, v, ∆) = cφp (y; µ, Σ) Φq (D (y − µ) ; v, ∆) ,

y ∈ Rp ,

(2)

where φp (.; η, ψ) and Φp (.; η, ψ) are the PDF and the CDF of a p−dimensional normal distribution with mean vector η and covariance matrix ψ, and c is given via  c−1 = Φq 0; v, ∆+DΣDT .

Gaussian Process for Regression

Jo

2

urn

al

From now and on, we adopt the notation Y ∼ CSN p,q (µ, Σ, D, v, ∆) for a random variable Y with PDF (2). Throughout this paper, several lemmas and results about the multivariate CSN distribution will be used extensively. So we present them in Appendix A. The rest of this paper is organized as follows. In Section 2, we introduce the GPR model. The proposed new GPR model along with the derivation of its corresponding predictive distribution is given in Section 3. In Section 4, we conduct a simulation study to compare its performance, in terms of predictions for the observation model, to the GPR model as well. Some conclusions and future works are given in Section 5.

Consider a set of training data Y = (Y1 , Y2 , . . . , Yn )T that are governed by the non-linear regression model Yi = f (ti ) +  (ti ) at the input vectors t1 , t2 , . . . , tn ∈ C ⊆ Rn where  (t1 ) ,  (t2 ) , . . . ,  (tn ) are independent and identically distributed Gaussian noises on C of mean 0 and variance σ 2 and f (.) is a latent (unknown) function. The main object of inference is the latent function f. In particular, 3

Journal Pre-proof

p ro

of

based on the given set of training data Yi0 s measured at the inputs t0i s, the prediction of f (t) for a new input t∗ is of a central interest in a wide range of applications. To achieve this goal, a prior distribution for f (t) , is assumed to be defined. This prior distribution should be defined on the class of all functions defined on the space of t. The Gaussian process on C through its sample paths defines a rich class of such functions. So, it is natural to assume that the latent function follows a Gaussian process. Therefore, if f (t), t ∈ C is a Gaussian process with covariance function c(., .), then for every n and t1 , t2 , . . . , tn ∈ C, we have f = (f (t1 ) , . . . , f (tn ))T ∼ Nn (0, Σ), where   c (t1 , t1 ) · · · c (t1 , tn )   .. .. .. Σ= . . . . c (tn , t1 ) · · · c (tn , tn )

A suitable choice for c(., .) is the following covariance function   1 T −1 T 2 c (ti , tj ) = β0 + β1 ti tj + ψ exp − (ti − tj ) Λ (ti − tj ) + β3 δij , 2

Pr e-

(3)



c (t1 , t1 )  ..  . Σ =   c (tn , t1 ) c (t∗ , t1 )

Jo

with

urn

al

where βi0 s, ψ, and Λ are unknown parameters and δij is the Kronecker function. For simplicity, we may consider Λ=diag( w12 , . . . , wn2 ). The covariance function (3) is designed to the linear and exponential radial random errors. A covariance function c(., .) is said to be isotropic if c (ti , tj ) depends only on the distance kti − tj k. For more information about other types of covariance functions see Girard et al. [19]. The joint PDF of f (t) and f ? = f (t∗ ) , where t? is the new input, is also an (n + 1) −dimensional multivariate Gaussian distribution, i.e.,   f (t1 )   ..   .   ∼ Nn+1 (0, Σ ), (4)    f (tn )  f (t∗ ) ··· ...

··· ···

c (t1 , tn ) .. . c (tn , tn ) c (t∗ , tn )

c (t1 , t∗ ) .. . c (tn , t∗ ) c (t∗ , t∗ )



  =  

K k kT k ∗

!

,

where k = (c (t1 , t∗ ) , . . . , c (tn , t∗ ))T and k ∗ = c (t∗ , t∗ ) . Rasmussen and Williams [3] showed that the prediction distribution of f ∗ given Y and t∗ remains Gaussian and is given by  P (f ∗ |Y , t∗ ) ∼ N µ (t∗ ) , σ 2 (t∗ ) , (5) 4

Journal Pre-proof

where µ (t∗ ) and σ 2 (t∗ ) are the mean and the variance of the predictive distribution in (5). They are given by −1 −1 µ∗ = µ (t∗ ) = kT K + σ 2 In Y and σ 2 (t∗ ) = k ∗ − kT K + σ 2 In k.

p ro

of

The distribution in (5) can be used to draw several inferential statements about f (t∗ ). For instance, when p = 1, a 100 (1 − α) % prediction interval for f (t∗ ) is given by [L, U ], where L and U are the solutions of the following two equations: Z L Z ∞ α α ∗ ∗ ∗ P (f ∗ | Y , t∗ ) df ∗ = . and P (f | Y , t ) df = 2 2 U 0 Similarly, a 100 (1 − α) % prediction interval for f ∗ is

µ (t∗ ) ± Zα/2 σ(t∗ ),

P f



Pr e-

where Zα/2 is the 100 (1 − α/2) quantile of N (0, 1). Moreover, the mean µ (t∗ ) serves as a predictor for f (t∗ ) given the data Y and t∗ whereas the variance σ 2 (t∗ ) serves as a measure of uncertainty of µ (t∗ ) . The GPR model can be of considerable importance in predicting f (t) at random input. That is, if t∗ is a random variable such that t∗ ∼ Np (µ∗ , σ∗2 ), then the predictive PDF of f ∗ given that µ∗ , σ∗2 , see for example Girard et al. [19], is given by |µ∗ , σ∗2 , Y



=

Z

P (f ∗ |Y , t∗ ) P (t∗ ) dt∗ ,

(6)

P f

urn

al

where P (t∗ ) is the pdf of Np (µ∗ , σ∗2 ) . The integral in (6) does not have a closed form. Therefore, an approximation to this integral is needed in order to obtain some inferential statements about f ∗ . Additionally, the main computational problems in (6) are: inversion of the matrix Σ + σ 2 In and obtaining the mean and the variance of the predictive distribution of f ∗ . Alternatively, we propose to use the following simple Monte Carlo approximation to (6): ∗

|µ∗ , σ∗2 , Y



=

Z

N  1 X  ∗ ∗ (r) P f , Y |t , P (f |Y , t ) P (t ) dt ' N r=1 ∗







Jo

where t∗ (1) , . . . , t∗ (N ) are independent samples from P (t∗ ). For other analytically approximation techniques to the predictive density in (6), the reader is referred to Rasmussen and Williams [3]. It should be noted that the GPR model is a full probabilistic model. So, a major advantage of GPR model over other methods(such as cross-validation) is that it enables researchers to estimate the hyper-parameters of covariance function from the training data. By considering this approach, prior distributions for the hyper-parameters can be defined and the corresponding posterior distribution of the hyper-parameters is then used to make inferences about them. However, 5

Journal Pre-proof

this approach is analytically intractable. Instead, since the marginal likelihood of the data is a multivariate normal distribution with zero mean vector and convariance function K + σ 2 In . i.e., Y ∼ Nn (0, K + σ 2 In ) , then the hyper-parameters θ = {β0 , β1 , β2 , ψ, σ 2 , w1 , . . . , wn } can be estimated using the maximum likelihood type II based on the log likelihood of Y ; L (θ; y) , which is given by

The Skewed Gaussian Process Regression

p ro

3

of

 −1 1 1 1 L (θ; y) = logPY (y; θ) = − logdet K + σ 2 In − y T K + σ 2 In y − nlog (2π) . 2 2 2

3.1

The proposed GPRSE

Pr e-

In this section, we introduce a generalization to the GPR model and referred to as the Gaussian process regression with skewed errors (GPRSE) model. Some of our motivations for proposing the GPRSE are: (i) to extend the GPR model into a flexible class of GPR models. (ii) to address some of the deficiencies of GPR model due to skewness. Most importantly, we derive the predictive PDF of f ∗ = f (t∗ ) at new input t∗ . Additionally, we provide the marginal PDF of Y that is of a considerable importance in learning of the hyper-parameters. Interestingly, the derivations of the both PDFs are given in closed forms.

urn

al

The GPRSE model can be defined as follows. The classical assumption is that given the latent function f, the error terms  (t1 ) ,  (t2 ) , . . . ,  (tn ) in the model Yi = f (ti ) +  (ti ), are i.i.d Gaussian distribution are replaced by a skew Gaussian noises. That is, we assume that  (t1 ) ,  (t2 ) , . . . ,  (tn ) , are independent SN (µ, σ 2 , λ) , where SN represents the 1-dimensional skew normal distribution. In this case, the error terms will have the following common PDF.   λ( − µ) P (; µ, λ, σ) = 2ϕ (; µ, σ) Φ , −∞ <  < ∞; −∞ < λ, µ < ∞, σ > 0, (7) σ

Jo

where ϕ (.; µ, σ) is the PDF of N (µ, σ 2 ) and Φ (.) is the CDF of N (0, 1) . Observe that, the GPRSE model contains the GPR model as special case by letting λ → 0. Simply because, when λ = 0, then the error distribution in (7) reduces to the Gaussian distribution with mean 0 and variance σ 2 . It is worth mentioning that the error terms distribution is free of the inputs t1 , t2 ..., tn . Therefore, we drop them from the epsilon notation. According to Genton [20], the mean and the variance of i are given by   2λ2 λσ p 2 2/π and V ar (i ) = σ 1 − E (i ) = µ + √ π(1 + λ2 ) 1 + λ2 6

Journal Pre-proof

of

p λσ In order to have a zero-mean GPRSE model, we shall assume that µ = − √1+λ 2/π. The joint 2 T PDF of the error terms  = (1 , 2 , . . . , n ) is a closed skewed multivariate normal distribution, i.e.,  ∼ CSN n,n (µ1n , σ 2 In , λIn , 0, σ 2 In ) . This follows from the fact that the joint PDF of these error terms can be written as follows   n Y λ(i − µ) n ϕ (i ; µ, σ) Φ , P (; µ, λ, σ) = 2 σ i=1   = 2n ϕn ; µ1n , σ 2 In Φn λIn ( − µ1n ) ; 0, σ 2 In ,

Pr e-

p ro

where the last term represents the PDF of the CSNn,n (, ). For the preparation of providing predictive distribution of f (t) at a new input t∗ under GPRSE model, we write an equivalent expression to the one given in (4) as   f (t1 )   ..    .  ∼ CSN n+1,1 0(n+1)×1 , Σ, 01×(n+1) , 0, 1 .     f (tn )  f (t∗ )

We are now in position to prove the following theorem which gives the predictive distribution of f (t) at a new input and the marginal distribution of the Y

al

Theorem 3.1. Consider the non-linear regression model Yi = f (ti ) + i , where 1 , 2 , . . . , n are iid distributed according to (7) and t1 , t2 , . . . , tn ∈ C ⊆ Rn . If Y = ( Y 1 , Y2 , . . . , Yn )T , t∗ ∈ C and f (.) is a Gaussian process with mean function zero and a covariance function c(., .), then we have the following: (i) The predictive distribution of f ∗ given Y and t∗ is   e v e e, ∆ f ∗ |Y , t∗ ∼ CSN 1,n+1 µ e, σ e, D,

urn

where

(8)

Jo

e = D

W = (K + σ 2 In )

−1

!

0 2 λσ Tn×1

+

e = ∆

!

0 e= , v −1 2 2 −λσ (K + σ In ) (y − µ1n ) ! 1 01×n , 4 0T1×n σ 2 (1 + λ2 ) In − λ2 σ W

−1

(K+σ2 In )

kkT (K+σ 2 In ) −1

k∗ −kT (K+σ 2 In )

µ e = kT K + σ 2 In

−1

k

−1

,

−1

(K+σ2 In ) T = − k∗ −kT (K+σ2 I

k

−1 k n)

(y − µ1n ) , σ e = k ∗ − kT K + σ 2 In 7

,

, and

−1

k.

Journal Pre-proof

(ii) The marginal distribution of Y is  Y ∼ CSN n,n+1 µ+ , Σ+ , D + , v + , ∆+ ,

where

of

D+ =

µ+ = µ1n , Σ+ = K + σ 2 In , v + = 0(n+1)×1 , ! ! 01×n 1 0 1×n , and ∆+ = . 4 −1 −1 λσ 2 (K + σ 2 In ) 0T1×n σ 2 (1 + λ2 ) In − λ2 σ (K + σ 2 In )

p ro

Proof. See Appendix B.

Based on Theorem 1(ii), we can estimate the hyper-parameters based on the likelihood of the marginal distribution of Y .

3.2

Making predictions

e T ζ, E (f ∗ |Y , t∗ ) = µ e+σ eD

where ζ=



e T ΨD e −σ e T ζζ T D, e V ar (f ∗ |Y , t∗ ) = σ e+σ e2 D e2 D ∂2 ∂s∂sT

  e +σ eD e T |s=0 e; ∆ Φn+1 s; v eD   . e +σ eD eT e; ∆ Φn+1 0; v eD

urn Ψ=

4

 T e e e e; ∆ + σ s; v eD D |s=0   . T e e e e; ∆ + σ Φn+1 0; v eD D

∂ Φ ∂s n+1

al

and the variance is where

Pr e-

The approximate posterior distribution of a latent variable f ? at a new input t? is also close skew normal distribution, and therefore defined by its mean and variance respectively

Simulation study

Jo

In this section, we conduct experimental simulations to assess the efficiency of our proposed approach GPRSE model. First, we demonstrate that GPRSE model can be successfully applied to learning smoothing functions in one-dimensional space and it yields superior predictive results compared to the GPR model. Second, we assess the performance of the GPRSE model on a function with multiple input locations. 8

Journal Pre-proof

4.1

Simulation from CSN p,q

P (x) = cg (x) h(x),

of

Simulation of a sample path from the skew Gaussian process can be obtained by sampling from a multivariate skew normal distribution on a smooth grid. To simulate a random vector from the PDF in (1), we may use the accept-reject method. The accept-reject method as given in Christian and Casella [21] assumes that the PDF P (x) can be written as

p ro

where c ≥ 1, 0 < g (x) ≤ 1, ∀x and h(x) is a PDF. In this case, a random observation from P (x) can be generated as follows: 1. Generate U from u(0, 1). 2. Generate Y from h(x).

4. Go to step 1.

Pr e-

3. If U ≤ g(Y ), then deliver Y as a realization of P (x).

For simulation from the PDF (2), we use the above algorithm with c = Φq 0; v, ∆+DΣDT h (x) = φp (x; µ, Σ) and g (x) =Φq (D (x − µ) ; v, ∆).

4.2

Simulation Results

−1

,

The one-dimensional case

urn

4.2.1

al

In what follows, we shall illustrate the potentiality of the GPRSE model by means of simulated data sets using different values for the skew normal distribution parameters. The statistical software R [22] is used to perform all the computations in this paper.

Jo

In this scenario, we choose for the latent function f, a smooth function of the form f (t) = sin(t) . 1+t2 We simulate training sets of size 50 each via the statistical model y = f (t) + , where  follows a SN distribution with different parameters given in Table 1, and the values of t are randomly chosen from a uniform distribution in the interval t = [−5, 5]. We use the squared exponential covariance function which is a special case of the covariance function given in (3). Specifically, we use the following covariance function   (ti − tj )2 2 c(ti , tj ) = ψ exp − 2ω12

9

Journal Pre-proof

of

For each training set, the hyper-parameters in the two models, namely, GPR and GPRSE are then estimated using maximum likelihood method. The evaluation of the goodness of fit of the two models on training set were based on negative log marginal likelihood (NLML) and AIC. Additionally, we use the root mean squared error defined below as a measure of accuracy in prediction. v u N u1 X t RMSE = (y ? − yi )2 , N i=1 i

4.2.2

al

Pr e-

p ro

where N is the size of the sample. The numerical values regarding the ML, NLML, AIC, and RMSE obtained from fitting the two models for the training data sets is listed in Table 1. These results reveal interesting information. Note that the performance of the ML estimates for the hyper-parameters of the two models are quite good, exhibiting small bias in all cases considered. It is evident that the trained GPRSE model has the smallest values of NLML and AIC statistics in comparison to the GPR model. In order to further assess the generalization capabilities of the proposed model, we generate independent test data sets in a similar way of generated training data, to be utilized for prediction purposes using the trained GPRSE and GP models. Specifically, test data sets of size 50 are generated for the cases given in Table 1. Then, we use the trained GPRSE and GP models to predict the test data sets. Similarly, we use the RMSE as a measure of accuracy in predicting these test data sets. Clearly, the GPRSE model exhibiting smaller RMSE in both the training and test sets than the corresponding ones for the GP model, and this is true for all the considered cases. Furthermore, Figures 1 and 2 demonstrate that the predicted values of the training set points and the test data points using the trained GPRSE model is close to their corresponding actual values. The Multi-dimensional case

urn

Here, we show that the GPRSE can be used to model functions that possess multiple input locations. For this purpose, we choose the following function f (t1 , t2 ) = exp(−t21 − t22 ) +

sin(t1 + t2 ) . (1 + t21 + t22 )

Jo

Similarly, we simulate training sets of size 25 each from the model y = f (t1 , t2 ) + , where  again follows a SN distribution with different parameters given in Table 2, and the values of t1 and t2 are randomly chosen from a uniform distribution in the intervals t1 = [−5, 5], t2 = [−5, 5]. Additionally, the performances of the two models are compared using the following covariance function,   1 1 T T 2 c (ti , tj ) = ψ exp − 2 (ti − tj ) (ti − tj ) − 2 (ti − tj ) (ti − tj ) , 2ω1 2ω2 10

Journal Pre-proof

GPRSE

λ = 0.5 σ = 0.1

GP

σ = 0.1

GPRSE

λ=5 σ = 0.1

GP

σ = 0.1

GPRSE

λ = 15 σ = 0.01

GP

σ = 0.01

σ b = 0.0294 ψb = 0.5300 ω b1 = 1.0369 b = 0.6077 λ σ b = 0.0874 ψb = 0.2390 ω b1 = 0.9028 σ b = 0.0796 ψb = 0.1934 ω b1 = 0.8284 b = 5.1228 λ σ b = 0.0769 ψb = 0.1919 ω b1 = 0.7038

0.0505 b ψ = 0.2135 ω b1 = 0.7807 b = 13.1488 λ σ b = 0.0120 ψ = 0.9944 ω b1 = 0.9315

urn Jo

NLML 74.2992

Training set AIC RMSE -140.5985 0.0249

σ b = 0.0103 ψb = 1.0258 ω b1 = 1.0170

Test set RMSE 0.04443

p ro

σ = 0.05

al

GP

ML b = 5.8136 λ σ b = 0.0511 ψb = 0.2110 ω b1 = 0.8681

69.8196

-133.6392

0.0478

0.05928

-35.8233

-67.6467

0.0699

0.07355

Pr e-

Model Parameter GPRSE λ=5 σ

of

Table 1: ML estimates, NLML, AIC, and RMSE statistics on the simulations

36.4272

-66.8544

0.0790

0.09929

55.7801

-103.5601

0.0430

0.0505

53.4577

-100.9155

0.0737

0.0795

110.7829

-213.5659

0.0056

0.0045

104.6595

203.3191

0.0106

0.0103

11

of

p ro

f(t) 0

2

4

−4

t

−2

0

2

4

2

4

t

(b)

Pr e-

(a)

Test data GPRSE GP

−0.2

−4

−2

0

2

−0.4

4

−4

−2

0 t

urn

t

al

−0.4

−0.2

f(t)

0.0

0.2

0.4

Training data GPRSE GP

0.2

0.4

0.0 −0.2 −0.4

−2

0.0

f(t)

0.0 −0.2 −0.4

−4

f(t)

Test data GPRSE GP

0.2

0.4

Training data GPRSE GP

0.2

0.4

Journal Pre-proof

(c)

(d)

Jo

Figure 1: Plots of f (t) on the grid [−5, 0.1, 5]. (a) training points (•) with (λ, σ) = (0.5, 0.1), and prediction of training points using the trained GPRSE (4) and GP (+) models. (b) test points (+), and prediction of test points using the trained GPRSE (?) and GP () models in (a). (b) training points (•) with (λ, σ) = (5, 0.1), and prediction of training points using the trained GPRSE (4) and GP (+) models. (d) test points (+), and prediction of the test points using the trained GPRSE (?) and GP () models in (c).

12

of

p ro

f(t) 0

2

4

−4

t

−2

0

2

4

2

4

t

(b)

Pr e-

(a)

Test data GPRSE GP

−0.2

−4

−2

0

2

−0.4

4

−4

−2

0 t

urn

t

al

−0.4

−0.2

f(t)

0.0

0.2

0.4

Training data GPRSE GP

0.2

0.4

0.0 −0.2 −0.4

−2

0.0

f(t)

0.0 −0.2 −0.4

−4

f(t)

Test data GPRSE GP

0.2

0.4

Training data GPRSE GP

0.2

0.4

Journal Pre-proof

(c)

(d)

Jo

Figure 2: Plots of f (t) on the grid [−5, 0.1, 5]. (a) training points (•) with (λ, σ) = (5, 0.05), and prediction of training points using the trained GPRSE (4) and GP (+) models. (b) test points (+), and prediction of test points using the trained GPRSE (?) and GP () models in (a). (b) training points (•) with (λ, σ) = (15, 0.01), and prediction of the training points using the trained GPRSE (4) and GP (+) models. (d) test points (+), and prediction of the test points using the trained GPRSE (?) and GP () models in (c).

13

Journal Pre-proof

5

Pr e-

p ro

of

The NLML, AIC and RMSE for each training data set are given in Table 2. Clearly, the GPRSE outperforms the GPR model in all the considered cases in terms of the smaller value of each of NLML, AIC and RMSE. Moreover, we evaluate the performance of the proposed model by making predictions using test data sets. These test data sets are quite generated in the way similarly as the way the trained data sets were obtained. We generate test data points of size 25 for the cases given in Table 2. Predicting the test data points is carried out using the trained GPRSE and GP models. Once again the GPRSE model shows better performance than the GPR model in terms of its possessing for small RMSEs in predicting both the training and test data sets. Additionally, Figures 3 and 4 illustrate that the predicted values of the training and test data sets using the GPRSE model outperform the GP model. In particular, the difference is so clear for the cases displayed in Figure 4 (c) and (d). It is worth stressing that much more numerical work is needed to come to any general conclusions about the performance of the two models, possibly for other models, in terms of estimation and prediction procedures. Indeed, there is a limitation regarding the GPRSE model when using this model for making predictions from moderate and large samples due to the computational cost associated with matrix manipulations. This kind of study is beyond this paper and, hence, an in-depth study regarding these issues can be considered in future.

Concluding remarks

Jo

urn

al

In this paper, we introduce a new version of the GPR model, referred to as Gaussian process with skewed errors (GPRSE). The GPRSE generalizes the GPR model and at the same time it has the potentiality to handle skewed data. We derive explicit closed form expression for its predictive distribution for the functional form in the GPR model at new input locations. Additionally, we derive the marginal likelihood function of the training data which is of considerable importance in learning the hyper-parameters of the observation model. Some simulation studies are carried out to compare the predictive performance of the GPRS and GPR models with estimation of the hyperparameters in small samples. Apparently, there is a limitation regarding the GPRSE model when using this model for making predictions from moderate and large samples due to the computational cost associated with matrix manipulations. This is an interesting point which can be considered in future works. In this paper, we confine our computations to small training and test data sets , however, moderate and large training data are of great importance in many practical applications. So, one can consider approximation inference methods such as Markov chain Monte Carlo, Laplace approximation and variational methods for the GPRSE model in order to deal with large training samples.

14

Journal Pre-proof

Table 2: ML estimates, NLML, AIC, and RMSE statistics on the simulations

σ = 0.05

GPRSE λ = 5 σ=1

GP

σ=1

σ b = 0.900 ψb = 1.7634 × 10−6 ω b1 = 14.7262 ω b2 = 13.3696 b = 5.1996 λ σ b = 0.199999 ψb = 0.003816 ω b1 = 25.5463 ω b2 = 38.1594

σ = 0.1

σ b = 0.1942 ψb = 0.0530 ω b1 = 7.44012 ω b2 = 7.44203 b = 9.88582 λ σ b = 0.2091 ψb = 0.0117 ω b1 = 2.8171 ω b2 = 2.4298

urn

GP

σ b = 0.0414 ψb = 0.2721 ω b1 = 3.3887 ω b2 = 7.5247 b = 4.4001 λ σ b = 0.9016 ψb = 0.0994 ω b1 = 2.75151 ω b2 = 5.84922

GPRSE λ = 10 σ = 0.01

σ = 0.01

Jo

GP

Test set RMSE 0.0325

of

Training set AIC RMSE -1.7442 0.0249

0.9267

3.8534

0.1579

0.1928

20.8114

45.6228

0.9375

0.9837

25.1192

52.2384

1.000

1.2101

-11.960

-19.92

0.93605

0.9563

-4.7584

-7.5168

1.47245

1.1131

-3.86084

-3.72168

0.00162

0.00148

3.9018

9.8036

0.0106

0.0261

al

GPRSE λ = 5 σ = 0.1

NLML -2.8721

p ro

GP

ML b = 4.9681 λ σ b = 0.0511 ψb = 0.2161 ω b1 = 2.6593 ω b2 = 4.2427

Pr e-

Model Parameter GPRSE λ = 5 σ = 0.05

σ b = 0.0174 ψb = 0.3336 ω b1 = 5.93344 ω b2 = 8.38007

15

Journal Pre-proof

1.0

Training data GPRSE GP

f(t1,t2)

0.0

0.5

0.0

4

4

2

2

0

−2

t2

−2

−2

0

t1

2

−2

0

t1

−4 4

2

−4

4

(b)

Pr e-

(a)

1.0

1.0

Training data GPRSE GP

f(t1,t2)

al

f(t1,t2)

0.5

0.0

0

−4

p ro

−4

t2

f(t1,t2)

0.5

Test data GPRSE GP

of

1.0

Test data GPRSE GP

0.5

0.0

4

4

2

2

0

−2

t2

−2

−2

0

t1

2

0

−4

t2

−4

−2

0

t1

−4

2 4

urn

4

−4

(c)

(d)

Jo

Figure 3: Plots of f (t1 , t2 ) on the grid [−5, 0.3, 5]. (a) training points (•) with (λ, σ) = (5, 0.05) and prediction of training points using the trained GPRSE (4) and GP (+) models. (b) test points (+), and prediction of test points using the trained GPRSE (?) and GP () models in (a). (b) training points (•) with (λ, σ) = (5, 1), and predicted of these training points using the trained GPRSE (4) and GP (+) models. (d) test points (+), and predicted of these test points using the trained GPRSE (?) and GP () models in (c).

16

Journal Pre-proof



I WW

f(t1,t2)

0.5

0.0

7HVWGDWD *356( *3

of

Training data GPRSE GP





4



2



0

í

t2

−2

−2

0

t1

2

í



W

−4 4



í



(b)

Pr e-

(a)

1.0

1.0

Training data GPRSE GP

f(t1,t2)

al

f(t1,t2)

0.5

0.0



í

p ro

−4

W

1.0

Test data GPRSE GP

0.5

0.0

4

4

2

2

0

−2

t2

−2 0

t1

−2

2

0

−4

t2

−4

−2

0

t1

−4

2 4

urn

4

−4

(c)

(d)

Jo

Figure 4: Plots of f (t1 , t2 ) on the grid [−5, 0.3, 5]. (a) training points (•) with (λ, σ) = (5, 0.1) and prediction of training points using the trained GPRSE (4) and GP (+) models. (b) test points (+), and prediction of test points using the trained GPRSE (?) and GP () models in (a). (b) training points (•) with (λ, σ) = (10, 0.01), and predicted of these training points using the trained GPRSE (4) and GP (+) models. (d) test points (+), and predicted of these test points using the trained GPRSE (?) and GP () models in (c).

17

Journal Pre-proof

Acknowledgments The authors gratefully thank the editor and the anonymous referees for their useful comments and suggestions on previous version of this paper.

of

Appendix Appendix A

p ro

The proofs of the following propositions can be found in Genton [20].

Proposition 1. If Y1 , . . . , Yn are independent random vectors with Yi ∼ CSN pi, qi (µi , Σ i , D i , v i , ∆i ). Then the joint distribution of Y1 , . . . , Yn is T

where p+ =

n X

pi , q + =

i=1

n X i=1

Pr e-

Y = (Y1 T , . . . , Yn T ) ∼ CSN p+ ,q+ (µ+ , Σ+ , D + , v + , ∆+ ), qi , µ+ = µT1 , . . . , µTn

D + = ⊕ni=1 Di , v + = and

v1T , . . . , vnT A 0 0 B

!

,

, Σ+ = ⊕ni=1 Σi ,

∆+ = ⊕ni=1 ∆i

.

al

A⊕B =

T

T

urn

Proposition 2. Let Y ∼ CSN p, q (µ, Σ, D, v, ∆) and A be an n × p (n ≤ p) matrix of rank n. Then Ay ∼ CSN p, q (µA , Σ A , DA , v, ∆A ) , where µA = Aµ, Σ A = AΣ AT , DA = DΣ AT Σ −1 A T , and ∆A = ∆ + DΣ D T − DΣ AT Σ −1 AΣ D . A Proposition 3. If Y ∼ CSN p, q (µ, Σ, D, v, ∆), then for two sub vectors Y1 and Y2 where Y T = (Y1T , Y2T ) , Y1 is k-dimensional, 1 ≤ k ≤ p, and µ, Σ, D are partitioned as follows:

and

Jo

µ=

µ1 µ2

!

k , Σ= p−k

k Σ11 Σ21

p−k! Σ12 Σ22

k p−k D = ( D 1 D2 ) q.

18

k p−k

Journal Pre-proof

Then the conditional distribution of Y2 given Y1 is CSN p−k,

 ∗ µ2 + Σ21 Σ−1 11 (y10 − µ1 ) , Σ22.1 , D2 , v − D (y10 − µ1 ) , ∆ ,

q

where

and

Σ22.1 = Σ22 − Σ21 Σ−1 11 Σ12 .

of

D ∗ = D 1 + D 2 Σ21 Σ−1 11

Appendix B

p ro

Proof of Theorem 3.1 (i)

T

T

Proof. First, we need to determine the joint PDF of (Y T , f ∗ ) . Observe that (Y T , f ∗ ) can be T displayed in terms of the vector f T , f ∗ , T as follows: =

f + f∗

!

=

In 0n×1 In T 0n×1 1 0Tn×1

!

   f f    ∗   f  = A  f∗  ,   

Pr e-

Y f∗

!

where

A(n+1)×(2n+1) =

In 0n×1 In 1 0Tn×1 0Tn×1

!

.

T T Therefore, in order to find the joint PDF of (Y T , f ∗ ) , we have to find the joint PDF of f T , f ∗ , T . T Since f (t) is a Gaussian process prior, it then follows that the joint PDF of f T , f ∗ and the joint PDF of  can be written in terms of the CSN distributions respectively as follows: T

 ∼ CSN n+1,1 0(n+1)×1 , Σ, 01×(n+1) , 0, 1 and  ∼ CSN n,n (µ1n , σ 2 In , λIn , 0n×1 , σ 2 In ),

al

fT , f∗

urn

where 1n denotes the column of one’s of size n, In is the identity matrix of size n × n and ! K k Σ= , kT k ∗

Jo

where Σ along with its elements is defined via (2). Since f T , f ∗ follows with the aid of using Proposition 1 that f T , f ∗ , T

where

T

is independent of , it then

∼ CSN 2n+1,n+1 (µ∗ , Σ∗ , D ∗ , v ∗ , ∆∗ ) ,

 T T

µ∗ = 0T(n+1)×1 , µ1n

T

, v ∗ = 0, 0Tn×1

19

T

,

∆∗ =

1 0T1×n

01×n σ 2 In

!

,

Journal Pre-proof

and 0n×1 is the zero vector of size n × 1 , and 01×(n+1) 0(n+1)×n λIn 0T(n+1)×n

D∗ =

!

,





K k  T ∗ Σ =  k k∗ 0T(n+1)×n

0(n+1)×n   . 2 σ In

where

In 0n×1 In 0Tn×1 1 0Tn×1

µA = Aµ∗ = and



!

0(n+1)×1 µ1n

K k  T  k k∗ 0T(n+1)×n

ΣA = AΣ∗ AT =

=

!

µ1n 0

=

0(n+1)×n

Pr e-

In 0n×1 In 0Tn×1 1 0Tn×1

!

p ro

of

On using the relation in (5) along with applying Proposition 2, it follows that   ! f Y   = A  f ∗  ∼ CSN n+1,n+1 (µA , ΣA , D A , v ∗ , ∆A ) , ∗ f 

σ 2 In

K + σ 2 In k kT k∗

!

 In 0n×1   T 1 ,   0n×1 In 0n×1 ! 

.

al

Now the matrix DA can be determined from the matrix equation given in Proposition 2 as DA = −1 D ∗ Σ ∗ AT Σ A . In this case, we have that    !−1 ! K k In 0n×1 2 0 01×(n+1) 0(n+1)×n  T K + τ I k    (n+1)×n n . DA = 1   k k∗   0Tn×1 0T(n+1)×n λIn kT k∗ T 2 0(n+1)×n σ In In 0n×1

urn

Equivalently, it reduces to

DA =

01×n 0 2 2 λσ W n×n λσ Tn×1

−1

!

, −1

−1

Jo

(K+σ2 In ) k (K+σ2 In ) kkT (K+σ2 In ) −1 and T = − k∗ −kT (K+σ2 I )−1 k . where W = (K + σ 2 In ) + k∗ −kT (K+σ 2 In )−1 k n Similarly, the parameter ∆A can be computed as follows: ∗ ∗T ∆A = ∆∗ +D∗ Σ∗ D∗ T − D∗ Σ∗ AT Σ−1 A AΣ D ,

where ∗

∆ =

1

0T1×n

01×n σ 2 In

!





∗T

, DΣD

=

0 0T1×n

01×n 2 λ2 σ In 20

!





T

and D Σ A =

0 01×n 2 λσ In 0T1×n

!

.

Journal Pre-proof

So 1

∆A =

0T1×n

01×n 4 2 2 σ (1 + λ ) In − λ2 σ W

!

.

D1 =

!

!

0 λσ 2 Tn×1

e = and D

p ro

01×n λσ 2 W n×n

of

The predictive density of f ∗ given Y and t∗ can be obtained by direct application of Proposition 3 with p = n + 1 and q = n + 1, followed by applying Proposition 2.3.2 of Genton [20]. To see this,    .. ∗ T T ∗ T T as Y along with the corresponding partition partition the random vector Y , f . f   e where of the parameters, DA = D1 D ,

it then follows that the conditional distribution of f ∗ given Y and t∗ is then given by   ∗ ∗ e e e, σ e, De v, ∆ , f |Y , t ∼ CSN 1,n+1 µ

where

and

which completes the proof.

e = ∆

Proof of Theorem 3.1 (ii)

σ e = k ∗ − kT ! e= , v 1

0T1×n

(y − µ1n ) , −1 K + σ 2 In k,

0 −1 2 2 −λσ (K + σ In ) (y − µ1n )

01×n 4 2 2 σ (1 + λ ) In − λ2 σ W

!

!

,

al

e = D

0 2 λσ Tn×1

−1

Pr e-

µ e = kT K + σ 2 In

(9)

urn

Proof. On applying Lemma 2.3.1 of Genton [20], it then follows that the marginal distribution of Y  Y ∼ CSN n,n+1 µ+ , Σ+ , D + , v + , ∆+ , (10)

Jo

where µ+ = µ1n , Σ+ = K + σ 2! In , v + = 0(n+1)×1 , ! 0 1 0 1×n 1×n D+ = and ∆+ = −1 4 −1 2 2 T 2 2 λσ (K + σ In ) 01×n σ (1 + λ ) In − λ2 σ (K + σ 2 In ) −1

4

Observe that the matrix σ 2 (1 + λ2 ) In − λ2 σ (K + σ 2 In ) any nonzero n-dimensional vector v, we have that v

T



σ

2

2

1+λ



2 4

2

In − λ σ K + σ In

−1 

2 T

v=σ v v 21

is positive definite. This is because for

(

2

2v

1+λ −λ

T

1 K σ2

+ In vT v

−1 ) v

.

Journal Pre-proof

Since maxn

v∈R

(

vT

1 K σ2

+ In T v v

−1 ) v

=

σ2 , σ 2 + ζ0

λ2 ζ 0 , σ 2 +ζ0

which is positive. Since this is true for every vector v, then it follows that the matrix  −1 4 σ 2 1 + λ2 In − λ2 σ K + σ 2 In

p ro

is 1 +

of

where ζ0 is the largest Eigenvalue value of K, it follows that a lower bound of −1 T 1 v K + I v n 2 σ 1 + λ2 − λ2 T v v

is positive definite.

References

Pr e-

[1] A. O’Hagan. On curve fitting and optimal design for prediction. Journal of Royal Statistical Society B. 40 (1978) 1–42. [2] R.M. Neal Bayesian learning for Neural Networks. Ph.D., thesis, Dept. of Computer Science, University of Toronto. [3] C. E. Rasmussen and C. Williams. Gaussian processes for machine learning. Cambridge, Massachusetts, MIT press, 2006.

al

[4] S. Brahim–Belhouari and A. Bermak. Gaussian process for non-stationary time series prediction. In Computational Statistics and Data analysis. 47 (2004) 705-712

urn

[5] J. Vanhatalo, P. Jyl¨ anki, and A. Vehtari. Gaussian process regression with Student-t likelihood. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems. 22 (2009) 910-1918. [6] J.H. Macke, S. Gerwinn, L.E. White, M. Kaschube, and M. Bethge. Gaussian process methods for estimating Cortical maps. Cambridge, Massachusetts, MIT press, 2010.

Jo

[7] P. Jyl¨ anki, J. Vanhatalo, and A. Vehtari. Robust Gaussian Process Regression with a Student-t Likelihood. Journal of Machine Learning Research., 12 (2011) 3227–3257 [8] M. T. Alodat and M. Y. Al-rawwash. Skew Gaussian random field. Journal of Computational and Applied Mathematics. 232 (2009) 496–504. [9] A. Azzalini. A class of distributions which includes the normal ones. Scandinavian Journal of Statistics. 12 (1985) 171–178.

22

Journal Pre-proof

[10] A. Azzalini. Further results on a class of distributions which includes the normal ones. Statistica. 46 (1986)199-208. [11] A. Azzalini and A. Dalla Valle. The multivariate skew-normal distribution. Statistica. 83 (1996)715– 726.

of

[12] A. Azzalini and A. Capitanio. Statistical application of the multivariate skew. Journal of Royal Statistical Society B. 61 (1999) 579–602.

p ro

[13] M. T. Alodat and K. M. Aludaat. A skew Gaussian process. Pakistan Journal of Statistics. 23 (2006) 89–97. [14] M. T. Alodat and M. Y. Al-rawwash. The extended skew-Gaussian process for Regression. Metron. 72 (2014) 317–330.

Pr e-

[15] G. Gonz´ ales–Fari´ as, J. Domingusez–Molina, and A. Gupta. Additive properties of skew normal random vectors. Journal of Statistical Planning and Inference. 126 (2004) 521–534. [16] D. Allard and P. Naveau. A New spatial skew-normal random field model. Communication in Statistics-Theory and Methods. 36 (2007) 1821–1834. [17] H. Zhang and A. El-Shaarawi. On spatial skew Gaussian process applications. Environmetrics., 10 (2009) 982–37. [18] T. T. Alodat, M.T. Alodat, and Z.R. Al-Rawi. A Generalized multivariate–skew normal distribution with applications to spatial and regression predictions. Statistica and Applicazioni. XIII (2015) 3-38

al

[19] A. Girard, J. Kocijan, R. Murray–Smith, and C.E. Rasmussen. Gaussian Process Model Based Predictive Control. proceeding of the American Control Conference Boston.

urn

[20] A. Genton. Skew-Elliptical Distributions and their applications: A Journey Beyond Normality. Chapman and Hall/CRC, Boca Raton, Florida, 2004. [21] P. R.. Christian and G. Casella Monte Carlo Statistical Methods. Springer, New York., 2004.

Jo

[22] R Core Team. R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2016.

23