Image super-resolution using non-local Gaussian process regression

Image super-resolution using non-local Gaussian process regression

Author’s Accepted Manuscript Image super-resolution using non-local Gaussian process regression Haijun Wang, Xinbo Gao, Kaibing Zhang, Jie Li www.els...

1MB Sizes 0 Downloads 97 Views

Author’s Accepted Manuscript Image super-resolution using non-local Gaussian process regression Haijun Wang, Xinbo Gao, Kaibing Zhang, Jie Li

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)00233-2 http://dx.doi.org/10.1016/j.neucom.2016.01.073 NEUCOM16755

To appear in: Neurocomputing Received date: 16 August 2015 Revised date: 3 December 2015 Accepted date: 28 January 2016 Cite this article as: Haijun Wang, Xinbo Gao, Kaibing Zhang and Jie Li, Image super-resolution using non-local Gaussian process regression, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.01.073 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Image Super-Resolution Using Non-local Gaussian Process Regression Haijun Wanga , Xinbo Gaoa,∗, Kaibing Zhanga,b , Jie Lic a State

Key Laboratory of Integrated Services Networks, School of Electronic Engineering, Xidian University, Xi’an 710071,China b School of Computer and Information Science, Hubei Engineering University, Xiaogan 432000, China c Video and Image Processing System Laboratory, School of Electronic Engineering, Xidian University, Xi’an 710071,China

Abstract Gaussian process regression (GPR) is an elegant statistical learning method for nonlinear mappings. Despite its effectiveness, GPR suffers from intensive computational complexity. To improve the efficiency, most existing GPR-based super-resolution (SR) methods are local patch-based, which cannot make full use of self-similarities existing in natural images. To address this problem, we propose a non-local self-learning SR framework based on GPR, which learns only one non-local GPR model instead of multiple local patch-based models for SR reconstruction. In the proposed framework, an anisotropic linear kernel is introduced to construct a new kernel function for capturing more structure similarity. Furthermore, we disclose that a simple grid patch sampling with moderate sampling interval can be used to speed up the SR processing significantly without compromise of reconstruction quality. In addition, we make a theoretical discussion on two important factors of the predictive variance and the condition number of the kernel matrix that relate to the performance of the GPR. Experimental results on the benchmark test images show that our proposed method is superior to other state-of-art competitors in terms of both quantitative and qualitative measurements. Keywords: Gaussian process regression, grid sampling, image ∗ Corresponding

author.

Preprint submitted to Neurocomputing

February 18, 2016

super-resolution, kernel, non-local self-learning.

1. Introduction Vision is the major means by which the human beings acquire information from the outside world, where images act as a main carrier conveying visual information. However, due to the limitations of physical imaging technology or application cost, it is not always ready to achieve high-quality images to meet practical demands. To overcome or alleviate the drawback of existing imaging devices, an effective signal processing technique called super-resolution (SR) is developed to produce a high resolution (HR) image with one or more low resolution (LR) images from the same scene. According to the difference of theoretical foundation, the existing SR methods can be roughly divided into three categories: interpolation-based methods, reconstruction-based methods, and example learning-based methods. Interpolation-based SR approaches typically utilize different interpolation kernels to predict unknown pixels in the HR grids, under the assumption that images are locally smooth. Some typical methods include bilinear interpolation, bicubic interpolation, nearest neighbor interpolation, and so on [1]. Although this group of methods is simple and efficient to implement, the weakness is that they are prone to produce blurring details under a large magnification. Reconstruction-based methods utilize a maximum a posterior (MAP) framework that integrates the reconstruction term and some priors to well regularize the ill-posed SR problem. There are various priors applied for different purposes [2]. For example, gradient profile prior [3], edge-specific prior [4], and bilateral total variation prior [5] are devised to keep the sharpness of edges. Some priors such as Zernike-moment [6] and total variation priors [7] are used to remove noise. Other priors such as non-local similarity [8], Markov random fields [9], and the steering kernel prior [10] are exploited to suppress artifacts. Due to the application of a certain prior constraint on the up-sampled images, reconstruction-based methods perform well in generating sharp edges and sup-

2

pressing aliasing artifacts, but are cumbersome at synthesizing novel details when a large magnification factor is performed. Example learning-based methods learn the mapping relationship from the LR image to an HR one by using a training dataset containing a large number of LR-HR co-occurrence pairs or a learned over-complete dictionary pair. According to how to establish the mapping relationship, example learning-based methods can be further classified into two major sub-categories: coding-based methods and regression-based methods. Coding-based methods assume that coding coefficients between the LR and the corresponding HR feature spaces are approximately isometric.

Typical

coding-based algorithms include neighbor embedding (NE) [11–14], sparse coding (SC) based on over-complete dictionary [15–17] or orthogonal dictionary [18, 19]. By contrast, regression-based methods directly learn the mappings from the LR to HR feature spaces. For regression-based methods, there are many different models such as Markov random fields in [20], kernel regression in [21], neural network in [22], neighborhood regression [23], linear regression in [24, 25], and support vector regression in [26]. Example learning-based methods perform well in recovering finer details but cannot suppress ringing and noisy artifacts due to incompatible feature matching. Recently, being a special branch of learning-based methods, exploiting selfsimilarity structures inside input LR image itself has attracted much attention in the SR literature [13]. Unlike those building an external training dataset for detail synthesis, many self-learning methods have been shown particularly effective in hallucinating details by exploiting cross-scale repetitive patterns. For example, the combination of linear mapping was proposed by Bevilacqua et al. in [24], neighbor embedding based method in [14], and first-order Taylor expansion of multivariate function in [25]. As an elegant statistical learning model, Gaussian process regression (GPR) has been successfully applied to SR. In particular, He et al. [27] proposed a GPR-based SR model called SRGPR, where two stages of up-sampling and deblurring are sequentially carried out for SR reconstruction. As a local sampling 3

based method, this promising method partitions the image into fixed-sized big patches (e.g. 45 × 45) with a small percentage of overlapping. For each big patch, it only utilizes the self-similarities of small patches (e.g. 3 × 3) in the big patch to learn an individual GPR model, but does not make full use of the self-similarities within inter-big-patches, which limits the SR performance. Moreover, its noticeable and intensive computational cost forbids the applicability in practice. Therefore, there is still some room to further improve the performance of this traditional GPR-based SR method. To overcome the above limitations of local GPR-based methods and keep the computational complexity at a moderate level, we propose a novel selflearning framework called the non-local GPR (NGPR) for SR reconstruction. In order to produce an image coupled to the interpolation of the input image, we first decimate the input LR image and then interpolate the result back to obtain an auxiliary interpolation image with the original size. The proposed framework utilizes the GPR to learn the mapping relationship from the high frequency (HF) details of the auxiliary interpolation image to those of the input LR image. For a given image, its high frequency details are the difference between the image and its mean image. Next, we apply the learned relationship to predict the missing high frequency details of the latent HR image. Finally, an iterative back projection (IBP) algorithm [2] is used to further improve the quality of reconstructed HR image such that the resulted image is satisfied with the assumed degradation model. The major contributions of the proposed method are three-fold: 1) A nonlocal GPR-based SR framework is proposed to fully exploit the self-similarities, which is superior to existing local methods. 2) The proposed framework allows different grid sampling intervals which leads to a tradeoff between quality and efficiency. 3) A theoretical discussion is made on the factors that fundamentally affect the SR performance. Comparing with the SRGPR, our proposed method is unique in the following aspects: (a) Different Framework : In order to fully exploit the non-local selfsimilarities and better match the GPR model, the framework of the proposed 4

NGPR-based SR method completely differs from [27] in three aspects including feature extraction, training pairs construction, and the GPR modeling. (b) Different Kernel : Besides the radial basis function (RBF) kernel and noise kernel, we introduce another linear kernel to capture more structure similarities for SR. The remainder of this paper is organized as follows: Section II provides a brief overview of the NGPR method. Section III presents feature extraction used in our work. In Section IV, we detail the proposed NGPR modeling. Section V discusses the relationship between NGPR and non-local means and provides some propositions about NGPR theoretically. Section VI evaluates the effectiveness of NGPR both in subjective and objective quality assessment and discusses the effects of each step and the key parameters that impact on SR performance. Finally, we conclude this paper in Section VII.

2. System Overview For convenient expression, column vectors are denoted as boldface lowercase letters, e.g., x, matrices are denoted as capital letters, e.g., X, and scalars are denoted as lowercase letters, e.g., x. Table 1 lists the important notations used in this paper, where the images with the same resolution as I belong to the LR scale, and the images with the same resolution as SI (the interpolation of I) belong to the HR scale. Because the Gaussian process (GP) can be used to describe a distribution over functions, the co-occurrence of intra-scale and inter-scale self-similarities inside natural images [13, 14, 24, 25] provides a basis for applying a learned GPR model of the LR scale to the HR scale for SR estimation, by which similar image patches xi and xj share the same or similar distribution of functions and thus are mapped into the similar center pixels yi and yj , no matter which scales the samples come from. The flow chart of the proposed NGPR is illustrated in Fig. 1. It contains two major phases, namely learning NGPR model at the LR scale for detail synthesis and predicting HR details at the desired HR scale for SR estimation.

5

Table 1: Notations in This Paper LR Scale

ID

down-sampling version of I with scale s

LR Scale

II

interpolation of ID which has the same size with I

LR Scale

IM

mean image of II

LR Scale

II0

II − IM , which is the high frequency details of II

LR Scale

I0

I − IM , which is the high frequency details of I

LR Scale

SI

interpolation image of I with upscale s

HR Scale

SM

mean image of SI

HR Scale

SI0

SI − SM , which is the high frequency details of SI

HR Scale

S0

the high frequency details of the latent HR image S

HR Scale

S

the initial SR result predicted by NGPR

HR Scale

the final reconstructed HR image

HR Scale

Input LR Image: 𝐼

Auxiliary Interpolation: 𝐼𝐼

Interpolation Image: 𝑆𝐼

Mean Image: 𝑆𝑀

Synthesizing Phase

Interpolation

NGPR Modeling

…… …

HF Details of 𝐼𝐼 : 𝐼𝐼′ = 𝐼𝐼 − 𝐼𝑀

HF Details of 𝐼: 𝐼 ′ = 𝐼 − 𝐼𝑀

…… …

HF Details of 𝑆: 𝑆 ′ = 𝑆 − 𝑆𝑀

Output HR Image: 𝑆𝐹

Gaussian Process Regression (GPR) 𝑓: 𝐼𝐼′ → 𝐼 ′ Regression

Learning Phase

SF

Feature Extraction

the input LR image

Feature Extraction

I

HF Details of 𝑆𝐼 : 𝑆𝐼′ = 𝑆𝐼 − 𝑆𝑀

Figure 1: Flow chart of the proposed NGPR method. In this figure, II is the degraded version of input LR (first by down-sampling and then by up-sampling), IM is the mean image shared by I and II , and SM is the mean image shared by S and SI .

6

To be more specific, there are two major parts contained in the learning phase: generating the training image pair and learning the NGPR model. We first obtain the high frequency component of II and that of I by subtracting them from the mean image IM . Note that IM is the filtered result of II with 3 × 3 averaging filter mask, i.e.,  1 1  M = 1 9 1

1 1 1

 1   1 .  1

Next, we extract the normalized images patches {xi }ni=1 in II0 and the normalized center pixels {yi }ni=1 at the corresponding images patches from I 0 as the training set to learn the NGPR mapping model for the SR estimation. In the testing phase, we first upscale the input LR image I to the desired size by Bicubic interpolation and then obtain its high frequency component by subtracting its mean image SM (the filtering result of SI produced by the same mask M ). Finally, the missing high frequency details {f (xj∗ )}m j=1 in the latent 0 HR image corresponding to the normalized input image patches {xj∗ }m j=1 in SI

can be predicted by using the learned mapping function in the NGPR model.

3. Feature Extraction In this section, we introduce three aspects of feature extraction related to our proposed SR method, namely patch sampling, patch pruning, and patch normalization, respectively. 3.1. Non-local Patch Sampling To reduce the training complexity and to make full use of the self-similarities, we adopt a simple non-local grid patch sampling scheme to prepare the training dataset {< xi , yi >}ni=1 . Specifically, we set the sampling interval to 4 when extracting image patches and corresponding center pixels from the whole HF image II0 and I 0 as training dataset, by which the size of training set is reduced to 1/16 of all the possible sampling patches. It is experimentally shown that 7

1.0

1.0

0.5

0.5

0.5

0.25

1.0

0.5

0.5

0.5

0.25

0.25

0.5

0.5

0.5

0.25

0.25

0.25

1.0

1.0

0.5

0.5

0.5 0.25

1.0

0.5

0.5

0.5 0.25 0.25

0.5

0.5

0.5

0.25 0.25 0.25

Figure 2: Example patches that have common trends.

this strategy can significantly improve the SR efficiency while nearly keeping the quality of the reconstruction image. 3.2. Patch Pruning Generally, natural images are comprised of a large number of flat patches. To avoid these uninformative patterns affecting learning performance, we exclude the patterns whose standard deviations (STDEV) are close to zero from the training samples. It is experimentally shown that the patch pruning scheme can save a large amount computational time without compromise of quality. 3.3. Patch Normalization In order to take full advantage of patches that have common trends, such as the examples in Fig. 2, for each sample pair in the training dataset {< xi , yi > }ni=1 , we normalize them into unit norm before training, i.e., < xi , yi > ← < xi /kxi k, yi /kxi k >, ∀i = 1, 2, · · · , n. Correspondingly, we also normalize testing samples {xj∗ }m j=1 into unit norm before testing, and restore the scales of the original predictions {f (xj∗ )}m j=1 by f (xj∗ ) ← f (xj∗ ) × kxj∗ k, ∀j = 1, 2, · · · , m.

4. NGPR model for SR In this section, we follow the previously defined notations and briefly introduce the theory of GPR. Next we construct the kernel used in the NGPR model. Finally, we detail how to determine the hyper-parameters. 8

4.1. Theory of GPR A Gaussian process can be used to describe a distribution over function mappings [28–31]. Fundamentally, a Gaussian process f (x) is completely specified by its mean and covariance function, i.e., f (x) ∼ GP(m(x), k(x, x0 )), where m(x) is the mean function of f (x), and k(x, x0 ) is the kernel function to evaluate the covariance. If the observation at x is the sum of f (x) and an additive white Gaussian noise, the model can be denoted as y = f (x)+,  ∼ N (0, σn2 ). Suppose the training samples are {< xi , yi >}ni=1 , and X and y are respectively denoted as X , (x1 , x2 , · · · , xn )T , y , (y1 , y2 , · · · , yn )T . Let X∗ , (x1∗ , x2∗ , · · · , xm∗ )T denote the m inputs of test samples. The objective of the GPR is to predict the corresponding function values f∗ , (f (x1∗ ), f (x2∗ ), · · · , f (xm∗ ))T without noise. It is obvious that (y, f∗ ) is also a joint Gaussian. As a result, if y and f∗ subtract their own mean values, the joint distribution of y and f∗ regarding the prior can be described as     y K , K(X, X) + σn2 I   ∼ N 0,  y f∗ K(X∗ , X)

K(X, X∗ )



 , K(X∗ , X∗ )

(1)

where K(X, X∗ ) denotes an n×m covariance matrix evaluated at all the pairs of the training and test samples, and similarly for the other entries Kf , K(X, X), K(X∗ , X), and K(X∗ , X∗ ). Then the conditional distribution of f∗ can be inferred as f∗ |X, y, X∗ ∼ N (f∗ , cov(f∗ )),

(2)

f∗ , E[f∗ |X, y, X∗ ] = K(X∗ , X)Ky−1 y,

(3)

cov(f∗ ) = K(X∗ , X∗ ) − K(X∗ , X)Ky−1 K(X, X∗ ),

(4)

where

f∗ is the unbiased prediction of f∗ , and the variance cov(f∗ ) provides the confidence of the prediction.

9

4.2. Kernel Selection Kernel function measures the covariance between samples, and it is the key factor to the performance of the GPR. We employ three basic kernels in the NGPR model. 1) Diagonal squared exponential kernel (RBF kernel) A RBF kernel is a popular kernel in machine learning, and its value depends on the distance between the two samples, i.e., kSEiso (x, x0 ) = σf2 exp(−

1 (x − x0 )T (x − x0 )), 2l2

(5)

where σf2 is the signal variance and l is the length-scale. The above kernel is isotropic and can be used to describe the similarity of the samples which have smaller Euclidean distance. Isotropy ignores the structural information of image patches, which tends to result in linear dependence of the matrix Kf because of the existence of patch similarity. 2) Additive measurement noise kernel In practice setting, there is more or less noise in actual observation, so it is reasonable to introduce white noise kernel. If we assume that the noise of each sample is independent and identically distributed, then the noise kernel is expressed as kN oise (x, x0 ) = σn2 δ(x − x0 ),

(6)

where δ is Kronecker delta function and σn2 is the noise variance. Besides the above consideration, the noise kernel can also be used as a regularization to keep the positive definiteness of Ky . 3) Linear kernel Linear kernel is a kind of dot product covariance function denoted as kLIN (x, x0 ) = xT x0 .

(7)

This kernel is symmetric but anisotropic, so it can be used to measure structure information. Because the samples are normalized into unit norm, the computation of a linear kernel between two samples is equal to the computation of a

10

O

O

O

(a)

O

(b)

Figure 3: Illustration of kernel selection.

cosine of angles by this way, i.e., cos(x, x0 ) =

xT x0 = xT x0 , s.t. kxk = 1 and kx0 k = 1. kxkkx0 k

The benefit of the linear kernel kLIN is that it is capable of discovering more intrinsic structure than the RBF kernel kSEiso alone. This can be illustrated by Fig. 3. Take a two-dimensional case as example. As shown in Fig. 3(a), although \ kSEiso can also represent the angle α , (x, x00 ) indirectly by the chord between x and x00 , kLIN is still favorable for the orthogonal setting because kLIN (x, x0 ) = 0 but kSEiso (x, x0 ) > 0. Moreover, since the covariance computed by kSEiso is always larger than zero, it models each relation between the samples only as a positive relation, which ignores the fact that negative covariance are also available. Moreover, inversely related samples represented by negative covariance are also informative enough to make prediction. Concerning this point, kLIN is complementary to kSEiso because kLIN can be negative to capture the inverse relation. This will be useful for the prediction of GPR because it can make better use of the training samples. To illustrate this point, we take a simple example in Fig. 3(b), where there are three training samples {< xi , yi >}3i=1 , s.t. x3 = −x∗ and an input test sample x∗ . Suppose kSEiso (x∗ , xi ) = si σf2 , and kLIN (x∗ , xi ) = li , for i = 1, 2, 3, we can have 0 < s3 < s2 < s1 < 1 and 0 < l1 < (−l2 ) < (−l3 ) = 1. If the training set includes only one sample, we have the following conclusions: 1) If kSEiso is the only kernel selected to compute the covariance, the prediction of x∗ , denoted as f∗ , based on only one training sample < x1 , y1 > or 11

< x2 , y2 > is f∗ = (s1 σf2 )(σf2 )−1 y1 = s1 y1 , or f∗ = (s2 σf2 )(σf2 )−1 y2 = s2 y2 . Since 0 < s2 < s1 , the influence of y2 is lower than that of y1 . 2) If kLIN is the only kernel, then the prediction f∗ based on < x1 , y1 > or < x2 , y2 > is f∗ = (l1 )(1)−1 y1 = l1 y1 , or f∗ = (l2 )(1)−1 y2 = l2 y2 . Since |l2 | > |l1 |, the influence of y2 is higher than that of y1 . That is to say, kLIN is more beneficial to reversely related training samples than kSEiso . Furthermore, in an extreme case, if < x3 , y3 > is the only training sample, the prediction f∗ should be very close to −y3 according to the structure similarity of image patches. Actually, when we use kSEiso , the prediction is f∗ = (s3 σf2 )(σf2 )−1 y3 = s3 y3 ≈ 0. By contrast, when we use kLIN , the prediction f∗ is f∗ = (l3 )(1)−1 y3 = −y3 , which is more reasonable in most cases. 4) Composite kernel Since the aforementioned basic kernels are too simple to catch complicated image structure if used alone, in the SRGPR, they suggest constructing a composite kernel denoted as k1 = kSEiso + kN oise for similarity measurement. However, it is insufficient enough to catch patch structure information as the discussion above. To address this issue, we linearly combine an extra linear kernel kLIN into k1 , and a new kernel k2 = kSEiso + kN oise + c × kLIN is used in our framework, where c is a constant coefficient to adjust the influence strength of the applied linear kernel. 4.3. Hyper-parameter Determination In this subsection, we maximize the marginal likelihood to seek the hyperparameters in the kernel function above. Therefore, we need to initialize these hyper-parameters according to their physical meanings and iteratively update them through the conjugate gradient. Based on the definitions of kernel kSEiso , kN oise , and kLIN , we initialize the hyper-parameters in the following way. (1) Consider that σn is the noise STDEV of I 0 , we initialize it by the sample’s STDEV of I 0 subtracted by II0 . Here II0 can be seen as an estimation of I 0 , so I 0 − II0 can be regarded as the estimation of noise. (2) l is the STDEV of kSEiso , namely the length of scale, 12

which can be initialized by the samples’ STDEV of distance matrix between training sample inputs. (3) σf is the STDEV of signal, which can be initialized by the STDEV of samples extracted from I 0 . (4) Hyper-parameter c is initialized to τ σf2 , where τ is a coefficient. It is reasonable that τ should be small, because it can maintain the dominance of the universal kernel kSEiso besides introducing structure similarity. So in the following the initialized value of τ is set to 0.2 experimentally. Mathematically, the initial estimations of the above hyperparameters can be written as σn2 =

XRLR ×CLR 1 [(I 0 (i) − II0 (i)) − (I 0 − II0 )]2 i=1 RLR × CLR − 1

XRLR ×CLR 1 2 (kP (i) − P (j)k − md ) i,j=1 (RLR × CLR )2 − 1 XRLR ×CLR 2 1 σf2 = I 0 (i) − I 0 i=1 RLR × CLR − 1 XRLR ×CLR 1 (kPE (i) − PE (j)k) md = i,j=1 (RLR × CLR )2

l2 =

c = τ σf2

(8) (9) (10) (11) (12)

where RLR is the row number of image I 0 , CLR is the column number of image I 0 , I 0 (i) is the i-th pixel in I 0 , II0 (i) is the i-th pixel in II0 , I 0 is the average pixel intensity for I 0 , II0 is the average pixel intensity for II0 , and P (i) is the i-th patch in II0 . Based on the initializations above, a conjugate gradient is used to seek the optimal hyper-parameters in θ ∗ iteratively such that they can approximately satisfy θ ∗ = arg maxθ log p(y|X, θ), where θ is the set of hyper-parameters, θ = {σn , l, σf , c}, y|f , X ∼ N (0, σn2 I). Once the hyper-parameters are determined, the obtained NGPR model can be used for regression task. 4.4. Pseudo-code of NGPR In this subsection, the details of the NGPR-based SR algorithm are summarized in algorithm 1.

13

Algorithm 1: The NGPR-based SR Algorithm 1

input

: (I, s, ξ, ρ)

// I is the LR input test image, s is the scale factor, ξ is the sampling interval, ρ is the threshold of patch pruning 2

output : Final SR result SF // Learning Phase

3

Get high frequency images II0 , I 0 , and SI0 according to Table 1;

4

0 Extract the patch vectors {xi }n i=1 from II with sampling interval ξ and their standard

deviation are greater than ρ; 5

0 0 n Extract the patch center pixels {yi }n i=1 from I corresponding to {xi }i=1 from II ;

6

for each training sample pair < xi , yi > do < xi , yi >←< xi /kxi k, yi /kxi k >;

7 8

end

9

Train GPR model f based on {< xi , yi >}n i=1 ; // Synthesizing Phase

10

0 Extract all the patch vectors {xj∗ }m j=1 from SI ;

11

Initialization: S 0 ← SI0 ;

12

for each patch vector xj∗ do

13

nj∗ ← kxj∗ k;

14

Normalize xj∗ to unit norm: xj∗ ← xj∗ × kxj∗ k−1 ;

15

fj∗ ← f (xj∗ ) × nj∗ ;

16

replace the patch center pixel in S 0 (corresponding to xj∗ from SI0 ) with fj∗ ;

17

end

18

S ← S 0 + SM ;

19

SF ← Perform iterative back projection (IBP) on S;

20

return SF ;

14

5. Theoretical Discussion In this section, we first interpret the NGPR model as an extension of nonlocal means. Then we make a theoretical discussion about the predictive variance of the posteriori conditional distribution and the morbid degree (which is measured by condition number) of the kernel matrix that influent the performance of our proposed NGPR model. 5.1. Relation with Non-local Means We use the mean to predict the pixels in GPR. It can be looked as a generalization of non-local means. Given a test sample x∗ , the mean of the posteriori conditional distribution is estimated as f∗ , E[f∗ |X, y, x∗ ] = K(x∗ , X)Ky−1 y.

(13)

Let wT = K(x∗ , X)Ky−1 , then we have f∗ = wT y =

Xn i=1

wi yi ,

(14)

where wT is a row vector, wi is the ith element in w, and yi is the ith target of training samples in y. The NGPR extracts yi from the whole image non-locally, and the estimation is a weighted average of all training samples. Consequently, the NGPR can be seen as an extension of non-local means. 5.2. Propositions about NGPR Predictive variance represents the confidence level of regression. The smaller the predictive variance, the more reliable estimation the regression can be achieved. To illuminate this, a proposition about the relationship between predictive variance and sample size is given below. Proposition 1: Let varn (f (x∗ )) be the predictive variance of a test point x∗ regarding to n training points, and varn−1 (f (x∗ )) be that regarding to the first n − 1 training points. Under the assumption of Gaussian process prior and Gaussian noise, we have varn (f (x∗ )) ≤ varn−1 (f (x∗ )). (see the proof in Appendix A.1). 15

Proposition 1 indicates that more samples are beneficial for the NGPR at higher confidence level. However, more samples also mean greater probability of linear dependence, which may aggravate the morbidity of kernel matrix. To solve the hyper-parameters for prediction task, the kernel matrix Ky is required to be inversed. For example, in the prediction with the GPR, Ky−1 y is equivalent to solve the linear equation Ky t = y (t is the unknown). Therefore, the condition number of Ky is used to measure how sensitive the linear equation is to changes or errors in the input y, and how much error in the solution results from an error in the input y. It is defined as cond(Ky ) = kKy k2 kKy−1 k2 = ζmax /ζmin ,

(15)

where ζmax and ζmin are are the maximum and minimum singular value of matrix Ky , respectively. For the kernel matrix Ky , a low condition number means to be well-conditioned, while a high condition number means to be ill-conditioned. If the Ky is ill-conditioned, the solution to t will be sensitive to the tiny changes on y induced by observation noise or truncation error. For natural images, patches are complex and irregular, so it is difficult to estimate the lower bound of condition number. However, due to the similarity of image patches commonly existed, it is very likely that linear dependence exists in Kf = K(X, X), i.e., Kf is singular. Under the scenario, we can lead to the following propositions. Proposition 2: For the NGPR, if linear dependence exists in Kf , there must exists a singular value of Ky that equals to σn . (see the proof in Appendix A.2). Proposition 3: For the NGPR, if linear dependence exists in Kf , the condition number of kernel matrix Ky is greater than or equal to ((1 + τ ) × σf2 + 1

σn2 ) 2 /σn . (see the proof in Appendix A.3). Proposition 3 determines the lower bound of condition number. In practice, most input LR images have very low noise. If linear dependence exists, the condition number would be large. Although double precision is used in computation, the cumulative error caused by morbidity is still an unavoidable problem. 16

6. Experiments and Result Analysis To validate the effectiveness of the proposed NGPR-based SR approach, we conduct 3× SR experiments on ten benchmark test images and compare it with six state-of-the-art methods including SCSR [15], NARM [19], SpReg [21], SRGPR [27], BPJDL [17], and UDF [1]. All the codes of these compared methods are downloaded from authors’ website online, and we use default settings of them for all the experiments. The IBP-based quality enhancement is applied on the SRPGR-based results for a fair comparison. 6.1. Experiment Settings To simulate the image degradation model, all the LR test images are produced first by blurring with a 7 × 7 Gaussian kernel with the standard deviation 1.1 and then down-sampling with a factor of 3. The patch size is set to 11 × 11. The interval of grid sampling is set to 4 and the threshold of patch pruning is set to 0.5. The number of IBP iterations is set to 10. Considering that human eyes are more sensitive to the luminance component than to the chrominance components, we perform the NGPR-based SR reconstruction only in the luminance Y channel of color images. Both Cb and Cr channels are directly magnified by Bicubic interpolation. All the experiments are executed on a PC with Intel Core i3-3220 3.3GHz CPU and 8GB memory. 6.2. Quality Assessment We first evaluate the performance of different algorithms by using three objective quality assessment methods, i.e., Peak signal to noise ratio (PSNR), structural similarity (SSIM) index [32] and feature similarity (FSIM) index [33]. Table 2 shows the compared results. There are three rows for each image, which are PSNR, SSIM, and FSIM from top to bottom, respectively. Based on the presented results in Table 2, we can find that NGPR can achieve better results than other competitors. Although there exist a few results slightly lower than those achieved by the SCSR and NARM, the average results of NGPR are superior to those of others. 17

Table 2: Objective Quality Assessment of Different Methods Image

barbara

bb

bird

girl

SCSR NARM SpReg SRGPRBPJDLUDF

NGPR

[16]

[20]

[22]

[27]

[18]

[1]

25.44

25.91

23.89

24.92

24.28

24.36

26.20

0.864

0.729

0.801

0.786

0.812

0.806

0.890

0.932

0.932

0.900

0.883

0.902

0.890

0.957

31.51

32.00

29.47

28.72

29.74

29.14

32.93

0.950

0.858

0.910

0.905

0.914

0.915

0.961

0.963

0.949

0.940

0.936

0.940

0.926

0.981

30.27

30.63

27.46

26.00

27.88

27.37

31.28

0.895

0.880

0.850

0.822

0.859

0.852

0.922

0.916

0.903

0.878

0.864

0.892

0.860

0.930

31.84 27.06

30.12

29.75

30.30

30.55

31.78

0.794

0.606

0.751

0.740

0.756

0.757

0.796

0.895

0.824

0.867

0.868

0.880

0.847

0.898

30.64

30.65

28.29

27.47

28.67

28.60

31.58

0.926

0.797

0.875

0.866

0.880

0.885

0.940

0.960

0.947

0.926

0.920

0.927

0.922

0.977

26.44

26.11

24.53

23.91

24.85

24.68

27.17

0.901

0.679

0.827

0.821

0.837

0.835

0.921

0.942

0.909

0.896

0.892

0.899

0.887

0.960

29.74 Observation0.864

28.85

27.64

26.43

27.92

27.43

30.10

0.827

0.830

0.806

0.835

0.825

0.882

0.907

0.880

0.885

0.873

0.893

0.859

0.915

29.11 red flower 0.856

28.45

27.50

26.65

27.69

27.64

29.11

0.815

0.822

0.802

0.827

0.826

0.861

Lena

man

Seeds

Starfish

Average

0.911

0.880

0.894

0.883

0.900

0.875

0.914

26.18

25.69

23.51

22.24

23.97

23.64

26.51

0.787

0.721

0.698

0.667

0.718

0.691

0.805

0.867

0.805

0.811

0.799

0.837

0.773

0.875

26.07

26.27 23.82

22.69

24.08

23.65

25.90

0.790

0.763

0.732

0.705

0.741

0.718

0.791

0.869 0.850

0.836

0.828

0.847

0.806

0.858

28.72

28.16

26.63

25.88

26.94

26.71

29.26

0.863

0.767

0.810

0.792

0.818

0.811

0.877

0.916

0.888

0.883

0.875

0.892

0.864

0.926

18

To further illustrate the effectiveness of the proposed method, we visually display the super-resolved results as shown in Figs. 4 to 6. We can see that NGPR-based method can recover more pleasing details than other compared methods. This can be indicated from the tablecloth in Barbara image, the stripe on the hat of Lena image, and the beak in bird image.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 4: Comparison of SR results (×3) on barbara image: (a) UDF [1] (PSNR: 24.36, SSIM: 0.806, FSIM: 0.890). (b) SCSR [15] (PSNR: 25.44, SSIM: 0.864, FSIM: 0.932). (c) NARM [19] (PSNR: 25.91, SSIM: 0.729, FSIM: 0.932). (d) SpReg [21] (PSNR: 23.89, SSIM: 0.801, FSIM: 0.900). (e) SRGPR [27] (PSNR: 24.92, SSIM: 0.786, FSIM: 0.883). (f) BPJDL [17] (PSNR: 24.28, SSIM: 0.812, FSIM: 0.902). (g) Proposed NGPR (PSNR: 26.20, SSIM: 0.890, FSIM: 0.957). (h) Original.

6.3. More Validation Experiments To validate the effectiveness of our proposed optimization scheme, we perform a series of validation experiments in three aspects, i.e., kernel selection, grid patch sampling, and patch pruning. 6.3.1. Kernel Selection UDF, SCSR, NARM, SpReg, SRGPR, BPJDL, Proposed, and Ground Truth

Kernel plays a key role to the performance of GPR-based SR methods. In

this subsection, we compare the reconstructed results using two different kernels, i.e., k1 = kSEiso + kN oise and k2 = kSEiso + kN oise + c × kLIN , both under the same NGPR framework. The evaluation results are summarized in Table 3. 19

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 5: Comparison of SR results (×3) on Lena image: (a) UDF [1] (PSNR: 28.60, SSIM: 0.885, FSIM: 0.922). (b) SCSR [15] (PSNR: 30.64, SSIM: 0.926, FSIM: 0.960). (c) NARM [19] (PSNR: 30.65, SSIM: 0.797, FSIM: 0.947). (d) SpReg [21] (PSNR: 28.29, SSIM: 0.875, FSIM: 0.926). (e) SRGPR [27] (PSNR: 27.47, SSIM: 0.866, FSIM: 0.920). (f) BPJDL [17] (PSNR: 28.67, SSIM: 0.880, FSIM: 0.927). (g) Proposed NGPR (PSNR: 31.58, SSIM: 0.940, FSIM: 0.977). (h) Original.

From Table the results obtained(c)from the NGPR with (a) 2, we can see that(b) (d) the same kernel k1 are superior to those by SRGPR. Moreover, the results obtained by our newly proposed kernel k2 are superior to those by k1 in all cases. Although the improvements are not very significant, it is suitable to select the composite kernel k2 for better SR reconstruction. (e) 6.3.2. Grid Patch Sampling

(f)

(g)

(h)

Sampling interval is tightly correlative to computational efficiency in the proposed NGPR-based SR algorithm. To probe a reasonable sampling scheme, UDF,SCSR, NARM, SpReg, SRGPR, BPJDL, Proposed, and Ground Truth we empirically study nine sampling intervals within the range [2, 10] to evaluate the SR performance. The variations of average performance and condition number for ten test images are displayed in Fig. 7. From Fig. 7 we can find that the morbidity may be alleviated with increasing of sampling interval. Although most training patches are flat and uninformative,

20

(e)

(f)

(g)

(h)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 6: Comparison of SR results (×3) on bird image: (a) UDF [1] (PSNR: 27.37, SSIM: 0.852, FSIM:NARM, 0.860). (b) SCSR [15] (PSNR:BPJDL, 30.27, SSIM: 0.895, FSIM: (c) NARM UDF,SCSR, SpReg, SRGPR, Proposed, and0.916). Ground Truth [19] (PSNR: 30.63, SSIM: 0.880, FSIM: 0.903). (d) SpReg [21] (PSNR: 27.46, SSIM: 0.850, FSIM: 0.878). (e) SRGPR [27] (PSNR: 26.00, SSIM: 0.822, FSIM: 0.864). (f) BPJDL [17] (PSNR: 27.88, SSIM: 0.859, FSIM: 0.892). (g) Proposed NGPR (PSNR: 31.28, SSIM: 0.922, FSIM: 0.930). (h) Original.

Table 3: Performance Comparison between Two Kernels PSNR(dB)

SSIM

FSIM

Image k1

k2

k1

k2

k1

k2

barbara

26.00

26.20

0.884

0.890

0.954

0.957

bb

32.72

32.93

0.959

0.961

0.980

0.981

bird

31.24

31.28

0.923

0.922

0.929

0.930

girl

31.77

31.78

0.796

0.796

0.898

0.898

Lena

31.36

31.58

0.936

0.940

0.974

0.977

man

27.00

27.17

0.918

0.921

0.957

0.960

Observation

30.07

30.10

0.881

0.882

0.914

0.915

red flower

29.10

29.11

0.861

0.861

0.914

0.914

Seeds

26.42

26.51

0.802

0.805

0.872

0.875

Starfish

25.69

25.90

0.780

0.791

0.852

0.858

average

29.14

29.26

0.874

0.877

0.924

0.926

it is unavoidable that some informative patches may be excluded if a larger

21

0.9

29.5

SSIM

PSNR 29

0.88

28.5

0.86

28 2

4

6

8

10

0.84 2

4

6

8

10

4

0.93

2.5

x 10

FSIM

Cond Num 2

0.92

1.5 1

0.91

0.5 0.9 2

4

6

8

10

0 2

4

6

8

10

Figure 7: Performance variation with different sampling intervals.

sampling interval is made. As a consequence, it is necessary to make a tradeoff between sampling interval and super-resolved quality. From Fig. 7 we can see the interval of 2 achieves top level performance. Nevertheless, the interval of 4 is a more reasonable as tradeoff between the performance and efficiency. If one emphasizes more on the computational efficiency than the construction quality, a large interval such as 5 or 7 can be adopted to reduce the runtime. Notice that three metrics does not obviously changed except at the interval of 3. For the interval of 3, the average condition number is far greater than others, and the number of training samples is less than that at interval of 2, resulting in the worst results. The performance at interval of 3 is not comparable to that at greater intervals, which also shows that more samples are not always propitious to achieve a better performance. 6.3.3. Patch Pruning Table 4 shows the average quality assessments using different STDEV thresholds. From the results we can find that the threshold 0.5 gains the top level results. It is noticeable that different thresholds affect the number of training samples and thus affect the processing performance of NGPR-based SR algo22

rithm. Table 4: Average Performance Variation with Different STDEV Thresholds Threshold

PSNR

SSIM

FSIM

RMSE∗



0

29.12

0.875

0.925

9.329

0.25

29.24

0.877

0.926

9.173

0.5

29.26

0.877

0.926

9.156

0.75

29.24

0.877

0.926

9.175

1

29.22

0.876

0.925

9.189

1.25

29.21

0.875

0.924

9.206

1.5

29.18

0.874

0.923

9.221

1.75

29.15

0.873

0.923

9.243

2

29.11

0.872

0.922

9.273

2.25

29.02

0.870

0.920

9.372

2.5

28.87

0.867

0.917

9.518

RMSE is the root mean squared error.

6.3.4. Effects of the Patch Size In the NGPR-based SR, the similarity of two pixels is measured by calculating the kernel between two corresponding image patches. Therefore the size of image patch may influence the SR performance. To verify the effects of image patches with different sizes, we conduct a set of experiments on different sizes from 3 × 3 to 21 × 21 at the sampling interval of 4. The compared results are listed in Table 5. We can observe that four assessment indices obtained from image patches with the size of 11 × 11 are the best. 6.4. Complexity Analysis In this subsection, we further evaluate the computational complexity which is important to SR in practical applications. Suppose the size of an LR input image is r × c and the sampling interval is ξ. The number of training samples is (r × c)/ξ 2 . According to the time complexity and the memory consumption of matrix inverse operation, the time complexity of NGPR is about O((r × c)3 /ξ 6 ), and the memory consumption is about O(8 × (r × c)2 /ξ 4 ). For an LR image of size 171 × 171 (512 × 512 23

Table 5: Average Performance Variation with Different Patch Size Patch Size

PSNR

SSIM

FSIM

RMSE

3×3

27.75

0.844

0.906

10.862

5×5

28.89

0.874

0.926

9.520

7×7

29.16

0.878

0.926

9.264

9×9

29.15

0.876

0.926

9.287

11 × 11

29.26

0.877

0.926

9.156

13 × 13

29.24

0.876

0.925

9.204

15 × 15

29.22

0.872

0.923

9.223

17 × 17

29.17

0.870

0.922

9.301

19 × 19

29.17

0.869

0.921

9.283

21 × 21

29.01

0.864

0.919

9.485

HR image for 3X SR), if the sampling interval is 1, the total memory allocated by NGPR is greater than 8 × (171 × 171)2 = 6.8403 × 109 bytes, equal to 6.3705 GB, which is unacceptable for many practical applications. By contrast, if the sampling interval is changed to 4, the memory requirement decreases to 8 × ((171 × 171)2 /44 ) = 2.6720 × 107 bytes, i.e., 0.0249 GB, which is acceptable for most portable devices. So in order to utilize the non-local similarity from the whole image, it is recommended to perform a larger sampling interval for acceptable memory allocation. As a local patch-based method, SRGPR is a representative work based on GPR. To evaluate the complexity of SRGPR, we suppose the size of big patch is br × bc , the size of small patch is sr × sc , the sampling interval is δ, and the big patch overlap factor is β. The number of big patches, denoted as bn , can be computed as  bn =

   r − br c − bc +1 × +1 br (1 − β) bc (1 − β)

(16)

and the number of training samples for each big patch is about sn = (br ×bc )/δ 2 . Then the time complexity of SRGPR is about O(2bn × s3n ), and the memory consumption is about O(8 × s2n ). Under the default experimental settings, i.e., ξ is set to 4, br and bc are set to 20, sr and sc are set to 3, δ is set to 1, β is set to 2/3, and the size of an

24

LR image is r × r, the comparisons of the time complexity and the memory consumption between SRGPR and NGPR are compared in Table 6. Table 6: Complexity Comparison between NGPR and SRGPR NGPR

r6 /4096

SRGPR

128 × ((3(r − 20)/20) + 1)2 × 106

NGPR

r4 /32

SRGPR

128 × 104

Time Complexity

Memory Consumption

In order to make a further comparison, both the time complexity and the memory consumption about square images with different sizes from 25 × 25 to 500 × 500 are illustrated in Fig. 8. 9

12

4

x 10

2

10

NGPR SRGPR

10 x 10

3

x 10

7

NGPR SRGPR

2 x 10

5

1

2

1 0

0

50 100 150

40

80 120

1 0

100

200

300

400

500

0

(a) Time Complexity.

100

200

300

400

500

(b) Memory Consumption.

Figure 8: Comparison of the time complexity and the memory consumption with different square image sizes.

It is noticeable that there is an intersection p (about at 320) between the two time complexity curves. It means that NGPR is faster than SRGPR when the LR image size is less than 320×320. However, if the size is larger than 320×320, the runtime of NGPR would grow sharply. In Fig. 8(b), the intersection is at 80. It means that when the LR image size is larger than 80 × 80, the required memory by NGPR is greater than SRGPR. Based on Fig. 8, although the proposed NGPR outperforms the SRGPR, the complexity is intensive to the test image with a large size. Therefore, to take a tradeoff between the construction quality and the computational efficiency, two tactics can be adopted in practical applications: 1) Perform a non-local

25

sampling with a larger grid-sampling interval throughout the whole image for decreasing the training complexity, and 2) Keep the sampling interval fixed and divide the whole image into several big tiles for reducing the non-locality and improving the computational efficiency.

7. Conclusion In this paper we presented a self-learning based single image SR method that employs the NGPR to establish a nonlinear mapping from the patches to the corresponding center pixels. The similar patterns across different scales of the input LR image are exploited to estimate missing details by using the learned NGPR model. Fundamentally, the proposed SR model is an extension of traditional GPR-based SR method. In the proposed method, we integrated the RBF kernel, the noise kernel, and the linear kernel to improve the evaluation accuracy of the covariance and therefore can enhance the SR performance. Moreover, an adjustable sampling interval can take a tradeoff between the construction quality and efficiency. Experimental results showed that the NGPR-based SR method outperformed the competitors in terms of both objective and subjective assessments. However, there is still some room to improve the NGPR-based SR model. On one hand, a more reasonable kernel can be elaborated to capture image structure similarity more effectively in the construction of kernel matrix. One the other hand, sparse GPR-based model (similar to [34, 35]) or active learning for feature selection (similar to [36]) are hopeful to reduce the computational complexity and boost up the SR quality in the future work.

Acknowledgement This work was supported in part by the National Natural Science Foundation of China under Grant 61432014, and Grant 61471161, in part by the Fundamental Research Funds for the Central Universities under Grant BDZ021403 and

26

Grant JB149901, in part by Microsoft Research Asia Project based Funding under Grant FY13-RES-OPP-034, in part by the Program for Changjiang Scholars and Innovative Research Team in University of China under Grant IRT13088, in part by the Shaanxi Innovative Research Team for Key Science and Technology under Grant 2012KCT-02, and in part by the China Post-Doctoral Science Foundation under Grant 2013M540734 and Grant 2014T70905.

References [1] L. Wang, H. Wu, C. Pan, Fast image upsampling via the displacement field, IEEE Trans. Image Process. 23 (12) (Dec. 2014) 5123–5135. [2] M. Irani, S. Peleg, Motion analysis for image enhancement: resolution, occlusion, and transparency, J. Vis. Commun. Image R. 4 (4) (Apr. 1993) 324–335. [3] J. Sun, J. Sun, Z. B. Xu, H. Y. Shum, Image super-resolution using gradient profile prior, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 2471–2478. [4] R. Fattal, Image upsampling via imposed edge statistics, ACM Trans. Graphic 26 (953) (Jul. 2007) 95:1–95:8. [5] X. Li, Y. Hu, X. Gao, D. Tao, B. Ning, A multi-frame image superresolution method, Signal Process. 90 (2) (Feb. 2010) 405–414. [6] X. Gao, Q. Wang, X. Li, D. Tao, K. Zhang, Zernike-moment-based image super resolution, IEEE Trans. Image Process. 20 (10) (Oct. 2011) 2738– 2747. [7] A. Marquina, S. J. Osher, Image super-resolution by tv-regularization and bregman iteration, J. Scientific Computing 37 (3) (Mar. 2008) 367–382. [8] M. Protter, M. Elad, H. Takeda, P. Milanfar, Generalizing the nonlocalmeans to super-resolution reconstruction, IEEE Trans. Image Process. 18 (1) (Jan. 2009) 36–51. 27

[9] S. Geman, D. Geman, Stochastic relaxation, gibbs distributions, and the bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (6) (Jun. 1984) 721–741. [10] K. Zhang, X. Gao, D. Tao, X. Li, Single image super-resolution with nonlocal means and steering kernel regression, IEEE Trans. Image Process. 21 (11) (Nov. 2012) 544–4556. [11] H. Chang, D. Y. Yeung, Y. Xiong, Super-resolution through neighbor embedding, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2004, pp. 275–282. [12] X. Gao, K. Zhang, D. Tao, X. Li, Image super-resolution with sparse neighbor embedding, IEEE Trans. Image Process. 21 (7) (Jul. 2012) 3194–3205. [13] D. Glasner, S. Bagon, M. Irani, Super-resolution from a single image, in: Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 349–356. [14] K. Zhang, X. Gao, D. Tao, X. Li, Single image super-resolution with multiscale similarity learning, IEEE Trans. Neural Netw. Learn. Syst. 24 (10) (Oct. 2013) 1648–1659. [15] J. Yang, J. Wright, T. Huang, Y. Ma, Image super-resolution via sparse representation, IEEE Trans. Image Process. 19 (11) (Nov. 2010) 2861–2873. [16] J. Yang, Z. Wang, Z. Lin, S. Cohen, T. Huang, Coupled dictionary training for image super-resolution, IEEE Trans. Image Process. 21 (8) (Aug. 2012) 3467–3478. [17] L. He, H. Qi, R. Zaretzki, Beta process joint dictionary learning for coupled feature spaces with application to single image super-resolution, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 345–352. [18] R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparserepresentations, in: Proc. Int. Conf. Curves and Surfaces, 2010, pp. 711– 730. 28

[19] W. Dong, L. Zhang, R. Lukac, G. Shi, Sparse representation based image interpolation with nonlocal autoregressive modeling, IEEE Trans. Image Process. 22 (4) (Apr. 2013) 1382–1394. [20] W. Freeman, E. Pasztor, O. Carmichael, Learning low-level vision, Int. J. Comput. Vision 40 (1) (Jan. 2000) 25–47. [21] K. I. Kim, Y. Kwon, Single-image super-resolution using sparse regression and natural image prior, IEEE Trans. Pattern Anal. Mach. Intell. 32 (6) (Jun. 2010) 1127–1133. [22] C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: Proc. Eur. Conf. Comput. Vis., 2014, pp. 184–199. [23] R. Timofte, V. De, L. V. Gool, Anchored neighborhood regression for fast example-based super-resolution, in: Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1920–1927. [24] M. Bevilacqua, A. Roumy, C. Guillemot, M.-L. A. Morel, Single-image super-resolution via linear mapping of interpolated self-examples, IEEE Trans. Image Process. 23 (12) (Dec. 2014) 5334–5347. [25] J. Yang, Z. Lin, S. Cohen, Fast image super-resolution based on in-place example regression, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1059–1066. [26] K. S. Ni, T. Q. Nguyen, Image superresolution using support vector regression, IEEE Trans. Image Process. 16 (6) (Jun, 2007) 1596–1610. [27] H. He, W.-C. Siu, Single image super-resolution using gaussian process regression, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 449–456. [28] C. E. Rasmussen, C. K. I. Williams, Gaussian processes for machine learning, Cambridge, MA: MIT Press, 2006, pp. 7–30. 29

[29] M. Seeger, Gaussian processes for machine learning, Int. J. Neural Sys. 14 (2) (Feb. 2004) 69–106. [30] A. McHutchon, C. E. Rasmussen, Gaussian process training with input noise, in: Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1341–1349. [31] C. E. Rasmussen, H. Nickisch, Gaussian processes for machine learning (gpml) toolbox, J. Mach. Learn. Res. 11 (Nov. 2010) 3011–3015. [32] W. Zhou, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (Apr. 2004) 600–612. [33] L. Zhang, L. Zhang, X. Mou, D. Zhang, Fsim: a feature similarity index for image quality assessment, IEEE Trans. Image Process. 20 (8) (Aug. 2011) 2378–2386. [34] F. Yan, Y. Qi, Sparse gaussian process regression via l1 penalization, in: Proc. Int. Conf. Mach. Learn., 2010, pp. 1183–1190. [35] T. Bui, R. Turner, Tree-structured gaussian process approximations, in: Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2213–2221. [36] X. Li, Y. Guo, Adaptive active learning for image classification, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 859–866.

Appendix A. The proof of Proposition 1-3 Appendix A.1. The proof of Proposition 1 Proof : Suppose the training set of size n−1 is X (n−1) = {x1 , x2 , · · · , xn−1 }, and the predictive variance is (n−1)

varn−1 (f∗ ) = k∗∗ − K∗

30

(n−1) T

(Ky(n−1) )−1 (K∗

) ,

(n−1)

where Ky

(n−1)

, [K(X (n−1) , X (n−1) ) + σn2 I], K∗

, K(x∗ , X (n−1) ). Append-

ing a new sample xn to X (n−1) leads to X (n) = X (n−1) ∪ {xn }. Then the corresponding predictive variance of n samples is (n)

(n)

varn (f∗ ) = k∗∗ − K∗ (Ky(n) )−1 (K∗ )T , where (n)

K∗ 

(n−1)

, [K∗

, k(x∗ , xn )],

(n−1)

RT

Ky



, Ky(n) ,  R , K(xn , X (n−1) ) s , k(xn , xn ) (n−1)

where s is a scalar, and R is a 1 × n row vector. Ky (n−1)

positive semi-definite, hence det(Ky (n)

(n−1) −1

|Ky | = |s − R(Ky

)

(n−1)

) = |Ky

(n)

and Ky

are both

(n)

| ≥ 0, det(Ky ) ≥ 0, and

RT | ≥ 0. Therefore,

|s − R(Ky(n−1) )−1 RT | ≥ 0 =⇒ s − R(Ky(n−1) )−1 RT ≥ 0. Based on the Banachiewicz inversion formula, we can get   (n−1) −1 (n−1) −1 T (K ) + N −(K ) R M y y , (Ky(n) )−1 =  (n−1) −1 −M R(Ky ) M (n−1) −1

where M , (s − R(Ky

)

(n)

(n)

(n−1) −1

RT )−1 , N , (Ky

)

(n−1) −1

RT M R(Ky

)

, and

(n)

K∗ (Ky )−1 (K∗ )T (n−1)

=K∗

(n−1) T

(Ky(n−1) )−1 (K∗ (n−1)

+ M (K∗ (n−1)

≥ K∗

(Ky(n−1) )−1 RT − k(x∗ , xn ))2

(n−1) −1

(Ky

)

)

(n−1) T

(K∗

) .

So varn (f (x∗ )) ≤ varn−1 (f (x∗ )). Appendix A.2. The proof of Proposition 2 Proof : If linear dependence exists in Kf , then det(Kf ) = 0. The characteristic equation of Ky , namely det(Ky − ρI) = 0, is formulated as det(Ky − ρI) = 0 ⇐⇒ det(Kf + σn2 I − ρI) = 0. 31

Then we can have  det(Kf + σn2 I − ρI) = 0  =⇒ ρ = σn2 .  det(K ) = 0 is known f

So there exists a singular value that equals to σn . Appendix A.3. The proof of Proposition 3 Proof : The trace of kernel matrix Ky can be written as Tr(Kf + σn2 I) =

Xn i=1

(σf2 + σn2 + τ × σf2 ).

The maximum singular value ζmax satisfies 1 1 ζmax ≥ Tr(Kf + σn2 I)/n 2 = ((1 + τ ) × σf2 + σn2 ) 2 . Based on Proposition 2, the minimum singular value ζmin satisfies ζmin ≤ σn . 1

So cond(Ky ) ≥ ((1 + τ ) × σf2 + σn2 ) 2 /σn .

32