Resolution-invariant coding for continuous image super-resolution

Resolution-invariant coding for continuous image super-resolution

Neurocomputing 82 (2012) 21–28 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom R...

2MB Sizes 0 Downloads 76 Views

Neurocomputing 82 (2012) 21–28

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Resolution-invariant coding for continuous image super-resolution Jinjun Wang a,n, Shenghuo Zhu b a b

Epson Research and Development, Inc., 2580 Orchard Parkway, San Jose, CA 95131, United States NEC Laboratories America, Inc., 10080 N. Wolfe Road, Cupertino, CA 95014, United States

a r t i c l e i n f o

abstract

Article history: Received 21 June 2011 Received in revised form 30 August 2011 Accepted 21 September 2011 Communicated by Qingshan Liu Available online 24 December 2011

The paper presents the resolution-invariant image representation (R IIR) framework. It applies sparsecoding with multi-resolution codebook to learn resolution-invariant sparse representations of local patches. An input image can be reconstructed to higher resolution at not only discrete integer scales, as that in many existing super-resolution works, but also continuous scales, which functions similar to 2-D image interpolation. The R IIR framework includes the methods of building a multi-resolution bases set from training images, learning the optimal sparse resolution-invariant representation of an image, and reconstructing the missing high-frequency information at continuous resolution level. Both theoretical and experimental validations of the resolution invariance property are presented in the paper. Objective comparison and subjective evaluation show that the R IIR framework based image resolution enhancement method outperforms existing methods in various aspects. & 2011 Elsevier B.V. All rights reserved.

Keywords: Image representation Sparse-coding Image super-resolution

1. Introduction Most digital imaging devices produce a rectangular grid of pixels to represent the photographic visual data. This is called the raster image. The human perceptual clarity of a raster image is decided by its spatial resolution which measures how closely the grid can be resolved. Raster images with higher pixel density are desirable in many applications, such as high resolution (HR) medical images for cancer diagnosis, high quality video conference, HD television, Blu-ray movies, etc. There is an increasing demand to acquire HR raster images from low resolution (LR) inputs such as images taken by cell phone cameras, or converting existing standard definition footage into high definition image/ video materials. However, raster images are resolution dependent, and thus cannot scale to arbitrary resolution without loss of apparent quality. Another generally used image representations is the vector image. It represents the visual data using geometrical primitives such as points, lines, curves, and shapes or polygon. The vector image is totally scalable, which largely contrasts the deficiency of raster representation. Hence the idea of vectorizing raster image for resolution enhancement has long been studied. Recently, Ramanarayanan et al. [1] added the vectorized region boundaries to the original raster images to improve sharpness in scaled results; Dai et al. [2] represented the local image patches using the background/foreground descriptors and reconstructed the sharp discontinuity between the two; to allow efficient vector representation

n

Corresponding author. E-mail address: [email protected] (J. Wang).

0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.09.027

for multi-colored region with smooth transitions, gradient mesh technique has also been attempted [3]. In addition, commercial softwares such as [4] are already available. However, vector-based techniques are limited in the visual complexity and robustness. For real photographic images with fine texture or smooth shading, these approaches tend to produce over-segmented vector representations using a large number of irregular regions with flat colors. To illustrate, Fig. 1(a) and (b) is vectorized and grown up to  3 scale using methods in [2,4]. The discontinuity artifacts in region boundaries can be easily observed, and the over-smoothed texture regions make the scaled image watercolor like. Alternatively, researchers have proposed to vectorize raster image with the aids of a bases set to achieve higher modeling capacity than simple geometrical primitives. For example, in image/video compression domain, pre-fixed bases, such as the DCT/DWT bases adopted in JPEG/JPEG-2000 standard, and the anisotropic bases such as countourlets [5], have already been explicitly proposed to capture different 2-D edge/texture patterns, because they lead to sparse representation which is very preferable for compression [6]. In addition to pre-fixed bases, adaptive mixture model representations were also reported. For example, the Bandelets model [7] partitions an image into squared regions according to local geometric flows, and represents each region by warped wavelet bases; the primal sketch model [8] detects the high entropy regions in the image through a sketching pursuit process, and encodes them with multiple Markov random fields. These adaptive representations capture the stochastic image generating process, therefore they are suited for image parsing, recognition and synthesis. In the large body of example-based image resolution enhancement literature, or called ‘‘Single Frame Super-Resolution

22

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28

Fig. 1. Image SR quality by our R IIR framework. Top: comparison to image vectorization. Bottom: comparison to different example-based methods.

(SR in short)’’, researchers utilize the co-occurrence prior between LR and HR representations in an over-completed bases set to ‘‘infer’’ the HR image. For example, Freeman et al. [9] represented each local region in the LR image using one example LR patch, and applied the co-occurrence prior and global smoothness dependence through a parametric Markov Network to estimate the HR image representation. Qiang et al. [10] adopted Conditional Random Field to infer both the missing HR patches and the point-spread-function parameters. Chang et al. [11] utilized locally linear embedding (LLE) to learn the optimal combination weights of multiple LR base elements to estimate the optimal HR representations. In our previous work [12] and in Yang et al.’s work [13], the sparse-coding model is applied to obtain the optimal reconstruction weight using the whole bases set. In addition to example patches, representing images in transferred domain, such as edge profile [14], wavelet coefficients [15], image contourlet [16], etc., has also been examined. However, although the example-based SR methods significantly improve image quality over 2-D image interpolation, the bases used by existing approaches have only single scale capacity. E.g., the base used for  2 up-sizing cannot be used for  3 up-sizing. Hence these existing methods are not capable for multi-scale image SR. To cope with these limitations, this paper presents a novel method that uses example bases set yet is capable for multi-scale and even continuous-scale image resolution enhancement. The contribution includes:

 The paper introduces a novel resolution-invariant image





representation (R IIR) framework that models the inter-dependency between example base sets of different scales. The paper shows that, an image can be encoded into a resolutioninvariant representation, such that by applying different bases set, the LR input can be enhanced to multiple HRs. This capability has obviously the importance in many novel resolution enhancement applications that existing SR method cannot handel well. The key components of the R IIR framework include constructing a R IIR bases set and coding the image into R IIR. In addition to our previous work [12,17], this paper introduces several coding schemes that all possess the resolution-invariant property, as illustrated in Fig. 1(f)–(h). A comprehensive evaluation was conducted to evaluate the advantages of different coding scheme over different aspects. The paper further extends the proposed R IIR framework to support continuous scale image SR. A new base for any arbitrary resolution level can be synthesized using existing R IIR set on the fly. In this way the input image can be enhanced to continuous scales using only matrix–vector multiplication, which can be implemented very efficiently by modern computers.

The rest of the paper is organized as follows: Section 2 introduces the image decomposition model and generalizes the invariant property between different image frequency layers. Section 3 introduces our key R IIR framework based on the invariant property between base sets of different scales. Section 4 applies the R IIR framework for continuous image SR. Section 5 lists the experimental results, and Section 6 summarizes the proposed methods and discusses future works.

2. Resolution invariant property between frequency layers 2.1. Image model Example-based SR approaches assume that [9] an HR image I ¼ Ih þ Im þ Il consists of a high frequency layer (denoted as Ih ), a middle frequency layer (Im ), and a low frequency layer (Il ). The down-graded LR image I ¼ Im þ Il results from discarding the high frequency components from the original HR version. Hence the image super-resolving process strives to estimate the missing high frequency layer Ih by maximizing PrðIh 9Im ,Il Þ for any LR input. In addition, since the high frequency layer Ih is independent of Il [9], it is only required to maximize PrðIh 9Im Þ, which greatly reduces the variability to be stored in the example set. A typical example-based SR process works as follows: Given an HR image I and the corresponding LR image I0 , I0 is interpolated to the same size as I and denoted as I. The missing high frequency layer Ih can be obtained by Ih ¼ II. A Gaussian filter Gl is properly defined to obtain the middle frequency layer Im by Im ¼ II  Gl . Now from Ih and Im , a patch pair set S ¼ fS m ,S h g can be extracted h N h N as the example bases set. S m ¼ fpm i gi ¼ 1 and S ¼ fpi gi ¼ 1 represent the middle frequency and the high frequency bases respech tively. Each element pair in fpm i ,pi g is the column expansion of a square image patch from the middle frequency layer Im and the corresponding high frequency layer Ih . The dimensions of pm i and phi are Dm  1 and Dh  1 respectively, and often Dm a Dh . Now from a given LR input, the middle frequency patches can be extracted accordingly and denoted as fym j g. The missing high frequency components fyhj g are estimated based on the co-occurrence patterns stored in S. The following subsections review three different models for the estimation process.

2.2. Nearest neighbor Assuming that image patches follow Gaussian distribution, i.e., Prðym Þ  N ðlm , R2 Þ, and Prðyh 9ym Þ  N ðlh , R2 Þ, it can be easily verified that, for any observed patch ym j from the LR input, the maximum likelihood (ML) estimation of lm minimizes the j

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28

following objective function: mn j g¼

fl

m 2 Jym j  j J ,

l

argmin flm g  fpm gN j i 1 ¼ 1

ð1Þ

which yields a 1-nearest neighboring (1-NN) solution. With the co-occurrence prior, the corresponding lhj n is the ML estimation of lhj , which is then used as the missing yhj for reconstruction. The 1-NN estimation only considers the local observation, hence the performance of 1-NN method heavily depends on the example images. Freeman et al. proposed a parametric Markov Network (MN) to incorporate neighboring smoothness conm N m straint [9]. His method strives to find lm j A fpi gi ¼ 1 such that lj h is similar to ym , and the corresponding l follows certain j j smoothness constrain from the 4-connection neighborhood. This equals to minimizing the following objective function for the whole network X n m 2 h h 2 flm argmin ðJym ð2Þ j lj J þ lJlj Oðlj ÞJ Þ, j g¼ flm g  fpm gN j i i ¼

1

j

where Oðlhj Þ represents the overlapped region in lhj by neighboring patches. Basically, the second term in Eq. (2) penalizes the pixel difference at overlapped regions. Comparing to Eq. (1), Eq. (2) obtains a maximum a posteriori (MAP) estimation of lm , hence the result is more stable and robust. However, due to cyclic dependencies of Markov Network, exact inference of Eq. (2) is a #P-complete problem, and thus computationally intractable. One feasible solution is to break the inference process into two individual steps [9,10,18]: First, for each input patch ym j , K-NN K fpm k gk ¼ 1 are selected from the training data. This minimizes m 2 Jym j lj J in Eq. (2); second, the K corresponding high frequency patch fphk gKk ¼ 1 are used as candidates to search for a winner that minimizes Jlhj Oðlhj ÞJ2 , using approximation techniques, such as Bayesian belief propagation [9], Gibbs sampling [10], graph-cut [18], etc. The winner is the estimated lhj n which is then used as the final yhj for reconstruction.

sub-optimal. In fact, the searching and embedding steps in the LLE method can be addressed simultaneously, i.e., searching for a set of base elements whose combination is a good estimation of the input. This equals to learning the optimal fxnj g that minimizes the following objective function: m 2 fxnj g ¼ argminJym j P xj J þ gfðxj Þ,

m 2 2 fxnj g ¼ argminJym j Pj xj J þ lJxj J ,

where Pm is a Dm  N matrix representing the middle frequency N patch set S m ¼ fpm i gi ¼ 1 in the training data. It can be easily found that, Eq. (5) is very similar to Eq. (3), m except Pm is used instead of Pm is usually overj . Since S complete, the regularization term, fðxj Þ, is very important. In our previous work [12], an L1 regularization is suggested, and the optimization problems becomes learning the sparse-coding (SC) [19] for each ym j individually. More details can be found in [12]. Similar to Sections 2.3 and 2.2, the obtained fxj gn can be applied to estimate the high frequency layer to estimate flhj n g using Eq. (4), and then fyhj g accordingly. 2.5. Invariant property between different frequency layers In the above subsections, each image patch ym j is converted into a local representation xnj , using either the NN, LLE or SC model. Each representation xnj is a sparse N  1 vector with only one non-zero element (in the NN model), K non-zero elements (in the LLE model), or a small number of non-zero elements (in the SC model). For simplicity, we call such process the coding process. When fxnj g is obtained, the reconstruction process calculates Eq. (4) for all the three models. The difference among the three models is the objective function used during the coding process:

 In the NN model, xnj is obtained by writing Eq. (1) as m 2 xnj ¼ argminJym j P xj J xj X s:t: xj A f0; 1gN and xj ¼ 1,

flhj n g ¼ fPhj xnj g, Phj

where Pm is the same as that used in Eq. (5). m 2 2 xnj ¼ argminJym j ðP Aj Þxj J þ lJxj J ,

matrix representation of fphk gKk ¼ 1 hn h j is used as the computed yj for

where is the D  K that K corresponds to fpm final k gk ¼ 1 . l reconstruction, and pixels in the neighboring overlapped regions simply take their average values. 2.4. Sparse coding The performance of the LLE method is limited by the quality K of the K candidates fpm k gk ¼ 1 , hence the solution by LLE is

ð7Þ

xj

where the term Pm Aj denotes the neighboring relation using matrix manipulation. Aj is a Dm  N matrix and can be factorized by Aj ¼ Im aj . Im is a Dm  1 unit vector. aj A f0; 1gN , and for each element ak in aj , ( m m 2 m 1 if Jym j pk J rdthr , pk A S , ak ¼ ð8Þ 0 otherwise,

ð3Þ

ð4Þ h

ð6Þ

 In the LLE model, according to Eq. (3), xnj is obtained by

fxj g

m m K where Pm j is the D  K matrix representation of fpk gk ¼ 1 and xj ¼ ½x1 , . . . ,xK > . The regularization term lJxj J2 is added to improve the condition of the least square fitting problem. Then xnj is used to estimate lhj n by

ð5Þ

fxj g

2.3. Local linear embedding The two-step strategy in solving Eq. (2) is computationally expensive. Besides, the improvement in SR image quality over the 1-NN is limited. Chang et al. [11] introduced an alternative approach where the problem of reconstruct the optimal yhj is regarded as discovering the local linear embedding (LLE) in the m h original RD space and reconstruct in another RD space. The LLE method works in the following manner. First, for each m K input patch ym j , K nearest neighboring patches fpk gk ¼ 1 are selected as that in Section 2.2. Next, an optimal embedding weights xnj is obtained by

23



where dthr is a pre-defined threshold that controls the number of NNs to be selected. It can be regarded as a constant variable. In the SC model, xnj is obtained by Eq. (5) with fðxj Þ ¼ 9xj 91 .

Our intention to discuss these different coding models is that, although xnj is learned from the middle frequency layer by Eqs. (1), (3) and (5), it can be directly applied to compute the missing components in the high frequency layer by Eq. (4). Such invariant property can be generalized in Theorem 2.1 below: Theorem 2.1. The optimal representation xn is invariant across different frequency layers given respective bases of the corresponding frequency components. Theorem 2.1 is a direct result of the image co-occurrence prior, and has been validated by numerous example-based SR work. However, such invariant property depends on the example patch

24

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28

pair set S, where the optimal representation xn is invariant across the middle and the high frequency layer only under a defined up-scaling factors, such as  2,  3, etc. In this paper, the correlations between base sets of different scales is of interest. The following section introduces another invariant property between different base sets. We call it resolution-invariant image representation (R IIR).

the aliasing effects [20]. Both the interpolation and the anti-aliasing filters are low-pass filters, and they can be combined into a single filter. In practice, the filter with the smallest bandwidth is more restrictive, and thus can be used in place of both filters. Now assuming both the HR image I and several downgraded LR images I u are available for training (the notation I and I1 is interchangeable hereafter), each I u can be modeled by I u ¼ ððI1  G1=u Þk1=u mu=1 Þ  Gu=1 ¼ I1  Gm u,

3. Resolution-invariant image representation 3.1. Generating multi-resolution base set To examine the relation among different resolution versions of a same image, a multi-resolution image patch pair set S is generated: First, each image I in a given dataset is processed to obtain its LR version I u by first downsampling I to 1/u scale, then upsampling it back to the original size. As explained in Section 2.1, in this way N image patch pairs can be extracted from the Im and Ih layers respectively, and we denote the obtained set as h m h N S u ¼ fS m u ,S u g ¼ fpi,u ,pi,u gi ¼ 1 . Next, multiple u ¼1,y,U is applied to obtain a multi-resolution bases set. In particular, the order of the elements in each set is h specially arranged such that the ith pair in fS m u ,S u } and the ith pair m h in fS U ,S U } are from patches at the same location as highlighted in Fig. 2. With the obtained S ¼ fS u g, u ¼1,y,U, the next subsection examines the relation among these multiple base sets to reveal another invariance property.

ð9Þ

where G1=u is the anti-aliasing filter, Gu=1 is the interpolation filter, and k1=u / mu=1 is the downsampler/upsampler. The combined filter is the one with the smallest bandwidth between G1=u and Gu=1 . For simplicity, we denote the true combined filter as Gm u for later discussion. The downsampling/upsampling steps are generally not reversible. The difference between the obtained I u and the original I1 is the missing high frequency layer Ihu that needs to be estimated (Section 2.1). Similarly, the middle frequency Im u can be obtained by l m m l Im u ¼ I u I u  Gu ¼ I1  Gu I1  Gu  Gu ¼ I1  Gu , m Gu ¼ Gm u Gu

Glu ,

ð10Þ

Glu

 and denotes the combined filter to where further discard the middle frequency layer from I u . m m Let Pm u be a Du  N matrix to represent all the elements in S u , m m m where Du is the dimension of patch pu , yu be the middle frequency component of an input patch yu , and gu be the column expansion of Gu . With Eq. (10), we can have m Pm u ¼ P1  gu ,

ð11Þ

where the convolution applies on each row of P, and

3.2. Invariant property between different base sets Ideally, obtaining I u requires first a downsampling process and then an upsampling process. The downsampling process consists of applying an anti-aliasing filter to reduce the bandwidth of the signal and then a decimator to reduce the sample rate; the upsampling process, on the other side, increases the sample rate and then applies an interpolation filter to remove

m ym u ¼ y1  gu :

ð12Þ

To see whether the representation learned from SC is independent of u, taking Eqs. (11) and (12) into Eq. (5), the optimal representation under resolution u is obtained by m 2 xnu ¼ argminJym u Pu xu J þ g9xu 9 xu

resolution level 1

Original HR image I resolution level u LR image Iu' Interpolated image Iu ...

...

LR image IU' Interpolated image IU Fig. 2. Sampling the base set.

resolution level U

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28 m 2 ¼ argminJym 1  gu ðP1  gu Þxu J þ g9xu 9 xu

m 2 ¼ argminJCu ðym 1 P1 xu ÞJ þ g9xu 9,

ð13Þ

xu

where Cu is the convolution matrix formed by gu . The solution is independent of u, i.e., xnu ¼ xn1 , when C> u Cu is a unitary matrix, which requires gu to be the Dirac Delta function. The proof for NN and LLE model is given in Appendices A and B respectively. 3.3. Validation for realistic imaging system In realistic imaging process, gu is a low-pass filter with sufficiently small bandwidth in small scale factors, such that gu usually approximates Dirac Delta function well. Although the exact parameters of gu is unknown, it is able to examine the caused invariancy property by directly measuring the similarity of x learned from different resolution versions. To depict, from a training HR image, we first generated a multi-resolution patch pair set S with around 8000 patch pairs in each resolution level. Next, we extracted around 2000 patches from each of the five resolution version of the same testing image. Then we solved Eqs. (2), (3) and (5) to get the optimal representations x ¼ fxj,u g2000 j ¼ 1 , u ¼ 1, . . . ,5 at each resolution level for the NN, LLe and SC model respectively. If Theorem 3.1 holds, xj,u should be very similar to xj,v , 8u a v. Hence we computed the overall similarity between every two elements in xj ¼ fxj,1 ,xj,2 ,xj,3 ,xj,4 ,xj,5 g by simðxj Þ ¼

4 1 X

5 X

C 25 u ¼ 1 v ¼ u þ 1

cor ju,v ,

where cor ju,v is the correlation between xj,u and xj,v . Finally, the overall similarity is averaged over the 2000 patched to get a score. To make the experiment more comprehensive, we tested different redundancy level in the base set by either random removing or using K-Mean clustering methods to reduce the base cardinality from 8000 to until 50. The experiments were repeated 5 times with different training/testing images, and the results are shown in Fig. 3. As can be seen from Fig. 3, the lower bound of the similarity score is greater than 0.44, and the maximal score reaches almost 0.8. The results validate the high similarities between xu and xv from different resolutions. The reason why the scores decrease as the cardinality increase is due to the overcompleted nature of the base, where the coding process may not select exactly the same basis for reconstruction. When such redundancy is removed, the similarity between representations becomes significantly higher. Based on both theoretical proof in Eq. (13) and the experimental validation in Fig. 3, we can generalize the second invariant property for a multi-resolution base set:

0.75 NN (KMean) NN (Rand) LLE (KMean) LLE (Rand) SC (KMean) SC (Rand)

0.7

Correlation

0.65 0.6 0.55

25

Theorem 3.1. The optimal representation xn is invariant across different resolution version given respective bases of the corresponding resolutions. Theorem 3.1 reveals that, if the different resolution versions of the same image are related by Eq. (9), then the optimal representation learned from one resolution can also be applied for another. For simplicity, we call Theorem 3.1 the resolutioninvariant image representation (R IIR) property, and the multiresolution sets S an R IIR set. With R IIR property, it is able to save the computational expensive coding process for multi-scale resolution enhancement tasks, as discussed in the next section.

4. Applying R IIR for continuous image SR There are many scenarios where users need different resolution version of the same image/video input, and thus requires the multi-scale image SR capacity. An R IIR set S ¼ fS u g, u ¼1,y,U is born with the multi-scale reconstruction ability at discrete scales, because each S u can be used for  u image SR by existing example-based SR method [9]. One advantage of the R IIR framework is that, instead of solving the optimal representation fxn g under each scale independently, under the R IIR framework, fxn g only needs to be learned once. By applying different S u , the same fxn g can be used to reconstruct the image at multiple scales. Finer scale factors are achievable by simply increasing the level of resolutions in the R IIR set. During reconstruction, only matrix– vector multiplication is required by Eq. (4), which can be implemented very efficiently. In addition, the R IIR set can be stored locally, while the computed R IIR can be transmitted together with the image/video document. To further extend R IIR to support continuous scale SR, a new base can be synthesized at the required scale on the fly. To elaborate, let v be the target scale factor which is between u and uþ 1, the ith element in S v can be synthesized by pi,v ¼ wu,v p~ i,u þð1wu,v Þp~ i,u þ 1 ,

ð14Þ

where p~ i,u is the patch interpolated from scale u, and p~ i,u þ 1 is interpolated from scale u þ1. The weight wu,v ¼ ð1þ expððvu 0:5ÞntÞÞ1 where in our implementation, t ¼ 10 empirically.

5. Experimental results 5.1. Multi-scale image SR This sub-section compares the quality of super-resolved images by R IIR framework with existing SR methods. Since most of these benchmark methods do not support continuous SR, we compared the image quality under multiple discrete scales. To begin with, an R IIR set S was trained. Around 20 000 patch pair examples were extracted from some widely used images such as the ‘‘peppers’’ image. First, 25 testing images were processed to compare with existing example-based SR methods that use the same coding model but without the R IIR technique, including, ‘‘KNN’’ [9],

0.5 Table 1 Comparison of average SR processing time (seconds) with/without R IIR.

0.45 0.4 0.35 0

1000

2000

3000 4000 Cardinality

5000

Fig. 3. Correlation between different resolution versions.

6000

Scale

NN

R IIR(NN)

LLE

R IIR (LLE)

SC

R IIR (SC)

2 3 4 5 6

3.89 11.15 14.39 14.69 15.11

0.11 11.15 0.19 0.28 0.42

7.03 13.97 22.73 26.86 28.94

0.13 13.97 0.23 0.34 0.52

19.45 54.22 98.59 159.92 249.68

0.16 54.22 0.28 0.45 0.66

PSNR score over Bi−Cubic interpolatiion

26

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28

0.4 0.2 0 −0.2 −0.4 RIIR(NN) RIIR(LLE) RIIR(SC) NN

−0.6 −0.8 x2

x3

x4 Scale

x5

Fig. 4. Average SR quality under different scales.

LLE SC Enhance Soft Edge

x6

‘‘LLE’’ [11] and ‘‘SC’’ [13]. The R IIR was learned at  3 scale, and multiple up-scale factors from  2 to  6 was specified for reconstruction. The processing time is logged from a DELL PRECISION 490 PC (3.2 GHz CPU, 2G RAM), and the results are listed in Table 1. As can be seen, while the same amount of computation is required at calculating the coding at  3 scale, for the rest scales, the computation becomes neglectable. Next, the quality of the generated SR images were evaluated. In addition to those benchmark methods used in Table 1, two functional interpolation SR methods, ‘‘Enhance’’ [21] and ‘‘Soft edge’’ [2], were also implemented for comparison. The PSNR score over BiCubic interpolation is presented in Fig. 4. As can be seen, in most scales, the best image quality is achieved by the SC method, while the proposed R IIR method using SC model achieves the second best image quality, losing only a very small margin. This promising result shows that, the R IIR method saves a considerable amount of computation (Table 1) while sacrifices only negligible amount of image quality. In fact, comparing the three

Fig. 5. Illustration of continuous image scaling (top-left: the original image; first row: BiCubic interpolation; second row: R IIR with NN model; third row: R IIR with LLE model; last row: R IIR with SC model).

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28

scale: × 2.65

scale: × 3.15

scale: × 3.85

27

scale: × 4.35

Fig. 6. Illustration of continuous image scaling (top-left: the original image; first row: BiCubic interpolation; second row: R IIR with NN model; third row: R IIR with LLE model; last row: R IIR with SC model).

coding models with/without the R IIR framework, the achieved image quality are always comparable. In addition, in the NN coding model, the achieved image PSNR score is even higher than that without the R IIR framework. This is reasonable because, as explained in Section 2.2, the NN method tends to over-fit to the example patch pair set, and hence the computed representation by Eq. (1) may not generalize well to the high-frequency layer well. On the other side, solving Eq. (1) under the R IIR framework incorporates stronger regularization, such that the learned representation is more reliable and robust. 5.2. Continuous image SR The second experiment demonstrated continuous image SR using the R IIR framework (Section 4). We first generated the R IIR base set S from 1 to 5 scales, with step size 0.5, i.e., a base is trained at every u and u þ0.5 scales, u¼ 1,y,5. This would take up 15 Mb storage space if the cardinality is selected to be 2000. For each testing image, the R IIR is learned at scale  3. Next we conducted continuous scaling between 1 and 5 with step size 0.05. A DELL PRECISION 490 PC (3.2 GHz CPU, 2G RAM) was used to conduct a subjective user study where 12 people were asked to compare the image quality with BiCubic interpolation. All of them rated significantly higher scores for our results, and most of them were not aware of the processing delay in generating the SR images. This results validate the good performance of continuous SR using the R IIR method as well as the low computational cost in generating the up-scaled images. Some example images can be found in Figs. 5 and 6. A video demonstration of the reconstruction process has been attached, where readers can examine the processing speed and quality of image reconstructing by our R IIR framework.

5.3. Parameter tuning To get more insight of the R IIR framework, we also evaluated different parameter settings. Of the three different coding schemes, NN has no parameter during the coding step, while in LLE, the number of local neighbors K in Eq. (3), and in SC, the weight of L1 regularization g in Eq. (5), need to be specified. In addition, in all three schemes the codebook size N is required. We first tested the affect of different codebook sizes. We used the method in Section 3.3 to build several R IIR base sets with increasing cardinality. Then we evaluated the image quality under multiple scales as that in Section 5.1. It is observed that, the average SR image quality saturates after the number of basis vectors reached a sufficient number around 2000. The second experiments tested the affect of the K for LLE and the g for SC. According to Eqs. (3) and (5), they both control the number of activated basis vectors for reconstruction, hence we put the results into the same scale as the ratio of non-zero element to the size of codebook. Similar to that reported in [12], best SR image quality is achieved when averagely 5–10 basis vectors are used to code each input patch, i.e., 0.25–0.5% of total basis vectors are selected, while further reducing or increasing the sparsity will decrease the performance.

6. Conclusion and future work The paper presents the resolution-invariant image representation (R IIR) framework motivated by the idea that, a same image should have identical representation at various resolution levels. In the framework, a multi-scale R IIR bases set is constructed, and three coding models, the NN, LLE and SC model, are all

28

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28

validated to present the resolution-invariant property. In this way the computational cost in multi-scale image SR can be significantly reduced. In addition, the method to extend R IIR to support continuous image scaling is also discussed. With such capacity, the R IIR framework can support additional applications that existing image SR methods cannot handle well. For instance, in [22] the R IIR framework is applied for content-based zooming for mobile users. Experimental results show that our R IIR based method outperforms existing methods in various aspects. The future work of the research includes the following issues: first, in addition to image magnification, the possibility of applying the R IIR framework to improve image shrinking quality will be investigated; second, additional optimization strategies to improve the coding speed will be examined; third, the implementation of the coding and reconstructing process will be parallelized with modern CPU and/or GPU support; and fourth, other application domains in image compression, streaming, personalization, etc., will be explored.

Appendix A. Resolution invariancy in NN model According to Eq. (6), at level u, for each input patch yu (the subscript j is omitted), the optimal representation xnu can be obtained by m 2 xnu ¼ argminJym u Pu xu J xu X s:t: xu A f0; 1gN and xu ¼ 1,

Pm u

ð15Þ

Dm u

where is a  N matrix to represent all the elements in S m u. Taking Eq. (10) into Eq. (15), m 2 xnu ¼ argminJðym 1  gu ÞðP1  gu Þxu J xu X s:t: xu A f0; 1gN and xu ¼ 1, d m 2 ¼ argminJCu ðym 1 P1 xu ÞJ xu X s:t: xu A f0; 1gN and xu ¼ 1,

which becomes identical to Eq. (13) except for the constraint. n Hence when C> u Cu is the unitary matrix, the solution of xu becomes independent to u, and hence the resolution-invariant property holds.

Appendix B. Resolution invariancy in LLE model According to Eq. (7), for each input patch yj,u , the optimal representation weight xnj,u minimizes m 2 2 xnu ¼ argminJym u ðPu Au Þxu J þ lJxu J :

ð16Þ

xu

[2] S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, Soft edge smoothness prior for alpha channel super resolution, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [3] J. Sun, H. Tao, H. Shum, Image hallucination with primal sketch priors, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2003, pp. 729–736. [4] /http://www.vectormagic.comS. [5] M.N. Do, M.M. Vetterli, The contourlet transform: an efficient directional multiresolution image representation, IEEE Trans. Image Process. (2005) 2091–2106. [6] W. Hong, J. Wright, H. Kun, M. Yi, Multiscale hybrid linear models for lossy image representation, IEEE Trans. Image Process. (2006) 3655–3671. [7] L. Erwan, M. Stephane, Sparse geometric image representations with bandelets, IEEE Trans. Image Process. (2005) 423–438. [8] C. Guo, S. Zhu, Y. Wu, Towards a mathematical theory of primal sketch and sketchability, in: Proceedings of International Conference on Computer Vision, 2003, pp. 1228–1235. [9] W. Freeman, E. Pasztor, O. Carmichael, Learning low-level vision, Int. J. Comput. Vision (1) (2000) 25–47. [10] Q. Wang, X. Tang, H. Shum, Patch based blind image super resolution, in: Proceedings of International Conference on Computer Vision, 2005, pp. 709–716. [11] H. Chang, D. Yeung, Y. Xiong, Super-resolution through neighbor embedding, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2004, pp. 275–282. [12] J. Wang, S. Zhu, Y. Gong, Resolution enhancement based on learning the sparse association of image patches, Pattern Recognition Letters 31 (1) (2010) 1–10. [13] J. Yang, J. Wright, T. Huang, M. Yi, Image super-resolution as sparse representation of raw image patches, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. [14] J. Sun, Z. Xu, H. Shum, Image super-resolution using gradient profile prior, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. [15] C. Jiji, M. Joshi, S. Chaudhuri, Single-frame image super-resolution using learned wavelet coefficients, Int. J. Imaging Syst. Technol. (3) (2004) 105–112. [16] C. Jiji, C. Subhasis, Single-frame image super-resolution through contourlet learning, EURASIP J. Appl. Signal Process. (2006) 73767. (11). [17] J. Wang, S. Zhu, Y. Gong, Resolution-invariant image representation and its applications, in: Proceedings of CVPR’09, 2009, pp. 2512–2519. [18] U. Mudenagudi, R. Singla, P.K. Kalra, S. Banerjee, Super resolution using graph-cut, in: Proceedings of Asian Conference on Computer Vision, 2006, pp. 385–394. [19] H. Lee, A. Battle, R. Raina, A. Ng, Efficient sparse coding algorithms, in: Advances in Neural Information Processing Systems, MIT Press, 2007, pp. 801–808. [20] A.V. Oppenheim, R.W. Schafer, J.R. Buck, Discrete-time Signal Processing, 2nd ed., Prentice Hall. [21] J. Wang, Y. Gong, Fast image super-resolution using connected component enhancement, in: Proceedings of International Conference on MultimediaExpo, 2008. [22] J. Wang, S. Zhu, Y. Gong, Resolution-invariant image representation for content-based zooming, in: Proceedings of International Conference on Multimedia-Expo, 2000.

Jinjun Wang received the B.E. and M.E. degrees from Huazhong University of Science and Technology, China, in 2000 and 2003. He received the Ph.D degree from Nanyang Technological University, Singapore, in 2006. From 2006 to 2009, Dr. Wang was with NEC Laboratories America, Inc. as a postdoctoral research scientist, and in 2010, he joined Epson Research and Development, Inc. as a senior research scientist. His research interests include pattern classification, image/video enhancement and editing, content-based image/video annotation and retrieval, semantic event detection, etc.

Taking Eqs. (7) and (10) into Eq. (16), we can have m 2 2 xu ¼ argminJym 1  gu ððP1  gu Þ ðA1  gu ÞÞxu J þ lJxu J n

xu

m ¼ argminJCu ðym 1 ðP1 xu

2

2

A1 Þxu ÞJ þ lJxu J ,

which has the same form as Eq. (13). Hence when C> u Cu is unitary, the solution of xnu becomes independent to u, and hence the resolution-invariant property holds. References [1] G. Ramanarayanan, K. Bala, B. Walter, Feature-based textures, in: Proceedings of Eurographics Symposium on Rendering’04, 2004, pp. 186–196.

Shenghuo Zhu received the Ph.D. degree in computer science from the University of Rochester, Rochester, NY, in 2003. He is a Research Staff Member with NEC Laboratories America, Inc., Cupertino, CA. His primary research interests include information retrieval, machine learning, and data mining. In addition, he is interested in customer behavior research, game theory, robotics, machine translation, natural language processing, computer vision, pattern recognition, bioinformatics, etc.