Double sparsity for multi-frame super resolution

Double sparsity for multi-frame super resolution

Communicated by Jun Yu Accepted Manuscript Double sparsity for multi-frame super resolution Toshiyuki Kato, Hideitsu Hino, Noboru Murata PII: DOI: R...

1MB Sizes 0 Downloads 19 Views

Communicated by Jun Yu

Accepted Manuscript

Double sparsity for multi-frame super resolution Toshiyuki Kato, Hideitsu Hino, Noboru Murata PII: DOI: Reference:

S0925-2312(17)30334-X 10.1016/j.neucom.2017.02.043 NEUCOM 18117

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

7 December 2016 15 February 2017 17 February 2017

Please cite this article as: Toshiyuki Kato, Hideitsu Hino, Noboru Murata, Double sparsity for multiframe super resolution, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.02.043

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Double sparsity for multi-frame super resolution Toshiyuki Katoa , Hideitsu Hino∗,b , Noboru Murataa a

CR IP T

School of Science and Engineering, Waseda University, 3-4-1 Ohkubo, Shinjuku, Tokyo, Japan, 169-8555 b Department of Computer Science, University of Tsukuba, 1-1-1 Tennoudai, Tsukuba, Ibaraki, Japan, 305-8573

Abstract

AN US

A number of image super resolution algorithms based on the sparse coding

have successfully implemented multi-frame super resolution in recent years. In order to utilize multiple low-resolution observations, both accurate image registration and sparse coding are required. Previous study on multi-frame super resolution based on sparse coding firstly apply block matching for im-

M

age registration, followed by sparse coding to enhance the image resolution. In this paper, these two problems are solved by optimizing a single objective

ED

function. The proposed formulation not only has a mathematically interesting structure, called the double sparsity, but also yields comparable or

PT

improved numerical performance to conventional methods. Keywords: Image Super Resolution, Sparse Coding, Double Sparsity,

CE

Dictionary Learning

AC

1. Introduction Image super resolution (SR) is the problem of enhancing low-resolution

(LR) images. Its importance is increasing over time because of a growing need to remaster old videos or investigate low-resolution surveillance videos, ∗

Corresponding author. Tel.:+81 298 53 5538; Fax: +81 298 53 5538 Email address: [email protected] (Hideitsu Hino∗, )

Preprint submitted to Neurocomputing

February 22, 2017

ACCEPTED MANUSCRIPT

for example. A large number of methods have been proposed for SR, of which a group of actively studied methods is based on sparse coding (Olshausen and Field, 1996, 2005; Natarajan, 1995; Elad, 2010; Bruce et al., 2016), which is the focus of this paper. Sparse coding is a methodology in signal process-

CR IP T

ing where the observed signals are approximated by a linear combination of

simple, common components called atoms. A distinctive feature of sparse coding is that the number of atoms prepared for signal reconstruction is large, whereas the number of atoms actually used to represent a signal is

AN US

small. The nature of sparse coding enables us to represent signals compactly

and remove noises effectively. Sparse coding has broad applications, and one of the most actively studied area is image representation. For example, combined with graph-based regularization techniques (Belkin and Niyogi, 2003), manifold regularized sparse coding methods are proposed (Zheng et al., 2011;

M

Jin et al., 2015; Lou et al., 2016; Shu et al., 2016). Other interesting applications of sparse coding are on human pose estimation problems (Wang et al.,

ED

2014; Hong et al., 2015b,a) and on image ranking problems (Yu et al., 2014, 2016), to name but a few.

PT

Sparse coding is also used for super resolution with a single LR image (Yang et al., 2008, 2010; Zeyde et al., 2010) and, recently, with mul-

CE

tiple LR images (Kato et al., 2015b). Multi-frame image SR is expected to generate clearer high-resolution (HR) images than single-frame SR, if the rel-

AC

ative positions of the observed LR images are accurately estimated. Indeed, the estimation of relative displacement or shift of multiple images, referred to as image registration (Zitova and Flusser, 2003), plays a critical role in

multi-frame SR as well as sparse coding. In our previous work on image SR (Kato et al., 2015b), we treated problems of image registration and sparse coding separately. That is, we first

2

ACCEPTED MANUSCRIPT

estimated shifts in observed LR images, and then applied the sparse coding algorithm to the aligned LR images. Since the purpose of registering LR images is high-quality HR image restoration by sparse representation, it is natural to perform image registration so that the error in sparse image rep-

CR IP T

resentation is minimized. The contribution of this paper is to treat image

registration at the sub-pixel level and the sparse coding problem in a unified

framework. More concretely, we simultaneously estimate both the shifts of LR images and the coefficients of sparse coding with a single objective. The-

AN US

oretically, we pose the multi-frame SR problem in a framework called double sparsity, which is an interesting approach to sparse modeling (Rubinstein et al., 2010; Zhan et al., 2013).

The rest of this paper is organized as follows. Section 2 describes the problem setting and the underlying model of multi-frame super resolution.

M

The sparse coding approach to super resolution is also described in this section. Section 3 briefly explains how fine shifts (relative displacements) be-

ED

tween observations are expressed by combinations of pixel-level shifts. Section 4 describes our proposed approach for simultaneously estimating shifts

PT

and sparse coding coefficients in a unified framework, then some theoretical analyses on the proposed method are presented. Section 5 details the exper-

CE

imental results, and the final section is devoted to our concluding remarks. A short and preliminary version of this paper appeared in MLSP 2016 (Kato

AC

et al., 2016). In this paper, in addition to the preliminary results appeared in MLSP, we show 1. Theoretical error analysis of estimating sparse coding coefficients under image registration errors. 2. Theoretical evaluation of the computational cost of the proposed method. 3. Experimental results on the application to movie datasets and evalua3

ACCEPTED MANUSCRIPT

tion of computational costs for both still image and movie datasets.

2. Notation and Formulation

CR IP T

We first explain the notion of single-frame super resolution by sparse coding, and extend it to multi-frame super resolution. 2.1. Single-frame Super Resolution

Let Y ∈ RQ be the observed LR image. The aim of super solution is to

AN US

construct an HR image X ∈ RP from Y, where Q < P . To reduce computational cost, image super resolution is often performed for small image regions called patches, which are then combined to construct the entire image. Following this method, we consider reconstructing an HR image patch

represented by a vector x ∈ Rp by using a single LR image patch y ∈ Rq ,

M

where q < p. Having obtained all the HR patches, post-processing to construct the full-sized HR image is performed, which is explained in details in

ED

Sections 5.2 and 5.3 of our previous paper (Kato et al., 2015a). For each patch, we assume the following degradation process: the HR

y = Gx + ε,

(1)

CE

PT

and LR patches x and y are connected each other by the observation model

where G is a degradation operator composed of blur and down-sampling

AC

operations, and ε ∈ Rq is the additive observation noise. In the framework of sparse coding (Olshausen and Field, 1996; Elad,

2010), the observed signals are represented by combinations of only a small number of basis vectors chosen from a large number of candidates. These basis vectors are called atoms henceforth.

4

ACCEPTED MANUSCRIPT

Let D = [d1 , d2 , . . . , dK ] ∈ Rq×K be a dictionary consisting of K atoms

dk , k = 1, . . . , K, and let α ∈ RK be a coefficient vector for the sparse representation of the patch y ∈ Rq . Typically, K > q. The problem of sparse

minimize ky − Dαk22 + ηkαk1 , α

where kαk1 =

PK

j=1

CR IP T

coding is then formulated as follows: η > 0,

(2)

|αj | is the `1 -norm of a vector, and η is called the

regularization parameter. This problem (2) adopts the `1 -norm of coefficients

as a measure of sparsity, and is referred to as the `1 -norm sparse coding. By

AN US

minimizing both the approximation error and the `1 -norm of the coefficient, the resultant α has only a few nonzero elements, and the observed patch y is well-approximated by using a small number of atoms. Specifically, we call

the problem of obtaining the coefficients α with a fixed dictionary D sparse

M

coding.

On the other hand, the problem of optimizing the dictionary D with a

ED

set of observations {yi }ni=1 is called dictionarylearning: minimize

PT

D

n X i=1

kyi − Dαi k22 ,

(3)

where αi is the solution of the problem (2) for each yi . A number of methods

CE

are available for dictionary learning (Engan et al., 1999; Aharon et al., 2006) and sparse coding (Rezaiifar and Krishnaprasad, 1993; Tibshirani, 1996). In

AC

this study, we use the ones proposed in (Lee et al., 2006) for their computational efficiency. We assume the HR patch x is represented by a sparse combination of HR

atoms as x = Dh α. Because of relation (1), the LR patch y is connected to the HR patch x as y ' Gx = GDh α = Dl α, 5

(4)

ACCEPTED MANUSCRIPT

= 0.63

+ 0.41

+ 0.31

+ 0.24

HR patch reconstruction

degradation

= 0.63

+ 0.41

+ 0.31

+ 0.24

CR IP T

LR patch

HR patch

HR atoms constructed in advance

LR atoms obtained by degenerating HR atoms

step 0: construct HR atoms using HR training images (dictionary learning) step 1: degenerate HR atoms by blur and downsampling step 2: approximate an observed LR image by using LR atoms (sparse coding) step 3: use the coefficients obtained in "step 2" for constructing an HR image

AN US

Figure 1: Super resolution by sparse coding.

where Dl is the LR dictionary. Atoms in Dl are generated from Dh by the image degeneration process explained above, and have one-to-one correspondence with the HR atoms. The above correspondence (4) naturally leads to

M

the following two-step procedure for single-frame super resolution: representing the LR patch by a sparse combination of LR atoms, and reconstructing

ED

the corresponding HR patch by using the combination coefficients of LR atoms to combine the HR atoms, as shown in Fig. 1. Before performing this single-frame SR, an HR dictionary needs to be prepared by an appropriate

PT

dictionary learning algorithm with HR training images. The single-frame SR

CE

algorithm with sparse coding is summarized in Algorithm 1. 2.2. Multi-frame Super Resolution by Sparse Coding

AC

Suppose N low-resolution images Yi ∈ RQ , i = 1, . . . , N are observed,

and all are varyingly degenerated from a high-resolution image X ∈ RP .

Without loss of generality, we assume that the LR image Y1 is the target for super resolution; the other N − 1 LR images are called auxiliary images. We assume that each patch is exposed by the following image degradation process: for target patch y1 , image degradation is modeled by Eq. (1), and, 6

ACCEPTED MANUSCRIPT

CR IP T

Algorithm 1 Single-frame Super Resolution with Sparse Coding Input: LR image Y, arbitrary chosen HR images Z1 , Z2 , · · · , ZN ∈ RP for training dictionary, regularization parameter η > 0 for the `1 -norm sparse coding Initialize:

• Randomly extract patches zj of size Rp from the input HR images.

minimize Dh

X j

AN US

• Learn an HR dictionary by solving the problems kzj − Dh αj k22

and minimizekzj − Dh αj k22 + ηkαj k1 αj

iteratively until convergence.

ED

Dl = GDh .

M

• Generate an LR dictionary by applying the degradation operator as Enhancing Resolution by Sparse Coding: while there remain patches in Y to be extracted do

PT

extract patch yi from the input LR image Y αi ← arg minα kyj − Dl αk22 + ηkαk1

CE

xi ← Dh αi

end while

AC

post processing to construct the entire image from {xi } (see, e.g. (Kato et al., 2015b)) return HR image X

7

ACCEPTED MANUSCRIPT

for auxiliary patch pairs (x, yi ), i = 2, . . . , N , observation model yi = GWi x + ε

(5)

is assumed, where Wi is the parallel shift, and the clipping operator corre-

CR IP T

sponds to the i-th LR patch, which is explained later and shown with an explicit operation in Appendix A. We note that G can be not identical for

different images Yi , but they are assumed identical for the sake of simplicity and we concentrate on estimating Wi , i = 2, . . . , N for N − 1 observations.

We explain the concept of the shift and clipping operators Wi with a

AN US

schematic diagram in Fig. 2. A detailed explanation is deferred until Sec-

tion 3. For ease of explanation, Fig. 2 depicts the case where y1 ∈ R5×5 and

the magnification factor is three, i.e. x ∈ R15×15 . The target patch y1 is first extracted from the target image Y1 as shown in the red square in “target

M

LR image” of Fig. 2. We then consider estimating the shift in the i-th image Yi . We can roughly estimate the shift of target patch y1 in the auxiliary im-

ED

age Yi by sub-pixel-level accuracy matching in LR space, as shown in “i-th LR image” of Fig. 2. For a detailed explanation, a small region around the placed target patch in the i-th LR image is zoomed into the bottom-right

PT

part of Fig. 2. The grid correspond to pixels in the i-th LR image. The target patch shown in the red square does not lie on the grid in general. To

CE

avoid a negative effect at the non-overlapping region and use the informative area only, the pixels completely included in the target patch are extracted

AC

as the i-th LR patch yi ∈ R(5−1)×(5−1) , which is shown by the blue square. √ √ Since the target patch y1 is represented by a vector of length q = q × q, √ √ the size of the patch in two-dimensional expression is q × q. On the √ √ other hand, the boundary-clipped patch has size q 0 = ( q − 1) × ( q − 1). Once Wi , i = 2, . . . , N are estimated, the image SR based on sparse coding is straightforward. With the shift and the clipping {W2 , . . . , WN } of auxiliary 8

ACCEPTED MANUSCRIPT

target LR Image Y1

HR Image magnify by a factor of three

=

CR IP T

= y1 2 R5⇥5 i-th LR Image Yi

x 2 R15⇥15

(k, k) (a, b): shifted origin

(0, 0) : origin of LR patch =

AN US

to be estimated

yi 2 R4⇥4

M

Figure 2: Shift in the i-th observation to the target image patch. The target image patch (red square in the upper-left part of the figure) is to be magnified as shown in the upper-

ED

right part of the figure, where the HR patch x is also shown in the red square. The target LR patch is placed in the i-th observation (lower-left) by sub-pixel-level matching in LR space, and the image around the target patch in Yi is zoomed into for display. The blue

PT

dashed square is patch yi clipped from Yi . The upper-left point of the upper-left pixel of yi is regarded as the origin (0, 0), and the shift from the target patch y1 is denoted by

CE

(a, b) ∈ [0, k]2 .

0

AC

patches y2 , . . . , yN ∈ Rq , the LR dictionaries corresponding to observations

9

ACCEPTED MANUSCRIPT

(6)

CR IP T

{yi }N i=1 are stacked to construct a stacked LR dictionary:     h Dl1 GD      l    D2   GW2 Dh         ˜l =  h l D =  D3   GW3 D  .      ..    ..  .    .     h l GWN D DN

AN US

This dictionary is used to approximate the stacked LR patch   y  1    y2   ˜= y  ..   .    yN

(7)

ˆ is given by by sparse coding: namely, the HR estimate x

(8)

ˆ = Dh α. ˆ x

(9)

M

˜ l αk2 + ηkαk1 , ˆ = arg mink˜ α y−D 2

ED

α

Each atom dh in the HR dictionary Dh is shifted and clipped to an atom of

PT

size q 0 by the action of Wi , i = 2, . . . , N , and then blurred and down-sampled by the action of G to form corresponding LR atoms GWi dh , i = 2, . . . , N .

CE

Each block of the stacked dictionary in Eq. (6) is composed of LR atoms

AC

obtained in this manner. 3. Approximation of Displacements by Pixel-level Shifts As discussed in the previous section, the major problem of multi-frame SR

can be reduced to the problem of estimating shift and the clipping operators Wi , i = 2, . . . , N . In the following, we consider enhancing the resolution of the magnification factor k. Since the LR target patch y1 ∈ Rq is represented 10

ACCEPTED MANUSCRIPT

√ by a square with q pixels on a side, the corresponding HR patch is a square √ √ with p = k q pixels, namely, a vector of size p = k 2 q. To estimate Wi , i = 2, . . . , N , we consider the upper-left-most point of the i-th auxiliary patch yi as the origin (0, 0) of the shift, which is on the

CR IP T

grid of the i-th LR image Yi as shown in Fig. 2. We consider the shift of the

target LR patch using HR accuracy to achieve a satisfactory result, which

means placing the target LR patch on the LR image Yi by using sub-pixellevel matching. Let the nearest upper-left grid point to the origin in the i-th

AN US

LR image Yi be (k, k), and let that of the target patch be (a, b) ∈ [0, k]2 , as shown in Fig. 2.

There are two cases of displacement: pixel-level parallel shifts in the HR image space, and others. In the former case where (a, b) are both integers, the LR dictionary for the auxiliary patch can be simply constructed through the

M

clipping of the corresponding areas from the HR dictionary and degradation with G. These LR dictionaries of pixel-level parallel shifts a, b = 0, 1, . . . , k

ED

are denoted by Dl(a,b) and referred to as the LR base dictionaries henceforth. In the latter case, we assume that the shift in the auxiliary patch is well-

PT

approximated by a linear combination of pixel-level parallel shift in HR space: namely, the estimation of the shift and clipping operators is reduced to that

CE

of the combination coefficients for possible (k + 1)2 parallel shifts. The i-th

AC

block of the LR dictionary Dli , i = 2, 3, . . . in Eq. (6) is then given by Dli = GWi Dh = θi,(0,0) Dl(0,0) + θi,(0,1) Dl(0,1) + · · · + θi,(k,k) Dl(k,k) , k X

a,b=0

θi,(a,b) = 1,

(10)

θi,(a,b) ≥ 0,

where the LR base dictionaries are common to all LR observations, and different observations are represented by different coefficients θ. We call θ 11

ACCEPTED MANUSCRIPT

the registration coefficient. In our previous work (Kato et al., 2015b), we user a two-step procedure for multi-frame SR. That is, we first estimated the registration coefficient by using the 2D simultaneous block matching method proposed in (Shimizu and

CR IP T

Okutomi, 2006) due to its computational efficiency. To further reduce computational cost, we then simplified the registration procedure by considering

only four pixel-level shift operators around origin (0, 0), obtained by pixel-

level matching. The sub-pixel-level shift operator was estimated by estimat-

AN US

ing the coefficients (1 − v1 )(1 − v2 ), (1 − v1 )v2 , v1 (1 − v2 ), and v1 v2 , v1 , v2 ≥ 0

for four shift operators corresponding to the coordinates (0, 0), (1, 0), (0, 1), and (1, 1). We then constructed the stacked LR dictionary in Eq. (6) and the stacked LR patches in Eq. (7). The HR patch was obtained by sparse coding. In this work, instead of this two-step procedure, we propose a novel

M

approach to estimate the sub-pixel-level accuracy shifts through a common

ED

optimization objective for spares coding. 4. Double Sparsity for Image Super Resolution

PT

In this section, we formalize the proposed method to estimate both the shifts in LR images and the coefficients for sparse coding.

CE

4.1. Problem Formulation ˜ in Eq.(7) is approxiWe start by showing that the stacked LR patch y

AC

mated by a bi-linear form. Let B be a matrix of base dictionaries, Dl and Dl(a,b) , a, b = 0, . . . , k, defined as 



Dl1

B=

                 

Dl(0,0) · · ·

Dl(k,k) ..

,

. Dl(0,0)

12

                 l(k,k) 

···

D

(11)

ACCEPTED MANUSCRIPT

and the registration coefficient of the base dictionaries is collectively denoted by h i> 1, θ2,(0,0) , θ2,(0,1) , . . . , θ2,(k,k) , . . . , θN,(0,0) , . . . , θN,(k,k) .

(12)

CR IP T

Then, the stacked dictionary in Eq. (6) is expressed as ˜ l = B (θ ⊗ I) , D

(13)

where ⊗ is the Kronecker product and I is the unit matrix of appropriate size.

AN US

Then, the approximation of an LR patch is denoted by three representations as

˜ l α = B (θ ⊗ I) α ˜'D y

(14)

= B (I ⊗ α) θ

(15)

M

= B vec αθ

 >

,

(16)

ED

where vec is the column-span vectorization operator. In the above expression, Eqs. (14) and (15) are linear with respect to α and θ, respectively. We use this fact to derive a computationally efficient iterative algorithm. Then, the

PT

estimated coefficient α is used to reconstruct the HR image as x = Dh α.  minimize k˜ y − B vec αθ > k22 + ηkαk1 {α,θ}

(17)

subject to Eθ ≤ 1, θ ≥ 0,

AC

CE

The optimization problem to be solved is

where 1 denotes the vector of all ones, and E is defined as   1      1 1 ··· 1    ∈ RN ×{1+(k+1)2 ×(N −1)} . E=  ...     1 1 ··· 1

0

0

13

(18)

ACCEPTED MANUSCRIPT

The regularization term ηkαk1 imposes sparsity in the representation of the observed signals by linear combinations of atoms. The inequality Eθ ≤ 1 encodes a constraint whereby the sum of coefficients for interpolation is less than or equal to one for each image.

CR IP T

Under the non-negativity constraint θ ≥ 0, Eθ = kθk1 , hence Eθ ≤ 1

works as a sparsity-inducing constraint for θ like the `1 -norm regulariza-

tion. The sparsity-inducing mechanism of the the `1 -norm-like constraints

with non-negativity has been studied in the literature of non-negative matrix

AN US

factorization (Lee and Seung, 1999; Cichocki et al., 2009). 4.2. Optimization method

We solve the optimization problem (17). Since the problem of finding a closed-form solution for problem (17) on both α and θ is intractable, we

M

alternatingly solve the problem with respect to α with a fixed θ and θ with a fixed α. As explained in the next subsection, the optimization problem (17)

ED

has the structure of double sparsity: the dictionary used for sparse coding itself is the sparse combination of the base dictionary. Using only the target LR patch y1 , we first initialize the sparse coding coefficient α(1) by solving

CE

PT

the following optimization problem: α(1) = arg min ky1 − Dl1 αk22 + ηkαk1 , α

(19)

which is efficiently solved using sparse coding algorithms, such as the methods

AC

proposed by Tibshirani (1996) or Lee et al. (2006). Sparse optimization for shift operator coefficients: By fixing coefficient α(t) , we obtain the registration coefficient by θ (t) = arg min k˜ y − B(I ⊗ α(t) )θk22 θ

subject to Eθ ≤ 1, θ ≥ 0. 14

(20)

ACCEPTED MANUSCRIPT

This is a quadratic programing problem that is efficiently solved by using any off-the-shelf solver. We used the active-set method for solving the quadratic problem equipped with the MATLAB quadprog function. We note that the optimization cost of Eq. (20) is very small, since the dimension of parameter

CR IP T

θ to be optimized is (N − 1) × (k + 1)2 , which is 64 in case of N = 5 observed images and we try to magnify the image of factor k = 3, for example. Sparse optimization for atom coefficients:

To optimize α with a fixed θ (t) , we solve the following problem:

α

AN US

α(t+1) = arg min k˜ y − B(θ (t) ⊗ I)αk22 + ηkαk1 .

(21)

This problem is also an instance of the `1 -norm regularized least-squares optimization problem, which is efficiently solved by using sparse coding algorithms. We iteratively solve the optimization problems (20) and (21) until

M

convergence.

Algorithm 2 summarizes the proposed method. By the operation of

ED

ClipByMatching, we estimate the position of the target patch in auxiliary image Yi and extract auxiliary patch yi of size q 0 . Moreover, operations

PT

to solve the registration problem Eq. (20) and the sparse coding problem Eq. (21) are denoted by SolveReg and SolveSC, respectively.

CE

4.3. Computational Complexity We briefly show the computational complexity of the proposed super res-

AC

olution algorithm. The main part of the proposed algorithm is composed of two modules, SolveReg and SolveSC. We use the feature-sign-search method for SolveReg, namely, solving sparse coding problem (21). This algorithm searches for the sign of the coefficients α. Then, considering only nonzero

elements, the problem is reduced to a standard unconstrained quadratic optimization problem, which can be solved analytically with cubic computational 15

CR IP T

ACCEPTED MANUSCRIPT

Algorithm 2 Proposed Algorithm Input: LR Images Y1 , Y2 , · · · , YN , HR Dictionary Dh , and LR base Dictionaries Dl1 , B

while there remain patches to be extracted do

AN US

extract patch y1 from the target image Y1 for i = 2 to N do

yi ← ClipByMatching(y1 , Yi ) end for h i> ˜ ← y1 , y 2 , . . . , yN y α

for t = 1 to T do

M

α(1) ← arg min ky1 − Dl1 αk22 + ηkαk1

ED

θ (t) ← SolveReg(˜ y, B, α(t) ) α(t) ← SolveSC(˜ y, B, θ (t) )

. Eqs. (21)

PT

end for

. Eqs. (20)

x ← Dh α(T )

CE

end while

Post-processing to construct the entire image (see, e.g.

AC

2015b))

return HR image X

16

(Kato et al.,

ACCEPTED MANUSCRIPT

complexity with respect to the non-zero number of the coefficient vector α. The computational cost for solving equation (21) is of order O(L · n3α ) where L is the number of iterations for the feature-sign-search algorithm and nα is the number of non-zero elements in the optimum vector α. The length of the

CR IP T

coefficient vector is the number of atoms K in the dictionary. In the frame-

work of the sparse coding, usually the number of atoms in the dictionary is

several times of the pixel size ql . Whereas, the proportion of non-zero elements in the coefficients are less than one-tenth, namely, nα . ql , in typical

AN US

solutions.

The problem (20) is a quadratic programming with linear constraints, and solved by the active-set method. The computational cost of quadratic programming is cubic. Since the length of the registration coefficient θ is (N −1)×(k+1)2 , the computational cost for solving (20) is of order O(N 3 k 6 ).

M

Typically the number of observation N = 5 and the magnification factor k = 3. The active-set method significantly reduces effective computational cost by

ED

using the elements of θ satisfying the equality only 1 , and the computational cost for QP becomes of order O(L0 · n3θ ), where L0 is the number of iterations

PT

for the active-set algorithm and nθ is the number of non-zero elements in the estimated coefficient vector θ. The sparsity of θ depends on actual shift in

CE

the observed images. In our experiments, we observed that nθ ' 10 in many cases.

AC

The procedures SolveReg and SolveSC are repeated at most T times,

hence the overall computational cost for the proposed algorithm is of order O(T · (L · n3α + L0 · n3θ )). 1

We note that the constraint in the problem (20) leads us sparse solution.

17

ACCEPTED MANUSCRIPT

4.4. Double sparsity structure It is worth noting that the optimization problem (17) shares in form with the formulation of double sparsity dictionary learning proposed by (Rubinstein et al., 2010). Double sparsity dictionary learning was proposed to

CR IP T

bridge the gap between analytic dictionaries such as wavelets (Daubechies, 1992), and learning-based dictionaries such as MOD (Engan et al., 1999) and

K-SVD (Aharon et al., 2006). Double sparsity dictionary learning assumes that dictionary atoms have some sparse underlying structure over a set of

AN US

more fundamental base dictionaries. In our formulation of multi-frame super

resolution, the atoms are generated from a sparse combination of fundamental atoms derived by pixel-level shifting and degeneration of the HR atoms. A schematic diagram of the double sparsity structure in our formulation is shown in Fig. 3. By modeling the shift operation by a set of base shift op[registration]

M

[sparse coding] solve eq. (21)

Dl

ED

solve eq. (20)

Dl2 (✓

.. .

Dl3

.. .

..

.

(✓3,(0,0) , . . . , ✓3,(k,k) )

Dl(0,0) , . . . , Dl(k,k) 2 DlN (✓N,(0,0) , . . . , ✓N,(k,k) ) (k + 1) base dictionaries

composed of pixel-level parallel shifts

AC

CE



PT

(↵1 , . . . , ↵K )

2,(0,0) , . . . , ✓2,(k,k) )

Figure 3: Double sparsity for our formulation of the multi-frame super resolution problem. The LR dictionary used for the sparse coding of the stacked LR patches was itself a sparse combination of base dictionaries generated by the pixel-level parallel shift in the HR dictionary followed by degradation.

18

ACCEPTED MANUSCRIPT

erators, we obtained a natural and simple formulation of multi-frame super resolution based on sparse coding. 4.5. Sparse Coding Stability in Double Sparse Dictionary Learning

CR IP T

Sparse coding coefficient in equations (2), (19), and (21) are affected by the dictionary used for representing observed signal. There have been proposed quite a few theoretical analyses on sparse coding coefficient (Fan and

Li, 2001; Donoho et al., 2006). Stability and sample complexity of dictionary learning are also analyzed (Vainsencher et al., 2011). In this section, follow-

AN US

ing the analysis presented in Mehta and Gray (2013), we consider the effect of image registration estimation error to that of sparse coding coefficient estimation.

First, we define the space of dictionaries parametrized by the registration

M

coefficient θ as D = {B(θ ⊗ I)|Eθ ≤ 1, θ ≥ 0}. Since the estimated

sparse coding coefficient vector α ∈ RK in equation (2) depends on both the

ED

dictionary D and observed signal y, we write the estimated sparse coding coefficient as α(D, y) or, in our double sparse dictionary formulation, α(θ, y) because dictionary is specified by a registration coefficient θ. For the sake

PT

of simplicity, each atom of dictionaries is assumed to be normalized to have unit `2 norm.

CE

Let [n] = {1, 2, . . . , n} be a set of natural numbers from 1 to n ∈ N. To

AC

guarantee the stability of sparse codes, we define the notion of s-incoherence. Definition 1 (s-incoherence). For s ∈ [K] and D ∈ D, the s-incoherence µs (D) is defined as the square of the minimum singular value among subdictionaries of D composed of s atoms: µs (D) = (min{ζs (DΛ )|Λ ⊆ [K], |Λ| = s})2 ,

19

(22)

ACCEPTED MANUSCRIPT

where DΛ denotes a sub-dictionary with columns indicated by the index set Λ, |Λ| is the cardinality of Λ, and ζs (A) is the s-th largest singular value of a matrix A.

CR IP T

In Mehta and Gray (2013)(Theorem 4), the following result has been presented:

Theorem 1 (Sparse Coding Stability). Let D, D0 satisfy µs (D), µs (D0 ) ≥ µ for some positive constant µ and kD − D0 k2 ≤ ε, and let y ∈ BRq , where BRq is the unit ball of Rq . Suppose that there exists an index set I ⊆ [K] of

AN US

K − s indices such that for all i ∈ I,

|d> i (y − Dα(D, y))| < η − τ

with

ε≤

τ 2η . 27

(23)

and α(D0 , y) is bounded as

M

Then, the difference between two sparse coding coefficient vectors α(D, y) √ 3ε s kα(D, y) − α(D , y)k2 ≤ . 2 ηµ

(24)

ED

0

In this theorem, the condition (23) implies that at least K − s inactive atoms

PT

in α(D, y) do not have too high absolute correlation with the residual y − Dα(D, y)), and the degree of permissible correlation is bounded by η − τ ,

CE

where η is the sparse coding regularization coefficient and τ (< η) controls the strictness on the permissible correlation. By applying this theorem to our

AC

double sparse dictionary learning problem, we obtain the following corollary: Corollary 1. Let θ, θ 0 ∈ {θ|Eθ ≤ 1, θ ≥ 0}, and µs (B(θ ⊗ I)), µs (B(θ 0 ⊗ I)) ≥ µ. We also let, for some ε ≥ 0, v u N k uX X t (θi,(a,b) − θ0

2 i,(a,b) )

i=2 a,b=0

20

ε ≤ √ = ε˜, K

(25)

ACCEPTED MANUSCRIPT

and y˜ ∈ BRq+(q−1)(N −1) . Suppose that there exists an index set I ⊆ [K] of

˜ − B(θ ⊗ I)α(θ, y))| ˜ < η−τ K − s indices such that for all i ∈ I, |b> i (y √ 2 with K ε˜ ≤ τ27η , and bi is the i-th atom in B(θ ⊗ I). Then, the difference

˜ and α(θ 0 , y) ˜ is bounded between two sparse coding coefficient vectors α(θ, y) as

CR IP T

√ 3 ε˜ Ks ˜ − α(θ , y)k ˜ 2≤ kα(θ, y) . 2 ηµ 0

(26)

Suppose θ 0 is the true shift parameter and θ is the estimate obtained by

solving the problem (20). Then, this corollary states that the error of the

AN US

sparse coding coefficient α is bounded as equation (26). The equation (25)

is the counterpart of kD − D0 k2 in Theorem 1, and based on the fact that dictionary D is written as B(θ ⊗ I) hence v u N k uX X l(a,b) 0 kB(θ ⊗ I) − B(θ 0 ⊗ I)k2 =t (θi,(a,b) − θi,(a,b) )2 kDi k22

(27)

i=2 a,b=0

M

v u N k u XX 0 =tK (θi,(a,b) − θi,(a,b) )2 ,

(28)

ED

i=2 a,b=0

where we used the assumption that each atom are normalized to have unit

PT

norm. The bound (26) explicitly contains the dimension of the sparse coefficient, and we can see that when we increase the dimension K of the

CE

sparse coefficient, we need to increase the regularization coefficient η for the sparse coding problem to ensure the stability of α with respect to the reg-

AC

istration. Finally, we show the worst case bound for the error of the sparse coding coefficient vector α. Noting that the registration coefficient has reP striction ka,b=0 θi,(a,b) = 1, θi,(a,b) ≥ 0 for i = 2, . . . , N , the left hand side of qP P p N k 0 2 ≤ (θ − θ ) 2K(N − 1), equation (25) is bounded as i=2 a,b=0 i,(a,b) i,(a,b)

hence we obtain the inequality

3 ˜ − α(θ , y)k ˜ 2≤√ kα(θ, y) 2 0

21

p

K(N − 1)s , ηµ

N ≥ 2,

(29)

ACCEPTED MANUSCRIPT

as the worst case bound. 5. Experimental Results

images and those of images from movies2 . 5.1. Application to still images

CR IP T

In this section, we report the result of super-resolution on some sets of

We assumed that the observed LR images were generated from an HR

image through parallel shifts, blurring, down-sampling, and the addition of

AN US

noises. The degrees of vertical and horizontal shifts were randomly sampled from a uniform distribution in [−5, 5]. The blurring operation was imple-

mented by the convolution of a 9 × 9-pixel Gaussian filter with standard deviation σh = 1. The blurred and shifted images were then down-sampled

M

by a factor three. Finally, noise sampled from N (0, σn = 1) were added to generate LR observation images. In our experiments, both the intensity of

ED

the blur and the noise were assumed to be given. Our algorithm iteratively solved the quadratic programming (20) and the sparse coding problem (21).

PT

We observed that the algorithm converged in fewer than three iterations, and fixed the number of iterations T to three for all experiments. We magnified

CE

the input LR images by factor of three for all cases. We used five LR images to estimate an HR images, i.e., N = 5.

AC

We compare the proposed method with seven conventional methods. The

first was bi-cubic interpolation. This is a simple method regarded as baseline for SR. The second and third methods were Single-Frame Joint Dictionary 2

A simple MATLAB implementation will be made available on author’s website fol-

lowing the acceptance of the paper.

22

ACCEPTED MANUSCRIPT

Learning (SF-JDL; (Yang et al., 2010)) and Adaptive Sparse Domain Selection (ASDS; (Dong et al., 2011)). These methods are considered state-ofthe-art single-frame SR methods with publicly available software implementations. The fourth method was the one proposed by (Wang et al., 2011),

CR IP T

which is a multi-frame SR method based on joint dictionary learning. We re-

fer to this method as MF-JDL (Multi-Frame super resolution based on Joint

Dictionary Learning). The other two methods were representative methods in reconstruction-based SR in the literature. In (Farsiu et al., 2004), a multi-

AN US

frame SR method based on regularization in the form of Bilateral Total Vari-

ation (BTV) was proposed. Due to its impressive performance and simplicity, the method has become one of the most commonly used in multi-frame SR. BTV was further improved in (Li et al., 2010), where the method based on regularization by Locally Adaptive Bilateral Total Variation (LABTV) was

M

proposed. In this paper, these reconstruction-based methods were referred to as BTV and LABTV, respectively. Finally, we also used the multi-frame

ED

SR method based on sparse coding proposed in our previous paper (Kato et al., 2015b), which is referred to as MF-SC.

PT

There were several tuning parameters for each SR method. To make fair comparison, we first optimized the parameters of each method to maximize

CE

the PSNR of the image “Lena”, which is one of the most commonly used benchmark images in image processing. For all other images, we continued

AC

to use the same parameters optimized for Lena. We used two grayscale images (Lena and Cameraman), and three color

images (Flower, Girl, and Parthenon) to assess the performance of SR methods. When dealing with color images, we first converted them to YCbCr format and then apply SR methods only to the luminance channel (Y). The values of the channels Cb and Cr were simply expanded by bi-cubic interpo-

23

ACCEPTED MANUSCRIPT

lation. The experimental results are shown in Fig. 4-Fig. 8. To focus on the difference between our previous method and the newly proposed method, we only show the original images, the degraded images, the images obtained by

CR IP T

MF-SC and those obtained by our proposed method. These figures indicate that the proposed method was able to generate images comparable or better

PT

ED

M

AN US

than those of MF-SC.

CE

Figure 4: Reconstructed images estimated from LR observations for Lena. (a) Observed LR image. (b) Original HR image. Results of (c) MF-SC(our previous work), and (d) the

AC

proposed method.

For a quantitative comparison of SR methods, we used the Peak Signal-

to-Noise Ratio (PSNR), defined as PSNR[dB] = 10 log10

2552 , MSE

(30)

where MSE is the mean squared error between the original HR image and 24

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 5: Reconstructed images estimated from LR observations for Cameraman. (a)

M

Observed LR image. (b) Original HR image. Results o (c) MF-SC, and (d) the proposed

AC

CE

PT

ED

method.

Figure 6: Reconstructed images estimated from LR observations for Flower. (a) Observed LR image. (b) Original HR image. Results of (c) MF-SC, and (d) the proposed method.

25

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 7: Reconstructed images estimated from LR observations for Girl. (a) Observed

AC

CE

PT

ED

M

LR image. (b) Original HR image. Results of (c) MF-SC, and (d) the proposed method.

Figure 8: Reconstructed images estimated from LR observations for Parthenon. (a) Observed LR image. (b) Original HR image. Results of (c) MF-SC, and (d) the proposed method.

26

ACCEPTED MANUSCRIPT

the estimated HR image; a higher PSNR indicates better SR performance. We show PSNR values obtained by various methods in Table 1. To evaluate the PSNR values, we randomly generated 100 sets of 5 shift operators and generated degraded images by adding random observation noises. From

CR IP T

each set of observed images, we randomly chose one target image. Only the target image was used for single-frame SR, whereas in multi-frame SR, the

remaining 4 images were used as auxiliary LR images. The mean values and standard deviations of the PSNR values were calculated using 100 SR results

AN US

of each methods. The best and the second-best results are shown in bold

and are underlined, respectively. The averaged computational time over 100 trials are also shown in parentheses. As shown in Table 1, the proposed Table 1: PSNRs and computational times (in parentheses) of SR method. The best and

Image Lena

Bicubic

M

second-best results are shown in bold and are underlined, respectively. SF-JDL

ED

Cameraman 27.03 ± 0.00 28.25 ± 0.01 (1.81)

PT

35.50 ± 0.01 35.81 ± 0.01

(2.24)

24.40 ± 0.00 24.59 ± 0.00 (4.28)

BTV

LABTV

MF-SC

Proposed

29.01 ± 0.22

29.33 ± 0.20

29.69 ± 0.15

29.82 ± 0.14

(149.07)

(99.79)

(44.03)

(71.15)

(24.78)

(59.54)

29.88 ± 0.02

28.29 ± 0.03

29.43 ± 0.37

29.83 ± 0.37

30.19 ± 0.38

30.44 ± 0.35

(157.71)

(98.45)

(44.29)

(71.06)

(24.23)

(42.78)

36.22 ± 0.02

36.32 ± 0.04

36.26 ± 0.24

36.46 ± 0.21

36.61 ± 0.10 36.61 ± 0.12

(124.88)

(94.14)

(43.86)

(67.96)

(21.68)

(30.80)

31.72 ± 0.01

31.73 ± 0.02

31.84 ± 0.17

32.09 ± 0.16

31.98 ± 0.06

31.92 ± 0.07

(157.10)

(118.64)

(53.53)

(83.21)

(26.66)

(56.52)

25.07 ± 0.01

24.70 ± 0.02

25.45 ± 0.19

25.33 ± 0.13

25.41 ± 0.09

25.37 ± 0.08

(442.35)

(257.19)

(108.94)

(173.38)

(57.80)

(141.46)

AC

Parthenon

(1.88)

31.12 ± 0.00 31.49 ± 0.01

CE

Girl

MF-JDL

27.91 ± 0.00 28.73 ± 0.01 30.08 ± 0.02 29.25 ± 0.05 (1.90)

Flower

ASDS

method outperformed the other conventional methods for two out of five images ( Cameraman and Flower), and was the second-best for Lena. It showed improvement over our previous method in two images, was the same in one image, and was slightly worse in the other two images. The proposed method 27

ACCEPTED MANUSCRIPT

requires more computational time than our previous method, but comparable or faster than other methods. We can see that the proposed method, which has theoretically interesting double sparse structure, has comparable

computational cost.

Illustration of the Double Sparsity Structure

CR IP T

performance over other methods in both image reconstruction quality and

For deeper understanding of the behavior of our algorithm, we illustrate, for

AN US

“Lena” image as an example, the estimated registration coefficients by our propose double sparse dictionary learning. For the original HR Lena image, we applied the following four different shift operations:

S2 = (2, 3), S3 = (3, 2), S4 = (0, 1), S5 = (1, 0),

M

where Si = (a, b) means the i-th observed image has relative displacement to the target image, a pixel in horizontal and b pixel in vertical directions. We

ED

can consider sub-pixel level shifts in HR space, but to simplify the explanation, we consider pixel-level shifts in HR space. We note that we consider five LR observations and treat the first one as the target image, hence we

PT

only consider four shift operations. The target and shifted images were then degenerated by applying the blur, downsampling, and additive noise as in

CE

equation (5). We extracted a patch of size q = 5 × 5 from the target LR observed image, and corresponding 4 × 4 patches from four auxiliary LR im-

AC

ages. According to the possible (N − 1) × (k + 1)2 = 4 × 16 = 64 different

pixel-level shifts in HR space, we generated 64 LR dictionaries, which are the element dictionaries in terms of the double sparse dictionary learning framework. The stacked observed patch y˜ was generated from the target

and auxiliary patches. Then, we obtained the registration coefficients θ by solving the problem (20). The length of the coefficient vector θ was 64, 28

ACCEPTED MANUSCRIPT

which were composed of N − 1 blocks corresponding to the j-th observed image (j = 2, . . . , 5), and these blocks were of length 16. For an arbitrary chosen patch, we show four blocks of coefficients in Fig. 9 by barplots. In the Coefficients for shift S3

(0,3)

(1,3)

(2,3)

(3,3)

(0,3)

(1,3)

(2,3)

(3,3)

(2,2)

(3,2)*

(1,2)

(0,2)

(3,1)

(2,1)

(1,1)

(0,1)

(3,0)

(2,0)

(1,0)

(0,0)

0.0 (3,3)

(1,3)

(2,3)*

Coefficients for shift S5

0.4 0.2

CE

Figure 9: Examples of the estimated registration coefficients for four shift patterns.

barplots, the indices corresponds to the true shifts are marked with ∗. From

AC

these plots, we can see that the estimated coefficients are sparse and the elements correspond to the true shifts have large values as expected. Figure 9 is an example of the estimated registration coefficients for one patch, but we saw the same tendency of the estimated coefficients for other patches. We show average of estimated registration coefficients over whole patches in Fig. 10. It is seen that the elements of coefficients correspond to true 29

(3,2)

(2,2)

(1,2)

(0,2)

(3,1)

(2,1)

(1,1)

(0,1)

(3,0)

(2,0)

(0,0)

(1,0)*

0.0 (3,3)

(2,3)

(1,3)

(0,3)

(3,2)

(2,2)

(1,2)

(0,2)

(3,1)

ED (2,1)

(1,1)

(0,1)*

PT

(3,0)

(2,0)

(1,0)

(0,0)

0.0

0.2

0.4

M

0.6

0.6

0.8

AN US

0.8

Coefficients for shift S4

(0,3)

(3,2)

(2,2)

(1,2)

(0,2)

(3,1)

(2,1)

(1,1)

(0,1)

(3,0)

(2,0)

(1,0)

(0,0)

0.0

0.2

0.2

0.4

0.6

0.4

0.8

0.6

CR IP T

Coefficients for shift S2

ACCEPTED MANUSCRIPT

shift have larger value than other elements, and we can see that by solving the problem (20), we can obtain reasonable estimate of the registration of multiple observations.

(2,3)

(3,3) (3,3)

(1,3)

(2,3)

(0,3)

(2,2)

(3,2)*

(1,2)

(0,2)

(3,1)

(2,1)

(1,1)

(0,1)

(3,0)

Averaged coefficients for shift S5

0.2 0.1

CE

Figure 10: Average of the estimated registration coefficients for four shift patterns.

AC

5.2. Application to motion pictures We show the experimental results on LR images sequentially captured

from movies. From five consecutive LR images, the middle (third in the temporal sequence) image was selected as the target image, and the other four are considered auxiliary images. The obtained HR images using MF-SC and the proposed method are

30

(1,3)

(0,3)

(3,2)

(2,2)

(1,2)

(0,2)

(3,1)

(2,1)

(1,1)

(0,1)

(3,0)

(2,0)

(0,0)

(1,0)*

0.0 (3,3)

(2,3)

(1,3)

(0,3)

(3,2)

(2,2)

(1,2)

(0,2)

(3,1)

ED (2,1)

(1,1)

(0,1)*

PT

(3,0)

(2,0)

(1,0)

(0,0)

0.00

0.10

M

0.20

0.3

0.30

0.4

Averaged coefficients for shift S4

(2,0)

(1,0)

(0,0)

(3,3)

AN US

(1,3)

(2,3)*

(0,3)

(3,2)

(2,2)

(1,2)

(0,2)

(3,1)

(2,1)

(1,1)

(0,1)

(3,0)

(2,0)

(1,0)

(0,0)

0.0

0.00

0.1

0.10

0.2

0.20

0.3

0.30

CR IP T

Averaged coefficients for shift S3

0.4

Averaged coefficients for shift S2

ACCEPTED MANUSCRIPT

shown in Fig. 11 and Fig. 12. We also show the PSNR in Table 2 for all

AN US

CR IP T

methods.

Figure 11: Images estimated from LR observations for MacArthur. (a) Observed LR

AC

CE

PT

ED

M

image. (b) Original HR image. Results of (c) MF-SC, and (d) the proposed method.

Figure 12: Images estimated from LR observations for Samurai. (a) Observed LR image. (b) Original HR image. Results of (c) MF-SC, and (d) the proposed method.

31

ACCEPTED MANUSCRIPT

Table 2: PSNRs and computational times (in parentheses) of SR methods applied to movie data. SF-JDL

ASDS

MF-JDL

BTV

LABTV

MF-SC Proposed

MacArthur

34.11

34.33

35.63

35.18

34.39

34.40

34.79

35.63

(2.69)

(178.08)

(133.78)

(61.72)

(96.17)

(27.70)

(61.74)

Samurai

25.36

25.97

26.66

26.12

26.16

26.07

25.90

26.49

(2.50)

(211.65)

(138.38)

(62.13)

(96.24)

(30.75)

(59.86)

CR IP T

Bicubic

From Fig. 11 and Fig. 12, we see that the HR images obtained by the

proposed method were clear and had distinct edges compared to images ob-

AN US

tained by MF-SC. From Table 2, the PSNRs of the proposed method were the same or lower than those of ASDS. However, the computational cost of the proposed method was lower than that of ASDS and other multi-frame SR methods. Although the proposed method incurs approximately double

M

the computational cost of our previous method (MF-SC), it significantly im-

6. Conclusion

ED

proves the quality of the reconstructed image in terms of PSNR.

PT

In this paper, we discussed multi-frame image super resolution as a combination of two distinct problems, i.e., image registration and sparse coding.

CE

The main contribution of this work was formulating these problems within a framework of double sparsity dictionary learning. Image registration and

AC

sparse coding were unified in a single objective function, following which registration coefficients and sparse coding coefficients were alternatingly optimized with quadratic programming and `1 -norm constraint least squares, respectively, both of which led the sparse estimation of those coefficients. The proposed method improved our previous formulation of multi-frame super resolution for some images, particularly images from movies. We explained

32

ACCEPTED MANUSCRIPT

the proposed super resolution method using an application to image resolution enhancement; however, we think that our double sparsity formulation is applicable to the enhancement or refinement of multiple observations from a number of inaccurate sensors with appropriate base dictionaries, such as

CR IP T

information integration from observations by autonomous mobile robots.

Our future work includes an application of the proposed method to other signal resolution enhancement problems as well as further improvement of

computational efficiency by using or developing optimization methods for

AN US

sparse coding and shift estimation. It is interesting to optimize the learning

method of HR dictionary for super resolution task. For example, in the framework of the super resolution based on sparse coding, we generate the LR atoms from HR atoms. In reality, there can be a lot of different ways of degeneration of an HR atom that results in the same LR atom. So, when

M

we optimize the HR dictionary, making each atom in HR space as different as possible would improve the overall super resolution results. The way to

ED

impose this constraint and how to formulate the optimization problem will be investigated in our future work. It is also of great interest to investigate

PT

the relationship between the double sparse structure and the intermediate representation obtained by using the deep neural network architecture (Dong

CE

et al., 2016).

AC

Acknowledgment The authors would like to express their special thanks to the editor, the

associate editor and anonymous reviewers whose comments led to valuable improvements of the manuscript. Part of this work was supported by JSPS KAKENHI No. 25120009, 25120011, 16H02842 and 16K16108.

33

ACCEPTED MANUSCRIPT

Appendix A. Shift and clipping operator We explain how the parallel shift and clipping operator Wi in Eq. (5) is realized as an explicit operator. Suppose Wi shifts the original x-pixel of the

CR IP T

LR patch in the row and the y-pixel in the column. We denote Wi by W (x,y) . √ √ √ We consider an LR patch of size q = q × q: the patch has q pixels in

both rows and columns. Let Oa×b ∈ Ra×b the zero matrix and Ia×a ∈ Ra×a √ be a unit matrix. We define matrix Sz , z ∈ {0, ±1, ±2, . . . , ±( q − 1)} as follows:





O(√q−z)×z

ED

when z ≤ −1, and



√ √ I(√q−z)×(√q−z)    ∈ R q× q Oz×(√q−z) 

M

O√  ( q−z)×z Sz =   Oz×z

 √ √   ∈ R q× q 

Oz×z

AN US

when z ≥ 1,

 O √ z×( q−z)  Sz =   I(√q−z)×(√q−z)



(A.1)

(A.2)

Sz = I√q×√q

PT

(A.3)

when z = 0. Then, Sy ⊗Sx gives a matrix acting on the LR patch and realizes

CE

a pixel-level parallel translation to (x, y), where ⊗ is the matrix Kronecker product. We note that in the shifted matrix, pixel values outside the patch

AC

are replaced with zeros (zero padding). The shift and clipping operator W (a,b) acting on the y ∈ Rq is defined by W (a,b) y = C > (Sy ⊗ Sx y)C ∈ R(

√ √ q−2)×( q−2)

,

(A.4)

where h i> √ √ √ √ √ √ C = O( q−2)×1 I( q−2)×( q−2) O( q−2)×1 ∈ R q×( q−2) . 34

(A.5)

ACCEPTED MANUSCRIPT

References Aharon, M., Elad, M., Bruckstein, A., 2006. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Trans-

CR IP T

actions on Signal Processing 54 (11), 4311–4322. Belkin, M., Niyogi, P., 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15 (6), 1373–1396.

Bruce, N. D., Rahman, S., Carrier, D., 2016. Sparse coding in early visual

AN US

representation: From specific properties to general principles. Neurocomputing 171, 1085 – 1098.

URL //www.sciencedirect.com/science/article/pii/S0925231215010772 Cichocki, A., Zdunek, R., Phan, A. H., Amari, S.-i., 2009. Nonnegative Ma-

M

trix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley Publishing.

ED

Daubechies, I., 1992. Ten lectures on wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.

PT

Dong, C., Loy, C. C., He, K., Tang, X., February 2016. Image super-

CE

resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38 (2), 295–307.

AC

URL http://dx.doi.org/10.1109/TPAMI.2015.2439281

Dong, W., Zhang, D., Shi, G., Wu, X., 2011. Image Deblurring and SuperResolution by Adaptive Sparse Domain Selection and Adaptive Regularization. IEEE Transactions on Image Processing 20 (7), 1838–1857.

Donoho, D. L., Elad, M., Temlyakov, V. N., Jan 2006. Stable recovery of

35

ACCEPTED MANUSCRIPT

sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory 52 (1), 6–18. Elad, M., 2010. Sparse and Redundant Representations: From Theory to

CR IP T

Applications in Signal and Image Processing. Springer. URL http://books.google.co.jp/books?id=d5b6lJI9BvAC

Engan, K., Aase, S. O., Hakon Husoy, J., 1999. Method of optimal directions

for frame design. In: Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 05.

AN US

ICASSP ’99. IEEE Computer Society, Washington, DC, USA, pp. 2443– 2446.

URL http://dx.doi.org/10.1109/ICASSP.1999.760624 Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood

96 (456), 1348–1360.

M

and its oracle properties. Journal of the American Statistical Association

ED

URL http://www.jstor.org/stable/3085904

PT

Farsiu, S., Robinson, M., Elad, M., Milanfar, P., Oct 2004. Fast and Robust Multiframe Super Resolution. IEEE Transactions on Image Process-

CE

ing 13 (10), 1327–1344. Hong, C., Yu, J., Tao, D., Wang, M., June 2015a. Image-based three-

AC

dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Transactions on Industrial Electronics 62 (6), 3742–3751.

Hong, C., Yu, J., Wan, J., Tao, D., Wang, M., Dec 2015b. Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing 24 (12), 5659–5670.

36

ACCEPTED MANUSCRIPT

Jin, T., Yu, Z., Li, L., Li, C., 2015. Multiple graph regularized sparse coding and multiple hypergraph regularized sparse coding for image representation. Neurocomputing 154, 245 – 256.

CR IP T

URL //www.sciencedirect.com/science/article/pii/S0925231214016610 Kato, T., Hino, H., Murata, N., 2015a. Double sparse multi-frame image super resolution. CoRR abs/1512.00607. URL http://arxiv.org/abs/1512.00607

Kato, T., Hino, H., Murata, N., 2015b. Multi-frame image super resolution

AN US

based on sparse coding. Neural Networks 66, 64–78.

URL http://dx.doi.org/10.1016/j.neunet.2015.02.009 Kato, T., Hino, H., Murata, N., 2016. Doubly sparse structure in image super resolution. In: 26th IEEE International Workshop on Machine Learning for

13-16, 2016. pp. 1–6.

M

Signal Processing, MLSP 2016, Vietri sul Mare, Salerno, Italy, September

ED

URL http://dx.doi.org/10.1109/MLSP.2016.7738902 Lee, D. D., Seung, H. S., October 1999. Learning the parts of objects by

PT

non-negative matrix factorization. Nature 401 (6755), 788–791.

CE

Lee, H., Battle, A., Raina, R., Ng, A. Y., 2006. Efficient sparse coding algorithms. In: NIPS. pp. 801–808.

AC

Li, X., Hu, Y., Gao, X., Tao, D., Ning, B., February 2010. A multi-frame image super-resolution method. Signal Process. 90 (2), 405–414.

Lou, S., Zhao, X., Chuang, Y., Yu, H., Zhang, S., 2016. Graph regularized sparsity discriminant analysis for face recognition. Neurocomputing 173, Part 2, 290 – 297. URL //www.sciencedirect.com/science/article/pii/S0925231215012862 37

ACCEPTED MANUSCRIPT

Mehta, N., Gray, A. G., 2013. Sparsity-based generalization bounds for predictive sparse coding. In: Dasgupta, S., Mcallester, D. (Eds.), Proceedings of the 30th International Conference on Machine Learning (ICML-13). Vol. 28. JMLR Workshop and Conference Proceedings, pp. 36–44.

CR IP T

URL http://jmlr.csail.mit.edu/proceedings/papers/v28/mehta13.pdf Natarajan, B. K., 1995. Sparse approximate solutions to linear systems. SIAM journal on computing 24 (2), 227–234.

Olshausen, B. A., Field, D. J., 1996. Emergence of simple-cell receptive field

AN US

properties by learning a sparse code for natural images. Nature 381, 607– 609.

Olshausen, B. A., Field, D. J., 2005. How Close Are We to Understanding

M

V1? Neural Computation 17, 1665–1699.

Rezaiifar, Y. C. P. R., Krishnaprasad, P. S., 1993. Orthogonal matching

ED

pursuit: Recursive function approximation with applications to wavelet decomposition. In: Proceedings of the 27 th Annual Asilomar Conference

PT

on Signals, Systems, and Computers. pp. 40–44. Rubinstein, R., Zibulevsky, M., Elad, M., 2010. Double sparsity: Learn-

CE

ing sparse dictionaries for sparse signal approximation. Signal Processing, IEEE Transactions on 58 (3), 1553–1564.

AC

Shimizu, M., Okutomi, M., 2006. Multi-parameter simultaneous estimation on area-based matching. In: International Journal of Computer Vision, vol. 67, no. 3, pp. 327-342, May.

Shu, Z., Zhou, J., Huang, P., Yu, X., Yang, Z., Zhao, C., 2016. Local and global regularized sparse coding for data representation. Neurocomputing 38

ACCEPTED MANUSCRIPT

175, Part A, 188 – 197. URL //www.sciencedirect.com/science/article/pii/S0925231215015076 Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal

CR IP T

of the Royal Statistical Society, Series B 58, 267–288. Vainsencher, D., Mannor, S., Bruckstein, A. M., November 2011. The sample complexity of dictionary learning. J. Mach. Learn. Res. 12, 3259–3281. URL http://dl.acm.org/citation.cfm?id=1953048.2078210

AN US

Wang, C., Wang, Y., Lin, Z., Yuille, A. L., Gao, W., June 2014. Robust esti-

mation of 3d human poses from a single image. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2369–2376. Wang, P., Hu, X., Xuan, B., Mu, J., Peng, S., 2011. Super Resolution Recon-

M

struction via Multiple Frames Joint Learning. In: International Conference on Multimedia and Signal Processing (CMSP). Vol. 1. pp. 357–361.

ED

Yang, J., Wright, J., Huang, T., Ma, Y., 2008. Image super-resolution as sparse representation of raw image patches. In: Proceedings of the 2008

PT

IEEE Computer Society Conference on Computer Vision and Pattern

CE

Recognition (CVPR’08). pp. 1–8. Yang, J., Wright, J., Huang, T. S., Ma, Y., November 2010. Image superresolution via sparse representation. IEEE Transactions on Image Process-

AC

ing 19 (11), 2861–2873.

Yu, J., Rui, Y., Tao, D., 2014. Click prediction for web image reranking using multimodal sparse coding. IEEE Trans. Image Processing 23 (5), 2019–2032. URL http://dx.doi.org/10.1109/TIP.2014.2311377 39

ACCEPTED MANUSCRIPT

Yu, J., Yang, X., Gao, F., Tao, D., 2016. Deep multimodal distance metric learning using click constraints for image ranking. IEEE Transactions on Cybernetics PP (99), 1–11.

CR IP T

Zeyde, R., Elad, M., Protter, M., 2010. On single image scale-up using sparserepresentations. In: Proceedings of the 7th international conference on Curves and Surfaces. pp. 711–730.

Zhan, X., Zhang, R., Yin, D., Hu, A., Hu, W., 2013. Remote sensing image compression based on double-sparsity dictionary learning and univer-

AN US

sal trellis coded quantization. In: IEEE International Conference on Im-

age Processing, ICIP 2013, Melbourne, Australia, September 15-18, 2013. IEEE, pp. 1665–1669.

URL http://dx.doi.org/10.1109/ICIP.2013.6738343

M

Zheng, M., Bu, J., Chen, C., Wang, C., Zhang, L., Qiu, G., Cai, D., 2011. Graph regularized sparse coding for image representation. IEEE Transac-

ED

tions on Image Processing 20 (5), 1327–1336.

PT

Zitova, B., Flusser, J., 2003. Image registration methods: a survey. Image and Vision Computing 21 (11), 977 – 1000.

AC

CE

URL http://www.sciencedirect.com/science/article/pii/S0262885603001379

40

ACCEPTED MANUSCRIPT

Toshiyuki Kato received the Bachelors degree in engineer-

CR IP T

ing in 2013, and Master’s degree in engineering in 2015 from Waseda Univer-

sity. Since April 2015, he has been working at Toshiba Solutions Corporation.

AN US

His research interests include image processing and sparse representation.

Hideitsu Hino received his B. Eng. in 2003, and M. Informatics in Applied Mathematics and Physics in 2005 from Kyoto University, Japan. He was with Hitachi’s Systems Development Laboratory from April

M

2005 to August 2007. He earned D. Eng. in 2010 from Waseda University. From April 2017, he is an Associate Professor at University of Tsukuba.

PT

member of IEEE.

ED

His research interests includes the analysis of learning algorithms. He is a

CE

Noboru Murata received the B. Eng., M. Eng. and Dr.

Eng in Mathematical Engineering and Information Physics from the University of Tokyo in 1987, 1989 and 1992, respectively. After working at the

AC

University of Tokyo, GMD FIRST in Germany, and RIKEN in Japan, in April of 2000, he joined Waseda University in Japan where he is presently a professor. His research interest includes the theoretical aspects of learning machines such as neural networks, focusing on the dynamics and statistical properties of learning.

41