Journal Pre-proofs Video compressed sensing reconstruction based on structural group sparsity and successive approximation estimation model Jian Chen, Zhifeng Chen, Kaixiong Su, Zheng Peng, Nam Ling PII: DOI: Reference:
S1047-3203(19)30355-4 https://doi.org/10.1016/j.jvcir.2019.102734 YJVCI 102734
To appear in:
J. Vis. Commun. Image R.
Received Date: Revised Date: Accepted Date:
5 June 2019 26 September 2019 3 December 2019
Please cite this article as: J. Chen, Z. Chen, K. Su, Z. Peng, N. Ling, Video compressed sensing reconstruction based on structural group sparsity and successive approximation estimation model, J. Vis. Commun. Image R. (2019), doi: https://doi.org/10.1016/j.jvcir.2019.102734
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier Inc.
Journal of Visual Communication and Image Representation (2019)
Contents lists available at ScienceDirect
Journal of Visual Communication and Image Representation journal homepage: www.elsevier.com/locate/jvci
Video compressed sensing reconstruction based on structural group sparsity and successive approximation estimation model Jian Chena , Zhifeng Chena,∗, Kaixiong Sua , Zheng Pengb , Nam Lingc a College
of Physics and Information Engineering, Fuzhou University, Fuzhou, Fujian 350108, China of Mathematics and Computer Science, Fuzhou University, Fuzhou, Fujian 350108, China c Department of Computer Engineering, Santa Clara University, Santa Clara, CA 95053-0566, USA b College
ARTICLE INFO
ABSTRACT
Article history: Received 5 June 2019 Received in final form 15 June 2019 Accepted 25 June 2019 Available online 30 June 2019
The existing video compressed sensing (CS) algorithms for inconsistent sampling ignore the joint correlations of video signals in space and time, and their reconstruction quality and speed need further improvement. To balance reconstruction quality with
Communicated by S. Sarkar
computational complexity, we introduce a structural group sparsity model for use in the initial reconstruction phase and propose a weight-based group sparse optimization Keywords: Compressed sensing, Group sparsity, Interframe estimation, Reconstruction algorithms
algorithm acting in joint domains. Then, a coarse-to-fine optical flow estimation model with successive approximation is introduced for use in the interframe prediction stage to recover non-key frames through alternating optical flow estimation and residual sparse reconstruction. Experimental results show that, compared with the existing algorithms, the proposed algorithm achieves a peak signal-to-noise ratio gain of 1–3 dB and a multi-scale structural similarity gain of 0.01–0.03 at a low time complexity, and the reconstructed frames not only have good edge contours but also retain textural details. c 2019 Elsevier B. V. All rights reserved.
author: Tel.: +0-000-000-0000; fax: +0-000-000-0000; e-mail:
[email protected] (Zhifeng Chen)
∗ Corresponding
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
2
1. Introduction In traditional source coding, to reduce storage and transmission costs, technologies for transformation, quantization, predictive coding, entropy coding and other technologies must be comprehensively adopted on the coding side to remove most redundant data [1] such that only a small amount of the most important data is retained. Meeting this requirement not only requires the encoder to satisfy the extremely high sampling rate described by the Nyquist sampling theorem but also substantially increases the encoding complexity and system cost. The problem is even more serious for broadband video signals. To reduce the encoding complexity, new information processing theories such as distributed source coding [2], compressed sensing (CS) [3, 4], and distributed CS [5] have emerged. Because CS breaks through the bottleneck of the Nyquist–Shannon theorem, sparse or compressible signals can be recovered from only a small number of projection measurements [3, 4], making it possible to sample broadband signals at a low rate. In particular, for video acquisition equipment with low power consumption, CS can be used as an alternative to existing video compression systems to avoid the high complexity of full sampling and video coding. The structural models and reconstruction algorithms used in video CS are the main topics of focus for research in this field [6]. Currently, popular video CS models and algorithms can be divided into several categories: (1) inconsistent measurement models and interframe prediction algorithms [7, 8, 9, 10, 11, 12, 13, 14, 15, 16], (2) high-order tensor measurement models and corresponding reconstruction algorithms [17, 18, 19, 20, 21], and (3) dynamic CS [22, 23, 24]. Because models and algorithms of the third type are mainly used in medical magnetic resonance imaging (MRI) applications and because those of the second type have high storage and computation costs, we focus primarily on the first type of video CS, which is applicable to general dynamic scenes. In a measurement model of the first type, a video sequence is divided into key frames (reference frames) with a high sampling rate and non-key frames (nonreference frames) with a low sampling rate in a process called inconsistent sampling. The corresponding reconstruction algorithm is usually divided into two phases: initial reconstruction (independent reconstruction) and interframe prediction (non-key frame reconstruction). To improve the measurement efficiency and reduce the complexity, some scholars have adopted block-based compressed sampling (BCS) and smooth projected Landweber reconstruction (SPL) [25, 26]. However, BCS-SPL [26] ignores the high correlation between adjacent frames due to its independent reconstruction strategy, and thus, it is applicable only in the initial estimation stage when applied directly for video CS [9, 10]. To capture video interframe correlations, interframe prediction technology originating from the fields of traditional video coding and computer vision was introduced into the reconstruction stage for non-key frames in [7, 8, 9, 10, 11, 12, 13, 14]. These techniques for predicting nonreference frames include dictionary learning from image blocks in reference frames [7, 8], multihypothesis prediction (MH) [9, 13], motion estimation (ME)/motion compensation (MC) [10, 14], optical flow (OF) estimation [11, 12] and others. However, the complexity of dictionary learning is very high [8], making it unsuitable for real-time video reconstruction. In contrast, the complexity of block-based ME/MC
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
3
and MH [9, 10] is relatively low, but their estimation accuracies are not sufficiently high because of the block effect. Although the pixel-based OF method achieves high accuracy, it is computationally expensive [27]; at present, the algorithm estimates only the OF of low-resolution preview frames [11, 12], which limits its ability to improve the prediction accuracy. To improve the reconstruction quality of non-key frames, residual estimation and compensation were introduced in [10, 15] to modify the predicted value, but this method ignores the sparsity inherent to non-key frames; consequently, the reconstruction quality requires further improvement. To address video correlation in the spatial and temporal domains, the sparsity of intraframe and interframe prediction residuals was utilized in the two reconstruction phases, respectively, in [16], thereby further improving the video reconstruction quality. However, that approach still fails to fully use the joint correlations in space and time; therefore, it functions poorly in some cases while still having a high reconstruction complexity. In summary, the reconstruction algorithms based on inconsistent measurement models consider the intraframe sparsity and interframe correlations in isolation in the initial and non-key frame reconstruction phases and do not fully use the joint correlations of video signals in the time and space dimensions; consequently, the reconstruction quality has room for improvement. Moreover, the computational complexity needs to be reduced. The main innovations of our current study are as follows: (1) To address intraframe sparsity and interframe correlations, we introduce a structural group sparsity (SGS) model for use in the initial reconstruction stage and propose a weighted group sparse optimization method acting in joint domains for inconsistent measurements to improve the reconstruction quality for key frames. (2) In the non-key frame reconstruction stage, a coarse-to-fine OF estimation model with successive approximation is introduced for use in interframe prediction. OF estimation and residual sparse reconstruction are then alternately executed to recover and improve the reconstruction quality of non-key frames. The remainder of this paper is organized as follows: Section 2 reviews the related theory, the motivation for the present work and the basic framework of this paper. Section 3 describes the proposed structure model and reconstruction algorithm for video CS. Section 4 presents experimental results, and Section 5 concludes the paper.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
4
2. Related theory and basic framework 2.1. Theoretical basis 2.1.1. Compressed sensing According to CS theory [3, 4], when an unknown signal x ∈ RN is K-sparse in the transformation domain Ψ ∈ RN ×N , it can be projected into low-dimensional space by means of a measurement matrix Φ ∈ RM×N that is incoherent with respect to Ψ:
y = Φx = ΘS
(1)
where K
min ||S||p
s.t.
y = Φx = ΘS
(2)
where k · k p represents the lp norm. Common optimization algorithms to solve the problem include convex optimization algorithms [28, 29, 30], greedy algorithms [31, 32], and thresholding-based methods [33]. The original signal can then be recovered as x = ΨS.
2.1.2. Structural optimization When p ≥ 1, Problem (2) can be regarded as a classical convex optimization problem with the linear constraint Ax = b and the optimization objective θ(x) [29]. The standard form of this problem is
min θ(x)
s.t.
Ax = b
(3)
where A ∈ RM×N , x ∈ RN , and b ∈ RM (M
min {θ1 (x1 ) + θ2 (x2 )} s.t.
A1 x1 + A2 x2 = b
(4)
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
5
The augmented Lagrange function of Problem (4) is as follows:
LA (x1 , x2 , λ) = θ1 (x1 ) + θ2 (x2 ) − λT (A1 x1 + A2 x2 − b) +
β kA1 x1 + A2 x2 − bk22 2
(5)
where λ∈ RM is the Lagrange multiplier and β>0 is a penalty parameter. The ADMM based on augmented Lagrange multipliers (ALM), or ALM-ADMM, is an efficient algorithm for solving this type of problem. Algorithm convergence is achieved by means of a splitting algorithm based on separable convex programming [29, 30, 34].
2.1.3. Group sparse optimization The sparse optimization algorithm for solving Problem (2) allows us to reconstruct high-dimensional signals from only a small number of samples. To further enhance the reliability of the solution, current research has reached beyond the sparsity of a single signal to mine additional information from the potential structure. In fact, a broad class of highly correlated signal groups have a fixed group sparse (GS) structure [35, 36]. Encoding this GS structure can reduce the number of degrees of freedom of the solution, thereby leading to better recovery performance. At present, the GS reconstruction problem [35, 36, 37] has been fully studied. A good approach is to use a mixture of l1 and n o l2 regularization terms. Suppose that x ∈ RN is an unknown GS solution, and let xgi ∈ Rni : i = 1, . . . , q be the group to which x belongs, where gi ⊆ {1, 2, . . . , ni } is the index set corresponding to group i and xgi represents the subvector whose index is gi .
P According to Reference [36], kxk w,2,1 = qi=1 wi
xgi
2 means the l2,1 norm with a grouping weight of wi ≥ 0(i = 1, . . . , q). Then, the basis pursuit (BP) model based on this norm is as follows:
min kxkw,2,1 = x
q X
wi
xgi
2
s.t.
Ax = b
(6)
i=1
where A ∈ RM×N , x ∈ RN , and b ∈ RM (M
min kzkw,2,1 = x,z
q X
wi
zgi
2
s.t.
z = x,
Ax = b
(7)
i=1
Problem (7) contains two variables, x and z; its objective function is separable, and the constraint conditions are linear. There-
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
6
fore, it can be solved by minimizing the augmented Lagrange function of (7) as follows:
α β min LA (x, z, λ, µ) = min kzkw,2,1 − λT (Ax − b) + kAx − bk22 − µT (z − x) + kz − xk22 x,z x,z 2 2
(8)
where λ∈ RM and µ∈ RN are Lagrange multipliers and α and β (>0) are penalty parameters. The convergence of this minimization problem is guaranteed by the existing ADMM theory [38]. The multiple measurement vector (MMV) problem that arises in a variety of areas of applied signal processing can be seen as a special case of the nonoverlapping GS problem [36]. In recent years, the MMV problem has attracted considerable attention as an extension of single sparse signal reconstruction in CS, and it has been applied to CS for color and hyperspectral images [39, 40, 41].
2.1.4. Optical flow estimation As one of the cornerstones of computer vision, OF estimation is widely used in video processing to compress videos and to enhance video quality [27]. It should be possible to align two images by means of the OF field. The objective function of the OF vector v is [42]
E(v) =
X p
X 2 2 θ1 I1 (p) − I2 (p + v p ) + ω θ2 |∇v x |2 + ∇vy
(9)
p
where I1 and I2 represent the two images to be matched and p = (P x , Py ) is a point in the image lattice. v p = (v x , vy ) denotes the flow vector at pixel p, where v x and vy are the horizontal and vertical components, respectively, of the flow vector v p and ∇(·) denote the gradient operator. θ1 (·) and θ2 (·) are both robust functions. The first term of (9) is often called the data term, the second term is called the smoothness term, and ω is a coefficient that balances the two terms. For convenience in calculation, Eq. (9) can be rewritten by means of an incrementation and linearization strategy [27] as follows:
E(v, dv) =
X p
X 2 2 θ1 It (p) + I x (p)dv x + Iy (p)dvy + ω θ2 |∇(v x + dv x )|2 + ∇(vy + dvy )
(10)
p
where dv, the increment of v, at pixel p is dv p =(dv x , dvy ); It (p) = I2 p + v p − I1 (p); I x (p) = ∂ ∂y I2 p + v p .
∂ ∂x I2
p + v p ; and Iy (p) =
The objective function given in (10) can be optimized using the iteratively reweighted least squares (IRLS) method [27, 43], which is equivalent to the conventional Euler–Lagrange variational approach [42] but more succinct to derive. Therefore, the IRLS method is guaranteed to converge to a local minimum because the variational method is guaranteed to do so [43].
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
7
Furthermore, the coarse-to-fine OF algorithm [27, 44], in which the flow estimation is initialized at the top level of a Gaussian pyramid and is then progressively performed from the top to bottom level of the pyramid, is introduced for motion estimation. Consequently, the estimation process not only can adapt to slowly and smoothly moving scenes but is also able to capture largedisplacement motion to some degree.
2.2. Motivation and basic framework Inspired by the GS or MMV problem [35, 36, 37] and its application for reconstructing related images [39, 40, 41], as mentioned in Section 2.1, we believe that the correlations between frames of a video sequence are no less than the correlations between the color and spectral components of an image. Therefore, we present an SGS model that combines GS optimization [36] with structural optimization [29, 30, 34] and apply it in the initial stage of video CS reconstruction to improve the initial recovery performance. Of course, this structure is suitable only for slow-motion video sequences. For video sequences with obvious motion characteristics, certain interframe prediction techniques, such as motion estimation [9, 10, 11, 12, 13, 14, 15] and residual estimation and compensation [10, 15], as mentioned in Section 1, should be considered. As we know, in OF estimation, the pixel is treated as the unit of estimation [42, 43]; thus, the accuracy of OF estimation is higher than that of motion estimation based on blocks [9, 10, 13, 14]. In addition, the slow and smooth traditional OF algorithm has been extended in recent years to coarse-to-fine OF estimation [27, 44], which is more suitable for large-displacement motion. Therefore, we attempt to combine the coarse-to-fine OF method with a residual sparse model to achieve high-precision interframe prediction. To facilitate comparisons with similar algorithms, it is generally assumed that a video sequence consists of multiple groups of pictures (GOPs), each of which contains one key frame and multiple non-key frames. To reduce the computational complexity and storage space, we adopt BCS [25, 26] in this paper and use a block-based measurement matrix to measure each video frame. The task of video sequence measurement is divided into key and non-key frame measurements, as shown by the light red elements in the block diagram presented in Fig. 1. The difference between these two measurements is that a high sampling rate is often adopted in the former, while the latter uses a lower sampling rate. When the sampling rates of the key and non-key frames are equal, their statuses tend to be the same, and the inconsistent measurements described above become consistent measurements.
8
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
Fig. 1. Basic architecture of video CS.
In this article, the video recovery process is divided into two stages. The first stage is SGS reconstruction. Based on the prior characteristics of the common sparse support from nearby video frames in the transformation domain and the amount of measured information obtained from key and non-key frames, to achieve the initial recovery of the key frames, we present an SGS optimization model in joint domains, represented by the light green element in the block diagram in Fig. 1. The second stage is prediction and compensation (as shown by the light blue elements in the block diagram in Fig. 1). The SGS optimization result is utilized as the initial value for intraframe prediction for the key frames, and the intraframe residuals are estimated and compensated for by combining them with the measured values of the key frames during key frame reconstruction. We conduct the OF estimation by means of successive approximation for non-key frames by using the reconstructed values of the key frames as a reference and then iteratively estimating and compensating for the interframe residuals by using the OF-based predicted results and the measured values of the non-key frames as a reference to reconstruct the non-key frames.
3. Structure optimization model and reconstruction algorithm This section describes the mathematical model of the video CS architecture, the reconstruction algorithm based on the SGS model and the interframe prediction method based on OF estimation, residual estimation and compensation, as shown in Fig. 1.
3.1. Mathematical representation of video CS To facilitate the description, we assume that the size of one GOP (GOPsize) in a video sequence is fixed at 8 frames and that the 8 frames of the current GOP plus the key frame of the next GOP constitute one processing unit. Thus, there are 7 non-key frames in the interval between the two key frames, as shown in Fig. 2.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
9
Fig. 2. Video CS processing units.
For convenience of mathematical expression, the 9 frames of data—x1 to x9 —in one processing unit are stacked vertically as follows: X = [x1 , x2 , . . . , x9 ]T . Provided that the measurement matrix for key frames (with a high sampling rate) is φk and that for non-key frames (with a low sampling rate) is φnk , the measurement matrix Φ and the measured values Y for this processing unit are, respectively, expressed as follows: φk Φ =
φnk ..
. φnk
, Y = Φ φk
x1 x2 = .. . x 9
y1 y2 .. . y 9
When φk = φnk , the nonuniform measurements are transformed into uniform measurements. Without loss of generality, in the case of inconsistent measurements, the measurement matrices for the video frames are not all the same and their sampling rates are not uniform; consequently, the accuracy of independent frame-based reconstruction is also different for different frames. Therefore, the measured values of each frame in each processing unit contribute unequally to the overall video quality. Because a universal video sequence can be regarded as an image set with strong spatiotemporal correlations, each frame has the compressibility of a natural image (sparsity in the transformation domain), and similar pixels in adjacent frames will exhibit an approximate motion trend (enabling interframe prediction). In this paper, the video recovery problem for a processing unit is expressed in the form of the following optimization model: Q 9 X
2 X min τi kSi kri + γ j x j − x˜ j 2 X i=1 j=1
s.t.
e = Pr ed(Xref ) Y = ΦX, X = Ψi Si , X
(11)
10
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
where X represents the combination of video frame vectors to be solved, Φ represents the measurement matrix, and Y represents the measured value matrix. Ψi and Si denote the transformation matrix and corresponding transformation coefficient, respectively, in the i-th sparse domain, and τi denotes the weight of that sparse domain. Here, ψi1 Ψi =
ψi2 ..
. ψi9
si1 s i2 , Si = , .. . si9
where ψi j and si j (i = 1, . . ., Q, j = 1, . . . , 9) denote the transformation matrix and coefficient in the i-th sparse domain for the j-th frame. Q represents the number of transformation bases, and ri is the norm subscript. Xref represents the reference frame ˜ serves as the predicted value of X; as the components of X and X, ˜ x j and x˜ j represent the j-th frame for the video group. X and its predicted value, respectively, and γ j represents the weight of the predicted residual of the j-th frame. Pred(·) denotes the interframe prediction operation. The three imposed constraints are the measurements, the sparse transformation, and the interframe prediction. The first and second terms of the objective function reflect the sparsity of the intraframe transformation coefficients and the interframe prediction residuals, respectively. It is difficult to provide an explicit function expression because interframe prediction produces different results for different video contents and estimation methods. To solve Eq. (11), we decompose the solution process into two phases: an initial reconstruction phase based on the SGS optimization model and an interframe prediction phase based on the OF smoothness and residual sparsity.
3.1.1. Structural group sparsity optimization model When the video content changes slowly within a single GOP, adjacent frames are highly similar; thus, the sparse transformation coefficient also remains approximately the same. Consequently, we can simplify the interframe prediction process and the sparsity of the prediction residuals by considering the video frames to have a common sparse support; this is called the GS approach. The existing GS algorithms [39, 40, 41] consider relevant image components to have a common sparse support only in a specific domain, such as the discrete wavelet transform (DWT) domain, where the transformation coefficients in the same position from different image channels can be seen as a group. To reflect the contributions of the measured values of different video frames and their common sparsity in multiple transformation domains, the BP model with an lw,2,1 norm that is typically used for this problem
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
11
is extended to a SGS model based on the weighted lp,r norm minimization below:
min X
Q X
τi kSi kW,p,ri
s.t.
Y = ΦX, X = Ψi Si
(12)
i=1
Here, the transformation coefficients in the same position from adjacent frames can be seen as a group. kSi k W,p,ri :=
P
n h=1
p 1
W(Si )gh
r p , i
where n is the number of groups, h represents the index of the group, and p is set to 2. gh ⊆ {1, 2, . . . , 9} is the index set corresponding to group h, (Si )gh represents the subvector whose index is gh , and w1 W =
, w9
w2 ..
.
in which w j represents the weight of the j-th frame. The other symbols are defined as described earlier. More specifically, assuming that the transformation coefficients of video frames in the TV (i.e., gradient) domain and the DWT (i.e., wavelet) domain are sparse and that the transformation coefficients of adjacent frames have a common sparse support, Problem (12) can be rewritten as a structural optimization problem based on the SGS model:
n o min τ1 kZ1 kW,2,TV + τ2 kZ2 kW,2,DWT s.t.
X,Z1 ,Z2
Y = ΦX, Z1 = D1 X, Z2 = D2 X
(13)
where D1 and D2 represent the TV and DWT transformations, respectively; Z1 and Z2 represent the transformation coefficients of X in the gradient and wavelet domains, respectively; and τ1 and τ2 represent the weights of the corresponding lW,2,TV and lW,2,DWT norms. See Section 3.2 for the details of the solution process for Eq. (13). However, when the video shows sufficient variability within a single GOP, using the SGS model based on a common sparse support actually reduces the recovery quality. Therefore, it is unnecessary to consider the SGS approach in this case, and Problem (13) can be simplified to a general structural optimization model [29, 30] adopted for independent reconstruction, as BCS based on mixed variational inequality (MVI) in [45]:
n o min τ1 kZ1 kTV + τ2 kZ2 kDWT s.t.
X,Z1 ,Z2
Y = ΦX, Z1 = D1 X, Z2 = D2 X
(14)
where k · k TV and k · k DWT represent the lTV and lDWT norms, respectively, and the other symbols are defined as explained previously.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
12
3.1.2. Interframe prediction model In the case of inconsistent sampling, the sampling rate of the key frames (x1 and x9 ) is higher than that of the non-key frames (x2 , . . . , x8 ). Therefore, the quality of the key frames recovered through either the SGS approach or independent reconstruction will necessarily be better than that of the non-key frames. Thus, in the initial stage, the reconstructed key frames x˜01 and x˜09 can be regarded as the basic reference frames for interframe prediction, and interframe motion estimation is not required for the key frames. Considering that the residual frames are sparser than the original frames in the transformation domain [10, 15], in the second stage, the key frames ( j = 1, 9) can be reconstructed using the following model:
n o
min τ1
∆z1 j
TV + τ2
∆z2 j
DWT xj
s.t.
y j = φ j x j , ∆zi j = Di (x j − x˜0 j )
(15)
where x j , φ j and y j are the j-th frame to be solved, its measurement matrix and its measured value vector, respectively. Let ∆x j = x j − x˜0 j denote the residual between the values of the original frame and the initial reconstruction for the j-th frame; then, ∆zi j represents the coefficient of the j-th residual frame in the i-th (i = 1, 2) transformation domain. The above problem can be solved using a residual estimation and compensation algorithm. For the non-key frames ( j = 2∼8), we select the OF method, which offers pixel-level accuracy, to perform the estimation in this paper. The closer the frames are to each other on the timeline, the better the estimation effect is; therefore, the prediction step in Problem (11) can be simplified to residual sparse optimization and OF estimation based on pairs of adjacent frames:
n o
min τ1
∆z1 j
TV + τ2
∆z2 j
DWT xj
s.t.
y j = φ j x j , ∆zi j = Di (x j − x˜0 j ),
x˜0 j = OF( x˜ j−1 , x˜ j+1 )
(16)
where OF(·) denotes the OF estimation operation. Compared with Problem (15), Problem (16) includes an additional OF estimation step, and its solution process requires alternation between OF estimation, residual estimation and compensation. See Section 3.3 for the details of this process.
3.2. Solving the SGS reconstruction model Previous studies [29, 30, 34] have shown that the ALM-ADMM is an effective way to solve structural optimization problems. This section extends this algorithm to solve the SGS optimization problem defined in (13).
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
13
The augmented Lagrange function of Problem (13) can be written as
LA (X, Z1 , Z2 , λ) = τ1 kZ1 kW,2,TV + τ2 kZ2 kW,2,DWT − λT0 (ΦX − Y) +
(17)
β0 β1 β2 kΦX − Yk2 − λT1 (D1 X − Z1 ) + kD1 X − Z1 k2 − λT2 (D2 X − Z2 ) + kD2 X − Z2 k2 2 2 2
where λ0 , λ1 and λ1 are Lagrange multipliers and β0 , β1 and β2 are penalty parameters. During the reconstruction process, the ADMM is used to solve the X subproblem and the Z1 and Z2 subproblems (i.e., the Z subproblem) and then to update the Lagrange multipliers. In the (k + 1)-th iteration, the X subproblem is
2
2 β0 β1
β T T
D1 X − Z1k
2 −(λk2 ) (D2 X − Z2k ) + 2
D2 X − Z2k
2 Xk+1 = arg min −(λk0 ) (ΦX − Y) + kΦX − Yk22 − (λk1 )T (D1 X − Z1k ) + 2 2 2 X (18)
It is efficient to utilize the one-step steepest descent (OSD) method as in BCS based on mixed variational inequality (BCS-MVI) [45] and the TV-augmented Lagrangian alternating direction algorithm (TVAL3) [28] for the X subproblem as follows: First, the subgradient of LA with respect to X, denoted by gd , is calculated:
gkd =
∂LA ∂X
(19)
Second, the Barzilai–Borwein (BB) step ρk , a gradient descent step [46], is calculated using the nonmonotone line search technique [47]: ρk =
(Xk − Xk−1 )T (Xk − Xk−1 ) (Xk − Xk−1 )T (gkd − gk−1 d )
(20)
Then, X is updated as follows: Xk+1 = Xk − ρk gkd
(21)
In the (k + 1)-th iteration, the Z subproblem is
2 β1 T Z1k+1 = arg min τ1 kZ1 kW,2,TV − (λk1 ) (D1 Xk+1 − Z1 ) +
D1 Xk+1 − Z1
2 Z1
2 β2 T Z2k+1 = arg min τ2 kZ2 kW,2,DWT − (λk2 ) (D2 Xk+1 − Z2 ) +
D2 Xk+1 − Z2
2 Z2
(22) (23)
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
14
A threshold shrinkage method with a group weight W is proposed to solve the above Z (Z1 and Z2 ) subproblem: ( )
τ1 W F1 (Xk+1 ) Z1k+1 = max
WF1 (Xk+1 )
2,TV − ,0 β1 ||WF1 (Xk+1 )||2,TV ( )
τ2 W F2 (Xk+1 ) Z2k+1 = max
WF2 (Xk+1 )
2,DWT − ,0 β2 ||WF2 (Xk+1 )||2,DWT
(group direction 1, in the TV domain)
(24)
(group direction 2, in the DWT domain)
(25)
where W is the weight of the frames in this processing unit and the Fi := Di X −
1 βi λi
represent the shrinking transformation
coefficients in the TV domain (i = 1) and the DWT domain (i = 2). Note that when inconsistent measurements are adopted, the accuracies of the transformation coefficients obtained through convex optimization reconstruction vary because the sampling rates are different for different video frames. Generally, the transformation coefficients of the key frames, which have the higher sampling rate, are the most accurate. If the shrinkage process is sequentially carried out uniformly on each frame in the group direction, the reconstruction quality for key frames may be reduced. Therefore, to reduce the influence on the key frames caused by the inaccuracy of the transformation coefficients for the non-key frames, the weight of the transformation coefficients of the key frames should be increased during the group shrinkage step, and the weight for the non-key frames should be reduced. For this reason, the weight W appears in the update formulas for the two group directions in the Z subproblem. For a single video frame in the current group, the shrinkage formulas for the coefficient components for Z1 and Z2 are as follows: ) f1j τ1 w j 1 2,TV − β , 0 1 ||w j f1j ||2,TV ( )
f2j τ2 w j
z2j = max
w j f2j
− ,0 2,DWT β2 ||w j f2j ||2,DWT z1j
(
= max
w j f j
where w j is the j-th component of W and represents the weight of frame j. The fi j := Di x j −
(26) (27)
1 j βi λi
represent the components of the
shrinking transformation coefficients in the TV domain (i = 1) and the DWT domain (i = 2) domain for the j-th frame during the iterative process. The Lagrange multipliers are updated as follows:
λk+1 = λk0 − β0 (ΦXk+1 − Y) 0
(28)
λk+1 = λk1 − β1 (D1 Xk+1 − Z1 ) 1
(29)
λk+1 = λk2 − β2 (D2 Xk+1 − Z2 ) 2
(30)
In addition, to ensure accurate recovery, the penalty parameters in the external loop are updated from small to large values using
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
β0 = min(ηβ, βmax )
15
(31)
where η > 1 is a proportionality coefficient, the old penalty parameters are represented by β:= [β0 , β1 , β2 ]T , and βmax defines the maximum values of β. Note that the ALM-ADMM algorithm discussed above can be viewed as solving a structural optimization problem with three separable variables Z1 matrix of Z = Z 2
(X, Z1 , and Z2 ) or can be simplified to a two-variable problem (X and Z) simply by representing the weight τ1 in the objective function as τZ = . τ2
Algorithm 1 summarizes the SGS optimization algorithm described above to solve Problem (13). Algorithm 1 Algorithm for solving the SGS optimization problem Input: Y, Φ Output: X Initialization: k = 0, X0 = ΦT Y Iterative process: While (external loop cutoff condition is not met) Do While (internal loop cutoff condition is not met) Do the (k + 1)-th iteration: (1) solve the X subproblem: 1 calculate the subgradient of LA via Eq. (19)
2 calculate the BB step ρk via Eq. (20)
3 update X via Eq. (21)
(2) solve the Z subproblem: 1 threshold shrinkage along group direction 1 via Eq. (24)
2 threshold shrinkage along group direction 2 via Eq. (25)
(3) update the Lagrange multipliers via Eqs. (28), (29) and (30) End Do (of internal loop) Update the penalty parameters via Eq. (31) End Do (of external loop)
Notably, the threshold shrinkage method along the two group directions reflects the SGS scheme in the joint TV and DWT domains, in contrast to the GS scheme in a single domain [39, 40, 41]. In the case of dependent reconstruction for Problem (14), the solution for the Z subproblem in Algorithm 1 can be simplified to threshold shrinkage [33] without group direction in two domains.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
16
3.3. Interframe prediction In view of the suitability of coarse-to-fine OF estimation for large-displacement motion and its guaranteed convergence [27, 43], we adopt it in the interframe prediction of non-key frames. Then, the residual estimation and compensation technology [10, 15] is introduced to improve the accuracy of interframe prediction. For visualization purposes, this section mainly presents the interframe prediction process to solve Problem (16) in the form of a flow chart. One-way arrows are used to connect each step in order. The quantities whose symbols appear on either side of an arrow serve as the input to the next step; those on the left side of the arrow come from the input end, whereas those on the right come from the output of the previous step. 3.3.1. Initial prediction According to the relative positions between the predicted (the j-th frame) and the reference frames, the OF distortion coefficient a is defined as a :=
d1 j d12
(32)
where d1 j represents the distance between the j-th frame and the first reference frame and d12 represents the distance between the two reference frames. As shown in Fig. 3, the interframe prediction process for non-key frames consists of two stages: bidirectional OF estimation and residual estimation and compensation. The detailed steps are described as follows.
Fig. 3. Initial prediction for non-key frame reconstruction.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
17
(1) OF distortion coefficient calculation: Suppose that the OF vectors of all video frames in the same GOP are evenly distributed. With the two reconstructed key frames, x˜k1 and x˜k2 (frames 1 and GOPsize+1), as references, the OF distortion coefficient a of frame j can be estimated from their relative distance:
a=
j−1 GOPsize
(33)
(2) Forward and backward OF estimation: Based on x˜k1 , x˜k2 and a, OF estimation is performed separately in the forward and backward directions to predict two groups of non-key frames, xo f 1 and xo f 2 . (3) Image fusion: The predicted frames xo f 1 and xo f 2 are fused in the wavelet domain to yield the updated image estimates x f us . (4) Residual measured value calculation: The residual measured values yr of the non-key frames are obtained by subtracting the estimated measured values Φnk x f us from the corresponding measured values ynk . (5) Residual reconstruction and compensation: The residuals are estimated by means of the operation BCS SPL(·) [26] and are compensated by adding each residual value to the corresponding previously estimated value x f us , thereby producing the initial non-key frame prediction x˜nk0 . The detailed process of forward or backward OF estimation in the second step is shown in Fig. 4.
Fig. 4. OF estimation process.
First, nlevel-layer Gaussian pyramids are built for the two reference frames ( x˜k1 and x˜k2 ), and reduced images ( x˜dk1 and x˜dk2 ) are obtained by downsampling from top to bottom. Second, the corresponding feature images ( fdk 1 and fdk 2 ) are generated by means
18
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
of gradient or other operators. Then, the OF vector (v x ,vy ) between the two frames is calculated by means of flow field smoothing, which may be performed using the IRLS method [27, 43] with the objective function defined in Eq. (10). Finally, the predicted non-key frame, xo f , is constituted via bicubic interpolation, referring to ( x˜k1 , x˜k2 ) and their OF vector (v x ,vy ), as well as the OF distortion coefficient a of the current frame. For example, for a GOPsize of 8, the initial reference frames are frames 1 and 9. As shown in Fig. 5, the 1st frame is warped to the 9th frame to obtain the OF vector (v x , vy ) between the two key frames and to predict the position of a given point (P x , Py ) in frame 1 when projected to frame 9. Clearly, the closer to each other the reference frames are, the better the OF estimation effect will be. Considering that not all pixels in the same GOP may move at a uniform speed, the successive approximation of adjacent frames is adopted when estimating the non-key frames. Thus, all non-key frames between frames 1 and 9 are estimated in turn based on the OF vector (av x , avy ). As the relative positions of the reference and non-key frames change, the OF vector between reference frames and the value of a are adjusted accordingly.
Fig. 5. Warping frames 2 and 8 based on the OF vectors between frames 1 and 9.
Let us further elaborate on the update process for the forward or backward OF distortion coefficients in a GOP. As shown in Fig. 5, in the OF estimation process for frame 2, reconstructed frames 1 and 9 are taken as references. Forward warping is performed from frame 1 to frame 2; in accordance with Eqs. (32) and (33), the forward-estimated a for frame 2 is set to 1/8, which means that the forward-estimated OF vector is (1/8v x , 1/8vy ). In addition, backward warping is performed from frame 9 to frame 2; the backward-estimated a for frame 2 is set to 7/8, which means that the backward-estimated OF vector is (7/8v x , 7/8vy ). Similarly, the forward- and backward-estimated a values for frame 8 are 7/8 and 1/8; consequently, the forward- and backward-estimated OF vectors for frame 8 are (7/8v x , 7/8vy ) and (1/8v x , 1/8vy ), respectively.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
19
Subsequently, reconstructed frames 2 and 8 are taken as references for further forward estimation; a is set to 1/6 when estimating frame 3 and to 5/6 when estimating frame 7. Then, with reconstructed frames 3 and 7 as references, a is set to 1/4 and 3/4 when estimating frames 4 and 6, respectively. Finally, a = 1/2 is used when estimating frame 5 from previously reconstructed frames 4 and 6. For backward OF estimation or GOPs of other sizes, a is evaluated similarly. The OF estimation method used in this paper, in which approximation is sequentially performed from the two ends to the middle, is called the successive approximation estimation method. When the OF estimation results are more similar to the current frame, the residuals for the frame will be much sparser, which will result in a better optimization solution. The subsequent residual estimation and compensation steps are performed to improve the accuracy of the initial interframe predictions.
3.3.2. Iterative reprediction To further improve the predictions, unidirectional OF reprediction is performed based on the initial estimates for each nonkey frame, x˜nk0 , and the nearest reconstructed key frame, x˜k . The values for residual estimation and compensation are iteratively updated, and the corresponding reconstructed non-key frame x˜nk is output, until a specified cutoff condition is met (e.g., a maximum number of iterations or a certain relative error), as shown in Fig. 6. Note that to improve the prediction efficiency relative to the bidirectional OF estimation performed in the initial prediction step, the reprediction operation relies only on unidirectional OF estimation from a nearby frame. For example, the OF estimation for frame 2 refers only to frame 1, and that for frame 8 refers only to frame 9. The specific OF estimation process is similar to that depicted in Fig. 4 with a = 1, and therefore, its description is not repeated here.
Fig. 6. Iterative reprediction based on OF and residual estimation.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
20
4. Experimental results and analysis For convenience of expression, the proposed algorithm based on structural group sparsity and optical-flow estimation is abbreviated to SGS-OF. To verify the performance of the proposed recovery algorithm, this section compares experimental results obtained with this algorithm and other current video CS reconstruction algorithms. Since the previous experiments reported in [9, 10, 16, 17, 18, 21, 45] focused on only the brightness component, this paper also considers only the experimental results of similar studies in the presented comparisons. In addition, the complexity and convergence of the proposed algorithm are discussed.
4.1. Experimental deployment and parameter selection To validate the performance of the proposed SGS-OF for inconsistent measurements, the MVI [45], ME/MC1 [10], MH2 [9] and RRS3 [16] algorithms were chosen as reference algorithms. For compatibility with the C++ compilation environment used for the interframe estimation process in ME/MC and SGS-OF, we adopted a mixed compilation environment consisting of MATLAB 2010 and Visual Studio 2008. The experimental platform was a desktop computer with an Intel R Xeon R E3-1230 v3 CPU (@ 3.30 GHz), 16.00 GB of memory, and the Windows 7 operating system. The original data were taken from 6 classic CIF (288 × 352) video sequences—Paris, Foreman, Coastguard, Hall, Mobile and News. The brightness components of the first 17 frames of the original video sequence were evaluated. The GOPsize was set to 8, with the 1st , 9th and 17th frames being key frames and the remaining frames being non-key frames. A 32 × 32 block-based Gaussian random matrix (abbreviated as Gauss in the following figures and tables) and a partial discrete cosine transform (PDCT) matrix [48] were separately adopted as measurement matrices. Before our experiment, it is worth discussing the parameter selection for SGS-OF. The initial penalty parameters and weights in Problem (17) are set by experience as follows: β0 = 27 , β1 = β2 = 24 , τ1 = 0.7, τ2 = 0.3, and 1 W =
0.1 ..
. 0.1
, 1
where the penalty parameters (β0 , β1 and β2 ) and weight factors (τ1 and τ2 ) reflect the importance of the data accuracy and sparsity in the gradient and wavelet domains. Moreover, the larger the parameter τ1 is, the smoother the recovery frame may become. The
1 http://www.ece.msstate.edu/
fowler/BCSSPL/mc-bcs-spl-1.1-1.tar.gz. fowler/BCSSPL/mh-bcs-spl-1.0-1.tar.gz. 3 https://github.com/jianzhangcs/CS-video-reconstruction-RRS. 2 http://www.ece.msstate.edu/
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
21
larger the parameter τ2 is, the better the texture is preserved. In addition, W reflects the importance of the key and non-key frames. Clearly, the weight factors of the key frames on both sides of a processing unit are higher than those of the non-key frames.
4.2. Experimental results 4.2.1. Overall test In the overall test section, we focus on the experimental results of inconsistent measurement with a high sampling rate for key frames and a low sampling rate for non-key frames. Without loss of generality, we set the sampling rates of the key and nonkey frames to 0.70 and 0.10, respectively. Figs. 7 and 8 show the peak signal-to-noise ratios (PSNRs) and multi-scale structural similarities (MS-SSIMs) of the first two reconstructed GOPs obtained from the above testing sequences using the five algorithms. Tables 1 and 2 list the average PSNRs and MS-SSIMs for different sequences.
(a) Paris
(b) Foreman
(c) Coastguard
(d) Hall
(e) Mobile
(f) News
Fig. 7. Comparison of objective quality based on Gaussian measurements.
22
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
(a) Paris
(b) Foreman
(c) Coastguard
(d) Hall
(e) Mobile
(f) News
Fig. 8. Comparison of objective quality based on PDCT measurements.
As shown in Fig. 7, when a random Gaussian matrix is used as the block measurement matrix, the average PSNR of MVI is the lowest, the PSNR of MH is slightly higher than that of ME/MC, and that of SGS-OF is superior to the first three reference algorithms in most cases. Although RRS can reconstruct the best key frames, the objective quality of non-key frames is not stable. For Foreman and Coastguard, the average PSNR of RRS is the highest; for the other sequences, however, the PSNR is reduced because of the low quality of the reconstructed non-key frames. Similarily, SGS-OF achieves the highest MS-SSIM in most cases, although it is slightly lower than RRS for Foreman sequence. As shown in Fig. 8, when a PDCT matrix is used instead as the block measurement matrix, SGS-OF has more substantial advantages. The PSNRs and MS-SSIMs of the key frames reconstructed by SGS-OF become close to the best performance achieved by the RRS algorithm, and those of the non-key frames are also close to the highest. Because the frame-by-frame MVI reconstruction does not adopt an interframe prediction strategy, its overall performance
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
23
cannot be matched with that of the other algorithms. Tables 1 and 2 show that the SGS-OF achieves an average PSNR gain of 1–3 dB and an MS-SSIM gain of 0.01–0.03 higher than those of the reference algorithms with interframe prediction [9, 10, 16]. Fig. 9 shows the original image corresponding to a certain non-key frame from each of the first two sequences—e.g., Frame 2 from Paris and Foreman. Figs. 10 and 11 show the subjective recovery effect for the second frames from Paris and Foreman as achieved using the above algorithms. Table 1. Comparison of average reconstructed quality based on Gauss measurements. (a) PSNR (unit: dB) Sequence name
ME/MC
MH
RRS
SGS-OF
Paris
26.54
27.44
25.46
29.37
Foreman
32.38
34.47
35.36
34.43
Coastguard
27.90
29.69
29.85
29.40
Hall
34.09
34.89
31.33
35.70
Mobile
21.79
22.57
20.99
24.87
News
34.02
34.22
30.37
35.00
Average
29.45
30.55
28.89
31.46
(b) MS-SSIM Sequence name
ME/MC
MH
RRS
SGS-OF
Paris
0.9528
0.9555
0.9485
0.9774
Foreman
0.9692
0.9793
0.9849
0.9827
Coastguard
0.8992
0.9332
0.9195
0.9351
Hall
0.9867
0.9868
0.9809
0.9890
Mobile
0.8988
0.9014
0.8384
0.9619
News
0.9870
0.9863
0.9856
0.9915
Average
0.9489
0.9571
0.9430
0.9729
24
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
Table 2. Comparison of average reconstructed quality based on PDCT measurements. (a) PSNR (unit: dB) Sequence name
ME/MC
MH
RRS
SGS-OF
Paris
28.98
30.25
27.03
32.00
Foreman
34.85
35.54
36.43
37.20
Coastguard
31.06
31.94
31.20
32.40
Hall
37.95
38.14
35.63
40.30
Mobile
23.43
25.41
24.00
27.13
News
38.53
38.93
34.03
39.88
Average
32.47
33.37
31.39
34.82
(b) MS-SSIM Sequence name
ME/MC
MH
RRS
SGS-OF
Paris
0.9880
0.9894
0.9852
0.9946
Foreman
0.9903
0.9926
0.9943
0.9952
Coastguard
0.9565
0.9630
0.9535
0.9692
Hall
0.9950
0.9941
0.9947
0.9953
Mobile
0.9494
0.9701
0.9608
0.9897
News
0.9970
0.9964
0.9965
0.9978
Average
0.9794
0.9843
0.9808
0.9903
(a) Paris
(b) Foreman
Fig. 9. The original frames in the first two test sequences.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
(a) Gauss-ME/MC
(b) Gauss-MH
(c) Gauss-RRS
(d) Gauss-SGS-OF
(e) PDCT-ME/MC
(f) PDCT-MH
(g) PDCT-RRS
(h) PDCT-SGS-OF
Fig. 10. Comparison of subjective quality for a non-key frame at a sampling rate of 0.10 (Paris).
25
26
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
(a) Gauss-ME/MC
(b) Gauss-MH
(c) Gauss-RRS
(d) Gauss-SGS-OF
(e) PDCT-ME/MC
(f) PDCT-MH
(g) PDCT-RRS
(h) PDCT-SGS-OF
Fig. 11. Comparison of subjective quality for a non-key frame at a sampling rate of 0.10 (Foreman).
Figs. 10 and 11 show the recovery effect for a non-key frame (e.g., Frame 2) from each of the first two sequences as achieved using the above algorithms. When a random Gaussian matrix is used as the measurement matrix, the ME/MC and MH reconstructions of the face edges and eye areas result in block effects or ring phenomena to varying degrees. When reconstructed by RRS, the Paris frame exhibits obvious pasty shadows on the face and mouth, whereas the Foreman frame shows clearer and softer character lines; however, the words “SIEMENS” in the upper-left corner of the Foreman frame are blurred. In the SGS-OF reconstruction, the subjective performance for Paris is optimal because of its delicate display, and the reconstructed Foreman frame has a stronger
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
27
sense of hierarchy with clearer “SIEMENS” fonts than that of RRS, although the mouth area is relatively sharp. When the PDCT measurement matrix is used, the overall performance of the various algorithms improves; however, the left side of the face and the left eye in the Foreman frame recovered by ME/MC still show ring effects. The Foreman frame as reconstructed by MH shows horizontal stripes on the neck, and the contours of “SIEMENS” of the RRS-reconstructed frame are still fuzzy. By contrast, SGS-OF achieves a clear facial structure and images that retain a sense of depth, and it retains outline and textural information better than the other recovery algorithms. The high performance of the proposed algorithm is attributed not only to its use of the proposed SGS model but also to its pixel-level OF estimation and residual compensation. The CPU time can be used as a reasonably direct approximation of the computational complexity. Table 3 shows the CPU times (unit: s) required for the four algorithms to reconstruct the first 17 frames of the Paris and Foreman video sequences when the sampling rates of the key and non-key frames are 0.70 and 0.10, respectively. As shown in Table 3, MH has the lowest time complexity, the time cost of ME/MC is 3–5 times that of MH, RRS has the longest CPU times, and SGS-OF has a time complexity slightly higher than that of MH. Table 3. Complexity comparison in terms of reconstruction times (unit: s). Measurement matrix
Sequence
ME/MC
MH
RRS
SGS-OF
Gauss
Paris
1713.46
265.04
19866.11
376.05
Foreman
1149.45
291.39
19355.37
295.45
Paris
995.33
253.19
30348.04
294.47
Foreman
773.69
211.66
30495.96
227.36
PDCT
Further more, Tables 4 lists the average PSNRs of the reconstruction GOPs obtained from 6 testing sequences using different algorithms, when the sampling rates of non-key frames are adjusted to 0.05, 0.10, 0.15 and 0.20, respectively. As shown in Tables 4, SGS-OF is superior to the three reference algorithms in most cases, although RRS achieves the highest PSNR at the sampling rate of 0.20 for non-key frames. By combining the comparison results of objective and subjective quality, as well as the time complexity, we can conclude that compared with the existing algorithms, SGS-OF obtains both PSNR and MS-SSIM gains, and an optimal visual effect at a low time complexity.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
28
Table 4. Comparison of average PSNRs at different sampling rates for non-key frames. (unit: dB). Measurement matrix
Sampling rate
ME/MC
MH
RRS
SGS-OF
Gauss
0.05
27.59
29.09
24.72
30.48
0.10
29.45
30.55
28.89
31.46
0.15
30.73
31.55
32.22
32.30
0.20
31.67
31.48
34.42
32.99
0.05
30.26
32.02
28.09
33.46
0.10
32.47
33.37
31.39
34.82
0.15
34.51
34.66
34.59
35.71
0.20
35.82
35.48
36.55
36.56
PDCT
4.2.2. Ablation study From the results of the key frame reconstruction in the previous section, we can see the effect of SGS acting alone. To test the independent effect of the proposed OF prediction, we unify the reconstruction algorithm for key frames as MVI, and the four interframe prediction algorithms, ME/MC, MH, RRS and OF, are used for non-key frames. This section still takes the Paris and Foreman sequences as the research objects and tests the reconstructed quality of each algorithm in terms of PSNR and MS-SSIM. Due to the randomness of the Gauss matrix, PDCT is selected as the only measurement matrix. Figs. 12 show the PSNRs and MSSSIMs of the first 2 GOPs reconstructed by the above interframe prediction algorithms when the sampling rates of key frames and non-key frames are set to 0.50 and 0.10 respectively. Table 5 summarizes the average reconstruction quality of these two sequences.
(a) Paris
(b) Foreman
Fig. 12. Comparison of objective quality for interframe prediction.
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
29
Table 5. Comparison of average reconstructed quality for interframe prediction. Metric
Sequence
ME/MC
MH
RRS
SGS-OF
PSNR(dB)
Paris
28.15
28.96
26.26
29.77
Foreman
33.84
34.23
34.25
35.82
Paris
0.9882
0.9894
0.9857
0.9935
Foreman
0.9908
0.9929
0.9935
0.9948
MS-SSIM
As shown in Fig. 12 and Table 5, the PSNR and MS-SSIM of MH are higher than those of ME/MC and RRS for the Paris sequence, while the average MS-SSIM of RRS is superior to that of the first two reference algorithms for the Foreman sequence. Compared with these three reference algorithms, the proposed OF achieves an average PSNR gain of 0.8–3.5 dB and an average MS-SSIM gain of 0.001–0.008. 4.3. Algorithm complexity analysis The complexity of the proposed algorithm is analyzed as follows. In the SGS model, if we neglect the costs of addition, subtraction, scalar multiplication and division, the computation in each processing unit (note that if every frame contains n pixels, then a processing unit contains N = (GOPsize+1)×n pixels in total, where, e.g., GOPsize = 8) is dominated by the calculation of the subgradient of LA and the BB step in the X subproblem as well as the norm in the Z subproblem. Specifically, the subgradient calculation in Eq. (19) requires N multiplications, the inner products in both the numerator and denominator of the BB step in Eq. (20) amount to 2N multiplications, and the norm calculations in Eqs. (24) and (25) add up to 2N multiplications. Hence, in the initial stage, the total complexity of reconstructing a processing unit corresponds to 5N multiplications, and the average complexity of reconstructing a frame corresponds to 5n multiplications. Specifically, the complexity of reconstructing one frame in the initial phase can be estimated as O(n). In the interframe prediction stage, the computation for every non-key frame (containing n pixels) is dominated by the flow field smoothing in the OF estimation phase and the residual reconstruction in the residual estimation and compensation phase. As described in [43], the IRLS method used in flow field smoothing involves several derivatives and inner products of n-dimensional vectors. Consequently, it requires an equivalent of c×n multiplications, where c is a small positive integer. Moreover, as illustrated in Fig. 4, the flow field smoothing calculation is included in a pyramid structure with a number of layers denoted by nlevels; P nlevels hence, the total number of multiplications can be approximated as nlevels−1 srl × c × n = 1−sr × c × n , where 0 < sr < 1 l=0 1−sr is the downsampling ratio and nlevels is a small positive constant, e.g., nlevels = 5. The fast recovery algorithm used in the residual reconstruction process, BCS-SPL [26], consists of Wiener filtering, Landweber projection and hard thresholding. Its main
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
30
computational complexity lies in the last two steps, every iteration of which involves several multiplications of n-dimensional vectors. Therefore, the complexity of reconstructing one non-key frame in the interframe prediction stage can also be estimated to be O(n).
4.4. Convergence explanation The convergence of the proposed SGS model and OF estimation algorithm can be analyzed based on existing optimization theories, as mentioned in Section 2. Because the ALM-ADMM is utilized to solve the SGS problem, convergence is guaranteed by the theories of the ADMM [38] and separable convex programming [29, 30, 34]. Because the IRLS method, which is used in OF estimation, is equivalent to the conventional Euler–Lagrange variational approach [42], its convergence is guaranteed by that of the variational method [43].
5. Conclusions In this paper, we introduce a mathematical representation based on intraframe sparsity and interframe correlations into video CS and propose a reconstruction algorithm based on a SGS model and coarse-to-fine OF estimation method with successive approximation to improve reconstruction quality. This algorithm not only takes full advantage of the spatial and temporal correlations in videos to produce quality initial recoveries but also considers the video motion characteristics to further balance the reconstruction quality and computational complexity. The experimental results demonstrate that the proposed algorithm outperforms the state-of-the-art methods in terms of both objective and subjective performances. For convenience in processing, a fixed GOPsize is assumed in the SGS algorithm as presented in this paper. In future work, attempts to automatically adjust the group positions and GOPsize during the reconstruction process can be further studied based on the degree of similarity of the video frames, and an interframe prediction technology with better estimation performance could be introduced to make the proposed approach applicable to a wider range of videos, such as those with sudden scene changes or fast mutations. In addition, the penalty parameters and weighting factors in the target problem are temporarily taken as empirical values, and the optimal selection of these parameters is also worth further study.
Acknowledgements This work was supported by the National Natural Science Foundation of China (NSFC 61671153 and 11571074), the Natural Science Foundation of Fujian Province (2017J01757), and the Fuzhou University Fund (GXRC-17034).
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
31
References [1] S. Vivienne, B. Madhukar, J.S. Gary, High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, New York, 2014. [2] P.L. Dragotti, M. Gastpar, Distributed Source Coding: Theory, Algorithms and Applications, Elsevier, Amsterdam, 2009. [3] D.L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory 52 (4) (2006) 1289–1306. https://doi.org/10.1109/TIT.2006.871582. [4] M. Rani, S.B. Dhok, R.B. Deshmukh, A systematic review of compressive sensing: concepts, implementations and applications, IEEE Access 6 (2018) 4875–4894. https://doi.org/10.1109/ACCESS.2018.2793851. [5] G. Coluccia, C. Ravazzi, E. Magli, Compressed Sensing for Distributed System, Springer, New York, 2015. [6] R.G. Baraniuk, T. Goldstein, A.C. Sankaranarayanan, C. Studer, A. Veeraraghavan, M.B. Wakin, Compressive video sensing: algorithms, architectures, and applications, IEEE Signal Process. Mag. 34 (1) (2017) 52–66. https://doi.org/10.1109/MSP.2016.2602099. [7] J. Prades-Nebot, M. Yi, T. Huang, Distributed video coding using compressive sampling, in: 2009 Picture Coding Symposium, IEEE, Chicago, 2009, pp. 1–4. [8] H. Chen, L. Kang, C. Lu, Dictionary learning-based distributed compressive video sensing, in: 28th Picture Coding Symposium, IEEE, Nagoya, 2010, pp. 210–213. [9] E.W. Tramel, J.E. Fowler, Video compressed sensing with multihypothesis, in: 2011 Data Compression Conference, IEEE, Snowbird, 2011, pp. 193–202. [10] S. Mun, J.E. Fowler, Residual reconstruction for block-based compressed sensing of video, in: Proceedings of 2011 Data Compression Conference, IEEE, Snowbird, 2011, pp. 183–192. [11] A.C. Sankaranarayanan, C. Studer, R.G. Baraniuk, CS-MUVI: video compressive sensing for spatial-multiplexing cameras, in: Proceedings of 2012 IEEE International Conference on Computational Photography (ICCP), IEEE, Seattle, 2012, pp. 1–10. [12] A.C. Sankaranarayanan, L. Xu, C. Studer, Y. Li, K.F. Kelly, R.G. Baraniuk, Video compressive sensing for spatial multiplexing cameras using motion-flow models, SIAM J. Imaging Sci. 8 (3) (2015) 1489–1518. https://doi.org/10.1137/140983124. [13] M. Azghani, M. Karimi, F. Marvasti, Multihypothesis compressed video sensing technique, IEEE Trans. Circuits Syst. Video Technol. 26 (4) (2016) 627–635. https://doi.org/10.1109/TCSVT.2015.2418586. [14] X. Ding, W. Chen, I.J. Wassell, Compressive sensing reconstruction for video: an adaptive approach based on motion estimation, IEEE Trans. Circuits Syst. Video Technol. 27 (7) (2017) 1406–1420. https://doi.org/10.1109/TCSVT.2016.2540073. [15] W. Li, C. Yang, L. Ma, A multihypothesis-based residual reconstruction scheme in compressed video sensing, in: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, Beijing, 2017, pp. 2766–2770. [16] C. Zhao, S. Ma, J. Zhang, R. Xiong, W. Gao, Video compressive sensing reconstruction via reweighted residual sparsity, IEEE Trans. Circuits Syst. Video Technol. 27 (6) (2017) 1182–1195. https://doi.org/10.1109/TCSVT.2016.2527181. [17] M.F. Duarte, R.G. Baraniuk, Kronecker compressive sensing, IEEE Trans. Image Process. 21 (2) (2012) 494–504. https://doi.org/10.1109/TIP.2011. 2165289. [18] S. Friedland, Q. Li, D. Schonfeld, Compressive sensing of sparse tensors, IEEE Trans. Image Process. 23 (10) (2014) 4438–4447. https://doi.org/10. 1109/TIP.2014.2348796. [19] C.F. Caiafa, A. Cichocki, Stable, robust, and super fast reconstruction of tensors using multi-way projections, IEEE Trans. Signal Process. 63 (3) (2015) 780–793. https://doi.org/10.1109/TSP.2014.2385040. [20] M.S. Hosseini, K.N. Plataniotis, High-accuracy total variation with application to compressed video sensing, IEEE Trans. Image Process. 23 (9) (2014) 3869–3884. https://doi.org/10.1109/TIP.2014.2332755. [21] G. Chen, G. Li, J. Zhang, Tensor compressed video sensing reconstruction by combination of fractional-order total variation and sparsifying transform, Signal
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
32
Process. Image Commun. 55 (2017) 146–156. https://doi.org/10.1016/j.image.2017.03.021. [22] N. Vaswani, W. Lu, Modified-CS: modifying compressive sensing for problems with partially known support, IEEE Trans. Signal Process. 58 (9) (2010) 4595–4607. https://doi.org/10.1109/TSP.2010.2051150. [23] A. Majumdar, R.K. Ward, T. Aboulnasr, Compressed sensing based real-time dynamic MRI reconstruction, IEEE Trans. Med. Imaging 31 (12) (2012) 2253– 2266. https://doi.org/10.1109/TMI.2012.2215921. [24] S.G. Lingala, E. DiBella, M. Jacob, Deformation corrected compressed sensing (DC-CS): a novel framework for accelerated dynamic MRI, IEEE Trans. Med. Imaging 34 (1) (2015) 72–85. https://doi.org/10.1109/TMI.2014.2343953. [25] G. Lu, Block compressed sensing of natural images, in: 2007 15th International Conference on Digital Signal Processing, IEEE, Cardiff, 2007, pp. 403–406. [26] S. Mun, J.E. Fowler, Block compressed sensing of images using directional transforms, in: 2009 16th IEEE International Conference on Image Processing (ICIP), IEEE, Cairo, 2009, pp. 3021–3024. [27] C. Liu, Introduction to dense optical flow, in: T. Hassner, C. Liu (Eds.) Dense Image Correspondences for Computer Vision, Springer International Publishing, Cham, 2016, pp. 3–14. [28] C. Li, W. Yin, H. Jiang, Y. Zhang, An efficient augmented Lagrangian method with applications to total variation minimization, Comput. Optim. Appl. 56 (3) (2013) 507–530. https://doi.org/10.1007/s10589-013-9576-1. [29] G. Gu, B. He, X. Yuan, Customized proximal point algorithms for linearly constrained convex minimization and saddle-point problems: a unified approach, Comput. Optim. Appl. 59 (1–2) (2014) 135–161. https://doi.org/10.1007/s10589-013-9616-x. [30] B. He, M. Tao, X. Yuan, A splitting method for separable convex programming, IMA J. Numer. Anal. 35 (1) (2015) 394–426. https://doi.org/10.1093/ imanum/drt060. [31] J. Wen, Z. Zhou, J. Wang, X. Tang, Q. Mo, A sharp condition for exact support recovery with orthogonal matching pursuit, IEEE Trans. Signal Process. 65 (6) (2017) 1370–1382. https://doi.org/10.1109/TSP.2016.2634550. [32] J. Wen, H. Chen, Z. Zhou, An optimal condition for the block orthogonal matching pursuit algorithm, IEEE Access 6 (2018) 38179–38185. https://doi. org/10.1109/ACCESS.2018.2853158. [33] H. Liu, R.F. Barber, Between hard and soft thresholding: optimal iterative thresholding algorithms, Statistics 2 (2018) 1–29. [34] B. He, X. Yuan, On the O(1/n) convergence rate of the Douglas–Rachford alternating direction method, SIAM J. Numer. Anal. 50 (2) (2012) 700–709. https://doi.org/10.1137/110836936. [35] A. Bolstad, B.D. Van Veen, R. Nowak, Causal network inference via group sparse regularization, IEEE Trans. Signal Process. 59 (6) (2011) 2628–2641. https://doi.org/10.1109/TSP.2011.2129515. [36] W. Deng, W. Yin, Y. Zhang, Group sparse optimization by alternating direction method, in: Proceedings of SPIE - The International Society for Optical Engineering, San Diego, vol. 8858, 2013, p. 88580R. https://doi.org/10.1117/12.2024410. [37] Z. Zha, X. Zhang, Q. Wang, L. Tang, X. Liu, Group-based sparse representation for image compressive sensing reconstruction with non-convex regularization, Neurocomputing 296 (2018) 55–63. https://doi.org/10.1016/j.neucom.2018.03.027. [38] R. Glowinski, P.L. Tallec, Augmented Lagrangian and Operator-splitting Methods in Nonlinear Mechanics, Society for Industrial Appied Mathematics (SIAM), Philadelphia, 1989. [39] A. Majumdar, R.K. Ward, T. Aboulnasr, Algorithms to approximately solve NP hard row-sparse MMV recovery problem: application to compressive color imaging, IEEE J. Emerg. Sel. Top. Circuits Syst. 2 (3) (2012) 362–369. https://doi.org/10.1109/JETCAS.2012.2212774. [40] E.A. Bernal, Q. Li, Tensorial compressive sensing of jointly sparse matrices with applications to color imaging, in: 2017 IEEE International Conference on
Jian Chen et al. / Journal of Visual Communication and Image Representation (2019)
33
Image Processing (ICIP), IEEE, Beijing, 2017, pp. 2781–2785. [41] Q. Li, E.A. Bernal, An algorithm for parallel reconstruction of jointly sparse tensors with applications to hyperspectral imaging, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Honolulu, 2017, pp. 218–225. [42] B.K.P. Horn, B.G. Schunck, Determining optical flow, Artif. Intell. 17 (1-3) (1981) 185–203. https://doi.org/10.1016/0004-3702(81)90024-2. [43] C. Liu, Beyond pixels: exploring new representations and applications for motion analysis, Doctoral thesis, Massachusetts Institute of Technology, Cambridge, 2009. [44] C. Bailer, B. Taetz, D. Stricker, Coarse-to-fine PatchMatch for dense correspondence, IEEE Trans. Circ. Syst. Video Technol. 28 (9) (2018) 2233–2244. https://doi.org/10.1109/TCSVT.2017.2720175. [45] K. Su, J. Chen, W. Wang, L. Su, Reconstruction algorithm for block-based compressed sensing based on mixed variational inequality, Multimedia Tools Appl. 75 (23) (2016) 16417–16438. https://doi.org/10.1007/s11042-015-2975-9. [46] J. Barzilai, J.M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal. 8 (1) (1988) 141–148. https://doi.org/10.1093/imanum/8.1. 141. [47] H. Zhang, W.W. Hager, A nonmonotone line search technique and its application to unconstrained optimization, SIAM J. Optim. 14 (4) (2004) 1043–1056. https://doi.org/10.1137/S1052623403428208. [48] J. Chen, K. Su, Z. Peng, L. Su, Measurement method of block compressed sensing based on partial trigonometric function transform matrices, Oper. Res. Trans. 19 (4) (2015) 59–71.
To balance reconstruction quality with computational complexity, we introduce a structural group sparsity model for use in the initial reconstruction phase and propose a weight‐based group sparse optimization algorithm acting in joint domains. Then, a coarse‐to‐fine optical flow estimation model with successive approximation is introduced for use in the interframe prediction stage to recover non‐key frames through alternating optical flow estimation and residual sparse reconstruction. The experimental results demonstrate that the proposed algorithm outperforms the state‐of‐the‐art methods in both objective and subjective reconstruction quality.