Pattern Recognition Letters xxx (2013) xxx–xxx
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Accurate spatio-temporal reconstruction of missing data in dynamic scenes Margarita Favorskaya ⇑, Mikhail Damov, Alexander Zotin Siberian State Aerospace University, Av. Krasnoayrsky Rabochy, 31, Krasnoyarsk 660014, Russian Federation
a r t i c l e
i n f o
Article history: Available online xxxx Communicated by Manuel Graña Keywords: Image inpainting Missing data Texture analysis Neural network Video reconstruction
a b s t r a c t In this paper, the accurate method for texture reconstruction with non-desirable moving objects into dynamic scenes is proposed. This task is concerned to editor off-line functions, and the main criteria are the accuracy and visibility of the reconstructed results. The method is based on a spatio-temporal analysis and includes two stages. The first stage uses a feature points tracking to locate the rigid objects accurately under the assumption of their affine motion model. The second stage involves the accurate reconstruction of video sequence based on texture maps of smoothness, structural properties, and isotropy. These parameters are estimated by three separate neural networks of a back propagation. The background reconstruction is realized by a tile method using a single texton, a line, or a field of textons. The proposed technique was tested into reconstructed regions with a frame area up to 8–20%. The experimental results demonstrate more accurate inpainting owing to the improved motion estimations and the modified texture parameters. Ó 2013 Elsevier B.V. All rights reserved.
1. Introduction The task of texture reconstruction is the actual procedure in many cases: from the professional video materials editing by film editors to the editing of home videos by end users. The complexity of the task connects with a great variability of reconstructed scenes and parameters of non-desirable objects. Many authors design their algorithms according to type and behavior of the nondesirable objects. One of the main classifying parameter is a rigid/non-rigid boundary of object. Historically, the first ones were the patch based methods applied for computational fluid dynamics such as a Navier Stokes video inpainting algorithm (Bertalmio et al., 2001), a video tracking and fragment merging algorithm (Jia et al., 2005), etc. The object based methods provide the higher quality visual results owing to spatio-temporal analysis. It is known an algorithm under illumination changes (Jia et al., 2006), a virtual contour-guided object inpainting using posture mapping and retrieval (Ling et al., 2011), etc. At present, the patch based methods involve into the rigid object inpainting in videos (Vidhya and Valarmathy, 2011). In this paper, the rigid object-based technique is considered. Existing methods and algorithms of videos reconstruction cover many types of objects. It may be a single large sizes object (Mairal et al., 2008; Köppel et al., 2009; Li et al., 2010), small sizes ⇑ Corresponding author. Address: Department of Informatics and Telecommunications, Av. Krasnoayrsky Rabochy, 31, Krasnoyarsk 660014, Russian Federation. Tel.: +7 3912919240; fax: +7 3912919147. E-mail addresses:
[email protected] (M. Favorskaya),
[email protected] (M. Damov),
[email protected] (A. Zotin).
overlapping objects, moving objects from thermal video sequences (Haik et al., 2006), line scratches in degraded motion pictures (Joyeux et al., 2002), rain drops (Krishnan and Venkataraman, 2012), etc. Such algorithms ought to consider not only type of motion but also speed and velocity, and a possibility of object appearance/disappearance into a scene. For example, the blurred car image is a usual case into urban scenes. Also the most publications are dedicated to the automatic restoration for static scenes. One of a popular research direction connects with kernels, special functions, and fields. Some theoretical models for image restoration and decomposition into cartoon and texture which are based on Rudin–Osher–Fatemi model one can find in the paper (Osher et al., 2003). The fast object motion and the restricted aperture time of video camera are two main causes which product the motion-blurred sequential frames. An accurate motion estimation based on a piecewise constant Mumford–Shah model was considered by L. Bar for restoration purposes (Bar et al., 2007). The non-local range Markov Random Field includes a gradient-based discriminative learning method of potential functions and a non-local range filter bank (Sun and Tappen, 2011). A Non-Local Kernel Regression (NL-KR) method for image and video restoration based on the reliable and robust estimations was proposed by Zhang et al. (2010). This method joints the advantages of a local structural regularity and a non-local similarity in a unified framework and uses similar patterns for more accurate estimations of structural regression. Such technology cannot be directly applied to the current issue but contains a good idea of local and global data analysis for higher visibility of regularized structures into reconstructed frames.
0167-8655/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2013.06.003
Please cite this article in press as: Favorskaya, M., et al. Accurate spatio-temporal reconstruction of missing data in dynamic scenes. Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/j.patrec.2013.06.003
2
M. Favorskaya et al. / Pattern Recognition Letters xxx (2013) xxx–xxx
The usage of samples or exemplar based images is a perspective approach in videos restoration. A comparative analysis of exemplar based image inpainting algorithms one can find in the research (Sangeetha et al., 2011). A special authors attention is preferred to three approaches: an isophote-driven image-sampling process (Criminisi et al., 2004), the hybrid model using the total variation equation to decompose the image into structured and textured parts (Wu and Ruan, 2006), and an approach based on exemplarbased completion model which combines the characters of inpainting and texture synthesis (Wu et al., 2010). The last approach provides the more accurate decision including 4 stages: The size of image patch is determined from gradient domain of image. The filling priority is defined by the geometrical structure of image, especially curvatures and directions of the isophotes. A patch-matching scheme incorporates the curvature and color of image. The source template is copied into the destination template, and then the destination template is updated. The fast matching method and the block based sampling were proposed in the research (Huan et al., 2010). The first part of algorithm determines the filling order of the pixels in the target regions based on the high accuracy of the fast matching method. The second part of the algorithm implicitly assumes on a Markov random field model for textured image regions and computes blocks of texture using an efficient search process and the SSD (Sum of Squared Differences) measure. However such approach was tested only into static scenes. The organization of this paper is as follows. In Section 2, the accurate localization of non-desirable objects based on a feature points tracking and the affine motion model is given. Texture estimations received by the neural network approach are discussed in Section 3. The accurate spatio-temporal reconstruction of missing data is explained in Section 4. Section 5 reports the experimental results and evaluations. Conclusion and future work are presented in Section 6. 2. The accurate localization of non-desirable objects The research contribution connects with development of an accurate spatio-temporal reconstruction into dynamic scenes. The higher accuracy is achieved by an accurate localization of non–desirable objects using a feature points tracking and an accurate spatio-temporal reconstruction based on maps of such texture parameters as smoothness, structural performance, and isotropy. Let us suppose that the user chose the set of sequential frames into dynamic scene with a non-desirable object (not less 10–12 frames) and determined its contour as lasso in the first of selected frames. The proposed localization algorithm includes two stages. The pre-determined stage is based on a fast but non-accurate subtraction of frames to build the binary mask of foreground nondesirable moving object (Section 2.1). The stage of accurate object localization uses a SURF (Speed-Up Robust Feature) tracking (Section 2.2) in conjunction with an accurate contour methodology (Section 2.3). 2.1. The pre-determined stage The frames subtraction is a fast non-accurate motion estimation technique functioning under the following assumption. Let two adjacent frames or two frames with interval (from 2 to 6 frames) are presented. This assumption is determined by the availability
of total number of frames and the motion type (fast/slow) into a scene. For each current frame, the intensity values of grey-scaling frame are compared with corresponding parameters into a previous frame pixel by pixel. As a result the binary masks of moving objects will be received. This method is a noise-dependent approach, therefore a median filter or mathematical morphological operations are applied for a binary masks processing. Filter parameters improve the method sensitivity and reduce a degree of errors. This algorithm is simple and fast but disadvantages permit its usage as a pre-determined stage because of a shadow occurrence, a background motion in scene, luminance changes, and a camera inaccuracy (Favorskaya, 2012). The proposed approach is closed to the on-the-fly background modeling (Khan et al., 2003; Elhabian et al., 2008). The difference connects with the analysis of the extended frame interval including not only the following frames but also the previous frames. A frame subtraction permits to build not only a frame binary mask of a non-desirable object but also an inter-frame object binary mask which indicates the differences and estimates the motion parameters indirectly. 2.2. The motion parameters estimation The fast methods of motion estimations are based on the feature points’ detection and tracking methodology. Usually the accurate interframe keypoint matching uses two variants: SIFT (ShiftInvariant Feature Transformation) detector and SURF (Speed-Up Robust Feature) detector. The accuracy results of both detectors are approximately identical but the last one is faster. The SURF algorithm is based on convolutions and uses the Hessian matrixbased measure for a distribution-based detector (Bay et al., 2008). The Hessian matrix H(P, r) in point P is determined by Eq. (1), where Lxx(P, r), Lxy(P, r), Lyx(P, r), and Lyy(P, r) are convolutions of the second derivatives of Gaussian G(P) with functions describing an image Ip in a point P along axis OX, diagonal in the first quadrant, axis OY, and diagonal in the second quadrant respectively.
HðP; rÞ ¼
Lxx ðP; rÞ Lxy ðP; rÞ Lyx ðP; rÞ Lyy ðP; rÞ
ð1Þ
The convolution of second derivative of Gaussian G(P) with a function describing an image Ip in a point P along axis OX is calculated by Eq. (2).
Lxx ðP; rÞ ¼
@2 GðPÞ IP @x2
ð2Þ
The SURF algorithm uses the Haar wavelet response along the selected direction, constructs a square region aligned to it, and extracts the SURF descriptor which is invariant to rotations. The Haar wavelets are easy computed by integral images when the window situating in a point of interest is split up into 4 4 regions. An underlying intensity patterns (first derivatives) of each sub-region are described by a vector VH = (Rdx, Rdy, R|dx|, R|dy|), where dx and dy are Haar wavelet responses into horizontal and vertical directions, |dx| and |dy| are absolute values of corresponding responses. Thereby the overall vector will contain 64 elements which provide the invariance to rotations, scaling and intensity changes. For matching all detected feature points, a similarity matrix N M is built, where N is a number of feature points in a current frame and M is a number of feature points in a previous frame. The elements of such matrix are the results of normalized crosscorrelations. The minimal values of elements confirm the maximum matching of feature points into two frames. The multiplied corresponding is not available. In such manner, the position changes of non-desirable objects are estimated into the sequential frames of dynamic scene.
Please cite this article in press as: Favorskaya, M., et al. Accurate spatio-temporal reconstruction of missing data in dynamic scenes. Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/j.patrec.2013.06.003
3
M. Favorskaya et al. / Pattern Recognition Letters xxx (2013) xxx–xxx
2.3. The accurate contour localization The active contours (shakes) are widely used to extract the accurate deformable contour or boundary. The good review of active contour models is represented in the paper (Ge et al., 2012). The basic snake model is a curve associating with image energy, external and internal energies. It is supposed that a target boundary is a smooth closed curve. The initial contour is accepted as a simple geometrical primitive (for example, a circle). Then the contour is distorted, and contour points tend to the object boundary by minimization of the contour energy. The position of a shake can be parametrically represented by a set of points pi(s) (Eq. (3)).
pi ðsÞ ¼ ½xðsÞ; yðsÞ;
s 2 ½0; 1
ð3Þ
For each point pi(s) in neighborhood, the energy functional Esh is determined in the continuous domain by Eq. (4), where Eint() and Eext() are the internal and external energies, parameters a and b are the constants which control the energy values.
Esh ¼
Z
The novelty of texture spatial analysis consists in choice of three texture parameters: smoothness SM, structural performance ST, and isotropy IS. Three fully connected two-level neural networks (NN) of back propagation error for each parameter were applied. Such decision was made because these parameters do not influence on each other and a common NN would be large, complex and tangled. For estimation of smoothness SM, NN is used which works according to Eq. (8), where fac is the activate function (sigmoid); k and l are the number of neurons in the first and the second hidden layers; i and j are the current indexes; wU1j, wRm1j, wen1j are weights of synapses from input parameters homogeneity U, relative smoothness Rm, and normalized entropy En to neurons j of the first hidden layer respectively; w1j2i is a weight of a synapse connecting neurons j of the first hidden layer to neurons i of the second hidden layer; w2iY is a weight of a synapse connecting neurons i of the second hidden layer to the output Y. l1 k1 X X fac fac ðU wU1j log Rm
SM ¼ fac
i¼0
1
ða Eint ðpðsÞÞ þ b Eext ðpðsÞÞÞds
ð4Þ
0
An image banding causes the internal energy, and the external energy signifies the image energy influencing on an object contour point. For each point pi(s), pi ðsÞ ! p0i ðsÞ that corresponds the minimal value E(pi(s)). Usually the active contour method includes three main stages: a creation of initial contour, the anchor points’ detection, and the tracking of anchor points. For blurred contour, its position may be forcibly increased on value, enough for blurred compensation. Thereby the larger area than a determined region is restored.
j¼0
! wRm 1j
!
þðEn =log 2 LÞ wen 1j Þ w1j 2i Þ w2i Y
ð8Þ
Ranges of output values Y may be 0–199, 200–399, and 400– 599, they mean a smooth texture, a texture with non-defined smoothness, and a rough texture respectively. The structural performance ST is estimated by Eq. (9), where m is an order of central moment; lm is a central m-order moment; wlm1j is a weight of a synapse from input parameters lm to neurons j of the first hidden layer.
ST ¼ fac
l1 X
3. Texture estimations based on neural network approach
fac
i¼0
k1 6 X X fac ð ðlm =ðL 1Þm Þ wlm 1j Þ w1j 2i j¼0
!
! w2i Y
m¼3
ð9Þ The extraction of spatio-temporal texture features into regions locating near the region with a missing data is the stage which determines a success/failure of all following process. The spatial and temporal texture features extractions are discussed in Sections 3.1 and 3.2 respectively. Maps building of texture parameters are situated in Section 3.3. 3.1. The spatial texture features extraction For spatial texture analysis, the known statistical features based on two-order moments such as average m, dispersion a, homogeneity U, smoothness R, and entropy E may be successfully used (Favorskaya and Petukhov, 2011). Additionally two modified texture features – the relative smoothness Rm and the normalized entropy En were proposed. The texture smoothness R is determined by Eq. (5), where L is a number of brightness levels, L > 1 (Lmin = 2 for binary frame).
R¼1
1 1 þ r2 =ðL 1Þ2
ð5Þ
The relative smoothness Rm is calculated according to Eq. (6).
Rm ¼
log R if 10
if
R>0 R¼0
ð6Þ
The normalized entropy En is evaluated by Eq. (7).
En ¼ E=log2 L
ð7Þ
If the parameter R = 0 then its value is forcibly maintained to small empirical value differing from 0 (Rm = 10). The normalized entropy En permits to improve an intensity function in dark and bright areas into a frame (by a weak equalization effect).
Normalized central moments of m orders are the inputs of this NN and the output Y values lays in the following ranges: 0–19, 20–39, and 40–59 that gives low, middle, or high structural performance of analyzed texture region. The isotropy IS is determined by Eq. (10), where M is a maximum of a probability; l2 is a two order moment of elements difference; wM1j and wl21j are weights of synapses from input parameters M and l2 to neurons j of the first hidden layer respectively. IS ¼ fac
l1 k1 X X fac fac U wU1j M wM1j þ En wEn 1j þ l2 wl2 1j w1j ;2i i¼0
!
! w 2i Y
j¼0
ð10Þ
Values Y 2 0 19 mean an anisotropic texture and values Y 2 20 39 determine an isotropic texture. NNs functioning according to Eqs. (8)–(10) were learned by a teacher who used the images from the well known texture databases ‘‘Brodatz textures’’, ‘‘Walls textures’’, ‘‘Fabric textures’’, ‘‘Nature textures’’, etc. with total amount of 1,200 images. The usage of such texture databases is caused by the necessity of suitable estimations for various texture types. All images were previously classified by a teacher as smooth/rough/non-defined smoothness, low/middle/high structural performance, and anisotropic/isotropic textures. Then the designed software tool calculated the texture parameters of grey-scale images writing them into two dimension array (7 1200 elements) into a Pre-processing Module. In a learning mode, Configuration File of NN loaded the required parameters on the inputs of three NNs and automatically corrected false output values according to the algorithm of back propagation error increasing or decreasing the corresponding weights of synapses. Approximately 300 images were used for test
Please cite this article in press as: Favorskaya, M., et al. Accurate spatio-temporal reconstruction of missing data in dynamic scenes. Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/j.patrec.2013.06.003
4
M. Favorskaya et al. / Pattern Recognition Letters xxx (2013) xxx–xxx
results after the learning process. The number of learning cycles was 1,000,000; an average error of learning sampling is less 0.004; a maximum error of learning sampling is less 0.03; an average error of testing sampling is less 0.003; a maximum error of testing sampling is 0.01. 3.2. The temporal texture features extraction The temporal texture analysis is based on motion estimations in a neighborhood of missing data into the selected frames. The task of temporal texture features extraction is not a simple issue. However any complex motion may be interpolated by simple motion models into the sequential frames. Three motion approximation models were introduced according to affine model parameters: the linear, the rotational, and the scalable approximation models. The linear model is a singular affine transformation (transition) with motion vector v = (xi xi–1, yi yi–1) (where (xi, yi), (xi–1, yi–1) are points’ coordinates in two sequential frames) determined by Eq. (11), where dx and dy are displacement values along axes OX and OY respectively.
"
xi yj
#
¼
xi1 dx
ð11Þ
yi1 dy
Three sequential frames with points’ coordinates (xi, yi), (xi-1, yi-1), and (xi-2, yi-2) are used in the rotational model. It is necessary to determine a rotation angle a, the center coordinates (xc, yc), and a rotating radius R. It is possible to solve the combined Eq. (12)
ðxi xc Þ2 þ ðyi yc Þ2 ¼ R2 ðxi1 xc Þ2 þ ðyi1 yc Þ2 ¼ R2 2
2
ðxi2 xc Þ þ ðyi2 yc Þ ¼ R
ð12Þ
2
and find a rotation angle a between two sequential frames from a scalar vector products (Eq. (13)).
ððx x Þ ðx
x Þ þ ðy y Þ ðy
y ÞÞ
c c i i1 i c i1 c qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi a ¼ arccos qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2
2
ð ðxi xc Þ þ ðyi yc Þ
2
ðxi1 xc Þ þ ðyi1 yc Þ2 ÞÞ ð13Þ
In the scalable model, a scale coefficient k may be estimated by displacements of vectors lengths into sequential frames according to Eq. (14), where (xi,j, yi,j), (xi–1,j, yi–1,j), (xi–1,j–1, yi–1,j–1), and (xi,j–1,
yi,j–1) are feature points’ coordinates; i is a number of several frames into a series; j is a number of a series. The term ‘‘series’’ indicates a sequential couple of frames.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxi1;j1 xi;j1 Þ2 þ ðyi1;j1 yi;j1 Þ2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k¼ ðxi1;j xi;j Þ2 þ ðyi1;j yi;j Þ2
ð14Þ
The counters accumulate a number of motion vectors according to each motion model. The type of motion is chosen by the counter which stores a maximum value. The linear approximation model implies that an angle of motion vectors closed to a region with missing data tends to 0. In the rotation approximation model, an angle of motion vectors closed to a region with missing data will be more 0. Intersections of motion vectors in the center of coordinate system indicate about the scale approximation model. After calculation of spatial and temporal texture estimations, the algorithm of boundaries’ interpolation into a missing data region is started because a case with a single texture type into a missing region is a very rarely. For this purpose the wave propagation algorithm applied for gradient frames may be used (Favorskaya et al., 2012). The accuracy of this method is not high. However this step is only the first approximation of spatiotemporal reconstruction into a dynamic scene (Section 4). 3.3. Maps building of texture parameters For better interpretation and visualization of the received results, the texture pseudo-colored maps were built. To discriminate textured and homogeneous regions, the smoothness maps were built by using a slice window 5 5 pixels, the relative smoothness was calculated by Eq. (6). The downsizing of a slice window causes the inaccurate values of texture descriptors. The upsizing of a slice window gives the inaccurate boundaries’ detection into different texture types. For visualization of a relative smoothness texture map, 2 nonequaled thresholds and 3 colors were applied to indicate regions with high (green color), medium (yellow), and low (red) smoothness. According to the empirical texture analysis of 150 textures from texture databases, threshold values are equaled to 2 and 4: the interval 0 2 means a low smoothness region, the interval 2 4 is a medium smoothness region, and the interval 4 and more is denoted a high smoothness region (homogeneous region) (Fig. 1b).
Fig. 1. Example of frame processing: (a) original frame from ‘‘Lie to Me’’, Season 1, episode 2, frame 10472, (b) a smoothness texture map, (c) a structural texture map, (d) an isotropy texture map.
Please cite this article in press as: Favorskaya, M., et al. Accurate spatio-temporal reconstruction of missing data in dynamic scenes. Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/j.patrec.2013.06.003
M. Favorskaya et al. / Pattern Recognition Letters xxx (2013) xxx–xxx
In a similar manner, the texture maps of structural performance were built. They indicate the levels of structural properties (low, middle, and high) (Fig. 1c). The isotropy texture maps (anisotropic, non-determined, and isotropic) are represented in Fig. 1d. The maps may be built separately based on Haralick texture descriptors (smoothness, homogeneity, entropy, central moments of various orders), stochastic topological descriptors (maximum of probability, central moments, back central moments), or spectral descriptors and their combinations. In this research, the first variant was chosen. Additionally the maps of other texture descriptors may be built also by using a neural network approach. However investigations show that in this case an accuracy of a texture classification increases only up to 3% with a significant growth of computer cost. According to received results, a more appropriative rule of texture inpainting into a dynamic scene is determined. Such rules are stored in the designed software tool, may be added, removed, or updated if it is necessary.
5
4. The accurate texture inpainting into a dynamic scene The texture inpainting is a final stage of the reconstruction process. The motion parameters of non-desirable object and spatiotemporal texture parameters in neighborhood of it are calculated, and it is needed to choose the reconstruction method. Three main texture reconstruction methods are known. There are blurring (anisotropic/isotropic), texture tile (anisotropic/isotropic), and texture synthesis (statistical/superposition). For a dynamic scene, the most suitable method is a texture tile inpaining when a nonstrictly defined texton (pseudo-texton) is selected from the background region. Such pseudo-texton has changeable geometric sizes: a single pseudo-texton, a line, or a field of pseudo-textons. The choice of pseudo texton type is determined by the motion estimations into a whole scene. Any scene may be classified as a scene without motion, a scene with a simple type of motion, and a scene with complex type of motion when various artifacts are occurred
Fig. 2. Frames reconstruction from ‘‘CSI (season 01, episode 04)’’: (a) original frames 7076, 7078, 7080, 7082, 7084, (b) object masks with tracked feature points, (c) smoothness texture maps, (d) structural texture maps, (e) isotropy texture maps, (f) the reconstructed video sequence with tracked feature points.
Please cite this article in press as: Favorskaya, M., et al. Accurate spatio-temporal reconstruction of missing data in dynamic scenes. Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/j.patrec.2013.06.003
6
M. Favorskaya et al. / Pattern Recognition Letters xxx (2013) xxx–xxx
(an objects overlapping, periodic motions, a step scaling into a frame, a high speed motion, etc.). Three motion approximation models into a scene permit to find the correspondence pseudo textons in adjacent frames for reconstruction of missing data region. In the case of the linear motion model into a scene the point coordinates (xn, yn) into a reconstruction frame n are calculated by Eq. (15).
xn yn
¼ ðn 1Þ
xi xi1 yi yi1
ð15Þ
A transition of a known point from the adjacent frame to the reconstructed frame in the case of the rotational model is determined by Eq. (16), where Xn is a matrix of homogeneous coordinates of a reconstructed point; T is a transition matrix of the coordinate system into a rotating center; R0 is a rotating matrix of a point; S is a transition matrix of the coordinate system into an initial center; X is a matrix of homogeneous coordinates of a point in an adjacent frame; n is a displacement in frames.
Xn ¼ T
n1 Y
! R
0
SX
ð16Þ
i¼1
Let a scale coefficient k is a constant for several sequential frames. Therefore in the case of the scalable model, a transition of some point (xi-1, yi-1) into an adjacent frame to the point (xn, yn) into the reconstructed frame is described by Eq. (17), where K is a scale matrix.
Xn ¼ T
n1 Y
! K SX
ð17Þ
i¼1
For approximation of any type motion, it is better to use a long frames series.
5. Experimental results and evaluation The designed software tool ‘‘Video Editing’’, v. 1.27 includes seven modules: Interface Module, Pre-processing Module, Localization Module, Reconstruction Module, Module of Media Flows Editing, Database Module, and Configuration File of NN. Program code was realized on C# language, DBMS ‘‘MS Access 2000’’ was used to create a database which stores the original and processed video sequences. The efficiency of proposed algorithms was determined by processing of several dynamic scenes from movies. For chosen frames, a manual reconstruction was done that permitted to estimate results after program reconstruction. Videos ‘‘CSI (season 01, episode 04)’’ and ‘‘xXx’’ were chosen because the first one includes a large non-desirable object (Fig. 2) and the second one contains a small non-desirable object (Fig. 3). In the both cases, it was cars into urban scenes. The number of successful feature points tracking relatively from the first analyzing frame is situated in Table 1. For each scene, 100 feature points were initially detected. The total experimental results of the object localization by using a frame subtraction and an active contour detection (calculated during a whole scene) are situated in Table 2. As it seems, the results with the active contour method are more accurate and suitable. The reconstruction accuracy was estimated in the same way (with the help of frames without a non-desirable object and the reconstructed frames). Into an area with missing data, the smoothness, the structural performance, and the isotropy into regions with different texture types were recalculated. Then the relative values of these three parameters in percentage were found (the parameters of real textures calculated into a frame without nondesirable object were accepted as 100%). In Table 3 one can see the average relative results of such estimations.
Fig. 3. Frames reconstruction from ‘‘xXx’’: (a) original frames 153306, 153308, 153310, 153312, 153314, (b) object masks with tracked feature points, (c) smoothness texture maps, (d) structural texture maps, (e) isotropy texture maps, (f) the reconstructed video sequence with tracked feature points.
Please cite this article in press as: Favorskaya, M., et al. Accurate spatio-temporal reconstruction of missing data in dynamic scenes. Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/j.patrec.2013.06.003
7
M. Favorskaya et al. / Pattern Recognition Letters xxx (2013) xxx–xxx Table 1 The number of successful feature points tracking. Video sequence
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5
CSI (season 01, episode 04) xXx
100 100
76 93
54 89
48 84
42 81
Table 2 Average experimental results of object localization. Video sequence
Frame subtraction
CSI (season 01, episode 04) xXx
Active contour
True detected pixels, %
Undetected pixels, %
False detected pixels, %
True detected pixels, %
Undetected pixels, %
False detected pixels, %
86.43 93.86
13.57 6.14
10.16 4.72
92.67 97.80
7.33 2.20
7.26 3.64
Table 3 Average experimental results of relative texture estimations. Video sequence
Smoothness, %
Structural performance, %
Isotropy, %
CSI (season 01, episode 04) xXx
92.46
62.23
78.84
97.11
96.67
98.03
One can see that the better reconstruction is achieved into videos with a small non-desirable object (not less than 5–14%). Frames from videos ‘‘xXx’’ have a more smooth background; therefore the smoothness and isotropy results are very high. A low estimation of a structural performance into frames from videos ‘‘CSI (season 01, episode 04)’’ is explained by appearance a man on background, part of image of whom is absent at all. 6. Conclusion The spatio-temporal method for texture reconstruction of missing data into dynamic scenes was developed. The two-stage procedure firstly localizes a contour of rigid object which motion is described by the affine motion model (with using SURF detector) and secondly reconstructs a dynamic scene based on maps of the texture smoothness, the structural properties, and the isotropy in surrounding regions by NNs of back propagation error. Three motion models of an object and a background – linear, rotational, and scalable were introduced. The proposed technique was tested for visual reconstruction of regions with an area 8–20% of a frame area into several video sequences. Application of the active contour method demonstrates better results against the fast and inaccurate frame subtraction; the improvement achieves 5–12%. Videos with a small nondesirable object have better reconstruction results (with an area 5–14%). In future research, the tile inpainting algorithm will be improved and the existing set of spatial and temporal texture parameters will be enhanced by luminance invariants. References Bar, L., Berkels, B., Rumpf, M., Sapiro, G., 2007. A variational framework for simultaneous motion estimation and restoration of motion-blurred video. In: IEEE 11th Int. Conf. on Computer Vision, ICCV 2007, pp. 1–8. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L., 2008. Comput. Vision Image Understanding 110, 346–359. Bertalmio, M., Bertozzi, A.L., Sapiro, G., 2001. Navier–stokes, fluid dynamics, and image and video inpainting. Proc. IEEE Conf. Comput. Vision Pattern Recogn. 1, 355–362.
Criminisi, A., Perez, P., Toyama, K., 2004. Region filling and object removal by examplar-based image inpainting. IEEE Trans. Image Process 13, 1200–1212. Elhabian, S.Y., El-Sayed, K.M., Ahmed, S.H., 2008. Moving object detection in spatial domain using background removal techniques – state-of-art. Recent Patents Comput. Sci. 1 (1), 32–54. Favorskaya, M., 2012. Motion estimation for object analysis and detection in videos. In: Kountchev, R., Nakamatsu, K. (Eds.), Advances in Reasoning-based Image Processing, Analysis and Intelligent Systems: Conventional and Intelligent Paradigms. Springer-Verlag, Berlin Heidelberg, pp. 211–253. Favorskaya, M.N., Petukhov, N.Y., 2011. Recognition of natural objects on air photographs using neural networks. Optoelectron. Instrum. Data Process. 47 (3), 233–238. Favorskaya, M., Damov, M., Zotin, A., 2012. Intelligent texture reconstruction of missing data in video sequences using neural networks. In: Proc. of the Annual. 16th Int. Conf. on Knowledge and Engineering Systems (KES 2012), pp. 1293– 1302. Ge, Q., Xiao, L., Zhang, J., Wei, Z.H., 2012. A robust patch-statistical active contour model for image segmentation. Pattern Recognit. Lett. 33, 1549–1557. Haik, O., Lior, Y., Nahmani, D., Yitzhaky, Y., 2006. Effects of image restoration on acquisition of moving objects from thermal video sequences degraded by the atmosphere. Appl. Opt. 46 (36), 8562–8572. Huan, X., Murali, B., Ali, A.L., 2010. Image restoration based on the fast marching method and block based sampling. Comput. Vis. Image Underst. 114, 847–856. Jia, Y.-T., Hu, S.-M., Martin, R.R., 2005. Video completion using tracking and fragment merging. Proc. Pac. Graphics 21 (8–10), 601–610. Jia, J., Tai, Y., Wu, T., Tang, C., 2006. Video repairing under variable illumination using cyclic motions. IEEE Trans. Pattern Anal. Mach. Intell. 28 (5), 832–883. Joyeux, L., Boukir, S., Besserer, B., 2002. Tracking and map reconstruction of line scratches in degraded motion pictures. Mach. Vis. Appl. 13, 119–128. Khan, Z., Balch, T., Dellaert, F., 2003. Efficient particle filter-based tracking of multiple interacting targets using an MRF-based motion model. In: Proceedings of the 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’03), pp. 254–259. Köppel, M., Doshkov, D., Ndjiki-Nya, P., 2009. Fully automatic inpainting method for complex image content. In: 10th Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS ‘09, pp. 189–192. Krishnan, S., Venkataraman, D., 2012. Restoration of video by removing rain. Int. J. Comput. Sci. Eng. Appl. (IJCSEA) 2 (2), 19–28. Li, H., Wang, S., Zhang, W., Wua, M., 2010. Image inpainting based on scene transform and color transfer. Pattern Recognit. Lett. 31, 582–592. Ling, C.-H., Liang, Y.-M., Lin, C.-W., Chen, Y.-S., Liao, H.-Y.M., 2011. Human object inpainting using manifold learning-based posture sequence estimation. IEEE Trans. Image Process. 20 (11), 3124–3135. Mairal, J., Sapiro, G., Elad, M., 2008. Learning multiscale sparse representations for image and video restoration. Multiscale Model. Simul. 7, 214–241. Osher, S., Sole, A., Vese, L., 2003. Image decomposition and restoration using total variation minimization and the H–1 norm. Multiscale Model. Simul. 1 (3), 349– 370. Sangeetha, K., Sengottuvelan, P., Balamurugan, E., 2011. A comparative analysis of exemplar based image inpainting algorithms. Eur. J. Sci. Res. 60 (3), 298– 307. Sun, J., Tappen, M.F., 2011. Learning non-local range markov random field for image restoration. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2745–2752. Vidhya, B., Valarmathy, S., 2011. Novel object removal in video using patch sparsity. Int. J. Sci. Eng. Res. 2 (4), 810–814. Wu, J., Ruan, Q., 2006. Object removal by cross isophotes examplar based image inpainting. Proc. Int. Conf. Pattern Recognit., 810–813. Wu, J., Ruan, Q., An, G., 2010. Exemplar-based image completion model employing PDE corrections. J. Inf. 2 (2), 259–276. Zhang, H., Yang, J., Zhang, Y., Huang, T., 2010. Non-Local Kernel Regression for Image and Video Restoration. In: Proc. of the 11th European Conf. on Computer Vision, pp. 566–579.
Please cite this article in press as: Favorskaya, M., et al. Accurate spatio-temporal reconstruction of missing data in dynamic scenes. Pattern Recognition Lett. (2013), http://dx.doi.org/10.1016/j.patrec.2013.06.003