A hybrid algorithm for automatic segmentation of slowly moving objects

A hybrid algorithm for automatic segmentation of slowly moving objects

Int. J. Electron. Commun. (AEÜ) 66 (2012) 249–254 Contents lists available at ScienceDirect International Journal of Electronics and Communications ...

854KB Sizes 0 Downloads 61 Views

Int. J. Electron. Commun. (AEÜ) 66 (2012) 249–254

Contents lists available at ScienceDirect

International Journal of Electronics and Communications (AEÜ) journal homepage: www.elsevier.de/aeue

A hybrid algorithm for automatic segmentation of slowly moving objects Zhongjie Zhu ∗ , Yuer Wang Ningbo Key Lab. of DSP, Zhejiang Wanli University, Ningbo 31 51 00, China

a r t i c l e

i n f o

Article history: Received 12 February 2010 Accepted 26 July 2011 Keywords: Moving object segmentation Spatio-temporal information GMM Frame difference Fusing operation

a b s t r a c t Segmentation of moving objects in video sequences is a basic task in many applications. However, it is still challenging due to the semantic gap between the low-level visual features and the high-level human interpretation of video semantics. Compared with segmentation of fast moving objects, accurate and perceptually consistent segmentation of slowly moving objects is more difficult. In this paper, a novel hybrid algorithm is proposed for segmentation of slowly moving objects in video sequence aiming to acquire perceptually consistent results. Firstly, the temporal information of the differences among multiple frames is employed to detect initial moving regions. Then, the Gaussian mixture model (GMM) is employed and an improved expectation maximization (EM) algorithm is introduced to segment a spatial image into homogeneous regions. Finally, the results of motion detection and spatial segmentation are fused to extract final moving objects. Experiments are conducted and provide convincing results. © 2011 Elsevier GmbH. All rights reserved.

1. Introduction The task of moving object segmentation is to extract meaningful moving objects from video sequences, which provides the convenience of object-based representation and manipulation of video content. It is an essential step for many computer vision applications such as video compression, video retrieval, video surveillance, pattern recognition, and so on. Conventional approaches to moving object segmentation include frame difference methods, background subtraction methods, and optical flow methods. So far many techniques have been proposed for moving object segmentation in the literatures [1–8]. Huang and Hsieh have proposed a waveletbased technique for moving object segmentation [4,5]. Zeng et al. have proposed an approach to extract moving objects from H.264/AVC compressed bit-stream directly based on a block-based Markov random field (MRF) model [6]. Zheng et al. have proposed an automatic moving object detection algorithm based on frame difference for video surveillance applications [7]. Wan et al. have proposed a moving object segmentation algorithm for static camera via active contours and GMM [8]. Though the moving object segmentation has been extensively studied and many algorithms have been proposed, perceptually consistent moving object segmentation is still challenging due to the semantic gap between the low-level visual features, such as colors, textures and edges, and the high-level human interpretation of video semantics.

∗ Corresponding author. Tel.: +86 13777003378. E-mail addresses: [email protected] (Z. Zhu), [email protected] (Y. Wang). 1434-8411/$ – see front matter © 2011 Elsevier GmbH. All rights reserved. doi:10.1016/j.aeue.2011.07.009

Most of the algorithms proposed in literature deal with segmentation of fast or not slowly moving objects such as in video surveillance and traffic control applications. Due to motion information is the most important feature for moving object segmentation, compared with segmentation of fast moving objects, segmentation of slowly moving objects such as moving people in video conference and video telephone is more difficult. Hence, in this paper, a novel hybrid algorithm is proposed to deal with segmentation of slowly moving objects in video sequence, which combines frame difference and statistical modeling technologies in order to acquire accurate and perceptually consistent segmentation results. Not only the temporal information of frame difference but also the spatial correlations among pixels are employed. The whole algorithm consists of three steps: motion analysis, spatial segmentation, and fusing operation. Motion analysis is to detect initial motion regions. Spatial segmentation aims to segment an image into homogeneous regions. Firstly, multi-dimensional visual features for each pixel are extracted and the raw image pixels are converted to a collection of feature vectors in a multi-dimensional feature space. Compared with raw pixel values, pixel-based visual features can characterize the visual properties of images more effectively. Then the Gaussian mixture model (GMM) is employed to approximate the class distribution of image pixels and good spatial segmentation results can be obtained by grouping of pixels based on the mixture components. During fusing operation, the results of motion detection and spatial segmentation are fused to finally extract perceptually consistent moving objects. The main contributions of our work can be summarized as: (1) a high-order statistical method is introduced to perform motion detection, which can effectively remove noise; (2) the Gaussian finite mixture model is employed and a novel improved EM

250

Z. Zhu, Y. Wang / Int. J. Electron. Commun. (AEÜ) 66 (2012) 249–254

algorithm is introduced to perform spatial segmentation, which can provide more compact representations of image contents and acquire more perceptually consistent spatial segmentation results; (3) to overcome the incomplete motion information of moving objects due to slowly moving, the spatial structure information is also incorporated, which can finally extract rather complete and perceptually consistent segmentation results. The remainder of the paper is organized as follows: Section 2 describes the details of the proposed algorithm; Section 3 shows partial experimental results to evaluate the performance of the proposed techniques; Section 4 concludes the paper. 2. Proposed algorithm The proposed algorithm consists of three steps: motion analysis, spatial segmentation, and fusing operation. Motion analysis is to detect initial motion regions. Spatial segmentation aims to segment an image into homogeneous regions. During fusing operation, the results of motion detection and spatial segmentation are fused to extract final moving objects. The block diagram of our proposed algorithm is shown in Fig. 1. Firstly, motion analysis is performed based on multi-frame differences. Secondly, comprehensive visual features for each pixel are extracted, including gray value, gray deviation, gradients, and so on. Thirdly, the Gaussian mixture model is employed to approximate the class distribution of image pixels and an improved EM algorithm is proposed to estimate model parameters. Then, spatial segmentation results are obtained by grouping of pixels based on the mixture components. Finally, the results of motion detection and spatial segmentation are fused to extract final moving objects. 2.1. Motion analysis

(1) First, for each pixel (x,y), compute its fourth moment: (4)

1 2 Nw





4

dk,k+1 (p, q) − d¯ k,k+1 (x, y)

(1)

Ns2



dk,k+1 (p, q)

(p,q) ∈ w(x,y)

(2) Then, let  2 denote noise variance calculated by

dk.k+1 (p, q) − d¯ k,k+1

(2)

2

(3)

(p,q) ∈ S

where S is an area with size of Ns × Ns selected from the still background. In this paper, Ns is set to 5. (3) Finally, the detected binary motion mask, OB,k,k+1 (x, y), based on dk,k+1 (x, y) is derived:

 OB,k,k+1 (x, y) =

1 0

(4)

2 ) Fk,k+1 (x, y) ≥ u(¯ od

(4) Fk,k+1 (x, y)

2

2 ) < u(¯ od

2

(4)

where u is a weighting coefficient. When object’s motion velocity is small, the above extracted mask derived only from a single frame is always very incomplete. Hence, multi-frames are used to acquire a more complete mask. Assume M is the number of frames, then OB (x, y) = max{OB,k,k+1 (x, y)|k = 0∼M − 1}

(5)

2.2. Spatial segmentation based on statistical modeling Spatial segmentation aims to segment an image into homogeneous regions. In order to acquire perceptional consistent results, a new statistical modeling based method is introduced. To characterize the visual properties of images effectively, multiple features for each pixel are extracted for statistical modeling, including gray value f, gray deviation fd , two-dimensional gradients along horizontal and vertical axis gx and gy , and two-dimensional location descriptors x and y. The gray deviation is calculated within a local window, that is 1 Nl × Nl



Nl ×Nl

fi

(6)

i=1

where fd denotes the gray deviation, f denotes gray value, Nl is the size of the local window. In this paper, Nl is set to 5. For the given image I, after feature extraction we use a finite mixture model to approximate the class distribution of image pixels in the multi-dimensional feature space: P(X|I, ) =

K 

P(X|Si , i )ωi

(7)

i=1

where  = {K, ω, } is the parameter set of model structure, weights, and model parameters, P(X|Si ,  i ) is the ith mixture component to approximate the class distribution of the connected image pixels with similar visual properties, K is the model structure (i.e., number of mixture components),  = { i |i = 1, . . ., K} is the set of the model parameters for the K mixture components, ω = {ωi |i = 1, . . ., K} is the set of relative weights among the K mixture components. Generally, P(X|Si ,  i ) is supposed to follow Gaussian distribution. To learn the finite mixture model for statistical image modeling, maximum likelihood criterion can be used to determine the underˆ = (K, ˆ for ˆ ω, lying model parameters. Then the parameter set  ˆ ) the given image I is therefore determined by: ˆ = arg max{L(I, )} 

(p,q) ∈ w(x,y)

where w(x, y) is a Nw × Nw window centered at (x,y), d¯ k,k+1 (x, y) is the mean of frame differences within w(x, y), that is 1 d¯ k,k+1 (x, y) = 2 Nw

 

1

fd = f −

The goal of this process is to detect initial moving regions. In most applications, cameras have fixed position and parameters, meanwhile the background can be considered to be still. Suppose dk,k+1 (x, y) is the frame difference between two consecutive frames, fk (x, y) and fk+1 (x, y). Then, theoretically, pixel (x,y) belongs to still background if dk,k+1 (x, y) = 0, otherwise it belongs to moving regions if dk,k+1 (x, y) = / 0. In most cases, the noise is inevitable and it should be removed. Generally, noise can be assumed to follow 0-mean Gaussian distribution, while the motion information of video objects follows non-Gaussian distribution. Hence, the motion detection is a process of detecting non-Gaussian signal from Gaussian signals [9]. Here, we use the high-order statistical method to perform motion detection, which is widely used for noise removal. The aim of motion detection in essence is to extract a binary motion mask, OB (x, y), where OB (x, y) ∈ {0, 1}. If OB (x, y) = 0, it means pixel (x,y) belongs to the motion regions, otherwise it belongs to the still background. The motion detection process is briefly introduced as follows:

Fk,k+1 (x, y) =

2 = ¯ od



(8)

where L(I, ) = − log P(X|I, ) is the likelihood function. The estimation of maximum likelihood described in (8) can be achieved by using the EM algorithm with a predefined model structure K [10]. However, setting a fixed model structure may mismatch the real class distribution of image pixels because different images may consist of various image compounds with diverse visual properties. Hence, in this paper, the traditional EM algorithm is improved. Firstly, the model structure is set to a reasonably large

Z. Zhu, Y. Wang / Int. J. Electron. Commun. (AEÜ) 66 (2012) 249–254

251

Fig. 1. Block diagram of the proposed algorithm.

value Kmax to estimate model parameters with the traditional EM algorithm. Then the classified results of mixture components are performed merging operation to re-organize the distribution of image pixels. To select appropriate components for merging, the Kull-back divergence between two mixture components and the intercomponent Kull-back divergence between one mixture component and the local pixel density are considered. The Kull-back divergence Dkl (Si , Sj ) between the ith mixture component P(X|Si ,  i ) and the jth mixture component P(X|Sj ,  j ) is defined as:

 Dkl (Si , Sj ) =

P(X|Si , i ) log

P(X|Si , i ) P(X|Sj , j )

Dkl (I, Si ) =

(15)

l

Based on the clustering results, the whole image can then be segmented into different areas by spatially pixel grouping and connection. 2.3. Fusing operation and moving object extraction

P(X|Si , i ) P(X|Si , i ) log P(X|I, i )

N j=1

ı(X − X j )P(Si , i |X j )

N

j=1

(10)

P(Si , i |X j )

(11)

ωi P(X|Si , i )

K

i=1

After motion analysis and spatial segmentation, the results of both are fused to extract final moving objects. Let OB (x, y) denote the binary mask, Os (s = 0, 1, . . ., Q) denote the segmented regions of spatial segmentation, and Ns denote the size of each Os . Define the matching rate ˛sB between Os and OB (x, y) as ˛sB =

and P(Si |X,  i ) is the posterior probability defined as P(Si , i |X) =

(14)

where K is the number of components, ˛ is a constant determined by experiments, in this paper ˛ is set to 0.5 and Kmax is set to 30. Once the statistical image modeling is finished, the image pixels can be classified into different clusters according to the posterior probability, that is, the ith pixel will be classified into the jth cluster if j = arg maxωl P(Sl , l |X i )

where the local pixel density P(X|I,  i ) is calculated by P(X|I, i ) =

K

i=1 j=i+1

(9)

The local Kull-back divergence Dkl (I, Si ) between the ith mixture component P(X|Si ,  i ) and the corresponding local pixel density P(X|I,  i ) is defined as:



 1 J(i, j, ij ) K × (K − 1) K−1

ım = ˛

(12)

ωi P(X|Si , i )

1 Ns



OB (x, y)

(s = 0, 1, . . . , Q )

(16)

(x,y) ∈ Os

Given a threshold ˛T , the moving object, O(x, y), can be determined by: O(x, y) =



Os

(s = 0, 1, . . . , Q )

(17)

If two mixture components, P(X|Si ,  i ) and P(X|Sj ,  j ), are strongly overlapped, they may provide similar densities and they can be potentially merged as one single mixture component P(X|Sij ,  ij ). The merging criteria is:

In this paper, the ˛T is set to 0.5. Finally, morphological postprocessing is performed to acquire good final moving objects.

J(i, j, ij ) = KL(I, Sij ) + (1 − )KL(Si , Sj )

3. Experimental results

(13)

where  is the weighting coefficient. In this paper, the  is set to 0.5. The steps of the our improved EM algorithm are as follows: (1) Initialize the model structure with Kmax . (2) Estimate the model parameters with traditional EM algorithm. (3) When the EM algorithm converges, the merging operation is applied on the results. For two components, i and j, if Jm (i, j,  ij ) > ım , then they are merged and go to (2). Here, ım is the threshold given by

˛sB >˛T

To evaluate the performance of the proposed algorithm, experiments are implemented. Firstly, motion analysis is performed based on multi-frame differences to separate initial moving regions from the still background. Then, spatial segmentation is performed to segment a frame into homogeneous regions based on Gaussian mixture model. Finally, the results of motion detection and spatial segmentation are fused to extract final moving objects. Partial experimental results are given in Figs. 2–7. To demonstrate the advantages of our method, some comparison

252

Z. Zhu, Y. Wang / Int. J. Electron. Commun. (AEÜ) 66 (2012) 249–254

Fig. 2. Experimental results of Claire sequence, where (a) is the binary mask extracted from two frames, (b) is the binary mask extracted from five frames, (c) is the fused result, and (d) is the final extracted moving object.

Fig. 3. Experimental results of Diskus sequence, where (a) is the binary mask extracted from two frames, (b) is the binary mask extracted from five frames, (c) is the fused result, and (d) is the final extracted moving object.

Fig. 4. Experimental results of Alex sequence, where (a) is the binary mask extracted from two frames, (b) is the binary mask extracted from five frames, (c) is the fused result, and (d) is the final extracted moving object.

Fig. 5. Experimental results of Miss sequence, where (a) is the binary mask extracted from two frames, (b) is the binary mask extracted from five frames, (c) is the fused result, and (d) is the final extracted moving object.

results with the classic JSEG technique [11] are also given in Fig. 8. From this section, one can see that the extracted mask is very incomplete when only a few frame differences are used due to the slowly moving of objects. With the increase of the number of frame differences, the extracted motion masks become more complete. After performing fusing operation on the results of motion analysis and spatial segmentation, rather complete object

masks can be acquired. And the final perceptually consistent moving objects can be extracted after further post-processing. From the results in Fig. 8, we can see that, compared with the classic JSEG image and video segmentation technique, our method can acquire more accurate and perceptually consistent segmentation results. It is worth mentioning that, though our experiments are performed on gray videos, the proposed algorithm can be easily

Fig. 6. Experimental results of Grandma sequence, where (a) is the binary mask extracted from two frames, (b) is the binary mask extracted from five frames, (c) is the fused result, and (d) is the final extracted moving object.

Z. Zhu, Y. Wang / Int. J. Electron. Commun. (AEÜ) 66 (2012) 249–254

253

Fig. 7. Experimental results of silent, Suzie and mother–daughter sequences: (a) silent, (b) Suzie, and (c) mother–daughter.

extended for segmentation of moving objects in color videos by substituting color features for gray features. In our experiments, several limitations are also found. The major one is how to set appropriate Kmax s for different applications. An accurate Kmax can

reduce the iterative times of merging operation and hence reduce the computational load. However, it is not easy to automatically set accurate Kmax s for different videos. In addition, in this paper we only consider the scenario that the cameras have fixed position and

Fig. 8. Partial comparison results between the proposed method and the JSEG technique: (a) the proposed method and (b) the JSEG method.

254

Z. Zhu, Y. Wang / Int. J. Electron. Commun. (AEÜ) 66 (2012) 249–254

parameters and the background is still, which is not true in some applications. All the limitations will be addressed in our future work. 4. Conclusion Due to the semantic gap between the low-level visual features and the high-level human interpretation of video semantics, accurate and perceptually consistent moving object segmentation is still challenging. In this paper, a novel hybrid algorithm based on frame difference and GMM is proposed. The differences between multiple frames are firstly employed to detect initial moving regions and the spatial segmentation is performed based on GMM. Then, by fusing the results of motion detection and spatial segmentation, the final moving objects can be extracted. Experimental results reveal that the segmentation results are rather perceptually consistent. Acknowledgments This work was supported in part by the National Natural Science Foundation of China (No. 60902066, 60872094, 60832003); the Natural Science Foundation of Zhejiang Province (No. Y107740); the Scientific Research Foundation of Zhejiang Provincial Education Department (No. Z200909361). References [1] Wu ZP, Chen C. A moving object segmentation technique using dynamic programming. In: Proceedings of the ACCV2002. 2002. p. 1–6. [2] Zeng W, Gao W. Accurate moving object segmentation by a hierarchical region labeling approach. In: Proceedings of the ICASSP 2004. 2004. p. 637–40. [3] Sun C, Talbot H, Ourselin S, Adriaansen T. Automatic adaptive segmentation of moving objects based on spatio-temporal information. In: Proceedings of the VII digital image computing: techniques and applications. 2003. p. 1007–16. [4] Huang JC, Hsieh WS. Wavelet-based moving object segmentation. Electronic Letters 2003;39:1380–2.

[5] Huang JC, Su TS, Wang LJ, Hsieh WS. Double change detection method for wavelet-based moving object segmentation. Electronic Letters 2004;40:798–9. [6] Zeng W, Du J, Gao W, Huang QH. Robust moving object segmentation on H.264/AVC compressed video using the block-based MRF model. Real-Time Imaging 2005;11:290–9. [7] Zheng XS, Zhao YL, Li N, Wu HM. An automatic moving object detection algorithm for video surveillance applications. In: Proceedings of the international conference on embedded software and system. 2009. p. 541–3. [8] Wan CK, Yuan BZ, Miao ZJ. A moving object segmentation algorithm for static camera via active contours and GMM. Science in China Series F: Information Sciences 2009;52:322–8. [9] Neri A, Colonnese S, Russo G. Automatic moving object and background separation. Signal Processing 1998;66:219–32. [10] McLachlan G, Krishnan T. The EM algorithm and extensions. New York: John Wiley & Sons Press; 1997. [11] Deng Y, Manjunath BS. Unsupervised segmentation of color-texture regions in images and video. IEEE Transactions on PAMI 2001:800–10. Zhongjie Zhu received the PhD degree in electronics science and technology from Zhejiang University, China, in 2004. He is currently a professor with faculty of Electronics and Information Engineering, Zhejiang Wanli University, China. His research interests mainly include digital video compression and communication, watermarking and information hiding, 3D image processing, and image understanding.

Yuer Wang received her MS degree from Shanghai Fisheries University, China, in 2007. She is currently a research associate in Zhejiang Wanli University, China and her research interests include digital video compression and signal processing, watermarking and information hiding.