Pattern Recognition Letters 26 (2005) 653–662 www.elsevier.com/locate/patrec
Semi-automatic video object segmentation using seeded region merging and bidirectional projection Zhi Liu *, Jie Yang, Ning Song Peng Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, Peoples Republic of China Received 9 March 2004; received in revised form 4 August 2004
Abstract In this paper, we propose a novel approach to semi-automatic video object segmentation. First, an interactive video object segmentation tool is presented for the user to easily define the desired video objects in the first frame, which is user-friendly, flexible and efficient due to the proposed fast seeded region merging approach and the combination of two different ways of user interaction, i.e., marker drawing and region selection. Then, a bidirectional projection approach is proposed to automatically track the video objects in the subsequent frames, which combines forward projection and backward projection to improve the segmentation efficiency, and incorporates pixel classification with region classification in backward projection to guarantee a more reliable tracking performance. Experimental results for various types of the MPEG-4 test sequences demonstrate an efficient and faithful segmentation performance of the proposed approach. 2004 Elsevier B.V. All rights reserved. Keywords: Video object segmentation; Video object tracking; Seeded region merging; Bidirectional projection
1. Introduction As an important issue for the implementation of many content-based multimedia applications supported by MPEG-4, video object segmentation remains a challenging research topic until now. Although human beings can easily identify differ*
Corresponding author. Tel.: +862162934627; +862162932035. E-mail address:
[email protected] (Z. Liu).
fax:
ent video objects in a video sequence, it is hard for a computer to automatically segment the desired video objects in any kind of generic video sequences. At present, efficient algorithms for automatic video object segmentation only apply to moving objects or some kind of objects with a prior knowledge (Fan and Elmagarmid, 2002; Fan et al., 2001; Kim and Hwang, 2002; Kim et al., 1999; Meier and Ngan, 1998; Tsaig and Averbuch, 2002). In the near future, it seems hardly possible to develop a generic automatic algorithm
0167-8655/$ - see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.09.017
654
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
applicable to a variety of video sequences. Therefore, a more practical solution, the so-called semiautomatic video object segmentation (Cooray et al., 2001; Gatica-Perez et al., 1999; Gu and Lee, 1998a,b; Guo et al., 1999; Kim et al., 2001, 2003; Lim et al., 2000; Luo and Eleftheriadis, 2002; Sun et al., 2003), draws more and more attention in recent years. A typical paradigm of semiautomatic video object segmentation consists of two steps, i.e., segmenting the first frame with user interaction to define the video objects, and automatically tracking in the subsequent frames. The first step is extremely important in any semi-automatic video object segmentation algorithms, because the accuracy of the segmented video objects directly determines the success or failure of the following tracking process. A userfriendly segmentation tool should be provided for the user to conveniently define the video objects, and user interaction activity should be minimized to improve the segmentation efficiency. However, the flexibility and efficiency of user interaction are rarely considered as important as the algorithm itself in most existing approaches. The most common way of user interaction is to delineate an approximate contour clinging to the video object (Guo et al., 1999; Kim et al., 2001, 2003). However, it is a burdened job to move mouse along the true object contour, especially when the shape of the object is complex. For those approaches associated with snake model, a considerable number of control points around the object contour need to be selected one by one (Luo and Eleftheriadis, 2002; Sun et al., 2003). Region selection is a more natural way to define a video object, but an excessive number of regions still need to be selected at different partition levels (Cooray et al., 2001). In this paper, we propose an interactive video object segmentation tool, which is userfriendly, flexible and efficient due to the proposed fast seeded region merging approach and the combination of two different ways of user interaction. The second step is a process of video object tracking. Many approaches adopt a two-step configuration to track the video objects (Gu and Lee, 1998a; Guo et al., 1999; Lim et al., 2000, 2001, 2003), i.e., first project the previous objects to the current frame using some kind of parametric mo-
tion model, and then refine the object boundaries. The underlying tracking mechanism is forward projection, which works well for rigid objects with translation motion. For non-rigid objects with multiple motions, irregular boundaries and uncertain holes may appear on the video objects, and inevitable post-processing is needed for boundary refinement. In contrast with forward projection, backward projection (Gatica-Perez et al., 1999; Gu and Lee, 1998b) is suitable to deal with non-rigid objects, and needs no further refinements. Each segmented region in the current frame is projected to the previous frame, and then it is assigned to the current video object if the majority of the projected region overlaps the previous video object. In nature, it is a region classification approach rather than a tracking approach. However, it is not an efficient way to backward project all segmented regions for classification. Another problem may occur when a segmented region overlaps the video object and the background, which causes peninsulas or gaps to appear on the video object no matter what classification it is assigned to. So far, we have discussed the main features and limitations of forward projection and backward projection. In this paper, we propose a bidirectional projection approach mainly as an extension of backward projection (Gu and Lee, 1998b), which is more efficient due to the combination with forward projection, and ensures the visual quality of the tracked video objects by incorporating pixel classification with region classification. This paper is organized as follows. In Section 2, an interactive video object segmentation tool is presented. Section 3 proposes our bidirectional projection approach. Experimental results for different types of the MPEG-4 test sequences are shown in Section 4. Conclusions are given in Section 5.
2. Interactive video object segmentation In order to facilitate the user to easily extract the desired video object, we combine two ways of user interaction, i.e., marker drawing and region selection, and propose a flexible scheme shown in Fig. 1. The whole procedure of interactive video object segmentation consists of three steps: marker draw-
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
655
Fig. 1. A flexible scheme of interactive video object segmentation.
Fig. 2. A screen shot of our GUI.
ing, automatic video object extraction, and user correction. A screen shot of our graphical user interface (GUI) is shown in Fig. 2, which is exploited to clearly describe each step in the following. (1) Marker drawing: The user draws scribbles of different colors to roughly mark the video object and the background. As shown in Fig. 2, a red 1 scribble marks the interested video object, and a blue 1 scribble marks the background in the left window of the GUI. Scribble drawing is more convenient and flexible for the user to experience. It usually takes a few seconds, which is faster than the way of contour drawing and control points selection. (2) Automatic video object extraction: This step does not need any user interference. The computer performs two tasks to automatically extract the video object, i.e., spatial segmentation and fast seeded region merging, which will be described in the following two subsections. The outcome of each task is shown in the middle and the right window respectively. 1 For interpretation of color in Fig. 2, the reader is referred to the web version of this article.
(3) User correction: This step is optional. If not satisfied with the automatically extracted video object, the user can make correction. In our GUI, the user is prompted to click with the left button on the region to add it to the video object, while click with the right button to remove. The number of mouse clicks depends on the image content and the marker drawing, usually less than 2 in our many experiments. For example, the left image in Fig. 2 shows low contrast between the helmet of ‘‘foreman’’ and the background, and the region selected by the mouse in the middle image is merged into the background in the process of seeded region merging. The user just needs to click once with the left button to obtain the desired video object shown in the right image.
2.1. Spatial segmentation The watershed segmentation algorithm (Vincent and Soille, 1991) is exploited to partition the image into a set of regions because it preserves accurate boundaries between different objects.
656
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
However, the main drawback is the problem of over-segmentation due to the noise in the gradient image. In order to obtain a moderate segmentation result, we propose a simplification step to remove insignificant local minima in the gradient image before applying the watershed algorithm. First, the gradient image g of the color image f in YUV space is estimated by the delicate method proposed by Di Zenzo, which considers the relationship between different image components, and is more reasonable than other simple methods such as the RMS, or the sum, or the absolute maximum of three component gradient images. The detailed calculation procedure can be found in (Di Zenzo, 1986). Then, g is dilated with a structuring element E, and the dilated image is elevated by a height h to get the marker image gm gm ¼ ðg EÞ þ h
ð1Þ
Finally, the reconstruction of g from gm by geodesic erosion (Vincent, 1993) is performed to obtain the simplified gradient image gs gs ¼ uðrecÞ ðgm ; gÞ
ð2Þ
In our implementation, E is set to a 3 · 3 cross that is the smallest size of the symmetrical 4-connectivity structuring element, thus only the most insignificant local minima will be removed. Another parameter h is set to 2, which is the second smallest value that removes those local minima with depth smaller than 2. Both parameters are set to rather small values that lead to a reasonably finer partition. Now we apply the watershed segmentation algorithm to the image gs to obtain the label image fl, which shows a reasonable partition of the original image f (see the middle image of Fig. 2). 2.2. Fast seeded region merging A seeded region growing (SRG) algorithm is proposed in (Adams and Bischof, 1994) for grayscale image segmentation from a set of seeded pixels. It is a sequential labeling technique in nature, in which each loop of the algorithm labels only one pixel that neighbors the already labeled pixels with the lowest dissimilarity measure. Moreover, it
is a very efficient algorithm due to the use of sorted list structure. Here, we extend the idea in SRG from the pixel level to the region level and propose a fast seeded region merging algorithm for the color image, which is especially applicable to interactive video object segmentation. The proposed algorithm involves the following steps: (1) For each region Ri of the label image fl we calculate the area Ai and the mean color MCi, which is defined by the following: P ðx;yÞ2Ri f ðx; yÞ MCi ¼ ð3Þ Ai Then we generate the weighed region adjacency graph (WRAG) of the label image fl, in which each non-infinity element WRAG(i, j) denotes the dissimilarity measure of the two adjacent regions Ri and Rj. Considering the difference in the mean color and the area of two adjacent regions, the following criterion is proposed to merge small regions in preference, which is a robust dissimilarity measure demonstrated in our experiments. 2 WRAGði; jÞ ¼ min Ai ; Aj MCi MCj ð4Þ (2) We divide the regions of the label image fl into three sets denoted by vo, bg and un. Specifically, the three sets vo, bg and un stands for video object, background and unknown regions, respectively. If a region Ri of fl is traversed only by the red scribble or only by the blue scribble in the GUI, it is classified into vo or bg respectively; otherwise it is classified into un. Now we have seeded regions in vo for the video object, seeded regions in bg for the background, and some unassigned regions in un. Then we initialize a priority queue PQ constructed by the minimum heap structure with those regions in un, which are adjacent to at least one region in vo or bg. The element associated with the region Ri in PQ consists of the following data: • the dissimilarity measure with vo; dm vo ¼ minRj 2vo WRAGði; jÞ, • the dissimilarity measure with bg; dm bg ¼ minRj 2bg WRAGði; jÞ, • the smaller dissimilarity measure, dm = min (dm_vo, dm_bg), by which PQ is sorted.
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
657
(3) The first element in PQ is deleted, and the associated region Ri is removed from un. Ri is appended to vo if dm_vo < dm_bg, otherwise it is appended to bg. Then we check each region Rj that is adjacent to Ri and still in un. If Rj is already in PQ, its data is updated with the following: dm vo ¼ WRAGðj; iÞ; if Ri 2 vo WRAGðj; iÞ < dm vo
and
dm bg ¼ WRAGðj; iÞ; if Ri 2 bg WRAGðj; iÞ < dm bg
and
ð5Þ
If Rj is not in PQ, it is then inserted into PQ, and its data is initialized with the following: dm vo ¼ WRAGðj; iÞ;
dm bg ¼ 1; if Ri 2 vo
dm bg ¼ WRAGðj; iÞ;
dm vo ¼ 1; if Ri 2 bg ð6Þ
In both cases above, dm = min(dm_vo, dm_bg) (4) Repeat step 3 until PQ is empty. (5) Since there is no region left in un now, all the regions in vo are grouped into the video object, and the regions in bg form the background.
3. Automatic video object tracking In this section, we propose a bidirectional projection approach to automatically track the extracted video objects in the subsequent frames of the video sequence. Our tracking approach can be defined as obtaining the video object von of the current frame, based on the motion information related with the previous video object von1, and the spatial segmentation information of the current frame. The flowchart of the proposed tracking approach is depicted in Fig. 3, which consists of three steps: forward projection, backward projection and post-processing. 3.1. Forward projection The objective of forward projection is to locate the video object with rough boundary information, which is derived from the motion estimation. For each contour pixel In1(x, y) of the previous video object von1 (see Fig. 4(a)), the motion vec-
Fig. 3. Flowchart of the proposed bidirectional projection approach.
tor (u(x, y), v(x, y)) is estimated using the 3SS (3step search) method (Tekalp, 1998) to minimize the following prediction error: eðx; yÞ ¼ min u;v
N N X X
kI n1 ðx þ i; y þ jÞ
i¼N j¼N
I n ðx þ uðx; yÞ þ i; y þ vðx; yÞ þ jÞk
ð7Þ
In order to full perform the efficiency of the 3SS method, the search range should be set to [2m + 1, 2m 1]. Although a big search range [31, 31] in case of m = 5 is usually adopted in most video coding applications, we can limit the search range in our application because the purpose is to roughly locate the current video object. In our implementation, the search range for the motion vector (u(x, y), v(x, y)) is set to [7, 7] in case of m = 3, which is enough to predict the apparent translation motion of the video object between consecutive frames. Since we use block matching to predict the translation of each contour pixel, a smaller matching block is more suitable than bigger ones. In our implementation, N is set to 2 for these (2N + 1) · (2N + 1) matching blocks. Forward projection is performed on all contour pixels of von1, denoted by a pixel set ctn1. The projection of ctn1 in the current frame In can be denoted by another pixel set pn (see the black pixels in Fig. 4(b)) pn ¼ fðx þ uðx; yÞ; y þ vðx; yÞÞjðx; yÞ 2 ctn1 g
ð8Þ
658
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
Fig. 4. A pictorial description of the proposed bidirectional projection approach.
These projected pixels in pn may not exactly fall onto the true contour ctn of the video object von in the current frame, and they generally cannot form a closed contour. All pixels in pn are then dilated with a disk-shaped structuring element Ed to obtain a band area Bn (see Fig. 4(b)) to accommodate the rotation, scale change and deformation of the video object. The radius of Ed depends on these non-translation motion activities exhibited in the video sequences, and it should ensure that the true contour ctn locates in the band area Bn. We found by many experiments that Ed = 15 is enough for different types of video sequences. The approximate translation vector ðT un1 ; T vn1 Þ for the video object is estimated using the average of motion vectors for all the pixels in ctn1 P ðx;yÞ2ctn1 uðx; yÞ u ; T n1 ¼ jctn1 j P ðx;yÞ2ctn1 vðx; yÞ ð9Þ T vn1 ¼ jctn1 j This vector ðT un1 ; T vn1 Þ reflects a global translation movement of the video object if an apparent translation exists, which will be used in backward projection described in the next subsection. 3.2. Backward projection The objective of backward projection is to find the true contour ctn of the current video object von,
in the band area Bn. In fact, only the band area Bn needs to be partitioned into some regions, which need to be backward projected to determine whether they belong to the current video object or not. The area Rin inside Bn definitely belongs to the current video object, while the area Rout outside Bn belongs to the background (see Fig. 4(b)). The spatial segmentation algorithm described in Section 2.1 is exploited to partition the current frame In (see Fig. 4(c)). The pixels in Bn require gradient calculation, while the gradient of all other pixels is merely set to zero. It also saves the computation time of watershed segmentation, because the areas that are not covered by Bn can be simply flooded as the lowest flat catchment basins. The segmented regions, excluding Rin and Rout (see Fig. 4(d)), are backward projected to determine their classifications. For each region Ri, the backward motion vector (ui, vi) is estimated to minimize the following prediction error: X ei ¼ min kI n ðx; yÞ I n1 ðx þ ui ; y þ vi Þk ui ;vi
ðx;yÞ2Ri
ð10Þ Compared with the search range [7, 7] used in forward projection, a half search range [3, 3] with the offset ½T un1 ; T vn1 is used to predict (ui, vi). It is suitable to reduce one level of the search range from m = 3 to 2, because the vector ðT un1 ; T vn1 Þ has already reflected a possible apparent translation of the video object.
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
The backward projected region R0i in the previous frame In1 can be denoted by the following formula: [ R0i ¼ ð x þ ui ; y þ v i Þ ð11Þ ðx;yÞ2Ri
The classification of Ri can be determined from the intersecting area of R0i and von1. A natural method is that Ri is classified into von if the majority of R0i intersects with von1, otherwise classified into the background (Gu and Lee, 1998b). However, it is not a robust method to always guarantee the visual quality of the segmented video objects during the tracking process. Specifically, binary classification is not suitable for such a segmented region that overlaps the video object and the background at the same time. If such a region (see the region Ru in Fig. 4(d)) is classified into the video object, a peninsula appears on the video object; otherwise a gap appears (see Fig. 4(e)). In order to deal with such a problem, we propose a robust approach to improve the method in (Gu and Lee, 1998b). The ratio of the intersecting area of R0i and von1 to the area of R0i is defined by the following formula: T A R0i von1 hi ¼ ð12Þ A R0i where A[Æ] denotes the area operation. The value of hi indicates three different types of region, that is, a fairly higher value shows the region Ri belongs to the video object, a fairly lower value shows Ri is a part of the background, and a moderate value shows Ri may overlap the video object and the background at the same time. For the first and the second cases, the whole region is assigned to the video object or the background based on the following criterion:
Ri 2 von ; if hi > T h ð13Þ Ri 62 von ; if hi < T l For the third case, Tl 6 hi 6 Th, pixel classification in the region Ri is performed using the following criterion:
ðx; yÞ 2 von ; if ðx þ ui ; y þ vi Þ 2 von1 ð14Þ ðx; yÞ 62 von ; if ðx þ ui ; y þ vi Þ 62 von1
659
Since the value of hi lies in the range of [0, 1] , Th should be greater than 0.5, i.e., Th = 0.5 + D(D > 0). The other parameter Tl is set to the margin value D. Therefore, both pixel classification and region classification hold a half of the whole range. In our experiments, the margin value D is set to 0.15 for all test sequences, and these two criteria lead to a reliable tracking performance (see Fig. 4(f)). 3.3. Post-processing In the previous subsection, only the regions are considered in the process of backward projection, while those boundaries (watershed lines) between different regions are not classified. Therefore, a closing morphological operation is first performed to fill the watershed lines in the video object. It is the closed video object that is propagated in the tracking process. Since a closing operation is needed, an opening operation is also performed subsequently. Here, a cascade of closing and opening operation also smoothes the boundary of the video object, which sometimes enhances the visual quality of the segmented video object. The structuring element for both morphological operations is a 5 · 5 square, which can achieve a good tradeoff between the accuracy and the smoothness of video object boundaries.
4. Experimental results We use several MPEG-4 test sequences to test the proposed approach to semi-automatic video object segmentation. The experimental results for three test sequences are shown in Figs. 5–7. These sequences represent different levels of spatial detail and movement in real situations. The first sequence Mother and Daughter is a MPEG-4 class A sequence, with low spatial detail and low amount of movement. The background is uniform and static, and the motion of human bodies is relatively small. The second sequence Foreman is a MPEG-4 class B sequence, with medium spatial detail and low amount of movement. The background is complex and shows low contrast with the talking person, and the camera motion is also
660
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
Fig. 5. Experimental results for the sequence Mother and Daughter (Frame: 1, 20, 40, 60, 80, 100).
Fig. 6. Experimental results for the sequence Foreman (Frame: 1, 20, 40, 60, 80, 100).
Fig. 7. Experimental results for the sequence Table Tennis (Frame: 1, 5, 15, 25, 35, 40).
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
661
Table 1 Average processing time for a frame of the three sequences (msec) Test sequence
Mother and Daughter (176 · 144) Foreman (176 · 144) Table Tennis (352 · 240)
Proposed bidirectional projection approach
Backward projection approach in (Gu and Lee, 1998b)
Forward projection
Seg.
Backward projection
Total
Seg.
Backward projection
Total
47 43 95
37 42 131
81 68 181
165 (118) 153 (110) 407
43 58 154
281 275 981
324 333 1135
apparent besides the non-rigid motion of the person. The third sequence Table Tennis is a MPEG-4 class C sequence, with high spatial detail and medium amount of movement. Several moving objects appear on the clutter background. The interested video object is the arm holding the racket, which mixes different rigid motions of the arm, the hand and the racket. For all sequences, the initial video objects can be easily obtained using our interactive video object segmentation tool, and an example for the sequence Foreman is described in Section 2. In our experiments, it takes about 5 s to obtain the desired video object, and there is no perceptible delay during the run of seeded region merging. The activity of user correction is minimal, because the proposed seeded region merging algorithm provides a fairly good semantic video object expected by us. The extracted video objects from the first frame of the three sequences are shown in the first image of Figs. 5–7 respectively. The proposed bidirectional projection approach is then exploited to automatically track the video object in the subsequent frames. The desired video objects with good visual quality are obtained during the tracking process (see the latter five images in Figs. 5–7). These experiments are performed on a low-end AMD Athlon XP1800 (1.53 GHz) PC. The average processing time per frame using our bidirectional projection approach and the backward projection approach in (Gu and Lee, 1998b) is shown in Table 1. The same values are set to the related parameters in both approaches. Compared with Gu and Lees approach, our approach needs to consume some time on forward projection, but sharply reduce the time on backward projection (including post-
processing), and spatial segmentation to some extent. For the three sequences, the total processing time of our approach is 51%, 46%, and 36% of Gu and Lees approach, which demonstrate the improved segmentation efficiency of our approach. For the head-shoulder sequences with relatively small motion, like Mother and Daughter and Foreman, nearly the same experimental results with good visual quality are obtained if the forward projection is skipped. In this case, the band area Bn is dilated from the previous object contour ctn1, and the approximate translation vector (T un1 , T vn1 ) is a zero vector. The processing time can be further reduced (see the numbers in the brackets in Table 1), which equals 9 frames per second. It is promising that more efficiency can be gained after code optimization or using a higher speed processor.
5. Conclusions Video object segmentation is an inevitable necessity for MPEG-4 related multimedia applications. A novel approach to semi-automatic video object segmentation is proposed in this paper, which incorporates interactive segmentation and automatic tracking. An interactive video object segmentation tool is presented to allow the user to easily define the video objects. The user interaction is more convenient due to the flexible combination of marker drawing and region selection, and is also minimized because the proposed fast seeded region merging approach can extract a fairly good video object. A bidirectional projection approach is proposed for automatic video object
662
Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662
tracking, which extends backward projection with the combination of forward projection. The proposed tracking approach produces more reliable video objects for different types of video sequences, and improves the segmentation efficiency by a factor of two. The current system for semi-automatic video segmentation is implemented using independent modules, and we will consider some high level features of the video object in the tracking module to further improve the tracking reliability for a wide range of video objects. In conclusion, we believe that with further improvement, the proposed semi-automatic video object segmentation approach could be potentially useful in many applications. Typical applications are video-telephony and videoconference where the user may be able to interact with the segmentation process, for example with the purpose to achieve better coding quality for the most relevant objects such as human beings. Video production is another potential application where different types of objects are segmented for database storage and reused in other contexts. It is also possible to associate appropriate metadata to these video objects, and then they can be used in interactive broadcasting applications that allow the user can request the additional information available about each object in the scene. References Adams, R., Bischof, L., 1994. Seeded region growing. IEEE Trans. Pattern Anal. Machine Intell. 16 (6), 641–647. Cooray, S., OConnor, N., Marlow, S., Murphy, N., Curran, T., 2001. Hierarchical semi-automatic video object segmentation for multimedia applications. Proc. SPIE Internet Multimedia Manage. Syst. II 4519, 10–19. Di Zenzo, S., 1986. A note on the gradient of a multi-image. Comput. Vis. Graphics Image Process. 33 (1), 116–125. Fan, J., Elmagarmid, A.K., 2002. An automatic algorithm for semantic object generation and temporal tracking. Signal Process. Image Commun. 17 (2), 145–164. Fan, J., Zhu, X., Wu, L., 2001. Automatic model-based semantic object extraction algorithm. IEEE Trans. Circ. Syst. Video Technol. 11 (10), 1073–1084.
Gatica-Perez, D., Sun, M.T., Gu, C., 1999. Semantic video object extraction based on backward tracking of multivalued watershed. Proc. IEEE ICIP 2, 145–149. Gu, C., Lee, M.C., 1998a. Semiautomatic segmentation and tracking of semantic video objects. IEEE Trans. Circ. Syst. Video Technol. 8 (5), 572–584. Gu, C., Lee, M.C., 1998b. Semantic video object tracking using region-based classification. Proc. IEEE ICIP 3, 643–647. Guo, J., Kim, J.W., Kuo, C.-C.J., 1999. An interactive object segmentation system for MPEG video. Proc. IEEE ICIP 2, 140–144. Kim, C., Hwang, J.N., 2002. Fast and automatic video object segmentation and tracking for content-based applications. IEEE Trans. Circ. Syst. Video Technol. 12 (2), 122–129. Kim, M., Choi, J.G., Kim, D., Lee, H., Lee, M.H., Ahn, C., Ho, Y.S., 1999. VOP generation tool: automatic segmentation of moving objects in image sequences based on spatiotemporal information. IEEE Trans. Circ. Syst. Video Technol. 9 (8), 1216–1226. Kim, M., Jeon, J.G., Kwak, J.S., Lee, M.H., Ahn, C., 2001. Moving object segmentation in video sequence by user interaction and automatic object tracking. Image Vis. Comput. 19 (5), 245–260. Kim, Y.R., Kim, J.H., Kim, Y., Ko, S.J., 2003. Semiautomatic segmentation using spatio-temporal gradual region merging for MPEG-4. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences E86A(10), 2526–2534. Lim, J., Cho, H.K., Beom Ra, J., 2000. An improved video object tracking algorithm based on motion re-estimation. Proc. IEEE ICIP 1, 339–342. Luo, H.T., Eleftheriadis, A., 2002. An interactive authoring system for video object segmentation and annotation. Signal Process.: Image Commun. 17 (7), 559–572. Meier, T., Ngan, K.N., 1998. Automatic segmentation of moving objects for video object plane generation. IEEE Trans. Circ. Syst. Video Technol. 8 (5), 525–538. Sun, S.J., Haynor, D.R., Kim, Y.M., 2003. Semiautomatic video object segmentation using Vsnakes. IEEE Trans. Circ. Syst. Video Technol. 13 (1), 75–82. Tekalp, A.M., 1998. Digital Video Processing. Tsinghua University Press, Beijing. Tsaig, Y., Averbuch, A., 2002. Automatic segmentation of moving objects in video sequences: a region labeling approach. IEEE Trans. Circ. Syst. Video Technol. 12 (7), 597–612. Vincent, L., Soille, P., 1991. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 13 (6), 583–598. Vincent, L., 1993. Morphological grayscale reconstruction in image analysis: applications and efficient algorithms. IEEE Trans. Image Process. 2 (2), 176–201.