Semi-automatic video object segmentation using seeded region merging and bidirectional projection

Pattern Recognition Letters 26 (2005) 653–662 www.elsevier.com/locate/patrec Semi-automatic video object segmentation using seeded region merging and...

Download PDF

492KB Sizes 0 Downloads 63 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters 26 (2005) 653–662 www.elsevier.com/locate/patrec

Semi-automatic video object segmentation using seeded region merging and bidirectional projection Zhi Liu *, Jie Yang, Ning Song Peng Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, Peoples Republic of China Received 9 March 2004; received in revised form 4 August 2004

Abstract In this paper, we propose a novel approach to semi-automatic video object segmentation. First, an interactive video object segmentation tool is presented for the user to easily deﬁne the desired video objects in the ﬁrst frame, which is user-friendly, ﬂexible and eﬃcient due to the proposed fast seeded region merging approach and the combination of two diﬀerent ways of user interaction, i.e., marker drawing and region selection. Then, a bidirectional projection approach is proposed to automatically track the video objects in the subsequent frames, which combines forward projection and backward projection to improve the segmentation eﬃciency, and incorporates pixel classiﬁcation with region classiﬁcation in backward projection to guarantee a more reliable tracking performance. Experimental results for various types of the MPEG-4 test sequences demonstrate an eﬃcient and faithful segmentation performance of the proposed approach. 2004 Elsevier B.V. All rights reserved. Keywords: Video object segmentation; Video object tracking; Seeded region merging; Bidirectional projection

1. Introduction As an important issue for the implementation of many content-based multimedia applications supported by MPEG-4, video object segmentation remains a challenging research topic until now. Although human beings can easily identify diﬀer*

Corresponding author. Tel.: +862162934627; +862162932035. E-mail address: [email protected] (Z. Liu).

fax:

ent video objects in a video sequence, it is hard for a computer to automatically segment the desired video objects in any kind of generic video sequences. At present, eﬃcient algorithms for automatic video object segmentation only apply to moving objects or some kind of objects with a prior knowledge (Fan and Elmagarmid, 2002; Fan et al., 2001; Kim and Hwang, 2002; Kim et al., 1999; Meier and Ngan, 1998; Tsaig and Averbuch, 2002). In the near future, it seems hardly possible to develop a generic automatic algorithm

0167-8655/$ - see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.09.017

654

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

applicable to a variety of video sequences. Therefore, a more practical solution, the so-called semiautomatic video object segmentation (Cooray et al., 2001; Gatica-Perez et al., 1999; Gu and Lee, 1998a,b; Guo et al., 1999; Kim et al., 2001, 2003; Lim et al., 2000; Luo and Eleftheriadis, 2002; Sun et al., 2003), draws more and more attention in recent years. A typical paradigm of semiautomatic video object segmentation consists of two steps, i.e., segmenting the ﬁrst frame with user interaction to deﬁne the video objects, and automatically tracking in the subsequent frames. The ﬁrst step is extremely important in any semi-automatic video object segmentation algorithms, because the accuracy of the segmented video objects directly determines the success or failure of the following tracking process. A userfriendly segmentation tool should be provided for the user to conveniently deﬁne the video objects, and user interaction activity should be minimized to improve the segmentation eﬃciency. However, the ﬂexibility and eﬃciency of user interaction are rarely considered as important as the algorithm itself in most existing approaches. The most common way of user interaction is to delineate an approximate contour clinging to the video object (Guo et al., 1999; Kim et al., 2001, 2003). However, it is a burdened job to move mouse along the true object contour, especially when the shape of the object is complex. For those approaches associated with snake model, a considerable number of control points around the object contour need to be selected one by one (Luo and Eleftheriadis, 2002; Sun et al., 2003). Region selection is a more natural way to deﬁne a video object, but an excessive number of regions still need to be selected at diﬀerent partition levels (Cooray et al., 2001). In this paper, we propose an interactive video object segmentation tool, which is userfriendly, ﬂexible and eﬃcient due to the proposed fast seeded region merging approach and the combination of two diﬀerent ways of user interaction. The second step is a process of video object tracking. Many approaches adopt a two-step conﬁguration to track the video objects (Gu and Lee, 1998a; Guo et al., 1999; Lim et al., 2000, 2001, 2003), i.e., ﬁrst project the previous objects to the current frame using some kind of parametric mo-

tion model, and then reﬁne the object boundaries. The underlying tracking mechanism is forward projection, which works well for rigid objects with translation motion. For non-rigid objects with multiple motions, irregular boundaries and uncertain holes may appear on the video objects, and inevitable post-processing is needed for boundary reﬁnement. In contrast with forward projection, backward projection (Gatica-Perez et al., 1999; Gu and Lee, 1998b) is suitable to deal with non-rigid objects, and needs no further reﬁnements. Each segmented region in the current frame is projected to the previous frame, and then it is assigned to the current video object if the majority of the projected region overlaps the previous video object. In nature, it is a region classiﬁcation approach rather than a tracking approach. However, it is not an eﬃcient way to backward project all segmented regions for classiﬁcation. Another problem may occur when a segmented region overlaps the video object and the background, which causes peninsulas or gaps to appear on the video object no matter what classiﬁcation it is assigned to. So far, we have discussed the main features and limitations of forward projection and backward projection. In this paper, we propose a bidirectional projection approach mainly as an extension of backward projection (Gu and Lee, 1998b), which is more eﬃcient due to the combination with forward projection, and ensures the visual quality of the tracked video objects by incorporating pixel classiﬁcation with region classiﬁcation. This paper is organized as follows. In Section 2, an interactive video object segmentation tool is presented. Section 3 proposes our bidirectional projection approach. Experimental results for diﬀerent types of the MPEG-4 test sequences are shown in Section 4. Conclusions are given in Section 5.

2. Interactive video object segmentation In order to facilitate the user to easily extract the desired video object, we combine two ways of user interaction, i.e., marker drawing and region selection, and propose a ﬂexible scheme shown in Fig. 1. The whole procedure of interactive video object segmentation consists of three steps: marker draw-

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

655

Fig. 1. A ﬂexible scheme of interactive video object segmentation.

Fig. 2. A screen shot of our GUI.

ing, automatic video object extraction, and user correction. A screen shot of our graphical user interface (GUI) is shown in Fig. 2, which is exploited to clearly describe each step in the following. (1) Marker drawing: The user draws scribbles of diﬀerent colors to roughly mark the video object and the background. As shown in Fig. 2, a red 1 scribble marks the interested video object, and a blue 1 scribble marks the background in the left window of the GUI. Scribble drawing is more convenient and ﬂexible for the user to experience. It usually takes a few seconds, which is faster than the way of contour drawing and control points selection. (2) Automatic video object extraction: This step does not need any user interference. The computer performs two tasks to automatically extract the video object, i.e., spatial segmentation and fast seeded region merging, which will be described in the following two subsections. The outcome of each task is shown in the middle and the right window respectively. 1 For interpretation of color in Fig. 2, the reader is referred to the web version of this article.

(3) User correction: This step is optional. If not satisﬁed with the automatically extracted video object, the user can make correction. In our GUI, the user is prompted to click with the left button on the region to add it to the video object, while click with the right button to remove. The number of mouse clicks depends on the image content and the marker drawing, usually less than 2 in our many experiments. For example, the left image in Fig. 2 shows low contrast between the helmet of ‘‘foreman’’ and the background, and the region selected by the mouse in the middle image is merged into the background in the process of seeded region merging. The user just needs to click once with the left button to obtain the desired video object shown in the right image.

2.1. Spatial segmentation The watershed segmentation algorithm (Vincent and Soille, 1991) is exploited to partition the image into a set of regions because it preserves accurate boundaries between diﬀerent objects.

656

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

However, the main drawback is the problem of over-segmentation due to the noise in the gradient image. In order to obtain a moderate segmentation result, we propose a simpliﬁcation step to remove insigniﬁcant local minima in the gradient image before applying the watershed algorithm. First, the gradient image g of the color image f in YUV space is estimated by the delicate method proposed by Di Zenzo, which considers the relationship between diﬀerent image components, and is more reasonable than other simple methods such as the RMS, or the sum, or the absolute maximum of three component gradient images. The detailed calculation procedure can be found in (Di Zenzo, 1986). Then, g is dilated with a structuring element E, and the dilated image is elevated by a height h to get the marker image gm gm ¼ ðg EÞ þ h

ð1Þ

Finally, the reconstruction of g from gm by geodesic erosion (Vincent, 1993) is performed to obtain the simpliﬁed gradient image gs gs ¼ uðrecÞ ðgm ; gÞ

ð2Þ

In our implementation, E is set to a 3 · 3 cross that is the smallest size of the symmetrical 4-connectivity structuring element, thus only the most insigniﬁcant local minima will be removed. Another parameter h is set to 2, which is the second smallest value that removes those local minima with depth smaller than 2. Both parameters are set to rather small values that lead to a reasonably ﬁner partition. Now we apply the watershed segmentation algorithm to the image gs to obtain the label image fl, which shows a reasonable partition of the original image f (see the middle image of Fig. 2). 2.2. Fast seeded region merging A seeded region growing (SRG) algorithm is proposed in (Adams and Bischof, 1994) for grayscale image segmentation from a set of seeded pixels. It is a sequential labeling technique in nature, in which each loop of the algorithm labels only one pixel that neighbors the already labeled pixels with the lowest dissimilarity measure. Moreover, it

is a very eﬃcient algorithm due to the use of sorted list structure. Here, we extend the idea in SRG from the pixel level to the region level and propose a fast seeded region merging algorithm for the color image, which is especially applicable to interactive video object segmentation. The proposed algorithm involves the following steps: (1) For each region Ri of the label image fl we calculate the area Ai and the mean color MCi, which is deﬁned by the following: P ðx;yÞ2Ri f ðx; yÞ MCi ¼ ð3Þ Ai Then we generate the weighed region adjacency graph (WRAG) of the label image fl, in which each non-inﬁnity element WRAG(i, j) denotes the dissimilarity measure of the two adjacent regions Ri and Rj. Considering the diﬀerence in the mean color and the area of two adjacent regions, the following criterion is proposed to merge small regions in preference, which is a robust dissimilarity measure demonstrated in our experiments. 2 WRAGði; jÞ ¼ min Ai ; Aj MCi MCj ð4Þ (2) We divide the regions of the label image fl into three sets denoted by vo, bg and un. Speciﬁcally, the three sets vo, bg and un stands for video object, background and unknown regions, respectively. If a region Ri of fl is traversed only by the red scribble or only by the blue scribble in the GUI, it is classiﬁed into vo or bg respectively; otherwise it is classiﬁed into un. Now we have seeded regions in vo for the video object, seeded regions in bg for the background, and some unassigned regions in un. Then we initialize a priority queue PQ constructed by the minimum heap structure with those regions in un, which are adjacent to at least one region in vo or bg. The element associated with the region Ri in PQ consists of the following data: • the dissimilarity measure with vo; dm vo ¼ minRj 2vo WRAGði; jÞ, • the dissimilarity measure with bg; dm bg ¼ minRj 2bg WRAGði; jÞ, • the smaller dissimilarity measure, dm = min (dm_vo, dm_bg), by which PQ is sorted.

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

657

(3) The ﬁrst element in PQ is deleted, and the associated region Ri is removed from un. Ri is appended to vo if dm_vo < dm_bg, otherwise it is appended to bg. Then we check each region Rj that is adjacent to Ri and still in un. If Rj is already in PQ, its data is updated with the following: dm vo ¼ WRAGðj; iÞ; if Ri 2 vo WRAGðj; iÞ < dm vo

and

dm bg ¼ WRAGðj; iÞ; if Ri 2 bg WRAGðj; iÞ < dm bg

and

ð5Þ

If Rj is not in PQ, it is then inserted into PQ, and its data is initialized with the following: dm vo ¼ WRAGðj; iÞ;

dm bg ¼ 1; if Ri 2 vo

dm bg ¼ WRAGðj; iÞ;

dm vo ¼ 1; if Ri 2 bg ð6Þ

In both cases above, dm = min(dm_vo, dm_bg) (4) Repeat step 3 until PQ is empty. (5) Since there is no region left in un now, all the regions in vo are grouped into the video object, and the regions in bg form the background.

3. Automatic video object tracking In this section, we propose a bidirectional projection approach to automatically track the extracted video objects in the subsequent frames of the video sequence. Our tracking approach can be deﬁned as obtaining the video object von of the current frame, based on the motion information related with the previous video object von1, and the spatial segmentation information of the current frame. The ﬂowchart of the proposed tracking approach is depicted in Fig. 3, which consists of three steps: forward projection, backward projection and post-processing. 3.1. Forward projection The objective of forward projection is to locate the video object with rough boundary information, which is derived from the motion estimation. For each contour pixel In1(x, y) of the previous video object von1 (see Fig. 4(a)), the motion vec-

Fig. 3. Flowchart of the proposed bidirectional projection approach.

tor (u(x, y), v(x, y)) is estimated using the 3SS (3step search) method (Tekalp, 1998) to minimize the following prediction error: eðx; yÞ ¼ min u;v

N N X X

kI n1 ðx þ i; y þ jÞ

i¼N j¼N

I n ðx þ uðx; yÞ þ i; y þ vðx; yÞ þ jÞk

ð7Þ

In order to full perform the eﬃciency of the 3SS method, the search range should be set to [2m + 1, 2m 1]. Although a big search range [31, 31] in case of m = 5 is usually adopted in most video coding applications, we can limit the search range in our application because the purpose is to roughly locate the current video object. In our implementation, the search range for the motion vector (u(x, y), v(x, y)) is set to [7, 7] in case of m = 3, which is enough to predict the apparent translation motion of the video object between consecutive frames. Since we use block matching to predict the translation of each contour pixel, a smaller matching block is more suitable than bigger ones. In our implementation, N is set to 2 for these (2N + 1) · (2N + 1) matching blocks. Forward projection is performed on all contour pixels of von1, denoted by a pixel set ctn1. The projection of ctn1 in the current frame In can be denoted by another pixel set pn (see the black pixels in Fig. 4(b)) pn ¼ fðx þ uðx; yÞ; y þ vðx; yÞÞjðx; yÞ 2 ctn1 g

ð8Þ

658

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

Fig. 4. A pictorial description of the proposed bidirectional projection approach.

These projected pixels in pn may not exactly fall onto the true contour ctn of the video object von in the current frame, and they generally cannot form a closed contour. All pixels in pn are then dilated with a disk-shaped structuring element Ed to obtain a band area Bn (see Fig. 4(b)) to accommodate the rotation, scale change and deformation of the video object. The radius of Ed depends on these non-translation motion activities exhibited in the video sequences, and it should ensure that the true contour ctn locates in the band area Bn. We found by many experiments that Ed = 15 is enough for diﬀerent types of video sequences. The approximate translation vector ðT un1 ; T vn1 Þ for the video object is estimated using the average of motion vectors for all the pixels in ctn1 P ðx;yÞ2ctn1 uðx; yÞ u ; T n1 ¼ jctn1 j P ðx;yÞ2ctn1 vðx; yÞ ð9Þ T vn1 ¼ jctn1 j This vector ðT un1 ; T vn1 Þ reﬂects a global translation movement of the video object if an apparent translation exists, which will be used in backward projection described in the next subsection. 3.2. Backward projection The objective of backward projection is to ﬁnd the true contour ctn of the current video object von,

in the band area Bn. In fact, only the band area Bn needs to be partitioned into some regions, which need to be backward projected to determine whether they belong to the current video object or not. The area Rin inside Bn deﬁnitely belongs to the current video object, while the area Rout outside Bn belongs to the background (see Fig. 4(b)). The spatial segmentation algorithm described in Section 2.1 is exploited to partition the current frame In (see Fig. 4(c)). The pixels in Bn require gradient calculation, while the gradient of all other pixels is merely set to zero. It also saves the computation time of watershed segmentation, because the areas that are not covered by Bn can be simply ﬂooded as the lowest ﬂat catchment basins. The segmented regions, excluding Rin and Rout (see Fig. 4(d)), are backward projected to determine their classiﬁcations. For each region Ri, the backward motion vector (ui, vi) is estimated to minimize the following prediction error: X ei ¼ min kI n ðx; yÞ I n1 ðx þ ui ; y þ vi Þk ui ;vi

ðx;yÞ2Ri

ð10Þ Compared with the search range [7, 7] used in forward projection, a half search range [3, 3] with the oﬀset ½T un1 ; T vn1 is used to predict (ui, vi). It is suitable to reduce one level of the search range from m = 3 to 2, because the vector ðT un1 ; T vn1 Þ has already reﬂected a possible apparent translation of the video object.

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

The backward projected region R0i in the previous frame In1 can be denoted by the following formula: [ R0i ¼ ð x þ ui ; y þ v i Þ ð11Þ ðx;yÞ2Ri

The classiﬁcation of Ri can be determined from the intersecting area of R0i and von1. A natural method is that Ri is classiﬁed into von if the majority of R0i intersects with von1, otherwise classiﬁed into the background (Gu and Lee, 1998b). However, it is not a robust method to always guarantee the visual quality of the segmented video objects during the tracking process. Speciﬁcally, binary classiﬁcation is not suitable for such a segmented region that overlaps the video object and the background at the same time. If such a region (see the region Ru in Fig. 4(d)) is classiﬁed into the video object, a peninsula appears on the video object; otherwise a gap appears (see Fig. 4(e)). In order to deal with such a problem, we propose a robust approach to improve the method in (Gu and Lee, 1998b). The ratio of the intersecting area of R0i and von1 to the area of R0i is deﬁned by the following formula: T A R0i von1 hi ¼ ð12Þ A R0i where A[Æ] denotes the area operation. The value of hi indicates three diﬀerent types of region, that is, a fairly higher value shows the region Ri belongs to the video object, a fairly lower value shows Ri is a part of the background, and a moderate value shows Ri may overlap the video object and the background at the same time. For the ﬁrst and the second cases, the whole region is assigned to the video object or the background based on the following criterion:

Ri 2 von ; if hi > T h ð13Þ Ri 62 von ; if hi < T l For the third case, Tl 6 hi 6 Th, pixel classiﬁcation in the region Ri is performed using the following criterion:

ðx; yÞ 2 von ; if ðx þ ui ; y þ vi Þ 2 von1 ð14Þ ðx; yÞ 62 von ; if ðx þ ui ; y þ vi Þ 62 von1

659

Since the value of hi lies in the range of [0, 1] , Th should be greater than 0.5, i.e., Th = 0.5 + D(D > 0). The other parameter Tl is set to the margin value D. Therefore, both pixel classiﬁcation and region classiﬁcation hold a half of the whole range. In our experiments, the margin value D is set to 0.15 for all test sequences, and these two criteria lead to a reliable tracking performance (see Fig. 4(f)). 3.3. Post-processing In the previous subsection, only the regions are considered in the process of backward projection, while those boundaries (watershed lines) between diﬀerent regions are not classiﬁed. Therefore, a closing morphological operation is ﬁrst performed to ﬁll the watershed lines in the video object. It is the closed video object that is propagated in the tracking process. Since a closing operation is needed, an opening operation is also performed subsequently. Here, a cascade of closing and opening operation also smoothes the boundary of the video object, which sometimes enhances the visual quality of the segmented video object. The structuring element for both morphological operations is a 5 · 5 square, which can achieve a good tradeoﬀ between the accuracy and the smoothness of video object boundaries.

4. Experimental results We use several MPEG-4 test sequences to test the proposed approach to semi-automatic video object segmentation. The experimental results for three test sequences are shown in Figs. 5–7. These sequences represent diﬀerent levels of spatial detail and movement in real situations. The ﬁrst sequence Mother and Daughter is a MPEG-4 class A sequence, with low spatial detail and low amount of movement. The background is uniform and static, and the motion of human bodies is relatively small. The second sequence Foreman is a MPEG-4 class B sequence, with medium spatial detail and low amount of movement. The background is complex and shows low contrast with the talking person, and the camera motion is also

660

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

Fig. 5. Experimental results for the sequence Mother and Daughter (Frame: 1, 20, 40, 60, 80, 100).

Fig. 6. Experimental results for the sequence Foreman (Frame: 1, 20, 40, 60, 80, 100).

Fig. 7. Experimental results for the sequence Table Tennis (Frame: 1, 5, 15, 25, 35, 40).

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

661

Table 1 Average processing time for a frame of the three sequences (msec) Test sequence

Mother and Daughter (176 · 144) Foreman (176 · 144) Table Tennis (352 · 240)

Proposed bidirectional projection approach

Backward projection approach in (Gu and Lee, 1998b)

Forward projection

Seg.

Backward projection

Total

Seg.

Backward projection

Total

47 43 95

37 42 131

81 68 181

165 (118) 153 (110) 407

43 58 154

281 275 981

324 333 1135

apparent besides the non-rigid motion of the person. The third sequence Table Tennis is a MPEG-4 class C sequence, with high spatial detail and medium amount of movement. Several moving objects appear on the clutter background. The interested video object is the arm holding the racket, which mixes diﬀerent rigid motions of the arm, the hand and the racket. For all sequences, the initial video objects can be easily obtained using our interactive video object segmentation tool, and an example for the sequence Foreman is described in Section 2. In our experiments, it takes about 5 s to obtain the desired video object, and there is no perceptible delay during the run of seeded region merging. The activity of user correction is minimal, because the proposed seeded region merging algorithm provides a fairly good semantic video object expected by us. The extracted video objects from the ﬁrst frame of the three sequences are shown in the ﬁrst image of Figs. 5–7 respectively. The proposed bidirectional projection approach is then exploited to automatically track the video object in the subsequent frames. The desired video objects with good visual quality are obtained during the tracking process (see the latter ﬁve images in Figs. 5–7). These experiments are performed on a low-end AMD Athlon XP1800 (1.53 GHz) PC. The average processing time per frame using our bidirectional projection approach and the backward projection approach in (Gu and Lee, 1998b) is shown in Table 1. The same values are set to the related parameters in both approaches. Compared with Gu and Lees approach, our approach needs to consume some time on forward projection, but sharply reduce the time on backward projection (including post-

processing), and spatial segmentation to some extent. For the three sequences, the total processing time of our approach is 51%, 46%, and 36% of Gu and Lees approach, which demonstrate the improved segmentation eﬃciency of our approach. For the head-shoulder sequences with relatively small motion, like Mother and Daughter and Foreman, nearly the same experimental results with good visual quality are obtained if the forward projection is skipped. In this case, the band area Bn is dilated from the previous object contour ctn1, and the approximate translation vector (T un1 , T vn1 ) is a zero vector. The processing time can be further reduced (see the numbers in the brackets in Table 1), which equals 9 frames per second. It is promising that more eﬃciency can be gained after code optimization or using a higher speed processor.

5. Conclusions Video object segmentation is an inevitable necessity for MPEG-4 related multimedia applications. A novel approach to semi-automatic video object segmentation is proposed in this paper, which incorporates interactive segmentation and automatic tracking. An interactive video object segmentation tool is presented to allow the user to easily deﬁne the video objects. The user interaction is more convenient due to the ﬂexible combination of marker drawing and region selection, and is also minimized because the proposed fast seeded region merging approach can extract a fairly good video object. A bidirectional projection approach is proposed for automatic video object

662

Z. Liu et al. / Pattern Recognition Letters 26 (2005) 653–662

tracking, which extends backward projection with the combination of forward projection. The proposed tracking approach produces more reliable video objects for diﬀerent types of video sequences, and improves the segmentation eﬃciency by a factor of two. The current system for semi-automatic video segmentation is implemented using independent modules, and we will consider some high level features of the video object in the tracking module to further improve the tracking reliability for a wide range of video objects. In conclusion, we believe that with further improvement, the proposed semi-automatic video object segmentation approach could be potentially useful in many applications. Typical applications are video-telephony and videoconference where the user may be able to interact with the segmentation process, for example with the purpose to achieve better coding quality for the most relevant objects such as human beings. Video production is another potential application where diﬀerent types of objects are segmented for database storage and reused in other contexts. It is also possible to associate appropriate metadata to these video objects, and then they can be used in interactive broadcasting applications that allow the user can request the additional information available about each object in the scene. References Adams, R., Bischof, L., 1994. Seeded region growing. IEEE Trans. Pattern Anal. Machine Intell. 16 (6), 641–647. Cooray, S., OConnor, N., Marlow, S., Murphy, N., Curran, T., 2001. Hierarchical semi-automatic video object segmentation for multimedia applications. Proc. SPIE Internet Multimedia Manage. Syst. II 4519, 10–19. Di Zenzo, S., 1986. A note on the gradient of a multi-image. Comput. Vis. Graphics Image Process. 33 (1), 116–125. Fan, J., Elmagarmid, A.K., 2002. An automatic algorithm for semantic object generation and temporal tracking. Signal Process. Image Commun. 17 (2), 145–164. Fan, J., Zhu, X., Wu, L., 2001. Automatic model-based semantic object extraction algorithm. IEEE Trans. Circ. Syst. Video Technol. 11 (10), 1073–1084.

Gatica-Perez, D., Sun, M.T., Gu, C., 1999. Semantic video object extraction based on backward tracking of multivalued watershed. Proc. IEEE ICIP 2, 145–149. Gu, C., Lee, M.C., 1998a. Semiautomatic segmentation and tracking of semantic video objects. IEEE Trans. Circ. Syst. Video Technol. 8 (5), 572–584. Gu, C., Lee, M.C., 1998b. Semantic video object tracking using region-based classiﬁcation. Proc. IEEE ICIP 3, 643–647. Guo, J., Kim, J.W., Kuo, C.-C.J., 1999. An interactive object segmentation system for MPEG video. Proc. IEEE ICIP 2, 140–144. Kim, C., Hwang, J.N., 2002. Fast and automatic video object segmentation and tracking for content-based applications. IEEE Trans. Circ. Syst. Video Technol. 12 (2), 122–129. Kim, M., Choi, J.G., Kim, D., Lee, H., Lee, M.H., Ahn, C., Ho, Y.S., 1999. VOP generation tool: automatic segmentation of moving objects in image sequences based on spatiotemporal information. IEEE Trans. Circ. Syst. Video Technol. 9 (8), 1216–1226. Kim, M., Jeon, J.G., Kwak, J.S., Lee, M.H., Ahn, C., 2001. Moving object segmentation in video sequence by user interaction and automatic object tracking. Image Vis. Comput. 19 (5), 245–260. Kim, Y.R., Kim, J.H., Kim, Y., Ko, S.J., 2003. Semiautomatic segmentation using spatio-temporal gradual region merging for MPEG-4. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences E86A(10), 2526–2534. Lim, J., Cho, H.K., Beom Ra, J., 2000. An improved video object tracking algorithm based on motion re-estimation. Proc. IEEE ICIP 1, 339–342. Luo, H.T., Eleftheriadis, A., 2002. An interactive authoring system for video object segmentation and annotation. Signal Process.: Image Commun. 17 (7), 559–572. Meier, T., Ngan, K.N., 1998. Automatic segmentation of moving objects for video object plane generation. IEEE Trans. Circ. Syst. Video Technol. 8 (5), 525–538. Sun, S.J., Haynor, D.R., Kim, Y.M., 2003. Semiautomatic video object segmentation using Vsnakes. IEEE Trans. Circ. Syst. Video Technol. 13 (1), 75–82. Tekalp, A.M., 1998. Digital Video Processing. Tsinghua University Press, Beijing. Tsaig, Y., Averbuch, A., 2002. Automatic segmentation of moving objects in video sequences: a region labeling approach. IEEE Trans. Circ. Syst. Video Technol. 12 (7), 597–612. Vincent, L., Soille, P., 1991. Watersheds in digital spaces: an eﬃcient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 13 (6), 583–598. Vincent, L., 1993. Morphological grayscale reconstruction in image analysis: applications and eﬃcient algorithms. IEEE Trans. Image Process. 2 (2), 176–201.

Semi-automatic video object segmentation using seeded region merging and bidirectional projection

Semi-automatic video object segmentation using seeded region merging and bidirectional projection

Recommend Documents