Expert Systems with Applications 36 (2009) 5976–5986
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Automatic scene detection for advanced story retrieval Songhao Zhu *, Yuncai Liu Institute of Image Process and Pattern Recognition, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
a r t i c l e
i n f o
Keywords: Content-based retrieval Video scene detection Feature selection Time-constraint
a b s t r a c t Browsing video scenes is just the process to unfold the story scenarios of a long video archive, which can help users to locate their desired video segments quickly and efficiently. Automatic scene detection of a long video stream file is hence the first and crucial step toward a concise and comprehensive content-based representation for indexing, browsing and retrieval purposes. In this paper, we present a novel scene detection scheme for various video types. We first detect video shot using a coarse-tofine algorithm. The key frames without useful information are detected and removed using template matching. Spatio-temporal coherent shots are then grouped into the same scene based on the temporal constraint of video content and visual similarity of shot activity. The proposed algorithm has been performed on various types of videos containing movie and TV program. Promising experimental results shows that the proposed method makes sense to efficient retrieval of video contents of interest. Ó 2008 Elsevier Ltd. All rights reserved.
1. Introduction In recent years, rapid technology advance in multimedia information processing, and World-Wide-Web have led to such a following fact vast amounts of digital information can be available at a reasonable price. Meanwhile, the price of both digital instrument and digital storage media has decreased at a surprising speed, such as digital video camera, hard disk, and storage card. Combined with human nature souvenir of niceness and almost no special technique needed to manipulate, we have witnessed the ever growing amount of digital data in both professional and amateurish environment, many examples of applications such as visual commerce, remote distance internet teaching, professionally created movie and home made video. The exponential growth of visual data in digital format has led to a corresponding search problem. For example, if a human wants to locate one segment of his interest from a large collection of video archives, he has to watch each video segment from the very beginning using fast-forward and fast-backward operation. Due to the huge number of video archive, the manual searching process is naturally low efficiency and time-consuming. Accordingly, using which sort of strategy to manage large amounts of video archives efficiently and providing which kind of effective index for users to browse and retrieve quickly is ever becoming an open problem. Up to now, the most used approach for browsing and retrieving interesting contents is still using corresponding keywords, which
* Corresponding author. E-mail addresses:
[email protected] (S. Zhu),
[email protected] (Y. Liu). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.07.009
fits well for text-driven database. However, for the visual information with rich semantic inherent, browsing and retrieving method based on text is obviously not the effective solution. As a result, it is necessary to present a method to organize, index and retrieve video archives in view of semantic level. According to the general perspective, the structure of video file is a hierarchical fashion. According to the order from down to top, there are: frame, shot, scene and video. Frame is the lowest level in the hierarchical structure. A shot is the basic unit of a video, which consists of a series of adjacent frames in which the background is unchanged. Compared with based on the complete visual content of a video track, browsing and retrieving based on shots is in the higher level and more meaning. In some cases, method based on shot can achieve the browse and retrieval purpose, such as the case of less camera motion. For a typical produced video, however, due to the frequent camera motion, the amount of shots could be large, it is hence necessary to provide a more concise and compact semantic segmentation to improve the performance of browsing and retrieving. A scene grouped semantically related shots reflects a certain topic or theme. Correspondingly, the efficiency of browsing and retrieving based on scenes is obviously and naturally higher that based on shots. At the top level of the hierarchical structure is the video stream, which is composed of entire scenes. From our observation and intuitionist knowledge, such four-level hierarchical structure fashion helps users to locate their desired segments effectively and quickly from the perspective of semantics. Due to manually segmenting a long video stream file being a process of time consuming and the efficiency of it being low, the automatic aspect is hence ever becoming a very important problem to be processed.
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
1.1. Related work The problem of how to segment the video stream file into more meaningful scene for the purpose of browsing and retrieving is becoming a research topic of increasing important. Recent years have seen an explosive increase of research on automatic video segmentation techniques. In accord with what we have discussed above, automatic video stream file segmentation includes three important steps as presented below. The first step is the shot boundary detection. According to whether the transition is abrupt or gradual, the shot boundaries can be classified into following two types: cut or gradual transition. Furthermore, based on the different editing effects (Kobla, DeMenthon, & Doermann, 1999), the gradual transition can be categorized into more detailed types, such as dissolve, wipe and fade out/fade in. Shot boundary detection, sometimes also called as temporal video segmentation, completes the performance of identifying the transition between contiguous shot and subsequently separates a video stream file into different shots. Much work has been reported in this area and highly accurate results have been obtained such as in Albanese and Chianese (2004), Boccignone and Chianese (2005), Cernekova, Kotropoulos, and Pitas (2003), Dimitrova and Zhang (2002), Hanjalic (2002), Yuan, Li, Lin, and Zhang (2005). The second step is the key frame selection. This step extracts a characteristic set of either independent frames or short sequences from the corresponding shot detected according to both the complexity of activity content and the duration of shot. No matter how many frames finally selected from each shot, the extracted key frames should best depict the content of corresponding shot. In some cases, the selected key frames form each shot detected can be directly used as the condensed presentation of an entire video stream and can be accordingly used for the purpose of indexing, comparison and category. In recent years, there are several contributing products include (Girgensohn & Boreczky, 2000; Hasebe, Satoshi, Nagumo, & Makoto, 2004; Ren & Singh, 2003; Thormaehlen, Thorsten, Broszio, Hellward, & Weissenfeld, 2004; Togawa, Hideyuki, & Okuda, 2005; Xiong & Zhou, 2006). The last step is the scene segmentation, also named as video-content organization based on the semantic level. Namely, related shots detected are clustered into meaningful segments in view of semantic level and the resulting scenes can provide a concise and compact index for the purposes of browsing and retrieving. How to define a scene based on the human understanding is still an open topic and the segmentation of scene is accordingly the crucial step in the whole video stream file segmentation process. Recently a large amount of techniques have been reported to in this area involving (Boreczky et al., 1998; Rasheed & Shah, 2003; Rui, Huang, & Mehrotra, 1999; Sundaram, Hari, & Chang, 2000; Tavanapong & Zhou, 2004; Yeung, Minerva, Yeo, Boon Lock, & Liu, 1998; Zhai & Shah, 2005). Generally speaking, video stream file can be classified into following types: news video, sport video, feature film, television video, home video, and so on. Accordingly, much work has been reported in different types of videos. Jiang, Hao, Lin, Tong, and Zhang (2000) used the features of audio and visual of the video to complete the scene segmentation of news video and sport video. Audio scene boundaries and visual scene boundaries are first detected separately using respective feature. Then, the final scene boundaries are where the two types of scene boundaries coincide with each other. Sundaram et al. (2000) described a strategy of the segmentation of a film. Similar to Jiang et al. (2000), the audio and video data are first segmented into corresponding scenes separately. However, unlike the strategy of integration used in Jiang et al. (2000), the final scene boundaries are determined using a nearest neighbor algorithm and a time-constrained refined analysis. Boreczky et al. (1998) used hidden Markov models to search boundaries of
5977
different types of video including television shows, news broadcast, movies, commercials, and cartoons. Features for segmentation include audio, visual and motorial. The three types of features are firstly computed and then used to construct a vector for training hidden Markov model. The hidden Markov models contain seven states and the parameters trained include seven transition probabilities as well as the means and variance of various Gaussian distributions. Yeung et al. (1998) presented an approach for the segmentation of video into story units by applying time-constrained clustering and a scene transition graph, STG. Each node in a STG is a story unit, and the edge represents the transition from one story unit to the next. Based on the complete link technique for hieratical clustering, the STG is then further split into several subgraphs and correspondingly each one presents a scene. Rasheed and Shah (2003) proposed a graphical representation of feature films by constructing a shot similarity graph, SSG. The meaning of nodes and edges is same to that in Yeung et al. (1998). Shot similarities are firstly computed by utilizing the information of audio and motion of the video. Then, the normalized cut method is used to split the SSG into smaller ones presenting story units. Tavanapong and Zhou (2004) exploited film making technique, and discovered certain regions in video frames can well capture essential information needed to cluster visually similar shots into the same scene. Each shot is presented by the first and last frames with the shot. Based on the average value of all DC coefficients of the Y color component in certain region of key frames within each shot, the clustering of shots is done by forward comparison, backward comparison and temporal limitations. However, several of above methods of scene segmentation of new video use the inimitable characteristic: the location of anchor persons and the corresponding background are unchanged. Furthermore, new video has a comparable fixed structure. Namely the shots of anchor persons are shown in certain time interval, which helps to locate different news scene. Such an inimitable characteristic and structure are not fit for feature film. On the other hand, other methods discussed above do not fully taken into account the characteristic of film editing, such as the linking of shots and scene is determined what coefficients. Zhai and Shah (2005) used the Markov Chain Monte Carlo technique to determine the boundaries between video scenes. The Markov Chain has two types of states: diffusion and jumps. The Monte Carlo model parameters contain the number of scenes and their corresponding boundary locations. The temporal scene boundaries are detected by maximizing the posterior probability of the model parameters based on the model priors and the data likelihood. However, the Markov Chain Monte Carlo technique is too timeconsuming. 1.2. Proposed approach In this paper, a novel scene segmentation and semantic representation scheme is proposed for the efficient retrieval of video data using a large number of low-level and high-level features. To well depict the properties of different shots, gray-level variance histogram and Wavelet texture variance histogram are chosen as the features to perform the shot detection and key frame selection. To accurately segment video data into different scenes and semantically represent scene contents, redundant frames in the set of key frames should be recognized and removed using template matching. These shots within temporal constraint having visually similar activity are then grouped into the same scene. To retrieve interested video contents in terms of human accustomed usage, it is necessary to represent scene content from the semantic level, e.g. a conversation scene, a suspense scene and an action scene. The remainder of the paper is organized as follows. Section 2 presents the segmentation of scene boundary. Section 3 deals with
5978
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
the application of the general framework on the segmentation of the feature film. Finally, Section 4 provides the conclusion and discussions of the proposed framework. 2. Procedure of segmentation According to Webster dictionary, a shot consists of a continuous sequence of video frames recorded by an uninterrupted camera operation and consistent background settings, and a scene can be considered as a group of semantically related shots which represents different views of the same theme and contains the same object of interest. The definition of shot and scene is just the basis for editor to perform video data, and is also the basis for our proposed algorithm. Fig. 1 shows the flowchart of complete framework. For a raw video, the system first detects shot transition points and extracts key frames from each shot using color and texture feature. Then spatio-temporal associated shots are grouped into the same scene by time locality and correlation clustering. Finally scene contents are represented from the perspective of human understanding.
Selecting location of shot break is the first step to be done for the task of the partition of temporal video shots. The scheme of shot boundary detection can be seen from the diagram plotted in Fig. 2. The reason that using two steps to detect the point of shot boundary is based on the consideration of reducing the effect of noise and motion.
Input video
Video data
Shot detection
Shot 2
……
Shot N Time locality clustering
Scene 1
Scene 2
X
minðIC in ; IC inþ1 Þ
where IC in is the integrated coefficient of the n th image:
IC in ¼ ðjVIin VIan j; jVW in VW an jÞ
FCðnÞ FCðn 1Þ < T shot FCðn þ 1Þ FCðnÞ > T shot
ð2Þ
Scene L
Frames
2.1.2. Achieving precise transition boundary The precision location of shot transition is achieved in the fine searching phase described as following steps. Step 1: A sub-sequence (2M in length) centered at each coarsely detected transition positions n is extracted from the original clip V. Step 2: Frame F in the selected sub-sequence can be considered as the candidate exact position of a shot boundary:
ð4Þ
By contrast with Figs. 4 and 5 show the curve of the inter-image variance correlation for the same interview scene in the original clip space, where red circle mean the appearance of a precise boundary point.
Resultant scene
Sampling
Finding the coarse transition point based on the difference of inter-frame dissimilarity in the sampling space Locating the precision transition point by finding the local minimum in the original sequence space Fig. 2. Diagram of shot boundary detection.
ð3Þ
where Tshot captures the significant difference between consecutive images in the sampling space. Fig. 4 shows an example of the curve of the difference of interimage variance correlation for an interview scene in the sampling space, where a set of red circle and green circle mean the appearance of a rough boundary point.
F
Correlation analysis
……
Fig. 1. A brief diagram of our proposed scheme.
Video
ð1Þ
i2all bins
F ¼ arg minfPCðn M MÞ; . . . ; PCðn MÞ; ::; PCðn M þ MÞg
Key frame extraction Time locality
FCðn; n þ 1Þ ¼
Step 3: The threshold of the inter-image correlation is set at Tshot. Namely, image n can be considered as a coarse shot transition point if following inequalities are simultaneously satisfied:
2.1. Detecting shot boundaries
Shot 1
2.1.1. Detecting rough boundary point Let Ix = (I1,I2, . . ., IN) denote the images of the video clip V, rough positions of all possible shot boundaries are then obtained using the following steps. Step 1: The original clip Ix is first sampled by every M frames (M = 5 in the experiment). Then the image space is divided into 8 8 sub-images as shown in Fig. 3. Step 2: For each sub-image, the variance of the intensity information VIi and the variances of Haar Wavelet texture VWi are computed and used as histogram bin values in that image. Let VIan and VWan be the mean of the intensity variances and the Haar Wavelet texture variances in the overall nth image, respectively, then the inter-image correlation between two consecutive images, n and (n + 1) is defined as the histogram intersection:
Fig. 3. Definition of the sub-image.
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
5979
Fig. 4. Denotes the curve of the difference of inter-image correlation in the sampling space, where a set of blue circle and green circle mean the appearance of a rough boundary point. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 5. Shows the curve of the inter-image variance correlation for an interview scene in the original clip space. The red circle indicates the candidate of the precision location of shot transition. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 6 is a whole diagram of the process of locating the first precision position of shot transition as shown in Fig. 5, where the left figure determines the precision position of shot transition and the right figure finds the coarse transition point. 2.1.3. Advantage of the coarse-to-fine approach It is necessary, here, to explain the reason that we use a coarseto-fine approach rather than one direct pass to detect the candidate points of shot transition. On the one hand, compared with the inter-frame dissimilarity of cut transition in the original space as shown in Fig. 7a, it is easy to see that the direct approach of finding the local minimum to determinate shot transition points can result in many false positives for the gradual transition as shown in Fig. 7b. On the other hand, from the comparison panes between Figs. 8b and c, it is also clear to show that the advantage of using the difference of inter-frame dissimilarity in the sampling space rather than in the original video sequence to identify the boundary location of the type of gradual transition. That is, under the condition of using the difference of inter-frame dissimilarity in the original sequence space, a delicate shot transition, such as the example of fade-out/
fade-in transition of movie ‘Ice Age II’ shown in Fig. 7b may be missed. Furthermore, such a disposal can efficiently avoid reducing a number of false alarms and the corresponding total processing time can be save largely, which can be justified from latter detection process. Fig. 9 shows some examples of detected consecutive shots using the algorithm for the Hollywood cartoon movie ‘‘Ice Age II”. Here, each shot is depicted by one of the representative frames selected based on the dynamic content in corresponding shot. 2.2. Selection of key frame After detecting shot boundaries, each shot can now be represented using a characteristic set of either individual frame or short sequences instead of all the frames within a shot. In general, each shot can be depicted through a single key-frame, namely either the first frame, last frame or the middle frame. However, for some shots in feature film, the duration time many be longer and camera motion may be faster. Hence, for such shots, multi key-frames should be used to sufficiently represent shot content. Namely, the number of key-frames for each shot should coincide with the
Fig. 6. This is a whole diagram of the process of locating the first precision position of shot transition as shown in Fig. 5: (a) determining the precision position of shot transition and (b) finding the coarse transition point. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
5980
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
Fig. 7. This show examples of the precise transition location indicated by red circle of: (a) one cut transition and (b) one fade-out/fade in transition of the movie ‘Ice age II’. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 8. (a) The pixel dissimilarity in the original video space, (b) The first difference of the pixel dissimilarity in the original video sequence, and (c) shows the first difference of the pixel dissimilarity in the video sampling space.
Fig. 9. Some consecutive example shots detected using our algorithm for the testing Hollywood cartoon movie ‘‘Ice Age II”. Each shot is represented by one of the key frames.
activity content in that shot. For example, for a shot on talk show, a single key-frame can just represent the typical meaning. While for a shot within high activity content, it needs a set of key-frames to sufficiently express the visual content. A large amount of techniques on key-frame selecting have been reported in Boreczky
et al. (1998), Girgensohn and Boreczky (2000), Hasebe et al. (2004), Jiang et al. (2000), Rasheed and Shah (2003), Ren and Singh (2003), Sundaram et al. (2000), Tavanapong and Zhou (2004), Thormaehlen et al. (2004), Togawa et al. (2005), Xiong and Zhou (2006) and Yeung et al. (1998).
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
In this paper, we propose an approach of key frame selection based on the dynamics of content for a given shot. Specifically, the current frame Ij can be considered as another key frame if only the inter-frame correlation between Ij and the set of key frames are considerable weak. The key frames are obtained by the following steps. Step 1: The inter-frame correlation within the same shot is formulated as below:
P IFCðIj ; Il Þ ¼ P
i i2all bins jIC j
i2all bins
IC il j
minðjIC ij j; jIC il jÞ
ð5Þ
where IC il is the integrated coefficient of the ith sub-image of the nth frame. Step 2: Let Ss = (f1,f2, . . ., fL) denote the frames of video shot S, and frame f1 is always chosen as the first key frame:
K s ¼ fK s1 g ¼ ff 1 g
ð6Þ
Step 3: For the current frame Ij and the set of key frames K s ¼ fK s1 ; K s2 . . . K sk g, if MinðIFCðIj ; K sk ÞÞ > T keyframe , then Ij can be absorbed into the key frame set. The threshold Tkeyframe is a predefined value describing the considerable variation between frames.
Update Update K s : K s ¼ K s [ Ij If
Min ðIFC ðIj ; K sk ÞÞ > T keyframe
Fig. 10 shows some examples of key frames detected in the testing Hollywood cartoon movie ‘‘Ice Age II”. Specifically, key frames are
5981
selected from the example of detected shot as shown in Fig. 9. The resulting representative frames shows that the number of key frames is proportion to the shot visual activity in the corresponding shot. 2.3. Removing redundant frames To accurately segment video data into different scenes for highlevel retrieval, frame without useful information (such as rainbow frame, frame with black or gray background) in the set of key frames should be recognized and removed. We use the approach of template matching to detect and remove them. Specifically, the temple of black or gray background, such as image (a) and (b) in Fig. 11, is the mean and standard deviation of the pixels of its frames. Under ideal condition, the mean l and standard deviation r of the pixels for the frame with black background can be considered as 0 and 0, respectively, and 223 and 1 are the two corresponding values of frame with gray background. If both difference of the mean and standard deviation between key frame and temple are less than 1, key frame can be considered as redundant frame and should be removed. Rainbow frame, such as image (c) in Fig. 11, is detected and removed based on the RGB normalized histogram. Specifically, a 24 bin RGB normalized histogram is first computed for each key frame with 8 bins for each color component. If the histogram intersection between key frame and standard rainbow frame is less than a well defined threshold such as 0.1, then the
Fig. 10. Key-frames on the corresponding example shot shown in Fig. 9, which shows that the number of key frames is proportion to the shot visual activity in the corresponding shot. Image (a) is the one and only key frame selected from shot a in Fig. 9(a), image (b) and (c) are the key frames to represent the visual content for shot as shown in Fig. 9(b), the key frames of shot shown in Fig. 9(c) consist of last five images in Fig. 10.
Fig. 11. This show examples of redundant frames which should be detected and removed using template matching.
5982
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
corresponding key frame can be as considered as rainbow frame and removed. 2.4. Similarity of video shots According to encyclopedia definition, a scene can be understood as part of a story, which is a high-level temporal segment. A scene can be characterized either by the continuity in the visual contents of shots in which the setting is fixed, or by the continuity of the ongoing actions in the same locale. Namely, a scene is the set of shots with visually similar content. In addition, the duration of time of a scene complies with certain temporal constraint. Similarity between two video shots should reflect the correlation between visual content elements, such as locales, persons and events, and so on. To match information between shots based on a similar way human deal with the problem of matching, it is more suitable to measure similarity between shots in terms of the intersection of the flow of imagery tracks. Such a similarity measure helps to group visually similar shots into a scene in a reasonable way. Otherwise, an unsuitable form of similarity measure will lead to cluster shots formerly belonging to the same scene into several different scenes. A variety of criteria has been proposed to measure the dissimilarity between shots based on the respective key-frames only. Yeung et al. (1998) used information of color and luminance to measure the dissimilarity indices of video shots. Rasheed and Shah (2003) used the Backward Shot Coherence (BSC) method to measure the similarity for a given shot with respect to the previous shots. Niblack et al. (1993) proposed the method of determination of quadratic distance between color histogram to measure similarity between images. Strickter and Orengo (1995) presented a similarity measure by computing the distance based on low-order moments of histogram to cluster similar shots into scene. Zhai and Shah (2005) computed the similarity in terms of the Bhattacharya distance. In this paper, we propose a suitable intershot similarity measure based on the reciprocal of the Bhattacharya distance to measure the similarity between shots. The definition of similarity between two video shots consists of two steps described as follows. Step 1: Given two shots S1 ¼ K s1 ¼ fK sj11 gnj11¼1 and S2 ¼ K s2 ¼ S2 fK sj22 gnj22¼1 , S1 and S2 are similar if a pair of key-frame K S1 j , K j is similar, where n1 and n2 are total number of key-frames in corresponding shot. Step 2: The similarity between two video shots is here measured based on the reciprocal of the Bhattacharya distance:
RBDðj1 ; j2 Þ ¼
1 P P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi L1 L1 1 2 Ejpq b ln Ejpq p¼0 q¼0
ð7Þ
where j1 and j2 are the key frame from different shots S1 and S2, respectively. The formulation of similarity between shots S1 and S2 is shown in Eq. (9):
ShotSimðS1 ; S2 Þ ¼ maxðRBDðK S1 ; K S2 ÞÞ ¼ maxðRBDðfK sj11 gnj11¼1 ; fK sj22 gnj22¼1 ÞÞ
ð8Þ
The total number of shots of one testing Hollywood cartoon movie ‘‘Ice Age II” lasting 20 min and an hour is 1204, and the total number of ground truth scenes is 35. Fig. 12 shows the visual similarity map of the first four pairs of shots. The similarity between shots is depicted by pixel intensities such that two interesting facts can be easily found. On the one hand, the brighter block means higher similarity. That is, shots from the same scene form a bright block along the diagonal. On the other hand, there is an obvious border between different blocks indicated by rectangle solid lines in corre-
Fig. 12. A visual shot similarity map of the first four pairs of shots for ‘‘Ice Age II”. Shots from the same scene form a bright block along the diagonal. Brighter block represents higher similarity. The ground truth scene boundaries are indicated by rectangle solid lines.
sponding scenic region. Shots shown in Fig. 9 are chosen from the same scene ‘‘Looking for the acorn”; hence they fall into the same bright block annotated with number 1. 2.5. Analysis of time constraint For the purpose of segmentation of video scene boundary in a semantic way, it is necessary to introduce a temporal locality constraint. The reason for using the technique of time constraint during the process of grouping visually similar shots into the same scene is to keep from under-segmenting. Specifically, if temporal interval between two shots exceeds certain constraint, then although their visual content is similar, they cannot be grouped into the same scene yet. In fact, the similarity between them is only in visual aspect, the factual content in terms of human understanding is sufficiently different. Or the two shots are recorded at different scenes. Fig. 13 shows some example shots extracted from different scenes of the movie ‘‘Ice Age II” to explain such two case of visual similarity. The visual concept of top three corresponding shots is similar, however their factual content are distinct from each other. As a matter of fact, they are selected from the annotated scene ‘‘talking about ice dissolve”, ‘‘fighting with big fish”, ‘‘recalling her past” respectively. While for the bottom two shots, although they are recorded in the similar location, the corresponding episode takes place in different scenes. The former is in the annotated scene ‘‘talking about ice dissolve”, while the latter is from the annotated scene ‘‘fighting with flood”. For the technique of temporal constraint clustering, the parameter Twindow is introduced to denote the time-window length. Here, time-window length Twindow shows the maximum of total number of shots a scene can contain. It means that only if shots falling within this time-window length and satisfying certain criterion of clustering discussed in Section 2.6, they can be grouped into the same scene. The value of time-window length Twindow should be selected to reflect reasonable time duration of a scene and segment video stream file into approximate actual scenes from the perspective of human perceptivity. Accordingly, the choice of time-window length Twindow greatly affects the finally resulting scene boundaries. On the one hand, if this value is chosen to be very large, it will then cover more than one scene. It means that there may be more number of shots compared with within a potential scene and subsequently result in under-segmentation. On the other hand, if this value is too small, the number of shots will be less than that within a potential scene and cause over-segmentation. Let us take the testing movie ‘‘Ice Age II” to explain the important of the choice of the value of the time-window length.
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
5983
Fig. 13. Some example shots represented by one representative frame are extracted from different scenes of the Hollywood cartoon movie ‘‘Ice Age II” to illustrate the importance of time constraint. For the top three shots, although they have similar visual content, their factual content is distinct. While for the bottom two shots, although they are recorded in the similar location, the corresponding episode takes place in different scenes.
Fig. 14 shows some detected examples shots which is depicted by one of the key frames in respective shot. The top shots are selected from the scene ‘‘telling a story” and the middle shots are extracted from the scene ‘‘talking about doomsday”. These two scenes are both interacting events where shots interact between Dieco and animals. The shot of Diego is shown in both two scenes such that the time-window length with very big value will lead to under-segmentation. For the bottom shots extracted from the scene ‘‘Looking for the acorn”, which describe a serial event without any interaction shot. In such case, if the value of the time-window length is selected too small, then the resulting segmentation of scenes will be of over-segmentation. In our experiment, time-window length Twindow is set to be thirty. The reason why the value of time-window length Twindow is set to be thirty is discussed in Section 3 in detail. 2.6. Clustering of video shots In this section, in view of the clustering from the semantic level, shots with visually similar content should be clustered into the same scene based on the similarity measure and temporal constraint. Meanwhile, shots from different scenes should have sufficiently different visual content and accordingly should not be grouped into the same cluster by mistakenly. The process of grouping consists of following several steps.
Step 1: The similarity between two key frames j1 and j2 from different shots S1 and S2 is measured by the reciprocal of the Bhattacharya distance:
" #! L1 X L1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X j1 j2 Epq Epq RBDðj1 ; j2 Þ ¼ 1= ln
ð9Þ
p¼0 q¼0
Step 2: The measurement of the dissimilarity between shot S1 with n1 key frames and S2 with n2 key frames is the maximum across all pairs of frames in each shot:
ShotSimðS1 ; S2 Þ ¼ maxðRBDðfK sj11 gnj11¼1 ; fK sj22 gnj22¼1 ÞÞ
ð10Þ
Step 3: Within one time-window Twindow, we compute the shot similarity between all pairs of shots Si and Sj:
SceneSimðT window Þ ¼ ShotSimðSi ; Sj Þ; i; j 2 ð1; T window Þ
ð11Þ
Step 4: For the current shot Sc within Twindow, finding all the subsequent shots sharing similar visual content and picking out furthermost two shots Sf2 andSf1 for subsequent processing:
Sf ¼ ShotSimðSc ; Si Þ > T scene ; i 2 ðc; T window Þ
ð12Þ
where Tscene is the fixed-value to group similar shots into the same scene. Furthermore, Sf1 is chosen as the current shot Sc for subsequent processing.
Fig. 14. Illustrate the importance of the choice of the value of time constraint.
5984
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
Step 5: For shot Sc between Sf2 and Sf1, finding whether there exist similar shots between Sf1 and Twindow, and picking out furthermost two shots Sb2 and Sb1:
Sb ¼ ShotSimðSc ; Si Þ > T scene ; c 2 ðf2 ; f1 Þ; i 2 ðf1 ; T window Þ
ð13Þ
Furthermore, Sb1 is also chosen as the current shot Sc for subsequent processing. Step 6: Repeat step 4 or/and 5 until the end of Twindow. If the iterative process stops at step 4, then shot Sf1 is the boundary of current scene and Sf1+1 is the beginning shot for next scene; if the iterative process stops at step 5, then shot Sb1 is the boundary of current scene and Sb1+1 is the beginning shot for the next scene. Fig. 15 shows the process of one scene boundary detection for ‘Ice Age II’ within one time-window. For shot Sf3, Sf2 and Sf1 are the similar shots during the forward searching for Sc1. For shot Sc2 between Sf2 and Sf1, Sb1 is the similar shot. There are no similar shots to shot Sb1; hence the current scene Sb1 is composed of shots from Sc1 to Sb1 (see Fig. 16). 3. Experimental results This section describes experimental results of the analyses applied to three test feature films and one TV interview. The results of the segmentation of original video sequences into semantic scene units are based on the temporal locality constraint and shot coherence comparison discussed detailed in Section 2.5 and 3.5. All the experiments are done on the Intel Pentium 3.0 GHz machine running Windows. Performance measures and matching rule of scenes are presented in Section 3.1. Section 3.2 shows the characteristics of test feature films. Two important parameters, Twindow and Tscene used in the procedure of grouping similar shots into the same scene, are discussed in Section 3.3. The best two values are experimentally determined. This is done by fixing one value of parameter and varying the other parameter value on feature film Ice Age II.
Time-window length Twindow Scene q
c1
f3
f2
c2
3.1. Performance measures and matching rule The system performance of our proposed method is measured by following two metrics: precision and recall:
Precision ¼ Nc =Nd ; Recall ¼ Nc =Ng
ð14Þ
where Nc is the total number of correctly detected shots, Nd is the total number of detected shots, and Ng is total number of ground truth shots. From the perspective of system performance, high precision is pleasing for it shows correct scene boundaries are furthest detected. Similarity, high recall is desirable based on the following fact. High recall describes most scene boundaries can be uncovered by human knowledge. For a given feature film, suppose Ng be the total number of ground truth scene boundaries and Sg = {Sg1b, Sg1e, . . ., SgNgb, SgNge} be the number of beginning shot and ending shot in corresponding scene. Correspondingly, let Nd be the total number of scene boundaries detected by our proposed shot clustering method and Sd = {Sd1b, Sd1e, . . ., SdNdb, SdNde} denote the number of beginning shot and ending shot in corresponding scene. In theory, if one pair of shot boundaries Sdib, Sdie in testing set has same beginning shot and ending shot Sgjb, Sgje in ground true set, Sdi is then declared as the correct scene boundary. However, taking into account of editing effect, another alternative rule is used in the procedure on practical matching. That is, if more than 80% shots in testing scene Sdi and ground truth scene Sgj overlap each other, Sdi and Sgj can then be said to match. 3.2. Characteristics of test feature films In our experiment, three entire MPEG-1 feature films, namely Ice Age II (I. A. II), Mission Impossible III (M. I. III) and X Man III (X M. III) and one TV interview (TV Inter.). Frame rate of each video stream file is 25 frames/s and the corresponding spatial resolution is 320 240 pixels. Table 1 tabulates the detail characteristics of the testing videos. In addition, the ground truth scenes in this experiment are achieved by manually segmented from original video files. 3.3. Variation of important clustering parameters
f1
b1
Fig. 15. Example of the process of the determination of scene boundary using special correlation between shots under the condition of temporal constraint.
There are three important parameters in our experiment, namely parameter about video shots detection Tchange, parameter on time-window length Twindow, and parameter of video shots clustering Tscene. The process of choosing best values of three important parameters introduced above is described as below.
Fig. 16. This shows the ground truth scenes and detected scenes of the movie ‘X Man III’, where consecutive scenes are represented by alterative black/white stripes. The bottom row shows the scenes based on the consistent observation result of ten human observes, and detected scenes using the proposed scheme are shown in the upper row.
5985
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986 Table 1 Detail characteristics of the three test videos
Table 5 Comparison with Content-of-Table Rui et al. (1999)
Name
TV Inter.
I. A. II
M. I. III
X M. III
Name
Time (hh:mm:ss) # Frames # Shots # Scenes
00:39:23 59097 422 15
1:20:02 121557 1240 35
1:02:09 150808 1809 45
1:44:03 156084 1219 25
Proposed method Precision Recall TOF Precision Recall
The parameter Tchange are achieved from multi-time experimental results and chosen as 0.1, respectively, resulting in good performance on video shots detection and corresponding key-frames selection. As for parameters Twindow and Tscene, the procedure of choosing best values is experimentally determined. The best parameters values chosen can offer good trade-off between precision and recall. Since there is no simple theoretically correct way to choose the value on time-window length and video shots clustering, several cases of different values are tried. Table 2 shows the result of testing video ‘Ice Age II’ under the condition of Tscene with fixed value 0.35 and Twindow with various values within the range from 20 to 60. Table 3 gives the results testing video ‘Ice Age II’ using various values on video shots clustering from 0.25 to 0.45 and a fixed value on time-window length 50. From above detail discussion, it can be seen that the best performance for the task of the segmentation on scene boundaries is achieved when the value on time-window length Twindow and video shots clustering Tscene are 50 and 0.35, respectively. Thus, (Twindow, Tscene) of (50, 0.35) is selected as the best parameter value for scene boundary detection. Table 4 gives the final detection results with the best parameter value 50 time-window length and 0.35 video shots clustering.
Table 2 Detection results of testing video ‘Ice Age II’ under the condition of different Twindow values and fixed Tscene value – 0.35 Measure
# Detected scenes # Matched # Missed Precision Recall
Time-window length value (Twindow) 20
30
40
50
60
70 22 13 0.31 0.63
45 24 11 0.53 0.69
40 25 10 0.63 0.71
37 31 4 0.84 0.89
35 27 8 0.77 0.77
Table 3 Detection results of testing video ‘Ice Age II’ under the condition of various Tscene values and fixed Twindow value – 50 Measure
# Detected scenes # Matched # Missed Precision Recall
Video shots clustering value (Tscene) 0.25
0.30
0.35
0.40
0.45
48 25 7 0.52 0.71
43 26 6 0.60 0.74
37 31 4 0.84 0.89
35 25 7 0.71 0.71
32 24 9 0.75 0.69
Table 4 Detail detection results for scene boundaries segmentation with the value timewindow length of 50 and the value video shots clustering of 0.35 Measures
TV Inter.
I. A. II
M. I. III
X M. III
# Ground truth scenes # Detected scenes # Matched Precision Recall
15 16 13 0.813 0.867
35 37 31 0.838 0.886
45 52 40 0.769 0.889
25 27 20 0.741 0.800
TV Inter.
I. A. II
M. I. III
X M. III
0.813 0.867
0.838 0.886
0.769 0.889
0.741 0.800
0.728 0.806
0.742 0.645
0.712 0.763
0.676 0.638
From this table, we can see that the overall precision and recall accuracy are 79% and 86.1%, which fits evaluation of system performance and good balance between precision and recall 5. Fig. 16 shows the detail scene detection results of the movie ‘X Man II’. The upper row indicates the ground truth scenes represented by alternating black/white stripes, and the bottom row is the detected results using the proposed scheme. We compare the system performance of scene segmentation between the results generated by the proposed algorithm and the Rui method (Rui et al., 1999), which is shown in Table 5. From Table 5, it can be observed that the average precision and recall of our approach are 0.790 and 0.861, respectively, while the average precision and recall of Rui et al. (1999) are 0.715 and 0.713, respectively. 4. Discussion and conclusion In this paper, a new method is presented to complete the temporal scene segmentation of video track. The detected scenes can closely approximate factual video episodes. Our segmentation is based on the investigation of the visual information between adjacent scenes, as well as on the assumption that the visual content within a scene is similar. Consequently, the two-dimension entropy model is used to depict the representative character for different scenes. To depict the content of a shot more truly, appropriate amounts of key-frames are select based on the activity content within relevant shot. Using temporal constrained clustering and forward shot coherence comparison analysis can efficiently group shots with similar visual content into the same scene and avoid occurring the phenomenon of under-segmentation. Based on experimental results, a following phenomenon can be easily discovered. Although the three test feature films belong to quite different categories, the final detected results do not differ much, which shows our propose system fits for different types of feature films. Namely, with a common set of parameters, we can achieve satisfactory and meaning scenes, presenting distinct events and locales from different categories of video programs. From table 4, we can see high accuracy measures can be obtained. In addition, the final resulting scenes can be used to solve the movie retrieval task by developing an efficient blue print on event-driven video retrieval. Acknowledgements This work is supported by the National High-Tech Research and development Plan of China (973) under Grant No. 2006CB303103, and also supported by the National Natural Science Foundation of China under Grant No. 60675017. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the P.R.C. Government. References Albanese, M., Chianese, A., et al. (2004). A formal model for video shot segmentation and its application via animate vision. Transactions on multimedia tools and application, 24(3), 253–272.
5986
S. Zhu, Y. Liu / Expert Systems with Applications 36 (2009) 5976–5986
Boccignone, G., Chianese, A., et al. (2005). Foveated shot detection for video segmentation. IEEE transactions on circuits system and video technology, 15(3), 365–377. Boreczky, John S., & Wilcox, Lynn D. (1998). A hidden markov model framework for video segmentation using audio and image features. In: Proceedings of the international conference acoustics, speech, and signal processing. Cernekova, Z., Kotropoulos, C., & Pitas, I. (2003). Video shot segmentation using singular value decomposition. In: Proceedings on IEEE international conference multimedia and expo (pp. 301–302). Dimitrova, N., Zhang, H. J., et al. (2002). Applications of video content analysis and retrieval. IEEE transactions on multimedia, 9(3), 42–55. Girgensohn, Andreas, & Boreczky, John (2000). Time-constrained key frame selection technique. Transactions on multimedia tools and application, 11(3), 347–358. Hanjalic, A. (2002). Shot-boundary detection: Unraveled and resolved? IEEE transactions on circuits system and video technology, 12(2), 90–105. Hasebe, Satoshi, Nagumo, Makoto, et al. (2004). Video key frame selection by clustering wavelet coefficients. In: Proceedings on European signal processing conference (pp. 2303–2306). Vienna, Austria. Jiang, Hao, Lin, Tong, & Zhang, Hongjiang. (2000). Video segmentation with the support of audio segmentation and classification. In: Proceedings on IEEE international conference multimedia and expo. Kobla, V., DeMenthon, D., & Doermann, D. (1999). Special effect edit detection using video trails: A comparison with existing techniques. In: Proceedings on conference storage retrieval image video databases VII (pp. 302–313). Niblack, W., Barber, R., et al. (1993). The QBIC project: Querying images by content using color, texture and shape. In: Proceedings on storage and retrieval for image and video databases (pp. 13–25). Rasheed, Zeeshan, & Shah, Mubarak. (2003). A graph theoretic approach for scene detection in produced videos. In: Proceedings on ACM workshop multimedia information retrieval.
Rasheed, Zeeshan, Shah, Mubarak. (2003). A graph theoretic approach for scene detection in produced videos. In: Proceedings on computer vision and image understanding. Ren, W., & Singh, S. (2003). A novel approach to key-frame detection in video. In: Proceedings on visualization, imaging, and image processing. Rui, Y., Huang, T. S., & Mehrotra, S. (1999). Constructing table-of-content for videos. ACM multimedia systems journal, special issue on multimedia systems on video libraries, 7(5), 359–368. Strickter, M., & Orengo, M. (1995). Similarity of color images. In: Proceedings on storage and retrieval for image and video databases (pp. 381–392). Sundaram, Hari, & Chang, Shih-Fu. (2000). Video scene segmentation using video and audio features. In: Proceedings on IEEE international conference on multimedia and expo (pp. 1145–1148). Tavanapong, Wallapak, & Zhou, Junyu (2004). Shot clustering technique for story browsing. IEEE transactions on multimedia, 6(4), 517–526. Thormaehlen, Thorsten, Broszio, Hellward, & Weissenfeld, Axel. (2004). Key frame selection for camera motion and structure estimation from multiple views. In: Proceedings on Europe conference computer vision (pp. 523–535). Togawa, Hideyuki, & Okuda, Masahiro. (2005). Position-based key frame selection for human motion animation. In: Proceedings on parallel and distributed systems. Xiong, Z., Zhou, X., et al. (2006). Semantic retrieval in video. IEEE transactions on signal processing, 23(2), 18–27. Yeung, Minerva, Yeo, Boon Lock, & Liu, Bede. (1998). Segmentation of video by clustering and graph analysis. In: Proceedings on computer vision and image understanding. Yuan, J., Li, J., Lin, F., & Zhang, B. (2005). A unified shot boundary detection framework based on graph partition model. In: Proceedings on ACM multimedia (pp. 539–542). Zhai, Yun, & Shah, Mubarak. (2005). A general framework temporal video scene segmentation. In: Proceedings on IEEE international conference on computer vision.