AVC bit streams

AVC bit streams

ARTICLE IN PRESS Signal Processing: Image Communication 23 (2008) 473– 489 Contents lists available at ScienceDirect Signal Processing: Image Commun...

1MB Sizes 2 Downloads 57 Views

ARTICLE IN PRESS Signal Processing: Image Communication 23 (2008) 473– 489

Contents lists available at ScienceDirect

Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image

A compressed-domain approach for shot boundary detection on H.264/AVC bit streams Sarah De Bruyne , Davy Van Deursen, Jan De Cock, Wesley De Neve, Peter Lambert, Rik Van de Walle Department of Electronics and Information Systems, Multimedia Lab, Ghent University, IBBT, Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

a r t i c l e in fo

abstract

Article history: Received 21 April 2008 Accepted 29 April 2008

The amount of digital video content has grown extensively during recent years, resulting in a rising need for the development of systems for automatic indexing, summarization, and semantic analysis. A prerequisite for video content analysis is the ability to discover the temporal structure of a video sequence. In this paper, a novel shot boundary detection technique is introduced that operates completely in the compressed domain using the H.264/AVC video standard. As this specification contains a number of new coding tools, the characteristics of a compressed bit stream are different from prior video specifications. Furthermore, the H.264/AVC specification introduces new coding structures such as hierarchical coding patterns, which can have a major influence on video analysis algorithms. First, a shot boundary detection algorithm is proposed which can be used to segment H.264/AVC bit streams based on temporal dependencies and spatial dissimilarities. This algorithm is further enhanced to exploit hierarchical coding patterns. As these sequences are characterized by a pyramidal structure, only a subset of frames needs to be considered during analysis, allowing the reduction of the computational complexity. Besides the increased efficiency, experimental results also show that the proposed shot boundary detection algorithm achieves a high accuracy. & 2008 Elsevier B.V. All rights reserved.

Keywords: H.264/AVC Shot boundary detection Temporal segmentation Video analysis

1. Introduction Advances in multimedia coding technology, combined with the growth of the Internet and the advent of digital television, have resulted in the widespread use and availability of digital video. This rise will only be strengthened by the increasing popularity of user-generated video content. Unfortunately, these video collections are often not catalogued and only accessible by sequential scanning. As a consequence, technologies and tools for efficient browsing and retrieval of video content are gaining importance. Semantic key frame extraction and

 Corresponding author. Tel.: +32 9 33 14957; fax: +32 9 33 14896.

E-mail address: [email protected] (S. De Bruyne). URL: http://multimedialab.elis.ugent.be 0923-5965/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2008.04.012

summarization are generally accepted as the key to quickly browse through a video sequence and to enable easy navigation between relevant segments [33]. To extract the semantics from the video content, a system or agent is needed to analyze the content. Manual annotation is very time consuming, expensive, and often unfeasible for most applications, to the point that it is almost impossible. Therefore, automatic analysis of video content is of significant importance. Since the identification of the temporal structure of video is an essential task for video indexing and retrieval, the first step commonly taken for video analysis is shot boundary detection. A shot is usually conceived in the literature as a series of interrelated consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space [2]. According to whether the transition between consecutive shots is

ARTICLE IN PRESS 474

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

abrupt or not, boundaries are classified as cuts or gradual transitions. After identifying the shot boundaries, most of the existing work focuses on the summarization of the content. One common approach for visualizing video content is a storyboard in which key frames are tiled together to obtain a good overview of the video. Each key frame generally represents the content of one shot and is selected according to an analysis method that optimizes the semantic coverage of the video content [4,13]. A second approach is summarizing the video content by selecting short, highlighted segments, resulting in a fast preview of the content. A number of approaches have been used to generate these so-called video skims, including rate-distortion and motion analysis [10,11,24]. These techniques often rely on MPEG standards such as MPEG-7 [28] and MPEG-21 [3] for the description and analysis of the multimedia content and the multimedia customization [5,10,11,27]. In order to preserve storage space and to reduce bandwidth constraints, most video data in multimedia databases are stored in compressed form. To avoid unnecessary and time-consuming decompression during video analysis, features available in the compressed domain can be used. In the past, most video content was stored using MPEG-1 Video, MPEG-2 Video, or MPEG-4 Visual, which resulted in the development of video analysis algorithms only working on these specific coding formats. As the H.264/AVC video coding standard [32] performs significantly better than any prior standard in terms of coding efficiency, it can be expected that a significant amount of future video content will be encoded in this format. The compression efficiency of H.264/AVC can be attributed to a number of new coding tools, which results in compressed video data having specific characteristics. As a consequence, these characteristics influence prior algorithms used in compressed video analysis to a great extent. In this paper, we propose a novel shot boundary detection algorithm working directly on H.264/AVC bit streams by relying on macroblock and sub-macroblock types, and their related motion vectors and reference picture indices. By only using compressed-domain information, this approach is characterized by a low computational complexity. Furthermore, this algorithm is enhanced to exploit hierarchical coding patterns, a new coding pattern introduced in the H.264/AVC specification. By exploiting the pyramidal coding structure, only a subset of frames needs to be considered during analysis. Consequently, the shot boundary detection step becomes computationally less demanding, which is in line with the idea of executing shot boundary detection in the compressed domain. The outline of the paper is as follows. Section 2 addresses related work in the area of shot boundary detection. A discussion of the influence of H.264/AVC on shot boundary detection algorithms for prior video coding standards is also provided. Section 3 introduces our newly developed shot boundary detection algorithm for H.264/ AVC bit streams, whereas Section 4 presents an enhancement to this algorithm allowing to take into account

hierarchical coding patterns during shot detection. Section 5 provides performance results regarding the accuracy of the proposed algorithm in terms of recall and precision. This section also analyzes the decrease in complexity originating from the enhanced approach. Finally, conclusions are drawn in Section 6.

2. Related work 2.1. Existing algorithms Algorithms for shot boundary detection can broadly be classified into two major groups, depending on whether the operations are performed in the pixel domain, or whether they rely directly on compressed-domain features. The most common algorithms that work in the pixel domain are based on colour histogram differences, changes in edge characteristics, and pixel differences between successive frames [12,25,36]. Out of these different methods, colour histogram-based algorithms are considered to be the most reliable method for the detection of abrupt transitions. Full decompression of the video and the accompanying computational overhead can be avoided by only using compressed-domain features such as transform coefficients, macroblock types, or motion vectors. Hence, a significant number of shot boundary detection algorithms operate in the compressed domain. In the literature, most of these algorithms work on MPEG-1 Video and MPEG-2 Video. The concept of DC images is defined by Yeo and Liu [34], where every 88 block is represented by the average intensity of this block in the original image. For intracoded macroblocks, the DC coefficient of the block is extracted as this represents the average energy. For intercoded macroblocks, these coefficients are predicted based on the motion vector and the average intensity of the referred blocks in the previous DC image. Based on these DC images, shot detection algorithms based on colour histogram differences can be modified to operate in the compressed domain. Furthermore, the distribution of the different macroblock types and motion information [29] can also be used to detect shot boundaries. For example, when an abrupt cut occurs at a P picture, it is expected that most or all macroblocks are intra-coded since they cannot be predicted well from prior reference frames. Similarly, for an abrupt cut located at a B frame, it is assumed that macroblocks are mainly intra-coded or backward-predicted from future reference pictures. Compared to abrupt changes, gradual transitions are more difficult to detect; they take place over a variable number of frames and can be generated using a great variety of special effects. Consequently, the difference between successive frames in a transition is substantially reduced. Several techniques have been proposed in literature for the detection of these gradual changes in the pixel domain. In [36], Zhang et al. presented a twincomparison method. This approach takes into account the cumulative differences between frames and requires two thresholds: a higher threshold for detecting cuts and

ARTICLE IN PRESS S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

a lower threshold for detecting gradual changes. As frames are compared on a frame-to-frame basis by most shot boundary detection algorithms, limited differences between some frames may result in undetected transitions. An alternative is to compare every frame with the kth following frames. When the transition is spread over more than k frames, a plateau can be observed in the dissimilarity measures [14,34]. However, selecting an appropriate value for the parameter k is not trivial, as each gradual change can have a different duration. Although most of the proposed techniques for the detection of gradual changes work on uncompressed video, compressed-domain algorithms are gaining importance. Zhang et al. presented a compressed domain approach in [37] based on DC coefficients and motion vectors. These motion vectors are further examined to distinct different types of camera motion and gradual changes. In [1], Bescos disclosed a comparative study of most of the metrics used for the detection of gradual changes, both in the compressed and uncompressed domain. He also proposed an algorithm for the detection of abrupt as well as all types of gradual changes based on DC images. A drawback of most approaches is that they just focus on the detection of fades and dissolves, whereas special editing effects such as wipes are hardly ever considered.

2.2. Shot boundary detection on H.264/AVC bit streams As mentioned in the introduction, H.264/AVC contains a number of new coding tools compared to prior standards for digital video coding [32]. Three of these features have a major impact on existing shot boundary detection algorithms: intra-prediction in the spatial domain, multipicture motion-compensated prediction, and decoupling of display and referencing order. Intra-prediction in H.264/AVC is conducted in the spatial domain, by referring to neighbouring samples of previously decoded blocks. Two primary types of intracoding are supported: Intra_44 and Intra_1616. In the Fidelity Range Extensions of H.264/AVC, Intra_88 is also introduced. Due to intra-prediction in the spatial domain, DC coefficients in intra-coded pictures no longer represent average energy, but only represent an energy difference. Algorithms working on DC coefficients can therefore no longer be applied to H.264/AVC bit streams. The second feature, multiple reference picture motion compensation, allows an encoder to use a larger number of pictures as reference compared to previous standards. Instead of only using the previous and following reference picture for prediction, the encoder can choose from multiple pictures that have been decoded and marked as reference. In particular, the encoder is allowed to refer further than the previous and following reference picture, resulting in vagueness about random access in the bit stream. Therefore, instantaneous decoding refresh (IDR) pictures were introduced which indicate that no subsequent pictures in the bit stream will require references to pictures prior to the IDR picture in decoding order. As the prediction chain is broken, it is insufficient to only rely on

475

reference directions to detect shot boundaries. As a result, algorithms developed for previous coding standards cannot cope with the created gaps in the prediction chain of H.264/AVC. Furthermore, to extract reference directions, a more complex approach is requisite as the display numbers of the reference pictures first need to be calculated in order to determine the reference direction. Referencing order and display order are decoupled as well, allowing the encoder to choose the ordering of pictures with a higher degree of flexibility [32]. As a result, new coding structures such as hierarchical coding patterns are introduced [30]. In addition, there are no restrictions on the coding type of reference frames and a single coded picture may consist of a mixture of different slice types. As such, the traditional concept of I, P, and B frames is replaced by a highly flexible and general concept that can be exploited by an encoder for different purposes.1 Up to now, few publications report on shot boundary detection algorithms working on H.264/AVC compressed bit streams. In [35] and [26], an intra-mode histogram distance based on the different intra-prediction mode directions is presented to describe content changes in I frames whereas reference directions are used for P and B frames. In [22], the dissimilarity between I frames is measured by relying on macroblock subdivisions (i.e., Intra_44 and Intra_1616 prediction); P and B frames are not taken into consideration. A combination of both approaches is proposed by the authors in [7], where subdivisions of intra-coded macroblocks and reference directions of inter-coded macroblocks are used during analysis. Whereas previously described algorithms for H.264/ AVC principally focus on the detection of abrupt changes, [6] presents an algorithm for fade detection based on the use of explicit weighting factors. This new coding tool in H.264/AVC improves the coding efficiency of gradual changes such as fades and dissolves [32]. However, a drawback of algorithms relying on the use of luminance weighted prediction factors is that not all encoders make use of this feature; it is only advantageous for a limited number of frames, while it increases coding complexity. Furthermore, special effects during gradual changes, such as wipes, do not benefit from this coding technique. In theory, the set of motion vectors should also show very typical behaviour during a gradual transition [37]. However, an increase in the amount of intra-coded macroblocks as a result of content changes makes the characteristics of the remaining motion vectors less reliable. This rise is even strengthened in H.264/AVC by the introduction of spatial prediction of intra-coded macroblocks. The approach presented in this paper is partially based on previous work by the authors [7], where the analysis relies on macroblock and sub-macroblock types, and their related motion vectors and reference indices. In this paper,

1 In the context of H.264/AVC, we define an I frame as a frame that consists entirely out of I slices. P and B frames are defined in a similar way.

ARTICLE IN PRESS 476

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

particular attention is paid to the analysis of I and IDR frames, as these frames break the prediction chain. Next, this algorithm is enhanced in order to exploit the characteristics of hierarchical coding patterns. As elaborated upon in Section 4, the coarse-to-fine structure leads to the fact that a subset of lower layers can be seen as a reduced version of the original video. Therefore, in the enhanced algorithm, temporal dependencies between pictures belonging to the base layer are first investigated to decide whether they belong to the same shot or not. Only in case this outcome is insufficient to accurately detect shot boundaries, pictures belonging to higher levels are further processed. As a consequence, only a subset of pictures is taken into consideration during analysis, resulting in a lower computational cost. Note that the proposed algorithm can also be applied on conventional coding structures such as IBP and IPP. 3. Shot boundary detection for traditional coding patterns In this section, algorithms for the detection of abrupt and gradual transitions in traditional coding patterns will be examined together with the accompanying selection of thresholds. A flow chart of the algorithm is given at the end in order to provide a complete overview. 3.1. Detection of abrupt transitions Within a video sequence, a strong correlation exists between successive frames. As an encoder typically tries to exploit temporal and spatial correlation to a maximum extent, this results in specific characteristics of motion prediction information and macroblock subdivisions. By analyzing these patterns, shot boundary detection can be performed in the compressed domain. For the detection of abrupt transitions, we propose to examine the temporal dependencies between successive frames. Thereafter, when the prediction chain is broken as a result of certain I frames or IDR frames, spatial dissimilarities are calculated as well. 3.1.1. Relying on temporal dependencies P and B frames use temporal prediction to exploit similarities with previously coded frames. However, when the current frame is the starting frame of a new shot, this frame will have hardly any resemblance with previously displayed frames. As a consequence, the encoder will prefer to make use of intra-coded macroblocks or macroblocks referring to following frames in display order which are already decoded (called backward prediction). An example is given in Fig. 1, where frame B2 is the first frame of a shot in display order but not in decoding order. The macroblocks in this frame will therefore mainly refer backward to frame P3 . Frame P7 , which is the first encoded and displayed frame of a new shot, will be mainly intracoded. On the other hand, the last frame of a shot in display order will have hardly any correlation with following frames. Therefore, this frame will mainly contain macroblocks referring to previously displayed

0

1

2

3

Shot 1

0

4

5

6

Shot 2

3

1

2

5

7

8

Shot 3 Display order

4

7

6

8

Decoding order Fig. 1. Example of a video sequence consisting of three shots. The arrows represent the available reference frames; full arrows indicate frequently used reference frames while the dashed arrows indicate the opposite.

frames (called forward prediction) or intra-coded macroblocks. This is the case for frames B1 and B6 . In prior standards such as MPEG-1 Video and MPEG-2 Video, macroblocks could only be predicted from a single previous (and a single following) reference frame. As a result, the reference direction could directly be derived from the macroblock types. As H.264/AVC allows multiple reference picture motion compensation and decoupling of referencing order and display order, macroblock types only indicate which partitioning is applied and which reference lists are utilized for each macroblock partition. Based on the index in these reference lists, information about the reference frame can be extracted. In particular, information about the picture order count (POC) can be utilized to determine the display number of a reference frame. The reference direction of the macroblock partition can then be determined by comparing the display numbers of the current frame and the reference frame. When the display numbers of all reference frames of a partition are prior (resp. subsequent) to the current frame, forward (resp. backward) prediction is used. When reference frames are located before as well as after the current frame, a partition is coded using bi-directional prediction. As the smallest partition for which a reference index can change is 88 pixels in size, this block size will be used as the basic unit to calculate the reference directions present in a frame. These observations can be used to formulate two conditions for the detection of shot boundaries. Let iðf i Þ, jðf i Þ, bðf i Þ, and dðf i Þ, respectively, be the number of blocks coded using intra-coding, forward prediction, backward prediction, and bi-directional prediction of the current frame f i ; let f i1 be the previous frame in display order and let B denote the number of 8  8 blocks in a frame. By relying on temporal dependencies, a shot boundary can be declared at frame f i if the following condition is met: 1 ðiðf i1 Þ þ jðf i1 ÞÞ4T inter B 1 ^ ðiðf i Þ þ bðf i ÞÞ4T inter . B

(1)

Note that the percentage of bi-directional-predicted blocks dðf i Þ is not used in the inequalities as a high value for dðf i Þ typically corresponds to a low probability of a shot boundary. When the percentages calculated in the inequalities exceed a predefined threshold T inter , one can conclude that an abrupt shot boundary is detected.

ARTICLE IN PRESS S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

477

can be calculated. Instead of comparing macroblock types at corresponding positions, a window of 33 macroblocks is selected for each macroblock. This window is located centrally around the current macroblock, as can be seen in Fig. 2. This way, movement of objects or the camera will lead to less false alarms. Let f i denote the current I frame, M i1 the corresponding intra-prediction map of the previous frame, and MB the amount of macroblocks m in a frame. Define T as the set of possible macroblock partitions ðT ¼ fIntra_44; Intra_88; Intra_1616gÞ. Furthermore, let F be a placeholder for f i and Mi1 , and n a macroblock in the associated frame or intra-prediction map. The dissimilarity metric O is defined as follows:

In Section 3.1.3, we will further elaborate on the determination of optimal thresholds.

3.1.2. Relying on spatial similarities in case of I frames As I frames are coded independently of other frames, the second inequality in Condition 1 is always fulfilled. When the previous frame in display order is only allowed to use forward references, gaps in the temporal prediction chain are introduced. In this case, both inequalities in Condition 1 are met, resulting in a majority of falsely detected shot boundaries. Indeed, every I frame that is not used by previously displayed frames as reference, is then seen as the start of a new shot. For classic I frames, this phenomenon will typically occur when using IPP patterns, or when variable GOP sizes are used. Furthermore, as a result of the properties of an IDR frame, gaps in the prediction chain will always occur for this particular type of I frames. To overcome this problem, additional conditions are required in case the prediction chain is broken as a result of IDR frames and certain I frames. As stated in Section 2.2, H.264/AVC supports several intra-macroblock partitions, of which Intra_44 and Intra_1616 are the most commonly used. The first mode is generally preferred by an encoder in case of significant detail, whereas the latter mode is more suitable for coding smooth areas. As a result, the subdivision of an I frame in different macroblock partitions leads to a good representation of the detail of the content, as can be seen in Fig. 2. By comparing the distribution of the intra-macroblock types in the current frame with previously displayed frames, dissimilarities between frames can be calculated and therefore, changes in content can be detected. Comparing the current I frame with the previous I frame is not recommended as the content can change significantly between these two frames. For example, a shot boundary can be located in between at a P or B frame, or new objects can appear (Fig. 2). Therefore, an intraprediction map M i1 is constructed containing the last intra-coded macroblock partitioning of each macroblock. Every time an intra-coded macroblock has come across, M i1 is updated with the new spatial information. This map can then be used to represent the spatial distribution of the content of the previous frame, in spite of the fact that this frame contains inter-coded macroblocks. By comparing the current I frame with the map M i1 , the dissimilarity between the current and previous frame

wFm;t ¼ fnjn 2 F ^ n 2 window associated with macroblock m ^ n is coded using partitioning mode tg

(2a) 

P Wðf i ; mÞ ¼

Oðf i Þ ¼

 fi t2T jwm;t j

 i1   jwM m;t j f

2  jwmi j

,

1 X Wðf i ; mÞ. MB m2f

(2b)

(2c)

i

For I frames, when both inequalities in Condition 1 are met and the dissimilarity Oðf i Þ also exceeds a predefined threshold T intra , one can conclude that the current I frame is the starting frame of a new shot. The next section further elaborates on the determination of the threshold T intra . Note that for IDR frames, the gap in the temporal prediction chain is not necessarily located at the IDR frame, but can occur prior in display order. When the decoded picture buffer is cleared as a result of the IDR frame, frames located further in the bit stream but prior in display order can still refer to the IDR frame, but not to preceding frames in decoding order. As the amount of frames influenced by an IDR frame is typically much larger for hierarchical coding patterns than for traditional structures, this will further be discussed in Section 4.2. 3.1.3. Threshold selection for abrupt transitions A less extensively studied problem is the selection of optimal thresholds for evaluating the computed frame-toframe differences. Most authors work with global thresholds, which remain the same over the entire sequence.

window of 3x3 macroblocks

f 27

f10 I frame

P frame

f 51 I frame

Fig. 2. Distribution of Intra_44 and Intra_1616 macroblocks. Although the second frame is a P frame, it is mainly intra-coded as it is the first frame of a new shot.

ARTICLE IN PRESS 478

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

Optimal values for global thresholds are often determined by compromising between recall and precision ratios [14,36]. An alternative is to work with adaptive thresholds, which vary during the analysis and match the local activity. The threshold for each frame is then based on the frame-to-frame differences of surrounding frames located in a corresponding search window. In general, maximum values or mean and standard deviation of frame-to-frame differences in a search window are used to determine the local thresholds [15,34,36]. Selecting appropriate threshold values is a key issue in successfully applying our proposed shot boundary detection algorithm. The detection of abrupt changes relies on two thresholds T inter and T intra . The first threshold, which is applied to both inequalities in Condition 1, is responsible for detecting gaps in the temporal prediction chain. For this purpose, the percentage of intra-coded, and forward and backward referring blocks of the previous and current frame are compared to T inter . As these percentages represent the probability of a shot boundary (from 0% to 100%), global thresholds can be applied [29]. The optimal value for T inter is heuristically chosen by compromising between recall and precision ratios for different threshold values. Experimental results shown in Fig. 3 indicate that a wide range of threshold values (in particular, the interval ½0:68; 0:88) is qualified for minimizing the amount of missed detections and false alarms. In order to select the optimal value from this interval, the sum of the recall and precision curve is calculated, after which the maximum of this curve is localized. As this maximum corresponds to a threshold of 80%, this value is used for T inter in the remainder of this paper. Note that for determining the optimal values for other variables and thresholds in the paper, the same methodology based on recall and precision curves is applied. To evaluate the spatial dissimilarity, T intra is used to compare the distribution of the different macroblock partitioning types in I frames. As the obtained results do not represent probabilities, but rather indicate the difference with previous frames, a more sophisticated technique is requisite. A technique often used in the literature is based on the statistical distribution of the frame-to-frame differences [15,36]. The obtained distribution is modelled by a Gaussian function with parameters m and s which

represent the mean and standard deviation of the frameto-frame differences. The corresponding threshold value can be computed as T ¼ m þ as,

(3)

where a is the parameter related to the tolerated probability for missed detections and false alarms. When difference values fall out of the range 0 to m þ as, they can be considered indicators of shot boundaries. Although this threshold is based on the properties of the sequence, this value remains the same for the entire sequence. Therefore, it is not capable of anticipating on the properties of the local content. A second method using adaptive thresholds is based on sliding windows [15,34]. For each frame k, a window containing N elements is created where k is located in the middle of the window. When the frame-to-frame difference between frame k  1 and k is the window maximum and b times larger than the second largest difference value, a shot boundary is detected. A major weakness of this approach is the high sensitivity of b, originating from motion. In this paper, we have utilized a combination of both approaches discussed above and applied several changes to determine T intra . In particular, these techniques need to be adjusted as they are mainly used in uncompresseddomain algorithms, whereas we want to apply them on compressed-domain information. First, as the spatial dissimilarity can only be calculated for I frames, T intra is not based on frame-to-frame differences; instead, only difference values for I frames are considered. To adapt T intra to the local properties of the content, a window consisting of M elements is constructed. In contrast to the above technique, all M elements are located before the examined frame. By making this adjustment, I frames located in the future and which are part of a high-motion shot cannot cause missed detections. To control the resulting false alarms, mean and standard deviation are used. Let mO denote the mean of the dissimilarity values of the M previous I frames and sO the corresponding standard deviation. T intra can then be defined as T intra ¼ mO þ asO .

(4)

The values for M and a are computed heuristically in the same way as for T inter, resulting in typical values lying

Fig. 3. Influence of T inter on recall and precision ratios.

ARTICLE IN PRESS S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

three intra-coded objects

opening of intra-coded macroblocks

opening of inter-coded macroblocks

479

structure element

Fig. 4. Extraction of foreground and background using the mathematical morphology operation opening.

 Estimation of foreground and background: First, the

depicted in Figs. 4 and 5. By applying this operator, the influence of isolated blocks can be ignored. Determination of motion intensity: As the estimated foreground is intra-coded, the corresponding motion of this region cannot be calculated from the same frame. Therefore, all frames are divided into two groups: one group which is used for the estimation of both regions, and one to calculate the corresponding motion. As frames marked as reference typically consist of a higher amount of intra-coded macroblocks compared to non-reference frames, the reference frames can be used for the estimation of the foreground and background, whereas the latter type of frames can then be used to measure the motion. This is for example the case in Fig. 5 where the foreground and background are determined from P 7 and where the motion intensity is computed using B5 and B6 . The corresponding motion intensity, a concept originating from the MPEG-7 specification [21], is then calculated for both the foreground (MIF ðf i Þ) and background (MIB ðf i Þ). This is done by relying on the standard deviation of the magnitude of the motion vectors.2 The two obtained values for the foreground and the background will therefore represent the amount of motion present in both regions. Note that as I and IDR frames cannot be used to make an estimation of the foreground and background, the previous reference frame is used as estimation instead. Distinction between motion and gradual changes: When the calculated motion intensity of both foreground and background is high, the change in content can most likely be attributed to fast, global movement. High motion intensity in only the foreground is typically connected to local motion. Otherwise, when the motion intensity is low although the amount of intracoded macroblocks indicates a change in content, the origin is typically a gradual change. Note that relying only on the motion intensity of an entire picture is insufficient as the origin of intra-coded macroblocks is unclear. More precisely, small moving objects can result in an increase in intra-coded macroblocks, while the percentage of corresponding motion vectors is too low to influence the motion intensity of the complete picture. Therefore, without separating foreground and

regions corresponding to the foreground and background are estimated. By calculating the mathematical morphology operation opening [16] of the intra-coded (resp. inter-coded) macroblocks, a rough estimation of the foreground (resp. background) can be made, as

2 Note that each motion vector is first normalized in relation to the difference between the display numbers since the distance to the different reference frames can vary.

around 5 and 3, respectively. Furthermore, an upper and lower boundary is set to this threshold to avoid extreme values. For example, when the start of a shot is static, the corresponding dissimilarity values are close to zero. A small amount of motion later in the shot would therefore result in a FA. For high motion scenes, the standard deviation could become too large, making the T intra larger than its maximum. When observing typical values for OðiÞ corresponding to abrupt transitions, appropriate boundaries are ½0:2; 0:6.



3.2. Detection of gradual transitions In this paper, the percentage of intra-coded macroblocks combined with motion vector information is used as criterion for identifying gradual changes. First, a change in the percentage of intra-coded macroblocks is used as an indication for changes in content. When an increase in this percentage is noticed, this can typically be attributed to different events: abrupt changes, gradual changes, fast local motion resulting from objects, or fast global motion resulting from camera movement. To identify the gradual changes from the abrupt changes and the motion, a distinction between these different types of events needs to be made. Abrupt changes can be identified by locating gaps in the temporal prediction chain. Otherwise, motion analysis is performed to distinguish gradual changes from motion. The different steps of the algorithm are explained below. A flow chart of this algorithm is given in Fig. 6 in Section 3.3, containing a summary of the proposed algorithm. For content containing fast local motion, the moving objects will typically be coded using a large amount of intra-coded macroblocks, whereas the background regions will mainly be coded using temporal prediction. For gradual changes and fast global motion, the intra-coded macroblocks are typically scattered over the entire frames. By examining the motion intensity corresponding to the intra-coded regions and inter-coded regions, a distinction between the local motion, global motion, and gradual changes can be made.



ARTICLE IN PRESS 480

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

B5

B6 P7

P4

intra-coded macroblocks in P7

estimated foreground of P7

estimated background of P7

motion intensity of motion intensity of foreground in B6 background in B6

Fig. 5. Example of a sequence with local motion. The estimated foreground and background of P 7 is computed based on the opening of the intra-coded and inter-coded macroblocks in P 7 . Using these estimations, the corresponding motion intensity of both regions in B6 is calculated.

background, it is difficult to distinguish gradual transitions from local motion. For the detection of gradual changes, two thresholds are used. First, the amount of intra-coded macroblocks is examined as an increase in the number of intra-coded macroblocks can typically be attributed to a content change. In order to anticipate the local properties of the content, using an adaptive threshold is appropriate, as explained in Section 3.1.3. Similar to the selection of T intra , only a certain amount of frames located before the current frame is considered. When a content change starts (as a result of a gradual transition or motion), the amount of intra-coded macroblocks first increases, after which it remains high for a certain time. When using future frames as well, in combination with the original sliding window technique, no real peak could be detected. However, by comparing the percentage of intra-coded macroblocks in the current frame with previous frames, the sudden increase can be identified. As a result, the same technique as for T intra is applied to calculate T grad . In particular, the percentage of intra-coded macroblocks in the previous M frames belonging to the base layer is used to compute the mean mgrad and standard deviation sgrad . T grad can then be defined as T grad ¼ mgrad þ asgrad .

(5)

Optimal values for M and a are computed heuristically in the same way as for T inter, by compromising between recall and precision curves obtained for different values for M and a, resulting in typical values lying around 5 and 3, respectively. Again, in order to avoid extreme values, this threshold is bounded to ½0:15; 0:60 for the same reason as explained in Section 3.1.3. Next, a conclusion must be drawn whether the change in content is caused by motion or gradual transitions. Therefore, the motion intensity of foreground and background is compared to the following threshold: T motion ¼

xl . F

(6)

The diagonal length l of a frame in pixels and the frame rate F are applied to normalize the threshold. This equation is defined in [18] to distinguish different degrees in motion intensity. The value of x is selected equal to the threshold dividing medium and high activity (i.e., 0.4267) in [18]. Lower degrees of motion can typically be compensated more easily by the encoder using temporal prediction, which makes that these GOPs are not considered during the detection of gradual changes. 3.3. Summary of the proposed shot boundary detection algorithm To provide the reader with a complete overview of our proposed shot boundary detection algorithm, a flow chart is depicted in Fig. 6. First, temporal correlation between successive frames is measured to detect gaps in the temporal prediction chain (see Section 3.1.1). When this gap is a result of certain I or IDR frames, the spatial dissimilarity is calculated as well (as previously explained in Section 3.1.2). This way, abrupt transitions can be identified. Next, to locate the gradual changes, the percentage of intra-coded macroblocks is compared to prior frames. An increase can typically be attributed to gradual changes as well as to local or global changes. By analyzing the motion intensity of the estimated foreground and background, these different types of content changes can be distinguished. 4. Enhanced shot detection for hierarchical coding patterns In H.264/AVC, features like multiple reference picture motion compensation and decoupling of display and referencing order, among others, allow the creation of arbitrary coding structures, which are not supported by prior video coding standards. This flexibility in terms of possible coding structures makes it possible to organize the pictures in a bit stream in multiple ways. Often, this flexibility is used for the creation of hierarchical coding patterns. Although a hierarchical coding structure usually

ARTICLE IN PRESS S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

if

new frame fi

1 (ι ( fi −1 ) +  ( fi −1 )) > Tinter B and

481

f i ∈ base layer if

No

and

No

No transition

ι ( fi ) > Tgrad

1 (ι ( fi ) + β ( fi )) > Tinter B

Yes

Yes Analysis of motion activity in foreground and background of intermediate non-reference frames fj

Gap in temporal prediction chain

if gap is result of I or IDR frame Yes

No transition

if

Detect abrupt transition

and

MI F ( f j ) < Tmotion if

MI B ( fj ) < Tmotion

No

if Ω ( fi ) > Tintra No

MI F ( fj ) > Tmotion

if MI B ( fj ) > Tmotion Yes

Yes

fast global motion

fast local motion

No transition

No transition

and

MI B ( f j ) < Tmotion Yes

Yes Detect abrupt transition

Detect gradual transition

Fig. 6. Flow chart for the detection of the different types of content changes.

B1

Layer 3 Layer 2

B3

B2

B7

B6

B4

Layer 1 Layer 0

B5

I0

P8

Display order I0

P8

B4

B2

B6

B1

B3

B5

B7

Decoding order Fig. 7. Hierarchical coding pattern with four temporal layers.

introduces a higher end-to-end delay, it can improve coding efficiency and offers multi-layered temporal scalability in a straight-forward way [30,23]. Hierarchical coding patterns consist of multiple layers, which result in a coarse-to-fine structure, as shown in Fig. 7. The base layer typically consists of I and P frames at a very low frame rate, whereas the higher layers typically contain B frames that are inserted between the frames of lower layers in display order. Frames belonging to higher layers can thus use frames of lower layers as references for the decoding process (and possibly some preceding frames of the current layer also). To identify the hierarchical coding structure of a bit stream, it is possible to rely on three supplemental enhancement information (SEI) messages [8,19], which are defined within the scope of sub-sequences and subsequence layers [31]. When applied to hierarchical coding patterns, a sub-sequence layer corresponds to a temporal

layer whereas a sub-sequence is a set of coded pictures within a sub-sequence layer. The sub-sequence information SEI message maps a picture to a subsequence and sub-sequence layer. The sub-sequence layer characteristics SEI message and the sub-sequence characteristics SEI message provide statistical information (e.g., average bit rate) on the indicated sub-sequence layer and subsequence. When these messages are not inserted into the bit stream, relying on the decoding order and display order of the pictures is also feasible to detect the hierarchical coding structure used. However, this solution is more complex.

4.1. Exploitation of the pyramidal structure Conventional algorithms for shot boundary detection can be optimized by taking into account layered coding

ARTICLE IN PRESS 482

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

explained below, only a subset of frames needs to be examined in order to detect the abrupt transition. As I0 and P 8 belong to different shots, P 8 will typically not use I0 as a reference. Furthermore, the other frames located after the shot boundary (i.e., B3 to B7 ) cannot be used as a reference either since P 8 is coded before the other frames, as illustrated in Fig. 7. As a result, the percentage of intra-coded macroblocks in P 8 is high, fulfilling the first if-statement in Algorithm 1. As a consequence, the reference directions of the intermediate frame B4 in the next layer need to be considered. For this purpose, the same algorithm is executed for I0  B4 and B4  P 8 . As B4 and P8 belong to the same shot and P 8 is present in the reference picture buffer of B4 , B4 is mainly backward-predicted using P8 . Therefore, the first part of the if-statement is not fulfilled, making further investigation of frames located in between B4 and P 8 superfluous. The same technique is applied to I0  B4 . As illustrated in Fig. 8, the shot boundary is located in between these two frames, making further examination necessary. It is clear that we need to rise to the highest layer of the pyramidal structure using the proposed technique in order to detect the exact location of the shot boundary. Even then, a significant amount of frames can be discarded, resulting in increased computational efficiency. This is for example the case for B1, B5 , B6 , and B7 . When the content of a sub-GOP does not change drastically, only the first level needs to be examined. On the other hand, significant motion or gradual changes will typically lead to the investigation of pictures residing at a higher level. When the difference between pictures residing at the lowest layer is relatively high, the prediction can fail. However, when rising in the pyramidal structure, the correlation between the frames becomes more clear. Therefore, the recursive algorithm for the detection of abrupt transitions is stopped once the similarity is noticed.

structures. As a subset of lower layers can be seen as a reduced version of the original video, pictures belonging to higher layers only need to be considered when the amount of information from the lower layers is insufficient to draw conclusions. First, temporal dependencies between pictures belonging to the first layer are investigated to decide whether they belong to the same shot or not. Based on the outcome, pictures belonging to higher layers are further processed in a recursive way or are immediately discarded. 4.1.1. Abrupt transitions Our algorithm for the detection of abrupt transitions is described in more detail using the pseudo-code provided below; it is executed for every sub-GOP in a bit stream. This sub-GOP corresponds to two successive reference frames in the base layer and the intermediate frames located in higher levels. Algorithm 1 RecursiveShotBoundaryDetection (startFrame, endFrame) if ð1B ðiðstartFrameÞ þ jðstartFrameÞÞ4T inter ^ 1 B ðiðendFrameÞ þ bðendFrameÞÞ4T inter Þ // Temporal prediction chain is broken if ðstartFrame þ 1 ¼ endFrameÞ // Highest level reached // Shot boundary detected NewShotBoundary(endFrame) else // Go one level higher in the pyramidal structure // to find the shot boundary middleFrame ¼ intermediate frame in next level RecursiveShotBoundaryDetection (startFrame, middleFrame) RecursiveShotBoundaryDetection (middleFrame, endFrame) end if else // No shot boundary detected end if

The use of our recursive technique is shown in more detail in Fig. 8, where a hierarchically coded sub-GOP is displayed containing frames from two different shots. As

B1

B3

B5

B2

B7

B6

B4

I0

P8 shot boundary Display order : low temporal correlation

: high temporal correlation : not considered

Fig. 8. Recursive algorithm for detecting shot boundaries in a hierarchically coded sub-GOP.

ARTICLE IN PRESS S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

483

4.2. Spatial dissimilarity for IDR frames

4.1.2. Gradual transitions For the detection of gradual transitions in H.264/AVC bit streams with hierarchical coding patterns, this bottom-up approach can also be applied to enhance the algorithm proposed in Section 3.2. First, the percentage of intra-coded macroblocks in the base layer is compared to the threshold T grad , which is calculated from previous frames residing in the base layer. Only when the threshold is exceeded, the next layer is examined as a result of the content change. Based on the estimation of the foreground and background derived from the base layer (i.e., f 24 in Fig. 9), the motion intensity is calculated for the intermediate frame in the second layer (i.e., f 22 ). As previously explained in Section 3.2, this procedure allows to make a distinction between fast global motion, fast local motion, and gradual changes. However, as the distance between the compared frames is larger than in traditional coding patterns, the amount of intra-coded macroblocks in higher layers can still be relatively large. When a significant amount of co-located macroblocks is still intra-coded, this procedure is repeated for higher layers to make a reliable estimation of the motion intensity. The estimation of foreground and background is then executed for the current layer, whereas the motion intensity is derived from the next layer. This is in line with the idea used for the detection of abrupt shots, where only these intervals are further investigated where the information to draw conclusions about the temporal decomposition is insufficient. To exploit the hierarchical coding structures for the detection of gradual changes, only one threshold needs to be added. In particular, a new threshold T nextLayer is introduced which decides whether higher layers need to be considered to make a good estimation of the motion intensity, or whether information from current frame is enough to draw conclusion. Experimental results show that when more than 50% of the macroblocks of the considered regions are inter-coded, a good estimation can be made.

As stated in Section 3.1.2, IDR frames break the temporal prediction chain of a video sequence as they indicate that no subsequent frame in the bit stream in decoding order is allowed to use frames prior to the IDR picture as a reference. As a result of the pyramidal structure, this gap is located multiple frames prior to the IDR frame in display order. An example is shown in Fig. 10, where frames B33 to B39 are located before the IDR frame in display order, but after the IDR frame in decoding order. When only taking into account temporal dependencies during shot boundary detection, an abrupt transition between P32 and B33 will typically be falsely detected as B33 is not allowed to use P32 as a reference. In order to calculate the similarity between P 32 and B33 , spatial distributions need to be examined instead of temporal dependencies. For the previous frame, the intra-prediction map M32 introduced in Section 3.1.2 is utilized. However, as B33 can be predicted from following frames, the corresponding spatial distribution needs to be derived as well. Therefore, a second intra-prediction map M IDR;i is constructed containing a spatial representation of B33 . This map is composed of the intra-prediction modes of following frames in display order. Initially, this map contains the intra-prediction modes of IDR40 , which is then updated with intra-prediction modes of macroblocks located closer to B33 . As this process takes place during the analysis of the different layers in a bottom-up approach, IDR40 , B36 , B34 , and B33 are used during the update step. When a shot boundary takes place within this sub-GOP, one of these frames will be mainly intra-coded, which results in overwriting the prediction modes of IDR40 . Consequently, M IDR;33 will represent the spatial distribution of the content of B33 . To calculate the correlation between P 32 and B33 , a new dissimilarity metric OIDR ðiÞ is introduced. This metric is based on Eqs. (2a)–(2c) for the intra-prediction maps M i1 and MIDR;i .

f 21

f 23

f 22

f 20

f 24

390 > Tgrad 396

Fig. 9. Example of a gradual transition in a hierarchical coding structure. Intra-coded macroblocks are represented by their original colour, whereas intercoded macroblocks are blanched.

ARTICLE IN PRESS 484

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

B33

B35

B37

B34

B39

B38

intra prediction map M32

B36

B20 P16

P24

P32

IDR40

Display order P32

IDR40

B36

B34

B38

B33

B35

B37

B39

Decoding order : temporal prediction prohibited

: temporal prediction allowed

Fig. 10. The use of IDR frames results in a temporal prediction chain that is broken, as no subsequent frame in decoding order is allowed to use frames prior to the IDR frame as reference.

During the construction of both maps, only the subset of frames is used that is needed for the analysis of the temporal dependencies. The other frames, which are not considered, will typically contain very little spatial information as they are located between highly similar frames. Therefore, discarding these frames does not influence the accuracy of the dissimilarity measurement. Note that M i is kept up to date during the analysis of the complete bit stream, whereas MIDR;i is only constructed for sub-GOPs containing IDR frames. As already declared in Section 3.1.2, this problem also arises for traditional coding patterns. However, explanations are provided in this section as the impact of IDR frames is larger for hierarchical coding patterns since more frames are involved.

4.3. Summary of the enhanced shot boundary detection algorithm working on hierarchical coding patterns To give a coherent overview of the algorithm working on hierarchical coding patterns, a summarizing flow chart is depicted in Fig. 11. This algorithm is executed for every sub-GOP in a video sequence, which corresponds to two successive frames belonging to the base layer (further referred to as start frame and end frame), and the intermediate frames located at higher layers. Note that the end frame in the current sub-GOP will therefore correspond to the start frame in the next sub-GOP. First, the algorithm responsible for detecting abrupt transitions is applied. As elaborated upon in Section 4.1, the temporal correlation between the start and end frame is investigated to decide whether they belong to the same shot or not. Based on the outcome, this algorithm is recursively executed for higher layers or immediately stopped. When arriving at the highest layer and the temporal correlation is still very low, the probability of a shot boundary is high. Only in case this gap in the

temporal prediction chain is caused by an IDR frame or certain I frames, the spatial dissimilarity is calculated as well, as explained in Section 4.2. As a consequence, the amount of FAs can be reduced. Next, the algorithm responsible for detecting gradual transitions is performed. In case the percentage of intracoded macroblocks in the end frame increases significantly, T grad is exceeded, which indicates a change in content. To discover the origin of this change (i.e., gradual transition, global motion, or local motion), the foreground and background of this frame are estimated, as discussed in Section 4.1. For both regions, the corresponding motion intensity in the intermediate frame belonging to the next layer is calculated. Using T motion , the source of the content change can be determined. In particular, when there is an increase in intra-coded macroblocks although the motion intensity is low, a gradual transition is detected. In case the content change is very drastic, it is possible that motion information from the current frame is insufficient since the high amount of intra-coded macroblocks gives a distorted view of the motion intensity. Therefore, when this percentage exceeds T nextLayer , this technique is performed in a recursive way on the frames located in the next layer.

5. Performance results To evaluate the performance of our shot boundary detection algorithm, experiments have been carried out on several video sequences with various characteristics in terms of resolution, length, quality, and content. An overview of the characteristics of the test sequences used is given in Table 1. The first two sequences are obtained from the publicly available MPEG-7 Content Set [17] and represent a part of a news sequence and a basketball sequence (i.e., V3 and V17 from [17]). Since the quality and the resolution of these sequences is low, three recent, proprietary sequences were added

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

new sub-GOP (startFrame, endFrame)

485

Abrupt transitions

Update Mi with startFrame

if

1 (ι (startFrame ) + ϕ (startFrame )) > Tinter B and

No transition in sub-GOP

No

1 (ι (endFrame ) + β (endFrame )) > Tinter B Yes parse intermediate frame in next layer and update Mi

No

if (startFrame + 1 = endFrame)

Yes

if gap is result of I or IDR frame

Construct MIDR,endFrame

Yes

No

No recursively repeat for [startFrame, middleFrame] and [middleFrame, endFrame]

if ι (endFrame ) > Tgrad

Yes

parse intermediate frame in next layer

Detect abrupt transition

Analysis of motion activity in foreground and background of middle frame fi

Yes

if Ω IDR (endFrame ) > Tintr a

if #(intra-coded MB in foreground or background) > TnextLayer

No Yes No transition in sub-GOP

recursively repeat for [startFrame, middleFrame] and [middleFrame, endFrame]

if MIB ( fi ) > Tmotion

No

MIF ( fi ) > Tmotion if

No transition in sub-GOP

if

and

MIB ( fi ) < Tmotion

MI B ( fi ) < Tmotion

Yes

Gradual transitions

MIF ( fi ) < Tmotion

and

Yes

Yes

No transition in sub-GOP

Detect gradual transition in sub-GOP

Fig. 11. Flow chart of the enhanced algorithm for the detection of the different types of content changes on hierarchical coding patterns.

Table 1 Characteristics of the test sequences Name

News 1 Basket News 2 Soap Trailer

Resolution

352  288 352  288 384  208 720  576 848  352

# Frames

26 000 18 053 23 802 15 040 3553

Frame rate (Hz)

25 25 25 25 25

to the test set as well. The first sequence originates from a news broadcast from Belgian public television, the second sequence is part of an international television soap, and the last sequence is the movie trailer of ‘‘Little miss sunshine’’.3 These test sequences were coded a number of times with different hierarchical coding patterns (i.e., two, three, and four temporal layers). These configurations correspond to a pyramidal structure containing two (hier_2), four (hier_4), and eight frames (hier_8), respectively. Furthermore, these different coding patterns were generated two times: once with I frames and once with IDR frames, which were inserted every 32 frames. A distinction between these

3

This sequence can be found on the website of Apple.

Length (s)

1040.0 722.1 952.0 601.6 142.1

# Transitions Abrupt

Gradual

154 62 138 160 81

18 13 19 7 24

two types is made as IDR frames lead to gaps in the temporal prediction chain, which is not always the case for classic I frames, as discussed in Section 3.1.2. First, the accuracy of the proposed algorithm is evaluated for different coding parameters, based on the manually created ground truth. Next, these results are compared to a publicly available uncompressed-domain algorithm. Thereafter, a complexity analysis of the enhanced algorithm for hierarchical patterns is given.

5.1. Accuracy of the proposed algorithm for shot boundary detection The accuracy of the proposed algorithm is evaluated by comparing the obtained results against the ground truth.

ARTICLE IN PRESS 486

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

This comparison is based on the number of correct detections (D), missed detections (MD), and false alarms (FA), expressed as recall and precision. Precision ¼

D ; D þ FA

Recall ¼

D . D þ MD

Table 2 Accuracy in precision (P) and recall (R) (%) of the proposed algorithm on sequences coded with different hierarchical structures and I or IDR frames inserted every 32 frames. Additionally, the accuracy of the ‘‘TZI Shotdetection TrecVID 2004’’ algorithm is depicted as comparison

(7)

Table 2 presents the accuracy of our proposed algorithm. As can be deduced from the results, the accuracy for the detection of abrupt transitions is very high. Only in case of IDR frames, a decrease in precision is observed which can be attributed to the gaps in the temporal prediction chain and the intra-prediction map for calculating the spatial dissimilarity. These FAs occur when the content slightly changes as a result of slow motion. Since this motion can generally be compensated using inter-prediction, the intra-prediction map used for calculating the spatial dissimilarity will not be updated accordingly. Consequently, the intra-prediction map does no longer give a good representation the spatial distribution. This phenomenon occurs less often for larger hierarchical structures as the base layer typically contains more intra-coded macroblocks, which results in a more accurate intraprediction map. Note that the intra-prediction map is not used in the test sequences coded with classic I frames, since the intermediate B frames prevent gaps in the temporal prediction chain. Therefore, the accuracy of using intra-prediction maps can deduced from Table 2 by comparing the precision and recall values for sequences containing classic I frames with sequences containing IDR frames. For the gradual transitions, all false alarms are caused by sudden changes in light intensity as the amount of intra-coded macroblocks increases drastically, but no significant motion is present. The missed gradual transitions can for the major part be attributed to transitions taking place of a long period of time or to high motion present during the transition. Our proposed algorithm falsely classifies these content changes as global motion, instead of gradual changes. To compare the results of our algorithm with publicly available algorithms, we employed the ‘‘TZI Shotdetection TrecVID 2004’’ algorithm and added the accuracy results to Table 2. This algorithm is developed within the scope of the DELOS Network of Excellence [9], relies on uncompressed-domain features for the detection of shot boundaries [20], and is also used in [14] for comparison purposes. When comparing the results of TZI with the results of our algorithm in Table 2, it can be seen that our proposed algorithm achieves a higher accuracy for the detection of abrupt transitions. Although uncompresseddomain algorithms can rely on more features, our algorithm exploits the decisions made by the encoder during its search for similarity with preceding frames, which is a computational intensive process. For gradual transitions, both algorithms struggle with the same problems. Our proposed algorithm has the advantage that it uses features available in the compressed domain instead of pixel-domain information, making full decompression unnecessary. Note that a pyramidal structure containing two frames corresponds to a traditional IðBPÞ pattern. Compared to

Name

Coding pattern

Abrupt P

News 1

Basket

News 2

Trailer

R

P

R

I frames hier_8 hier_4 hier_2

96 93 91

100 100 100

48 47 75

61 50 67

IDR frames hier_8 hier_4 hier_2

91 88 87

99 99 100

53 40 75

77 44 67

TZI

93

97

35

50

I frames hier_8 hier_4 hier_2

95 91 94

100 100 98

60 50 60

23 15 23

IDR frames hier_8 hier_4 hier_2

90 90 83

100 98 84

75 50 63

23 15 38

TZI

97

93

31

76

I frames hier_8 hier_4 hier_2

99 100 98

100 100 100

76 90 100

84 95 95

IDR frames hier_8 hier_4 hier_2

100 93 80

100 98 99

76 82 95

84 95 95

91

99

76

84

99 100 99

100 100 100

36 58 71

57 100 71

IDR frames hier_8 hier_4 hier_2

99 93 83

100 100 99

43 64 71

86 100 71

TZI

99

91

50

57

I frames hier_8 hier_4 hier_2

100 99 100

99 100 100

92 96 88

96 96 96

IDR frames hier_8 hier_4 hier_2

99 100 100

98 98 100

96 96 92

96 96 96

95

99

96

96

TZI Soap

Gradual

I frames hier_8 hier_4 hier_2

TZI

ARTICLE IN PRESS S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

the algorithm for traditional coding patterns proposed in Section 3, the enhanced algorithm proposed in Section 4 only investigates the second layer when the amount of information from the base layer is insufficient to draw conclusions. As examining the second layer corresponds to investigating all layers in a traditional coding pattern, the accuracy of both algorithms is the same for traditional coding patterns. Therefore, the results obtained for the proposed algorithm for traditional coding patterns are reflected in the results for pyramidal coding structures containing two frames.

5.2. Complexity analysis of the proposed shot boundary detection algorithm The main advantage of the enhanced algorithm for shot boundary detection is the decreasing computational complexity. As already stated in Section 4, only a subset of frames present in the video is needed to detect shot boundaries in the compressed domain. In this section, the influence of the different coding parameters on the complexity of the enhanced algorithm is discussed. The type of content is taken into account as this also plays an important role. Finally, the difference in complexity of compressed- and uncompressed-domain shot boundary detection algorithms is discussed. As can be seen from Table 3, the percentage of frames that needs to be analyzed during shot boundary detection decreases when more temporal layers are included. This results from the fact that the number of frames present in the lowest layer decreases. Despite this strong decrease in amount of processed frames, the execution times decrease less rapidly. In some cases, one can even notice an increase in execution speed for sequences consisting of multiple layers. This can mainly be attributed to the fact that more frames from higher layers need to be parsed when using more temporal layers. In particular, frames located in the base layer need less time to be parsed and processed as they typically use only one reference, instead of two references for frames located in higher levels. Sequences containing IDR frames will generally need more processing power than sequences with I frames as spatial dissimilarity needs to be calculated as well. This process becomes more expensive when the number of temporal layers increases. In particular, more intermediate frames are necessary to create the second intraprediction map M IDR;i . To illustrate the gain in efficiency, the execution speed of the enhanced algorithm (where only the necessary frames are parsed and analyzed) is compared to an adapted version of this enhanced algorithm where all frames are parsed. This is illustrated in Fig. 12 for the test sequences containing I frames, where the relative execution speed of the enhanced algorithm is depicted. Indeed, this graph represents the time needed to execute the enhanced algorithm compared to the time needed to perform the algorithm where all frames are parsed and processed. From these measurements, it can be concluded that a significant gain in execution time and therefore

487

efficiency can be achieved by only considering a welldefined subset of frames present in the bit stream. Furthermore, as can be seen from the percentage of analyzed frames in Table 3 and the relative execution speed in Fig. 12, the type of content has a major impact on the efficiency. For video sequences with long shots and little motion such as news sequences, the gain in efficiency is very high. In this case, frames in the lowest layer can often use each other as reference since the content remains very similar over a long period of time. As a result, higher layers only need to be examined rarely. On the other hand, high-motion video streams such as sport sequences demand more processing power. Since high motion drastically changes the content of the video, the similarity between pictures in the lowest layer is small and the amount of intra-coded macroblocks increases significantly. As this latter characteristic is also present when gradual changes occur, higher layers need to be examined in order to compute the motion intensity and to distinguish motion from gradual changes. Although the ‘‘soap’’ and the ‘‘trailer’’ sequence do not contain a significant amount of motion, the results show that their complexity is also higher than the complexity of news sequences. This can be attributed to the fact that the average length of the shots is much smaller. As shot boundaries occur more frequently, higher layers need to be examined more often. To determine the exact location, it is necessary to rise to the highest layer in the pyramidal structure. As a result, the amount of frames that needs to be processed highly increases. The same trends can be observed for sequences containing IDR frames. Note that the resolution of the video sequences also plays an important role since the execution speed is inversely proportional to the size of the content. It would be interesting to compare the complexity results of the proposed algorithm with uncompresseddomain algorithms, and in particular the TZI algorithm. However, making a decent comparison in terms of complexity is non-trivial. To obtain a fair comparison between the execution speeds of both algorithms, the same code base should be used since the underlying implementation determines the results to a great extent. At best, this means that the parser used in our proposed algorithm is part of the complete decoder used by the uncompressed-domain algorithm before starting the analysis phase. Even then, the implementation of the remaining part of the decoder influences the complexity. As a consequence, comparing the complexity in terms of execution speed is very questionable, and hence, not incorporated here. However, one can clearly recognize that algorithms working on compressed data are intrinsically less complex. In particular, several time-consuming steps from the decoding process such as entropy decoding, motion compensation, inverse quantization, and inverse transformation do not need to be executed in our algorithm. Even the parsing of a well-defined subset of frames is eliminated. Of course, the complexity of the algorithms themselves also plays an important role. Uncompressed-domain algorithms often compute information about edges and

ARTICLE IN PRESS 488

S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

Table 3 Complexity results for the enhanced shot detection algorithm Name

Coding pattern

I frames

IDR frames

Size (MB)

Execution time (s)

Analyzed frames (%)

Size (MB)

Execution time (s)

Analyzed frames (%)

hier_8 hier_4 hier_2

141.0 136.1 129.5

83.2 89.9 125.8

26.3 35.9 58.7

144.9 139.3 131.3

114.2 100.6 125.3

23.8 35.9 58.6

Basket

hier_8 hier_4 hier_2

118.4 114.6 109.7

57.8 61.2 84.4

32.7 40.4 60.9

120.7 116.4 110.7

72.3 67.6 84.9

29.7 40.5 60.8

News 2

hier_8 hier_4 hier_2

43.3 43.7 46.8

48.7 55.9 82.3

20.0 31.4 55.0

45.5 45.0 47.4

73.6 65.4 82.8

20.0 31.5 55.0

Soap

hier_8 hier_4 hier_2

95.6 91.1 71.9

302.4 280.9 341.9

30.3 38.7 59.5

99.0 93.5 86.3

377.1 295.9 340.3

30.1 38.9 60.5

Trailer

hier_8 hier_4 hier_2

14.1 13.7 13.5

56.0 45.5 58.0

29.4 42.4 60.5

14.9 14.2 13.7

67.0 47.9 58.0

30.9 40.0 60.7

News 1

Basket

News 2

Soap

Hier_2

Hier_4

Hier_8

Hier_2

Hier_4

Hier_8

Hier_2

Hier_4

Hier_8

Hier_2

Hier_4

Hier_8

Hier_2

Hier_4

50 45 40 35 30 25 20 15 10 5 0 Hier_8

Relative execution time (%)

News 1

Trailer

Fig. 12. Relative execution speed of the enhanced algorithm. These values represent the time needed to execute the enhanced algorithm compared to the adapted version.

motion intensity to increase their accuracy. Compresseddomain algorithms, on the other hand, can directly extract this information from the bit stream as these characteristics are already exploited during the motion estimation step and the search for optimal partitioning of macroblocks in the encoding phase. As a result, the actual analysis step of our compressed-domain algorithms is less complex as well.

6. Conclusions Temporal segmentation is a prerequisite for semantic video analysis. Therefore, in this paper, we have proposed

a novel algorithm for shot boundary detection on H.264/ AVC bit streams. To identify abrupt transitions, the algorithm examines the temporal dependencies between successive frames. Since IDR frames and certain I frames lead to gaps in this prediction chain, spatial dissimilarities are considered as well. Gradual changes are detected by relying on the amount of intra-coded macroblocks and motion intensity. Furthermore, an enhanced algorithm is introduced for shot boundary detection on H.264/AVC bit streams with hierarchical coding patterns. As these coding structures consist of multiple layers, a subset of lower layers can be seen as a reduced version of the original video. Therefore, pictures residing in higher layers are only considered in case information from lower layers is

ARTICLE IN PRESS S. De Bruyne et al. / Signal Processing: Image Communication 23 (2008) 473–489

insufficient to accurately detect shot boundaries. Experimental results show that this algorithm achieves a high accuracy in terms of recall and precision. Furthermore, by exploiting the layered structure, the computational complexity is drastically reduced, resulting in an increased efficiency.

Acknowledgements The authors would like to thank Davy De Schrijver for his constructive and extensive feedback. The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT-Flanders), the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union. References [1] J. Besco´s, Real-time shot change detection over online MPEG-2 video, IEEE Trans. Circuits Systems Video Technol. 14 (4) (2004) 475–484. [2] G. Boccignone, A. Chianese, V. Moscato, A. Picariello, Foveated shot detection for video segmentation, IEEE Trans. Circuits Systems Video Technol. 15 (3) (2005) 365–377. [3] I.S. Burnett, F. Pereira, R. Van de Walle, R. Koenen, The MPEG-21 Book, Wiley, New York, 2006. [4] J. C´alic´, D.P. Gibson, N.W. Campbell, Efficient layout of comic-like video summaries, IEEE Trans. Circuits Systems Video Technol. 17 (7) (2007) 931–936. [5] S.-F. Chang, A. Vetro, Video adaptation: concepts, technologies and open issues, Proc. IEEE 93 (1) (2005) 148–158. [6] B. Damghanian, M.R. Hashemi, M.K. Akbari, A novel fade detection algorithm on H.264/AVC compressed domain, in: Lecture Notes in Computer Science, vol. 4319, Springer, Berlin, 2006, pp. 1159–1167. [7] S. De Bruyne, W. De Neve, K. De Wolf, D. De Schrijver, P. Verhoeve, R. Van de Walle, Temporal video segmentation on H.264/AVC compressed bitstreams, in: Lecture Notes in Computer Science, vol. 4351, Springer, Berlin, 2007, pp. 1–12. [8] W. De Neve, D. Van Deursen, D. De Schrijver, K. De Wolf, R. Van de Walle, Using bitstream structure descriptions for the exploitation of multi-layered temporal scalability in H.264/AVC’s base specification, in: Lecture Notes in Computer Science, vol. 3767, Springer, Berlin, 2005, pp. 641–652. [9] DELOS Network of Excellence on Digital Libraries hhttp://www. delos.info/i. [10] A. Divakaran, K.A. Peker, R. Radhakrishnan, Z. Xiong, R. Cabasson, Video summarization using MPEG-7 motion activity and audio descriptors, Technical Report TR-2003-34, Mitsubishi Electric Research Laboratories, 2003. [11] P.M. Fonseca, F. Pereira, Automatic video summarization based on MPEG-7 descriptions, Signal Processing: Image Communications 19 (8) (2004) 685–699. [12] U. Gargi, R. Kasturi, S.H. Strayer, Performance characterization of video-shot-change detection methods, IEEE Trans. Circuits Systems Video Technol. 10 (1) (2000) 1–13. [13] A. Girgensohn, J. Boreczky, Time-constrained keyframe selection technique, Multimedia Tools Appl. 11 (3) (2000) 347–358.

489

[14] C. Grana, R. Cucchiara, Linear transition detection as a unified shot detection approach, IEEE Trans. Circuits Systems Video Technol. 17 (4) (2007) 483–489. [15] A. Hanjalic, Shot-boundary detection: unraveled and resolved?, IEEE Trans. Circuits Systems Video Technol. 12 (2) (2002) 90–105. [16] H.J.A.M. Heijmans, Composing morphological filters, IEEE Trans. Image Process. 6 (5) (1997) 713–723. [17] ISO/IEC JTC1/SC29/WGll/N2467, Description of MPEG-7 content set, 1998. [18] ISO/IEC 15938-3, Information technology—multimedia content description interface—Part 3: visual, 2002. [19] ITU-T and ISO/IEC JTC 1, Advanced video coding for generic audiovisual services, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, 2003. [20] A. Jacobs, A. Miene, G.T. Ioannidis, O. Herzog, Automatic shot boundary detection combining color, edge, and motion features of adjacent frames, in: TRECVID 2004 Workshop Notebook Papers, 2004, pp. 197–206. [21] S. Jeannin, A. Divakaran, MPEG-7 visual motion descriptors, IEEE Trans. Circuits Systems Video Technol. 11 (6) (2001) 720–724. [22] S.M. Kim, J. Byun, C.S. Won, A scene change detection in H.264/AVC compression domain, in: Lecture Notes in Computer Science, vol. 3768, Springer, Berlin, 2005, pp. 1072–1082. [23] A. Leontaris, P.C. Cosman, Compression efficiency and delay tradeoffs for hierarchical B-pictures and pulsed-quality frames, IEEE Trans. Image Process. 16 (7) (2007) 1726–1740. [24] Z. Li, G.M. Schuster, A.K. Katsaggelos, MINMAX optimal video summarization, IEEE Trans. Circuits Systems Video Technol. 15 (10) (2005) 1245–1256. [25] R. Lienhart, Comparison of automatic shot boundary detection algorithms, in: Proceedings of the SPIE Storage and Retrieval for Image and Video Databases VII, vol. 3656, 1998, pp. 290–301. [26] Y. Liu, W. Wang, W. Gao, W. Zeng, A novel compressed domain shot segmentation algorithm on H.264/AVC, in: Proceedings of the IEEE International Conference on Image Processing, vol. 4, 2004, pp. 2235–2238. [27] J. Magalha˜es, F. Pereira, Using MPEG standards for multimedia customization, Signal Processing: Image Communications 19 (5) (2004) 437–456. [28] B.S. Manjunath, P. Salembier, T. Sikora, Introduction to MPEG-7: Multimedia Content Description Interface, Wiley, New York, 2002. [29] S.-C. Pei, Y.-Z. Chou, Efficient MPEG compressed video analysis using macroblock type information, IEEE Trans. Multimedia 1 (4) (1999) 321–333. [30] H. Schwarz, D. Marpe, T. Wiegand, Analysis of hierarchical B pictures and MCTF, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2006, pp. 1929–1932. [31] D. Tian, M.M. Hannuksela, M. Gabbouj, Sub-sequence video coding for improved temporal scalability, in: Proceedings of the IEEE International Symposium on Circuits and Systems, 2005, pp. 6074–6077. [32] T. Wiegand, G.J. Sullivan, G. Bjøntegaard, A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Systems Video Technol. 13 (7) (2003) 560–576. [33] Z. Xiong, R. Radhakrishnan, A. Divakaran, Y. Rui, T. Huang, A Unified Framework for Video Summarization, Browsing & Retrieval: With Applications to Consumer and Surveillance Video, Academic Press, New York, 2005. [34] B.-L. Yeo, B. Liu, Rapid scene analysis on compressed video, IEEE Trans. Circuits Systems Video Technol. 5 (6) (1995) 533–544. [35] W. Zeng, W. Gao, Shot change detection on H.264/AVC compressed video, in: Proceedings of the IEEE International Symposium on Circuits and Systems, vol. 4, 2005, pp. 3459–3462. [36] H.J. Zhang, A. Kankanhalli, S.W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems 1 (1) (1993) 10–28. [37] H.J. Zhang, C.Y. Low, S.W. Smoliar, Video parsing and browsing using compressed data, Multimedia Tools Appl. 1 (1) (1995) 89–111.