High accuracy flashlight scene determination for shot boundary detection

High accuracy flashlight scene determination for shot boundary detection

Signal Processing: Image Communication 18 (2003) 203–219 High accuracy flashlight scene determination for shot boundary detection Wei Jyh Henga, King ...

2MB Sizes 4 Downloads 81 Views

Signal Processing: Image Communication 18 (2003) 203–219

High accuracy flashlight scene determination for shot boundary detection Wei Jyh Henga, King N. Nganb,c,* a Institute for Infocomm Research, Singapore, Singapore School of Computer Engineering, Nanyang Technological University, Singapore c Department of Electrical and Electronic Engineering, University of Western Australia, Australia b

Received 4 July 2002; received in revised form 1 November 2002; accepted 30 November 2002

Abstract Shot boundary detection, or scene change detection, is a technique used in the initial phase of video indexing. One of the problems in the detection is the discrimination of abrupt scene change from flashlight scenes. The usual discriminate method tests the similarity of the frame before and after a suspected flashlight effect. However, the performance of such a technique in discriminating flashlight scene from abrupt scene change can be affected by the scene content. To overcome this, we present a novel method that utilises the edge direction, thereby reducing erroneous matching with increasing dilation radius. This improves the accuracy of similarity testing and reduces the amount of erroneously matched edges by four times. Our experiment in discriminating flashlight effect from abrupt scene change frame pairs shows that our technique produces a perfect detection, which cannot be achieved by normal edge-based detection. Such a contribution is important as it improves the indexing of real life video. r 2003 Elsevier Science B.V. All rights reserved. Keywords: Shot boundary detection; Flashlight detection; Abrupt scene change detection; Feature-based detection; Directed edges; Matched edge elimination

1. Introduction Shot boundary detection, or scene change detection, is a technique used in the initial phase of video indexing, where video sequences are first segmented temporally into shots, before objects within each shot are segmented and indexed. This technique has been reported in many literatures [1,8,4]. In the past, detection was dominated by *Corresponding author. School of Computer Engineering, Nanyang Technological University, Singapore, Singapore. E-mail addresses: [email protected] (W.J. Heng), [email protected] (K.N. Ngan).

forming indicators from the feature differences between two adjacent frames, where the features include intensity, chrominance or histogram of these parameters. One of the problems encountered by these techniques was the existence of sudden flashlight, or lightning scenes, where the luminance or chrominance changes abruptly due to the sudden flashlight effect across the sequence. This effect usually leads to false detection of scene change since all indicators that utilise either luminance or chrominance are intolerable to sudden changes in the intensity. There is thus a need to differentiate abrupt scene change from the flashlight effect.

0923-5965/03/$ - see front matter r 2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S0923-5965(02)00139-X

204

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

The discrimination of flashlight effect is an important step as flashlight is one of the prominent effects in many movies, and is usually used to create suspense effects within the scenes. In many thriller movies, flashlight effects occur more often than other special effects such as wipes or dissolves. For instance, in the movie ‘‘Event Horizon’’, over 500 sudden luminance changes are found within 1 min of the sequence. When a flashlight effect occurs, there are correspondences between objects across the frames. This correspondence cannot be ignored in systems such as video indexing, where flashlight effect has to be separated from abrupt scene change. In addition, knowing the position where the flashlight effect occurs allows the object tracker to work properly, as most traditional object trackers which use colour or luminance for tracking fail under such effects. Object tracking for flashlight effects can thus be treated as a different problem to tracking under normal conditions. The paper is divided into several sections. The next section covers the traditional techniques along with their problems. Section 3 outlines the problem associated with flashlight determination technique. Section 4 discusses the fundamental theory behind feature-based detection (FBD) and outlines its limitations. In Section 5, we will discuss two alternative techniques that can reduce such limitations and increase accuracy. In addition, the algorithm that incorporates these concepts will be given. Following that in Section 6, timing analysis of the algorithm will be presented. The results of our algorithm are compared in Section 7 and improvements will be outlined. Furthermore, a solution that works for object-based shot boundary detection will be discussed. Section 8 gives some limitations of our algorithm and discusses improvements that can be made. Finally, Section 9 gives the conclusion.

researchers. In this section, we give a brief overview of the traditional techniques, and discuss their deficiencies in order to justify why the new technique is essential. Early research done by Yeo et al. [12] and Nakajima et al. [9] involved studying the behaviour of flashlight scenes. Both researchers assumed that most flashlight effects last only a few frames, and the statistical characteristics of the frame features returned to their original states after a flashlight effect. In this way, a single flashlight causes two transitions, one from the normal scene to the flashlight scene, and the other from the flashlight scene back to the normal scene. Two maxima in the indicator are produced during a single flashlight effect. Hence, the effect was detected by studying the characteristics of consecutive maxima. Nakajima used a direct method of studying the correlation between the frames before and after a flashlight effect, i.e. before the first maxima and after the second maxima, respectively. Since the statistical characteristics of the sequence returned to its original state after the flashlight, the correlation between the two frames should be high if the two maxima are caused by a single flashlight. Hence, the flashlight effects were identified from the correlation of frames using a chrominance histogram. Yeo et al. [12] introduced an even simpler method of studying the magnitude of the two maxima. Since the feature distributions of the frames before and after a flashlight effect are similar, and the frames within the flashlight do not change much, the two frames chosen from the non-flashlight scene and those from the flashlight scene are of similar characteristics. This produces two consecutive maxima of similar magnitude for a single flashlight scene. The traditional methods will work well if the flashlight scenes behave properly. However, these methods can fail under any one of the following situations:

2. Traditional techniques *

Flashlight determination is one of the least researched areas in transition detection, although the problem was recognised at an early stage by

*

A flashlight effect may last for a long period of time. The luminance change can be gradual and span over a few frames, creating intensity changes

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219 Traditional Technique

≠ ≠≠ ≠

≠ ≠ =

Ne

205

Flashlight detected after multiple tests. Middle stages may be undetected.

Fl shlightt detect ess of gradual c anges.

e

(a) Gradual flashlight scene. Flashlight detected after multiple tests. Middle stages remain undetected.

Traditional Technique





=

Flashlight detected regardless of multiple stages.

New Technique

(b) Flashlight sceneoccursin several stages. Traditional Technique



Flashlight not detected due to moving objects within long flashlight scene.

Flashlight detected regardless of disturbances within flashlight scene.

New Technique

(c) Object moves over long flashlight scene. Traditional Technique



Scene Change New Technique

Flashlight not detected due to scene change within flashlight scene.

Flashlightt detected regardless of scene change within flashlight scene

(d) Scene change occurs during flashlight scene.

Fig. 1. Proportion of frame area covered by dilated edge pixels versus the dilated dish radius used for dilation of edge pixels in the frame.

between frames that are large enough to create consecutive maxima. An example of such a video sequence is shown in Fig. 1(a).

*

A flashlight effect may occur in stages, with light of different intensities or from different directions projected in different time frames.

206

*

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

In this case, the sequence does not return to its original state even after the subsequent changes. This is frequently used to create suspense effects in thriller movies. Such a video sequence is shown in Fig. 1(b). A flashlight effect may never return to its original state. This can happen when the lighting of the scene changes, the position of the camera alters, or the objects move during the flashlight scene, as shown in Fig. 1(c), or when a scene change occurs before the flashlight effect ends, as shown in Fig. 1(d).

To detect the flashlight effects using traditional techniques, the first situation requires the system to analyse every two consecutive maxima, regardless of the time span between them. On the other hand, the next two situations require the system to check every combination of maxima found, which is not feasible. Moreover, the last situation causes the traditional techniques to fail. Examples of using a traditional technique that tests for frame similarity before and after the flashlight scene are demonstrated in Fig. 1 above each video sequence. The above problems are eliminated if we detect the flashlight scenes by analysing the adjacent frames in which the effect occurs, instead of analysing the frames before and after the effects, which span a number of frames. The results of using such a technique is shown below each video sequence in Fig. 1. This technique works by only testing the adjacent frames when a flashlight occurs, and is not influenced by what comes before or after the effect. Thus, it successfully determines the flashlight effect regardless of

disturbances during the flashlight scene or the occurrence of multiple flashlight stages. Various features can be used to test the similarity between frames for flashlight detection. One such example is using shape of the objects [5]. A more prominent method is to use the edges of objects, as edges are unaffected by the changes in luminance and chrominance within the same scene. One such method is the feature-based shot boundary detection method introduced by Zabih et al. [13], who extracted the edges of objects from the frames, and later used them to test for similarity between the frames. This idea was further extended by Shen et al. [11] who made use of Hausdorff distance histogram that worked well under the flashlight system. However, as we will show later in this paper, FBD is still susceptible to erroneous detection. To provide a solution for this, we use additional edge direction information to eliminate those edge pixels that have been matched, and prove that the technique is effective in the discrimination of flashlight effect from abrupt scene change.

3. Nature of the flashlight determination problem The flashlight detection technique is part of the overall shot boundary detection system, as illustrated in Fig. 2. Similar set-up has been used by Yeo et al. [12] and Nakajima et al. [9], in which the location of suspected abrupt scene change is first detected before it is further classified into either abrupt scene change and flashlight scene. The purpose of the abrupt change detection unit is to pick up the adjacent frames with distinct

Suspected abrupt change frame pair

Raw Video

Abrupt change

Flashlight

detection

detection

Flashlight

Abrupt scene change

Special effect (fade/wipe)

Fig. 2. Illustration of MDEE for flashlight scene and abrupt scene change frame pairs for (a) flashlight effect and (b) abrupt scene change.

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

luminance or colour difference, while filtering out both static and high motion scenes with no apparent shot change. Some popular techniques used in abrupt scene change detection include simple luminance histogram difference or even template difference. Alternatively, more sophisticated shot boundary detection unit capable of handling special effect such as fade or wipe [6] can also be used. As shot boundary detection is a wellresearched topic, its discussion is beyond the scope of this paper. In this paper, we focus solely on the flashlight detection. The main goal of flashlight determination is to discriminate flashlight scene from abrupt scene change, once a suspected abrupt change frame pair has been detected. To achieve this, we need to study the nature of the flashlight scenes and abrupt scene change, and extract the right feature from adjacent frames to test for similarity. A flashlight scene occurs when there is a drastic change of luminance and chrominance across frames of the same scene within a short period. These are caused by the sudden change of intensity, direction or light source, while the objects remain the same in the scene during the recording. Thus, unlike an abrupt scene change, there are correspondences between some objects across the adjacent frames during a flashlight scene. Since colour or intensity information between two frames can differ significantly regardless of whether flashlight or abrupt scene change occurs, the most obvious way is to utilise the edge of the objects for comparison. In this case, the edge pixels remain in the same position for frame pairs with flashlight effect, and occur in random fashion between two dissimilar frames. There are a lot of edge detection algorithms that can be found in the literatures [10], but a method of edge map construction using Canny’s edge detection [3] is chosen as the algorithm can be incorporated into our technique described in Section 5. The main task of flashlight determination is then simplified by maximising the number of edge pixels matched in the flashlight scene while minimising the amount of erroneous matching for abrupt scene change. For similarity testing between frames, a few issues have to be taken into account when

207

employing edge information during a flashlight scene: 1. The position of object boundary may shift even if it remains in approximately the same position, due to the change of projection in light source direction. 2. An edge boundary line can be split into multiple lines and vice versa, due to the appearance and disappearance of light patches at the boundary of the object where the intensity and direction of light sources have been altered. 3. The number of objects disappearing and appearing during the occurrence of flashlight can be very high, depending on the focus area of the light source. 4. The details of the object may change even though the object remains in the same position. All the above issues deal with the random appearance and disappearance of edge pixels, thus reducing the similarity of the frames for flashlight scene, which makes discrimination difficult. A simple method to reduce the amount of mismatch is to use the frame with less edge pixels for testing. This can be explained as follows. Consider the number of matched edge pixels for flashlight scenes and abrupt scene change to be ML and MA ; respectively. For any algorithm that successfully discriminates flashlight scenes, we expect ½mðML ÞFmðMA Þ > 0; where mðaÞ is the mean of a random variable a. Moreover, for an adjacent frame pair, let the number of edge pixels in the frame that has less edges be Nless ; and that in the frame with more edges be Nmore : We will expect M M M > > for any M > 0: ð1Þ Nless Nmore Nless þ Nmore Assuming that both the lightning and abrupt scene changes have approximately the same mðNless Þ and mðNmore Þ; we will then obtain mðML Þ mðMA Þ mðML Þ mðMA Þ  >  mðNless Þ mðNless Þ mðNmore Þ mðNmore Þ mðML Þ mðMA Þ  : ð2Þ > mðNless þ Nmore Þ mðNless þ Nmore Þ This means that the mean of the indicator formed by using Nless will produce a wider thresholdable area between flashlight scene and

208

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

abrupt scene change than using either Nmore or ðNless þ Nmore Þ. This results in a better discrimination system. In this case, similarity function can be constructed by dividing the number of matched edge pixels by the total number of edge pixels in the frame with less edge pixels.

4. Feature-based technique and its limitations Feature-based technique introduced by Zabih et al. [13] is perhaps the most obvious way for detecting flashlight scene using edge pixels. The basic concept of the technique is to test frames for similarity by first extracting edges of objects from two adjacent frames and counting the number of mismatched pixels when one edge map is superimposed onto the dilated edge map of the other frame. The technique works on the assumption that when no transition occurs, the majority of the objects are subjected to limited motion. Thus, within the same scene, the edge pixels of the objects in one frame should be within close proximity of those in the adjacent frame. The discrepancy in motion is compensated by the dilation of edge pixels in one of the frames. The same

concept can be applied to scenes with abrupt change of luminance. In this case, the alteration of direction and intensity of light source will cause a slight change in the reflected light boundary of the objects due to the curvature of rigid surface. The idea of similarity testing based on dilation works well when the number of edge pixels is small or the edges are highly structured. Problems arise when their numbers are large and they are spatially distributed. In this case, the proportion of the frame covered by the dilated edge pixels increases when the radius of the dish used for dilation increases, thus increasing the amount of pixels matched regardless of the frame content. This is illustrated in Fig. 3, where the proportion of frame area covered by dilated pixels is plotted against the dilation radius for 40 frames collected from various movies. It can be seen that the dilated pixels can easily cover at least 50% of the frame with dilated radius of only 5 pixels. At this radius, a random pixel from one scene will have a probability of 0.5 of landing within the dilated edge map. This reduces the accuracy of abrupt scene change detection. Our experiment in the next section will show that the discrimination of abrupt scene change and flashlight effect using this

1

Area Portion occupied by Pixels

0.9 0.8 0.725 0.7 0.625 0.6 0.525 0.5 0.4 0.3 0.2 0.1 0

0

5 10 Dilated Disc Radius (Pixels) Fig. 3. Execution time distribution of MDEE.

15

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

technique is difficult as the proportion of matched pixels of both effects overlaps each other.

5. Directed edge matching with matched edge elimination In order to achieve a successful discrimination, we require a mechanism to reduce the number of pixels matched between frames of different scenes, while not affecting the matching of pixels between frames with flashlight effect. Two methods are used here to solve the problem. The first method reduces the amount of random edges being matched, and the second method further prevents erroneous matching when the dilation dish radius increases. To reduce the amount of erroneous matching for random pixels, we have to extract more information from the frames to identify edge pixel types, rather than only edge pixels, so that they can be used for discrimination. One such information that can be easily obtained is edge direction, which can be obtained during edge detection. This is done first by taking the Euclidean distance of RGB value, or otherwise known as chromatic barycenter, in four different directions:

of luminance is used because it increases accuracy, and does not take up much computation time as shown later in Section 6. The edge direction is then determined by the maximum of the four. Edge pixels in the adjacent frames are only matched when they have the same direction. The use of edge direction works due to the fact that during flashlight effect, most objects will remain approximately in the same position, and two objects of the same shape will have approximately the same number of directed edge pixels distributed in the same geometrical position. Fig. 4 shows the result of directed edge matching under different circumstances. Here, a search radius of 10 pixels is used. It can be seen in Figs. 4(a)(ii) and (b)(ii) that majority of the edge pixels with the same direction are distributed in similar fashion. Thus, large proportion of the edges can be matched even though some objects have appeared and disappeared in different spatial locations across the frames shown in Fig. 4(b). On the contrary, during an abrupt scene change as shown in Figs. 4(c) and (d), most edge pixels with the same direction cannot be matched across the frames even though they are spatially distributed in the frames. Fig. 4(d) shows one of the worst-case scenarios for abrupt scene change, where the edges are

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2 2 2  ðRxþ1;y  Rx1;y Þ þ ðGxþ1;y  Gx1;y Þ þ ðBxþ1;y  Bx1;y Þ  ;  DHorizontal ¼ x;y   2   qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2 2 2  ðRx;yþ1  Rx;y1 Þ þ ðGx;yþ1  Gx;y1 Þ þ ðBx;yþ1  Bx;yþ1 Þ  Vertical ;  Dx;y ¼  2   qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2 2 2  ðRxþ1;y1  Rx1;yþ1 Þ þ ðGxþ1;y1  Gx1;yþ1 Þ þ ðBxþ1;y1  Bx1;yþ1 Þ  Ascending ;  pffiffiffi Dx;y ¼  2   qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2 2 2  ðRxþ1;yþ1  Rx1;y1 Þ þ ðGxþ1;yþ1  Gx1;y1 Þ þ ðBxþ1;yþ1  Bx1;y1 Þ  ;  p ffiffi ffi DDescending ¼ x;y   2  

where Rx;y ; Gx;y and Bx;y are the RGB components at ðx; yÞ position and can be replaced by YUV components with no conversion [7]. Colour instead

209

ð3Þ

uniformly distributed across both frames. However, the proportion of matching pixel is still less than half in this case. Theoretically, using edge

210

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

(i) Original Frames (ii)Directed Edge map (iii) Matched pixels (Proportion of matched pixel = 80.9%, 72.4% for 1stt and 2nd fr f ame, respectively) (a) Normal flashlight effect

(i) Original Frames (ii)Directed Edge map (iii) Matched pixels (Proportion of matched pixel = 68.8%, 67.7% for 1st and 2nd frame, respectively) (b) Flahslight where objects appeared and disappeared Fig. 4. Histogram of chromatic barycenter distribution for different types of samples used in the experiment.

direction in dilation matching reduces the number of mismatched pixels by 75% for adjacent frame with abrupt changes, while not having much significant effect on flashlight scenes.

Although more than four directions specified in Eq. (3) can be calculated, they are not used since they increase the computation complexity, and do not significantly increase the number of matches.

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

(i) Or (Proportion of matche

211

es (ii) Directed Edge map 29.6% for 1st and 2nd fr f ame, respectively) (c) Normal abrupt scene change

(i) Original Frames (ii) Directed Edge map (iii) Matched pixels st nd (Proportion of matched pixel = 39.2%, 42.5% for 1 and 2 frame, respectively) (d) Abrupt scene change where edges are uniformly distributed Fig. 4 (continued).

This is because the mismatch caused by the disappearance of edges across frames with flashlight effects is more significant than the slight

incompatibility between edge directions caused by the quantisation. In addition, it is not possible to determine accurately the exact direction of an

212

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

edge, since the direction of an edge pixel depends on the pixel values of its neighbours, and any slight deviation causes a huge difference in the edge direction for a low contrast frame. In some cases, this deviation is not avoidable due to the quantisation of the DCT coefficients. This is especially so when fixed quantisation is used for frames with flashlight effects, which results in low contrast. Moreover, the four directions are easily obtained using Canny’s edge detection, and in this way, edge detection can be performed simultaneously when extracting the edge direction, thus saving computational time. The use of four directions in Eq. (3) thus gives a reasonable balance between complexity and accuracy of the algorithm. Even though the amount of erroneous matching can be reduced by matching edges with the same edge direction, the number of matches can still be large if the dilation radius is large. With increasing dilation radius, an edge is more likely to match with another edge of the same direction in the adjacent frame, due to the occurrence of random edge pixel. To limit this type of erroneous matching, we systematically reduce the stray pixels by restricting those pixels to only one-to-one matching. This technique, known as matched edge elimination (MEE), examines each of the pixels in one frame, and tries to match them to the closest pixel of its adjacent frame in its proximity. Once found, the matched pixel is then eliminated from the adjacent frame so as to disqualify it from further matching. This is done to avoid multiple pixels from being matched to a single pixel. For accurate pixel matching, priority in matching is important, and the closest adjacent pixel is required to be matched first. This technique will not have much effect on flashlight scene since objects of similar shapes and sizes should have approximately the same number of pixels in the same geometrical position, and one-to-one match is achieved in this case. An efficient detection algorithm that incorporates the two techniques is constructed below. This is used on frames that have failed the intensity or chrominance similarity test. We call this technique, matched directed edge elimination (MDEE). The results of the matched pixels are shown in

Figs. 4(a)–(d)(iii). The algorithm is outlined as follows: 1. For each frame, extract the edge pixel using Canny’s edge detection algorithm, and the edge direction for each edge pixel is obtained during the detection. Each edge position found is inserted into one of the four lists, LEa ; according to its respective direction aA{Horizontal, Vertical, Ascending, Descending}. In addition, each edge position together with its direction is also recorded in an edge map M. 2. Choose the frame with less number of edge pixels as the key frame A, and the other as frame B. All pixels from frame A are to be matched with edge pixels extracted from frame B. 3. Prepare a search position list LS of search positions (dX ; dY ), with Euclidean distance from 0 to r, where r is the search radius. Sort these search positions according to their Euclidean distances in ascending order. 4. For each edge direction aA{Horizontal, Vertical, Ascending, Descending}, recurses through the list LS. For each position (dX ; dY ) extracted from this list, examine all the edge positions ðx; yÞ from the edge list LEA a belonging to frame A. For each of these edge positions, determine if there is an edge pixel of direction a in the position ðx þ dX ; y þ dY Þ in frame B using its edge map MB : If found, the two pixels are considered matched, and the edge position ðx; yÞ and ðx þ dX ; y þ dY Þ are eliminated from LEA a and MB ; respectively, to prevent future matching. 5. From the number of matched pixels, the proportion of pixels matched can be calculated by dividing the total number of edges in frame A.

6. Timing analysis In this section, we analyse the complexity of each process. The entire algorithm consists of three major steps: pixel difference, non-maximal suppression and MDEE.

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

The first two steps are used to detect the positions of the edges. In the first step, pixel difference calculates the differences between individual pixel and its immediate neighbours, and records the maximum of these differences and the direction corresponding to this value. The nonmaximal suppression step then examines each pixel to determine if its pixel difference value is the maximum as compared to the two neighbours in the direction of that pixel. If this is so, the edge position is recorded in one of the four lists corresponding to its direction. For both of these steps, the execution times are the order of Oðw:hÞ; where w and h are the width and height of the frame, respectively. The third step, MDEE, consists of three inner loops nested within another. The external loop examines each of the four edge lists containing edge positions, thus executes four times in total. The second loop examines each search position ðdX ; dY Þ; with Euclidean distance from 0 to search radius r; giving an approximate r2 execution. The third step examines each edge position ðx; yÞ in the first frame extracted from the edge position list to determine whether the position ðx þ dX ; y þ dY Þ in the next frame contains an edge pixel. This gives an upper bound of OðE1 Þ: As the matched pixels are subsequently removed from future examination, the execution number will thus be reduced. In total, the upper bound of the execution time is in the order of Oðr2 :E1 Þ: Fig. 5 shows the execution times of the algorithm in processing test samples containing 90 frame pairs with abrupt scene changes and 90 frame pairs with flashlight effects. The same samples will be used in analysing the performance of the algorithm in the next section. The experiment is run on a Pentium III 500 machine, with frame size of 352  240 for the sample. It can be seen that the execution time of the algorithm increases with search radius. In addition, the execution time for examining frame pairs with abrupt scene change increases faster than that for flashlight scene. This is because more edges are being matched in the flashlight scene, thus reducing the number of edges that has to be analysed in subsequent iterations.

213

7. Experiment and analysis Experiments are carried out on samples from various famous thriller movies such as ‘‘Event Horizon’’, ‘‘Alien 3’’, ‘‘Dark City’’, ‘‘The Mask’’, ‘‘Independence Day’’, ‘‘Starship Troopers’’, ‘‘Terminator 3’’, ‘‘Godzilla’’, ‘‘Deep Rising’’, ‘‘The Ring 2’’ and ‘‘Soldier’’. The size of each frame used is 352  240 pixels. To be fair in our examination, flashlight scenes are chosen such that similar scenes are not repeated, and prominent objects within the frames are more or less visible in both frames before and after the effect. The abrupt scene change is then selected consecutively from transitions in the middle of each movie. 90 frame pairs with flashlight effects and 90 frame pairs with abrupt scene changes are chosen for the experiment. The samples are chosen such that the colour differences between two frame pairs are large enough as compared to those in nontransition region. In order to show that the chromatic differences between the frame pairs in the test samples are large, we use a standard measurement indicator known as mean chromatic barycenter [7] to compare the test sets. This is constructed by first finding the means of the RGB or YUV vectors in both respective frames, and calculating the chromatic barycenter between these two vectors. The chromatic barycenter is the Euclidean distance of the RGB or YUV vectors rescaled to human visual sensitivity,   T2  d sðdÞ ¼ 100 ; ð4Þ T1  T2 where T1 and T2 are defined as 1/128 and 1/8 of the colour space, and d is the norm distance between the chromatic barycenter of two pixels in the RGB or YUV space. d between pixels A and B is defined as dðA; BÞ ¼ jCA  CB j

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðRA  RB Þ2 þ ðGA  GB Þ2 þ ðBA  BB Þ2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi E 4½ðYA  YB Þ2 þ ðUA  UB Þ2 þ ðVA  VB Þ2 :

¼

ð5Þ Here, a value of more than 100 indicates large difference between two pixels. There are many

214

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

Execution Time (msec)

160 140

Matched Edge Elimination

120

Non-maximal Suppression Pixel Differences

100 80 60 40 20 0 2

3

4

5 6 Search Radius

7

8

7

8

(a) Abrupt scene change

Execution Time (msec)

160 140

Matched Edge Elimination

120

Non-maximal Suppression Pixel Differences

100 80 60 40 20 0 2

3

4

5 6 Search Radius

(b) Flashlight scene Fig. 5. Precision versus recall graph for flashlight and abrupt scene change using different search radii (R ¼ 3; 5; 7).

ways to utilise the indicator, depending on the spatial resolution of the frame used for measurement. The distributions of the mean chromatic barycenters of the test samples with flashlight effect and abrupt scene change are plotted in Fig. 6. This is compared with that of 60 frame pairs of normal scene selected from various movies within non-transition regions. The graphs show that both the frame pairs with flashlight scene and abrupt scene change failed the similarity test as their values are high (greater than 100) compared to those frame pairs with non-transition regions. It

also shows that the distribution of the frame pairs with flashlight scene is almost similar to that with abrupt scene change, making them inseparable using this technique. For analysis, we compare MDEE with FBD and Hausdorff distance histogram (HDH) detection introduced by Shen et al. [11]. As for HDH, each frame is divided into blocks of 16  16 in size, and the indicator is constructed from the method described in [11], with a sample size of 3 pixels and search radius equivalent to the dilation distance used for both MDEE and FBD. To be

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

215

30 Flashlight effect Abrupt scene change Non-transition scene

Number of Frame Pairs

25 20 15 10 5 0

0

100

200

300 400 500 600 Mean Chromatic Barycenter

700

800

Fig. 6. Proportion of matched edge pixels versus dilated dish radius for (a) FBD, (b) HDH detection and (c) MDEE, with plots of means with 5–95% bound for the two effects.

fair in comparison, the proportion of matched edge pixels to the number of edge pixels in the frame with less edges is used as the indicator for the three methods, rather than using the maximum number of unmatched pixels as given in [11]. We first use two standard indicators, Precision and Recall [2], to measure the performance of our algorithm. These indicators are defined as follows: Correct ; Correct þ Missed Correct Precision ¼ : Correct þ False Positive Recall ¼

ð6Þ

In Eq. (6), Correct indicates the number of samples with abrupt scene change correctly detected using a fixed threshold, Missed is that with abrupt scene change mistakenly classfied as flashlight effect, and False Positive is that with flashlight effect erroneously classified as abrupt scene change. By varying the threshold used for the detection at a given search radius, the recall versus precision graph for three techniques can be plotted, as shown in Fig. 7. Precision of the detection is indicated by the distance of the lines from the top right-hand corner. For each techni-

que, three lines with search radii of 3, 5 and 7 pixels are used. The graph shows that almost perfect discrimination is achieved by the MDEE, in which unity precision and recall can be achieved simultaneously. This technique is superior as compared to both FBD and HDH in flashlight effect discrimination, where high recall is compensated by lower precision, vice versa. To analyse the discrimination process further, we plotted the indicator values for the three techniques under different search radii using the same samples. The values of these plots are shown in Fig. 8. The dashed lines indicate the samples with flashlight effect and solid lines samples with abrupt scene change. The black solid and dashed lines are the means of these traces and the bounds within which 90% of the traces lie, respectively. Note that all the indicator values for MDEE are lower than those for FBD. This is logical for abrupt scene change, since as explained earlier, their values have to decrease theoretically by four times. For flashlight scene, this is due to the reduction of matched pixels introduced by MEE and the disappearance of some objects when the brightness of the frame is too high or too low. However, the decrease is not as much as those in

216

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

R=5 R=7

1

R=3 R=5

Precision

0.9

R= 3

R=7

0.8 R=3 R=5

0.7

R=7

0.6

0.5 0.4

Matched directed edge elimination Feature-based detection Hausdorff distance histogram 0.5

0.6

0.7

0.8

0.9

1

Recall Fig. 7. Precision versus recall graph for flashlight and abrupt scene change using different search radii ðR ¼ 3; 5; 7Þ:

the abrupt scene change. Three advantages of using MDEE can be noted from the two figures: 1. Thresholding is highly accurate: As seen in Fig. 8(c), the thresholdable area is clearly visible for the MDEE for dilation radii greater than 4 pixels, in which the abrupt scene change is distinctively separated from the flashlight scenes. This accounts for the perfect discrimination as shown in Fig. 7. On the other hand, the indicators for abrupt scene change and flashlight effect produced by both FBD and HDH overlap each other, and a perfect discrimination between abrupt scene change and flashlight effect is impossible regardless of the value of threshold. Furthermore, the overlaping region increases when the dilated radius is high. 2. The uncertainty in erroneous matching for a fixed dilated radius is low: The variation or spreading of abrupt scene change, and hence the uncertainty of detection in MDEE, is much less than that in FBD, making MDEE a preferable method in detection. This is indicated by the length of vertical dashed lines in Fig. 8 for all detections. This is important as the presence of missed detection poses a more serious problem

than false detection of abrupt scene change, since missed detection can result in erroneous tracking, or sometimes even erroneous segmentation of objects across the sequences, while a false detection simply results in discontinuity of objects in temporal sequences. The reduction in uncertainty of abrupt scene change indicator in MDEE is by at least half as compared to feature-based and HDH detection technique. 3. Accuracy in matching is less sensitive to the variation in dilation radius: This can be seen from the mean of the matched edge pixels for abrupt scene change, as the slope of this mean does not vary as much with the increase of the dilation radii, which indicates less variation in the probability of mismatch with increasing dilation radius. On the other hand, the slope of the abrupt scene change mean in FBD is steeper, and from Fig. 8, it is about 70% greater than that in MDEE. This is attributed to the use of the MEE technique, which greatly reduces the mismatched edge pixels. Note that the combination of the larger spread and steeper slopes for abrupt scene change results in the overlapping of abrupt scene change and flashlight effect in FBD.

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

217

light effect from abrupt scene change. This technique achieved a perfect detection for our samples that has eluded either the FBD or HDH detection technique. Since these samples are collected from extensive real-life movies of different characteristics, we expect the algorithm to work well for other movies. For a good detection, the dilation radius (or maximum search radius) is set to at least 5 pixels in order to capture the shift in edge boundary during flashlight scene, while the threshold of matched edge proportion used for discrimination should be set to at least 0.4.

The second point leads to a fact that abrupt scene change detection is more important than flashlight scene detection. For accurate detection, priority is given to setting the threshold for the indicator high enough so that it covers all abrupt scene change effects. In addition, the dilation radius has to be large enough so that the distance between similar objects in flashlight scene will overlap each other. From Fig. 8, the threshold is chosen to be at least 0.4 for 5 pixels dilation radius and 0.5 for 7 pixels radius. Our experiment above demonstrates the superiority of MDEE in discriminating flash-

1

Proportion of Matched Edge Pixels

0.9

Slope = 0.039221

0.8 0.7 0.6

Slope = 0.054176

0.5 0.4 0.3 0.2 Flashlight effect Abrupt scene change

0.1 0

1

2

3

(a)

4 5 Dilated Dish Radius

6

7

8

1

Proportion of Matched Edge Pixels

0.9 0.8 Slope = 0.039282

0.7 0.6 0.5

Slope = 0.072206

0.4 0.3 0.2 Flashlight effect Abrupt scene change

0.1 0 1

(b)

2

3

4 5 Dilated Dish Radius

6

7

8

Fig. 8. Proportion of matched edge pixels versus dilated dish radius for: (a) FBD, (b) HDH detection, and (c) MDEE, with plots of means with 5–95% bound for the two effects.

218

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

Proportion of Matched Edge Pixels

1 0.9 0.8 0.7 Slope = 0.055404

0.6 0.5 0.4 0.3

Slope = 0.034679

0.2 Flashlight effect Abrupt scene change

0.1 0

1

(c)

2

3

4 5 Dilated Dish Radius

6

7

8

Fig. 8 (continued).

8. Limitations and improvements In some real-life scenes, flashlight effect detection can be difficult. The major problem comes from the fact that change in luminance is not uniform over the entire frame during the effect. In certain circumstances, some frame pairs that are judged by human as being in the same scene can fail at all levels of edge pixels testing regardless of the technique used. This is due to changes in the direction of light shining on irregular surfaces, especially if the surface is large and close to the camera. Another limitation can be seen in scenes where the direction and position of the light source moves, which lead to the disappearance of some objects within the scenes, and this effect contributes to the differences between frames. An example of this can be found in the scene where moving search light or touch light sources are used. From an algorithmic implementation point of view, we have found that any complex algorithm that tries to reduce the number of edges matches for abrupt scene change will also reduce that for flashlight scene. This is due to the object disappearance problem inherent in flashlight scenes, which reduces the thresholdable gap between flashlight and abrupt scene change effect. We have tested an algorithm that reduces the matching of pixels by dividing the

frame into regions and restricting the direction in which the pixel in each region can be matched to. Such a method can reduce the similarity between abrupt scene change frames by half, but will also reduce the indicator of flashlight scene frames to a level that is unacceptable for thresholding. Since our algorithm works on low-level pixelbased testing, the improvement we proposed can be used at the object level rather than merely at the pixel level. This involves the segmentation of all objects from the scene and testing them for similarity between adjacent frames. However, such a method can be processing-intensive as compared to pixel level examination, and successful segmentation depends very much on the clarity of the objects within the frame, which is lacking in frames with flashlight effect. In such circumstances, detection might be better off using pixel level examination.

9. Conclusion In this paper, a novel algorithm is proposed which significantly improves the accuracy of determining flashlight effect from abrupt scene change in shot boundary detection using pixel level examination for frame pairs that fail the similarity test during shot boundary detection. This paper

W.J. Heng, K.N. Ngan / Signal Processing: Image Communication 18 (2003) 203–219

recognises the possibility of FBD to be used in flashlight effect determination and outlines its limitations. Two methods, which aim to reduce the erroneous matching, are then utilised to overcome these limitations. The first method is based upon the edge direction that is extracted during the edge detection stage, and this method can reduce the erroneous matching by 75%. The second method eliminates edge pixels from further matching after they have been used, which reduces the chances of erroneous matching when the dilation radius increases. The combination of these two techniques, i.e. MDEE, was tested against the feature-based and HDH detection for real-life frames with flashlight and abrupt scene change effects. MDEE was found to be highly accurate in thresholding. Furthermore, this technique also reduces the amount of uncertainty and insensitivity to the dilation radius used for matching. In addition, we suggested a technique in which this method can be incorporated into object-based detection, which has proven to be superior to traditional detection schemes. The use of flashlight detection combined with object-based shot boundary detection greatly improves the accuracy of scene change detection for real-life sequences, and this can be used in video indexing. Such detection will be useful in object segmentation for action and horror movies, in which scenes are filled with flashlight effects.

References [1] G. Ahanger, T.D.C. Little, A survey of technologies for parsing and indexing digital video, J. Visual Commun. Image Representation 7 (1) (March 1996) 28–43.

219

[2] J.S. Boreczky, L.A. Rowe, Comparison of video shot boundary detection techniques, J. Electron. Imaging 5 (2) (April 1996) 122–128. [3] J. Cannny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-8 (6) (November 1986) 679–698. [4] B. Furht, S.W. Smoliar, H.J. Zhang, Content-Based Video Indexing and Retrieval, Video and Image Processing in Multimedia Systems, Kluwer Academic Publishers, Dordrecht, 1995, pp. 271–321 (Chapter 12). [5] W.J. Heng, K.N. Ngan, validity of scene cut detection using bit rate information of VBR video, in: International Symposium on Signal Processing and Robotics 98, Vol. 2, Hong Kong, September 1998, pp. 133–138. [6] W.J. Heng, K.N. Ngan, The Implementation of objectbased shot boundary detection using edge tracing and tracking, in: IEEE International Symposium on Circuits and Systems 99, Florida, USA, June 1999, IV-439-442. [7] W.J. Heng, K.N. Ngan, M.H. Lee, Performance of chromatic barycenter with MPEG elements for low-level shot boundary detection and its improvements, in: Proceeding, Symposium on Image, Speech, Signal Processing, and Robotics, Hong Kong, September 1998, pp. 272– 276. [8] Idris, S. Panchanathan, Review of image and video indexing techniques, J. Visual Commun. Image Rep. (Special Issue on Indexing, Storage and Retrieval of Images and Video) 8 (2) (June 1997) 146–166. [9] Y. Nakajima, K. Ujihara, A. Yoneyama, Universal scene change detection on MPEG-coded data domain, Proc. SPIE Visual Commun. Image Process. 3024 (2) (1997) 992–1003. [10] J.R. Parker, Algorithms for Image Processing and Computer Vision, Wiley Computer Pub, New York, 1997, Chapter 1. [11] B. Shen, D. Li, I.K. Sethi, HDH based compressed video cut detection, Visual 97 (December 1997) 149–156. [12] B.L. Yeo, B. Liu, Rapid scene analysis on compressed video, IEEE Circuits Systems Video Technol. 5 (6) (December 1995) 533–543. [13] R. Zabih, J. Miller, K. Mai, A feature-based algorithm for detecting and classifying scene breaks, Proceeding, ACM Conference on Multimedia, San Francisco, November 1995.