Spatial Similarity Retrieval in Video Databases

Spatial Similarity Retrieval in Video Databases

Journal of Visual Communication and Image Representation 12, 107–122 (2001) doi:10.1006/jvci.2000.0460, available online at http://www.idealibrary.com...

131KB Sizes 2 Downloads 135 Views

Journal of Visual Communication and Image Representation 12, 107–122 (2001) doi:10.1006/jvci.2000.0460, available online at http://www.idealibrary.com on

Spatial Similarity Retrieval in Video Databases Yung-Kuan Chan and Chin-Chen Chang Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan, 621, Republic of China E-mail:[email protected], [email protected] Received November 20, 1998; accepted October 10, 2000

A nine-direction lower-triangular (9DLT) matrix describes the relative spatial relationships among the objects in a symbolic image. In this paper, the 9DLT matrix will be transformed into a linear string, called 9DLT string. Based on the 9DLT string, two metrics of similarity in image matching measures, simpler but more precise, are provided to solve the subimage and similar image retrieval problems. Moreover, a common component binary tree (CCBT) structure will be refined to save a set of 9DLT strings. The revised CCBT structure not only eliminates the redundant information among those 9DLT strings, but also diminishes the processing time for determining the image matching distances between query frames and video frames. Experiments indicate that the storage space and the processing time are greatly reduced through the revised CCBT structure. A fast dynamic programming approach is also proposed to handle the problem of sequence matching between a query frame sequence and a video frame sequence. °C 2001 Academic Press Key Words: symbolic image; spatial similar image retrieval; spatial similar video retrieval; 2D C-trees; 9DLT matrix; CCBT structure.

1. INTRODUCTION The processing of video data plays an important role in many application areas, such as digital library, interactive video analysis, multimedia publishing, and geographic information systems. Video is an effective medium with a large amount of spatial and temporal information [8, 14, 18]; therefore, it is helpful to find the video frame sequence that satisfies the spatial and the temporal relationships specified in a query frame sequence. Owing to huge memory space required for storing video data, economical storage and fast retrieval are two pivot problems for video applications. Similarity retrieval works effectively when the users fail to express queries in a precise way. Spatial similar image retrieval seeks the images that satisfy the query image’s requirements in terms of the spatial relationships among the objects in their related symbolic images. An object, symbolizing an entity in a physical image, is enclosed with the boundaries of one minimal bounding rectangle parallel to the horizontal and the vertical axes. One name and the coordinates of an object relative to the image frame are affiliated to the 107 1047-3203/01 $35.00 C 2001 by Academic Press Copyright ° All rights of reproduction in any form reserved.

108

CHAN AND CHANG

object. A symbolic image is composed of a set of objects. If these objects correspond to all the entities in an image I , the symbolic image can depict the spatial relationships among the entities in the image I . Each frame in video data can be transformed into a symbolic image, too. Consequently, the techniques of iconic image indexing can be extended to video data indexing [1, 11, 19]. Research on video retrieval based on spatial and temporal relationships has been presented by several authors [1, 8, 11, 19]. Almost all of them conclude that video can be indexed and accessed with the groundwork of the spatial content. Hsu et al. [11] used 2D C-tree to characterize the spatial information of video contents. They solved the video frame sequence matching (VFSM) problem by computing the minimum editing distance of the 2D C-trees. A specific tree-matching algorithm is engaged in deciding the editing distance. Unfortunately, the distance cannot effectively measure the similarity of images with spatial constraints. To deal with the problem of video sequence matching, this paper defines a nine-direction lower-triangular (9DLT) string to represent a symbolic image. The 9DLT string not only makes the metric of similarity in image matching more precise, but also facilitates the VFSM problem. It is known that sharing the same parts among similar images can reduce the storage space. The concept of sharing the unchanged portions between two sequential images is named overlapping [16]. The CCBT structure, proposed by Chan and Chang [5], can realize this concept.This paper intends to refine it so as to store a set of 9DLT strings. It can noticeably reduce both the storage spaces required to save the information of a video frame sequence and the processing time to calculate the image matching distances between query frames and video frames. Moving from image retrieval to video retrieval causes increased complexity. The temporal feature is the distinguished one. Hence, this paper also presents a fast dynamic programming technique to solve the problem of sequence matching between a query frame sequence and a video frame sequence. The rest of this paper is organized as follows. A brief introduction outlines related works in Section 2. Section 3 presents two simple metrics of the similarity in image matching based on 9DLT strings. A revised CCBT structure is depicted in Section 4. In this section, a dynamic programming method is given to obtain the minimal VFSM distance between a query frame sequence and a video frame sequence through the revised CCBT structure. Section 5 explores the properties of the techniques discussed in this paper. This section still offers two strategies to determine the order of inserting the frames in a video into a CCBT. Section 6 shows the experiment results. Conclusions are drawn in Section 7. 2. RELATED WORKS Hsu et al. [9, 10] and Lee and Hsu [13] defined 2D C-trees to characterize the spatial information of image contents. Each image is indexed by two 2D C-trees along the X coordinate and the Y coordinate. Three kinds of editing operations—replace, delete, and insert—are employed to transform one tree T1 into another tree T2 . The minimal number of editing operations required to transmute T1 into T2 , written as δ(T1 , T2 ), is used to measure the similarity of the corresponding images of T1 and T2 with spatial constraints. The complexity of this algorithm is O(|T1 | × |T2 × min(depth(T1 ), leaves(T1 )) × min(depth(T2 ), leaves(T2 ))). Here, depth(T1 ) and leaves(T1 ) denote the depth and the number of leaf nodes of tree T1 , respectively. In the worst case, it runs in time O(n 4 ), if the numbers of nodes in T1 and in

SPATIAL SIMILARITY RETRIEVAL IN VIDEO DATABASES

109

FIG. 1. Three different symbolic images.

T2 are both n. Let Ti x and Ti y be the 2D C-trees of image Ii along the X coordinate and the Y coordinate, and |T1 | and |T2 | are the number of nodes in T1 and in T2 , respectively. The editing distance δ(I1 , I2 ) between two images I1 and I2 is given by   if δ(T1y , T2y ) = 0, δ(T1x , T2x ) δ(I1 , I2 ) δ(T1y , T2y ) if δ(T1x , T2x ) = 0, and   δ(T1x , T2x ) × δ(T1y , T2y ) otherwise. Unfortunately, this editing distance metric may misjudge two dissimilar images as similar ones. Take Fig. 1 as an example, where f 1 , f 2 , and f 3 are three different symbolic images. Figure 2 shows their corresponding 2D C-strings, T1x , T1y , T2x , T2y , T3x , and T3y in reference to the X coordinate and the Y coordinate, respectively. Hence, (

δ(T1x , T2x ) = 0 and δ(T1y , T2y ) = 2

(

δ(T1x , T3x ) = 2 δ(T1y , T3y ) = 2,

so δ(T1 , T2 ) = 2 and δ(T1 , T3 ) = 4. The editing distance elucidates that image f 1 is more similar to f 2 than to f 3 ; nevertheless, f 1 does more resemble f 3 , actually. Hsu et al. [11] also extended this image indexing technique to characterize the spatial contents of individual video frames by an ordered set of 2D C-trees. The similarity between two frames is measured by the editing distance between their corresponding 2D C-trees. Thus, the summation of the minimal editing distances of mapping the query frames to their corresponding video frames is defined as the similarity metric of the video sequence

FIG. 2. The corresponding 2D C-trees of f 1 , f 2 , and f 3 in Fig. 1

110

CHAN AND CHANG

matching. Let U = u 1 u 2 . . . u m be the query frame sequence and V = v1 v2 . . . vn be the video frame sequence. The symbolic string sequence comparison algorithm using dynamic programming technique [17] can handle the problem of video frame sequence matching. For a transformation from v1 v2 . . . vi into u 1 u 2 . . . u j , the minimal sequence matching distance, denoted by D[i, j], is defined as initially, D[0, 0] = 0, D[i, 0] = D[i − 1, 0] + |vi |,

for i = 2 to m, and

D[0, j] = D[0, j − 1] + |u j |, for j = 2 to n, and then repeatedly computed as D[i, j] = min{D[i − 1, j] + |vi |, D[i, j − 1] + |u j |, D[i − 1, j − 1]+ δ(vi , u j )}, for i = 1 to m

and

J = 1 to n.

Here, |u i | is the number of objects in u i and δ(vi , u j ) is the editing distance between u j and vi . Finally, the minimal VFSM distance is stored in D[m, n]. It is clear that the computing time of the above algorithm is O(m × n). In 1991, Chang [6] defined a 9DLT matrix to represent a symbolic image. The relative spatial relationships among objects are described by nine direction codes; each relationship maps onto one of the integers from 0 to 8 as shown in Fig. 3. R denotes the reference objects, 0 indicates “at the same spatial location as R,” 1 means “at the east of R,” and so forth. Consider a symbolic image f containing a set of p ordered distinct objects O = {o1 , o2 , . . . , o p }. A 9DLT matrix T representing the spatial relationships among the objects in f draws a p × p matrix over the set of the nine direction codes. The element Ti, j of T is the direction code of the object oi referring to the object o j if i > j; otherwise Ti, j is undefined. Figure 4 shows the 9DLT matrix of f 3 in Fig 1. Chan and Chang [5] developed a common-component binary tree (CCBT) structure to keep a linear quadtree family corresponding to a set of similar binary images by sharing the same parts among them. A CCBT is a binary tree. In the structure, each internal node owns exactly two child nodes. The information on the nodes of a path P from the root node to a leaf node u can be used to represent an image I . The path P is defined as a component path of I . A leaf node contains the information only in image I but not in the

FIG. 3. The nine direction codes.

SPATIAL SIMILARITY RETRIEVAL IN VIDEO DATABASES

111

FIG. 4. The 9DLT of f 3 shown in Fig. 1.

other images. The internal node v holds the information shared by the images with the corresponding component paths passing v. Let Q 1 and Q 2 be two linear qaudtrees. The overlapping operation extracts the common part between Q 1 and Q 2 , written as Q 1 ∩ Q 2 , and partitions Q 1 and Q 2 into three parts—O12 , D1 , and D2 . O12 is the common part between Q 1 and Q 2 . D1 = Q 1 − Q 2 is the information in Q 1 but not in Q 2 and D2 = Q 2 − Q 1 is that in Q 2 but not in Q 1 . The unoverlapped operation, written as O12 ∪ D1 , may reconstruct Q 1 by substituting the ith subquadtree in D1 for the ith x in O12 . Furthermore, a devised set of other algorithms may implement the insertion, retrieval, and deletion through the overlapping and the unoverlapped operations. 3. IMAGE RETRIEVAL There are two categories of spatial similar image retrieval queries. The first category only searches for the database images precisely enough to satisfy all of the spatial relationships specified in a query image [4, 6]. In other words, it looks for the superimages of the query image from databases. The second one is to seek all the images from the database whose spatial relationships among objects satisfy as many desired spatial relationships indicated in a query image as possible [12, 13]. This paper will define two simple metrics to answer both categories of queries based on a 9DLT string. In this section, we plan to introduce the definition of a 9DLT string before discussing both metrics.

3.1. 9DLT String A 9DLT matrix is a general data structure representing a symbolic image [2–4, 6, 7, 15]. One distinct name and the centroid coordinates of an object with reference to the image frame are tagged on the object. It takes nine integers to denote the spatial relationships among objects. A triple (Oi , O j , ri j ) can depict the pairwise spatial relationship ri j between two objects Oi and O j , where O 5 ri j 5 8 [7]. (Oi , O j , ri j ) is an ordered triple, where Oi < O j in lexical ordering. In this paper, the three attributes Oi , O j , and ri j in triple (Oi , O j , ri j ) is concatenated to be a 9DLT pattern Oi O j ri j . Thus, only m × (m − 1)/2 patterns are needed to describe a symbolic image with m objects. The first two attributes of a 9DLT pattern is defined as the name of the pattern. The 9DLT string of a symbolic image f is obtained by concatenating all the 9DLT patterns of f , in the increasing order on the names of patterns. Since each image can be treated as a symbolic image, a 9DLT string hence can characterize an image. The strings S1 , S2 , S3 , and S4 in Fig. 6 illustrate the 9DLT strings of the symbolic images f 1 , f 2 , f 3 , and f 4 in Fig. 5.

112

CHAN AND CHANG

FIG. 5. Four symbolic images.

Assume that f is a symbolic image consisting of an ordered object list {A1 , A2 , . . . , Am }, sorted according to their symbol names. Then the 9DLT string A1 A2r12 A1 A3r13 . . . A1 Am r1m A2 A3r23 A2 A4r24 . . . A2 Am r2m . . . Am−1 Am rm−1m of f can be generated easily from {A1 , A2 , . . . , Am } by a two-level loop. The processing time to generate the 9DLT string of f is O(m × logm) + O(m 2 ) = O(m 2 ). 3.2.

Image Matching

With the 9DLT string representation, the two categories of queries mentioned above can be easily solved. The first category of queries retrieves from the database all of the images that are the superimages of the query image. Consider the database image f A and the query image f B , whose ordered object lists are l A and l B , respectively. f A is a superimage of f B , only if all the 9DLT patterns in the 9DLT string S B of f B are in the 9DLT string S A of f A . Function Superimage below checks whether f A is a superimage of f B . In the algorithms, the text embedded in a pair of braces is the given comment. FUNCTION SUPERIMAGE (l A , l B ). generate the 9DLT string S A of f A from l A generate the 9DLT string S B of f B from l B i =1 for each 9DLT pattern p B in S B while ( piA < p B ) { piA is the ith 9DLT pattern in S A } i =i +1 if piA 6= p B then return FALSE { p B is not in S A } return TRUE The second category of query responds with the database images with the largest number of spatial relationships that also exist in the query image. Function Similar-Matching reckons the number of 9DLT pattern pairs ( p A , p B ), so that p A equals p B . Henceforth, SimilarMatching can reply to the secondary category of query. Count indicates the similarity between f A and f B , where Count is the number of spatial relationships appearing both in f A and f B .

FIG. 6. The corresponding 9DLT strings of f 1 , f 2 , and f 3 from Fig. 4.

SPATIAL SIMILARITY RETRIEVAL IN VIDEO DATABASES

113

FUNCTION SIMILAR MATCHING (l A , l B ). generate the 9DLT string S A of f A from l A generate the 9DLT string S B of f B from l B count = 0; i = 1 for each 9DLT pattern p B in S B while ( piA < p B ) { piA is the ith 9DLT pattern in S A } i =i +1 if piA = p B then Count = Count + 1 { p B is in S A } return Count Both of the above functions generate S A and S B , and then scan S A and S B exactly once. Thus, they take O(m 2 + n 2 ) amount of time, where m and n are the numbers of objects in f A and f B , respectively. 4. VIDEO RETRIEVAL The difference seems little between any two consecutive frames within a video frame sequence. Sharing the common information among a set of consecutive video frames can significantly diminish the storage space required to store these video frames. Chan and Chang [5] defined a CCBT structure to save a set of similar images represented by a linear quadtree family. The CCBT structure keeps only one copy of the common information. It places the shared parts among a cluster of images on node N and its ancestor nodes, such that the component paths of the images pass N and its ancestor nodes. This section will modify the CCBT to hold 9DLT strings. On the basis of the CCBT structure, this paper also provides an effortless video frame sequence matching method. 4.1.

The CCBT Structure

Overlapping and unoverlapped operations are two primary functions in CCBT structure. The algorithms of retrieving, deleting, and inserting a datum from or into the CCBT can be easily designed through the above two operations. Except for the overlapping and unoverlapped operations, the other operations proposed by Chan and Chang [5] still work without a trace of change here. Hence, this paper only discusses the overlapping and unoverlapped operations. The overlapping operation separates two 9DLT strings S1 and S2 , written as S1 ∩ S2 , into three parts—O12 , D1 , and D2 . O12 is the overlap between S1 and S2 . D1 = S1 − S2 carries the information in S1 but not in S2 , and D2 = S2 − S1 records that in S2 but not in S1 . For the convenience of saving the 9DLT strings on the CCBT structure, it is allowed to insert the symbols “x”, and “|” to a 9DLT string. In addition, both symbols are regarded as two extra 9DLT patterns whose names are smaller than those of other patterns. Here, “x” indicates some patterns not being saved in the current node, but saved in its child nodes. “|” keeps apart different substrings that are mapped onto distinct “x” symbols on its parent node. O12 , D1 , and D2 can be obtained by comparing the 9DLT patterns in S1 with those in S2 . If there are two identical substrings s1 in S1 and s2 in S2 , s1 is given to O12 . Otherwise, “x” is emmited to O12 ; “|” is added to D1 except that D1 is empty, and similarly to D2 ; then s1 is annexed to D1 and s2 to D2 . Since the data processed in the overlapping operation are the 9DLT strings rather than linear quadtrees, the function Overlapping provided by Chan and Chang [5] must be properly modified. A revised version of this method is described as follows. In the algorithms, AkB means concatenates of strings A and B.

114

CHAN AND CHANG

FUNCTION OVERLAPPING (S1 , S2 ). for each 9DLT pattern in S1 switch(the leftmost pattern of S1 ) case “x”: remove the “x” from S1 and append it to S1 . case “|”: repeatedly remove the leftmost pattern of S2 , and append it to S2 until the leftmost pattern of S2 is “|”. remove the leftmost patterns “|” of S1 and of S2 . Append Substring(s1 , s2 ) O12 = O12 k “|” otherwise: if the leftmost of S2 is “|” then repeatedly remove the leftmost pattern of S1 , and append it to s1 until the leftmost pattern of S1 is “|”. remove the leftmost patterns “|” of S1 and of S2 . Append Substring(s1 , s2 ) O12 = O12 k“|” else switch(compare the leftmost pattern of S1 with that of S2 ) case <: remove the leftmost pattern of S1 , and append it to s1 . case >: remove the leftmost pattern of S2 , and append it to s2 . case =: Append Substring(s1 , s2 ) remove the leftmost patterns γ of S1 and of S2 , and append one copy of γ to O12 . return (O, D1 , D2 ) PROCEDURE APPEND SUBSTRING (s1 , s2 ). D2 } if either s1 or s2 is not empty then

{append two substrings s1 and s2 to D1 and to

O12 = O12 k “x”; if D1 is empty then D1 = D1 ks1 else D1 = D1 k “|” ks1 if D2 is empty then D2 = D2 ks2 else D2 = D2 k “|” ks2 set s1 and s2 to be empty If a 9DLT string S is partitioned into O and D, S can be reconstructed by replacing the ith “x” in O with the ith substring separately by symbol “|” in D. The unoverlapped operation, denoted by O ∪ D, can undertake the reconstruction of S by combining O with D. The function Unoverlapped presented by Chan and Chang [5] is rewritten as follows. In this algorithm, Function Get Substring(D) is to find the substring in D corresponding to “x” in O. FUNCTION UNOVERLAPPED (O, D). for each 9DLT pattern in O switch(the leftmost pattern of O) case “x”: S = Sk Get Substring(D) case “|”: S = Sk‘|’ otherwise: S = Sk the leftmost pattern of O remove the leftmost pattern of O return S

SPATIAL SIMILARITY RETRIEVAL IN VIDEO DATABASES

115

FIG. 7. The CCBT holding S1 , S2 , and S3 from Fig. 6.

FUNCTION GET-SUBSTRING (D). while the leftmost pattern γ of D is not “|” and D is not empty remove γ from D and append it to s. if γ = “|” then remove γ from D. return s A CCBT can be constructed by inserting the video frames to the tree one by one. The insertion algorithm is the same as that provided by Chan and Chang [5]. This paper takes one example to illustrate the procedure of generating a CCBT. Figure 7 states the variation of a CCBT after inserting the 9DLT strings S1 , S2 and S3 in Fig. 6 into the CCBT one after another. Initially, the CCBT is a nil tree. Figure 7a illustrates the CCBT containing only S1 . When S2 is inserted, O12 = S1 ∩ S2 = AB4AC6x AE5x B D1xC D2xC E6, D1 = AD3|BC7|B E8|C E2, and D2 = AD2|BC6|B E6|C E3. The CCBT is changed into that in Fig. 7b. The steps to insert S3 into the CCBT are: step 1: O123 = O12 ∩ S3 = AB4AC6x AE5xC D2xC E6 D12 = x|x B D1x|x D3 = AD2|BC6B D8B E6|C E3 step 2: D10 = D1 ∪ D12 = AD3|BC7B D1B E8|C E2 D20 = D2 ∪ D12 = AD2|BC6B D1B E6|C E3 0 = D10 ∩ D3 = x|x|x step 3: O13 0 D13 = AD3|BC7B D1B E8|C E2 D30 = AD2|BC6B D8B E6|C E3 0 O23 = D20 ∩ D3 = AD2|BC6x B E6|C E3 0 D23 = B D1 D300 = B D8 0 0 , is smaller than the size of O23 , D3 is stored in the right step 4: Since the size of O13 subtree. Figure 7c illustrates the CCBT containing S1 , S2 , and S3 .

116

CHAN AND CHANG

The time complexities of the above functions are equivalent to those mentioned by Chan and Chang [5]. The operations overlapping and unoverlapped run in O(|S1 | + |S2 |) and in O(|O| + |D|). Here, |S| is the number of 9DLT patterns in 9DLT string S. The processing time in insertion operation only comes near to the quadruple of the average number of the 9DLT patterns in a 9DLT string. 4.2.

Video Frame Sequence Matching

Similar video retrieval is to search for the similar videos from databases via a query video. The database system computes the minimal VFSM distance between the query frame sequence and each database video frame sequence. After that, it displays the database videos with minimal VFSM distances less than a given threshold. Consider the query frame sequence U = u 1 u 2 . . . u m and the video frame sequence V = v1 v2 . . . vn , where m 5 n. The minimal VFSM distance problem is to compute the summation of the image matching distances of all the frame pairs (u i1 , v1 ), (u i2 , v2 ), . . . , (u in , vn ), such that 1 ≤ i 1 < i 2 < · · · < i n ≤ m and the VFSM distance is minimal. Hsu et al. [11] proposed a dynamic programming method to deal with the VFSM problem. This paper also provides a much simpler dynamic programming technique to solve the matching problem. In this method, first, all the image matching distances between each frame in U and that in V are computed. The image matching schemes mentioned in Section 3 can be used to obtain the image matching distance between two frames. The entry D[i, j] of a two-dimensional array D records the image matching distance between the ith query frame and the jth video frame. Then the minimal VFSM distance between u i u i+1 . . . u m and v j v j+i . . . vn can be calculated recursively by the formulas  min(D[i, j], D[i, j + 1]), for i = m and j = (n − 1). .1, (1),    D[i, j] + D[i + 1, j + 1], for i = (m − 1). .1 and j = (n − m + i), and (2), D[i, j] =  min(D[i, j + 1], D[i, j] + D[i + 1, j + 1]), for i = m − 1. .1 and   j = (n − m + i − 1). .1. (3) Finally,D[1, 1] shows the minimal VFSM distance between U and V . The computing time of the above algorithm is O(m × n), clearly. THEOREM. minimal.

The VFSM distance D[1, 1] computed by the above recursive formulas is

Proof. (a) Formula (1) reckons the minimal VFSM distance between u m and v j v j+1 . . . vn , for j = 1 to n, and formula (2) does that between u m−h u m−h+1 . . . u m and vn−h vn−h+1 . . . vn , for h = m − 1. .1. Thus, Formula (1) and formula (2) are trivially true. (b) Assume that D[i, j] is the minimal VFSM distance between u i u i+1 . . . u m and v j v j+1 . . . vn , and D[i − 1, j] is that between u i−1 u i . . . u m and v j v j+1 . . . vn . (c) Prove the truth of formula (3) for computing the minimal VFSM distance between u i−1 u i . . . u m and v j−1 v j . . . vn . If u i−1 maps to v j−1 in the sequence matching between u i−1 u i . . . u m and v j−1 v j . . . vn , then the minimal VFSM distance is the image matching distance (=D[i − 1, j − 1]) between u i−1 and v j+1 plus the minimal VFSM distance (= D[i, j]) between u i u i+1 . . . u m and v j v j+1 . . . vn . Otherwise, the distance (= D[i − 1, j]) is the minimal VFSM distance between u i−1 u i . . . u m and v j v j+1 . . . vn . Q.E.D.

SPATIAL SIMILARITY RETRIEVAL IN VIDEO DATABASES

117

Summarily, the procedure to retrieve the similar videos is: (a) compute the image matching distances of all the possible frame pairs (u i , v j ), (b) compute the minimal VFSM distance D[1, 1] between V and U , and (c) then return the videos with the minimal VFSM distances less than a given threshold. Stage (a) responds to the image matching distances of all the possible frame pairs. The execution time consumed by this stage takes most of the time for the whole system. Therefore, we are motivated to accelerate the processing time of the stage. When the CCBT structure is applied to reserve a video frame sequence, a significant amount of the redundant information can be eliminated. It is unnecessary to match common information repeatedly. To accelerate the image matching, this paper presents a fast concurrency matching method, based on the CCBT structure, to answer the image matching distances between a query frame and all the video frames simultaneously. The algorithm matches the shared information only once. Consequently, the CCBT not only makes the storage space reduced considerably but also the processing time of computing the distances between query frames and video frames shortened dramatically. Let the CCBT T hoard the information required to portray a video frame sequence V = v1 v2 · · · vn . In the revised CCBT structure, an internal node N consists of three fields N.left, N.right, and N.O, which denote the left and the right children, and the 9DLT string saved necessarily on this node, separately. Suppose that N is a leaf node; besides the above three fields, N contains an extra field N.No to label all the serial numbers i of the frames vi whose component paths are the same as the paths from root node to N . Matching two 9DLT strings S1 (corresponding to the query frame) and S2 (on one node of T ) may cause three kinds of 9DLT patterns in S1 : (1) the pattern also appearing in S2 , (2) the pattern different from some pattern in S2 , but with the same names, and (3) others. Obviously, the first kind of pattern is a matched pattern, and the second one is an unmatched pattern. The third one is an uncertain pattern, because, until now, we do not know whether there exists any pattern matching it in the 9DLT strings on descendant nodes. Function Concurrency Match (Si , T, 0) matches Si with the 9DLT string on root node first, and then matches the uncertain patterns with the 9DLT strings on its child nodes recursively. Si is the 9DLT string of the ith frame u i of the query frame sequence u 1 u 2 · · · u m and T is the CCBT. The function Match is devised to count the number of matching patterns and to collect the uncertain patterns between two 9DLT strings. Num is the number of matched patterns. Uncertain indicates the uncertain patterns. d keeps the image matching distance. PROCEDURE CONCURRENCY-MATCH (Si , T, d). (Num, Uncertain) =Match(Si , T.O) d = d+Num if T is a leaf node then for each image index j recorded in T.N o D[i, j] = (the number of 9DLT patterns in the original S1 ) − d else Concurrency-Match(Uncertain, T .left, d) Concurrency-Match(Uncertain, T .right, d)

118

CHAN AND CHANG

FUNCTION MATCH (S1 , S2 ). for each 9DLT pattern of S1 switch(compare the name of the leftmost pattern of S1 with that of S2 ) case >: remove the leftmost pattern of S2 , case <: remove the leftmost pattern of S1 and append it to Uncertain. case =: if the leftmost patterns of S1 and of S2 are the same then Num = Num + 1. remove the leftmost patterns of S1 and S2 . return(Num, Uncertain) Procedure Concurrency-Match scans all of the data in CCBT T just once. Hence, the information shared by some frames in V is matched only once. Since a video is composed of many consecutive images that form a gradually changing data set, there exists much redundant information among them. Therefore, a great part of the processing time to get the image matching distance between each query frame and each video frame is eliminated. Figure 8 delineates the behaviors to compute the distances between image f 4 and images

FIG. 8. The actions simultaneously computing the image matching distances.

SPATIAL SIMILARITY RETRIEVAL IN VIDEO DATABASES

119

f 1 , f 2 , f 3 in Fig 5. Each stage to match the 9DLT string held on one node of T generates three sets of data α, β, and γ , which list the matched, unmatched, and uncertain patterns, respectively. There are six 9DLT patterns in S4 . Therefore, the image matching distances between f 4 , and f 1 , f 2 , f 3 are 3, 1, and 0, individually. 5. SUMMARIES This section scrutinizes the properties of the techniques suggested in this paper and compares them with those presented by Hsu et al. [11]. Hsu et al. used an ordered labeled 2D C-tree as the spatial representation of an image. O(n 2 ) processing time is required to construct a 2D C-tree, where n is the number of objects in this symbolic image. This paper defines a 9DLT string to describe the spatial information of an image. It takes O(n 2 ) time to constitute a 9DLT string from a symbolic image as well, yet it is simpler to form a 9DLT string than to create a 2D C-tree. The editing distance between two 2D C-trees is utilized to measure the similarity of their related images by Hsu et al. The computing time of the editing distance between two trees T1 and T2 is O(|T1 | × |T2 | × min(depth(T1 ), leaves(T1 )) × min(depth(T2 ), leaves(T2 ))). However, the execution time for computing the image matching distance between two 9DLT strings S1 and S2 is merely O( p 2 + q 2 ). Here, p (5 |T1 |) and q (5 |T2 |) are the numbers of objects in the corresponding symbolic images of S1 and S2 , respectively. Moreover, the distance between two 9DLT strings offers a more effectual measure of image similarity than the editing distance between two 2D C-trees. These two methods make use of a dynamic programming technique to solve the VFSM problem. Both of them take O(m × n) steps to acquire the minimal VFSM distance, where m and n are the numbers of query frames and database video frames. The recursive formula offered by Hsu et al. needs to compute |vi | and |u j |. Moreover, each time it selects the minimal item from three items; however, our method picks the minimal one only from two. Since Hsu et al.’s method is more complicated, the unit time is longer than that in the method provided by us. Hsu et al. used four tables to maintain the sizes of query frames: the sizes of video frames, the image matching distances between query frames and video frames, and the minimal VFSM distances between the prefix subsequences of the query frame sequence and the video frame sequence. Nevertheless, one table is sufficient in our proposed method. This paper adopts the revised CCBT structure to hold the spatial information among objects in a video frame sequence. Based on the CCBT, the concurrency match operation calculates the distances between one query frame and all of the frames in a video, simultaneously. It matches the common information among these video frames only once. This data structure not only reduces the storage space required to keep a video frame sequence significant but also diminishes the computing time of the distances between query frames and video frames considerably. Let m and n be the number of query frames and video frames, respectively. Assume that each frame contains p objects. The execution time to procure the distances between the query frames and the video frames is extremely less than O (m × n × p 2 ), since the shared information is matched only once. On the Pm Pn other hand, the processing time of Hsu et al’s method is O( i=1 j=1 (|Ti | × |T j | × min(depth(Ti ), leaves(Ti )) × min(depth(T j ), leaves(T j ))). In the worst case, O(m × n × q 4 ) time is required to execute their algorithm, where q (≥ p) is the number of nodes in Ti and in T j .

120

CHAN AND CHANG

In addition, the CCBT structure can get a better efficiency if the tree is balanced, since less extra 9DLT patterns “x” and “|” are used. A balanced tree can be obtained when the insertion order of frames is at random [5]. This paper offers two versions of the randomized selection approaches to determine the order of inserting the frames of a video into a CCBT. The first one, called the middle selection, recursively obtains the insertion order by uniformly partitioning the video frame sequence. This strategy appoints the insertion order by executing the following steps: (a) Divide the video frame sequence into the left subsequence, middle frame, and the right subsequence. (b) Insert the middle frame into a CCBT. (c) Use (a) and (b) recursively to find out the middle frames of the left and of the right subsequences, and then insert them into the CCBT, separately. The other strategy, named randomized selection, is to specify the insertion order by a random number generator. Since a video frame sequence is composed of a set of gradually changing frames, both strategies above put the CCBT in the state of tree balance. 6. EXPERIMENTS One-hundred video frame sequences are generated by this system to be used as the test data. Each video frame sequence includes 34 frames. Four are picked randomly from these 34 frames as a query frame sequence. The remaining frames are considered to be a database video frame sequence. Every frame contains 5 to 12 objects, decided by a random number generator. An object is tagged with a distinct name o and an ordered pair of coordinates. The coordinates (x, y) of the object o in the bi × 34/tc-th frame of a video frame sequence v are assigned at random, where 0 5 i 5 t. (x, y) is regarded as the ith turning point of

FIG. 9. A portion of test video frames.

SPATIAL SIMILARITY RETRIEVAL IN VIDEO DATABASES

121

FIG. 10. Experiment results.

o in v. The value of x and that of y specified at random are bounded between 0 and 120. t is the number of turning points of o in v. The value of t is given randomly from 2 to 5. The coordinates of o appearing between the bi × 34/tc-th and the b(i + 1) × 34/tc-th frames of v are obtained by interpolation from the ith and the (i + 1)-th turning points of o. In addition, the object with the coordinates (x, y) is eliminated from the frame where the object is located, if either the value of x or that of y is greater than 100 or less than 20. Figure 9 shows a set of video frames extracted from a test database video. To make it more readable, each object is typified by a small picture. The goal of these experiments is to investigate the storage space required by the revised CCBT structure and the processing time to compute the image matching distance between a query frame and a video frame based on the revised CCBT structure. In these experiments, first, the video frames (including database video frames and query frames) are transformed into a serial of 9DLT strings. Consequently, the average storage space required to hold the 9DLT strings corresponding to all of the frames in a database video is 1413 bytes. Then the first experiment sequentially computes the distance between each query frame and each database video frame. The average execution time to gain the image matching distance between a query frame and a database video frame is 0.00311 s. The next three experiments put the 9DLT strings corresponding to all of the frames in a database video v into a CCBT. Afterward, they calculate the distances between each query frame and all of the frames in v, simultaneously. The first experiment decides the insertion order of a database video frame in v through its series number in v. This experiment takes on average 541 bytes of storage space to hold a CCBT and 0.00158 s to procure the image matching distance between a query frame and a database video frame. Another experiment takes the middle selection strategy to appoint the insertion frame. On average, it employs 438 bytes storage space to keep a CCBT and 0.00121 s to answer the image matching distance between a query frame and a database video frame. The third experiment chooses the insertion frame by a random number generator. Here, the storage space required for storing a CCBT and the running time for measuring the image matching distance between a query frame and a database video frame are on average 452 bytes and 0.00125 s respectively. Figure 10 illustrates these experiment results. 7. CONCLUSIONS Video is a medium with high complexity that contains a large amount of spatial and temporal information. The spatial and the temporal relations play an important role in indexing the video information. This paper makes use of a 9DLT string to depict the spatial relationships among the objects of a symbolic image. Two simple metrics of similarity

122

CHAN AND CHANG

in image matching based on the 9DLT strings are presented to solve the subimage and similar image retrieval problems as well. Both metrics are more effective in image similarity measure than the editing distance based on 2D C-tree representation. This paper modifies the CCBT structure and adopts the revised CCBT structure to store a set of 9DLT strings. In this structure, a great portion of the storage space required by a CCBT and that of the processing time to procure the distances between query frames and video frames are reduced. The CCBT structure assures a better efficiency if a CCBT is a balanced tree. Two insertion order selection strategies, middle selection and randomized selection, are provided, too. They position the CCBT in a better state. Experiment results show that by using both strategies, a high reduction in storage space and processing time can be achieved. In addition, a fast dynamic programming approach to calculate the minimal VFSM distance between a query frame sequence and a video frame sequence is proposed. It has been proven that the VFSM distance is minimal. REFERENCES 1. T. Arndt and S. K. Chang, Image sequence compression by iconic indexing, in IEEE Workshop on Visual Languages, Los Alamitos, CA, October 1989, pp. 177–182. 2. S. K. Bhatia and C. L. Sabharwal, A fast implement of a perfect hash function for picture objects, Pattern Recognit. 27, 1994, 365–376. 3. S. K. Bhatia and C. L. Sabharwal, Image databases and near perfect hash table, Pattern Recognit. 30, 1997, 1867–1876. 4. Y. K. Chan and C. C. Chang, Image retrieval by string-mapping, in The International Symposium on Combinatorics and Applications, Tianjin, P. R. China, June 1996, pp. 65–76. 5. Y. K. Chan and C. C. Chang, An efficient data structure for storing similar binary images, in Proceedings of the 5th International Conference on Foundations of Data Organization (FODO’98), Kobe, Japan, Nov. 1998, pp. 268–275. 6. C. C. Chang, Spatial match retrieval of symbolic pictures, J. Inform. Sci. Eng. 1991, 405–422. 7. C. C. Chang and S. Y. Lee, A retrieval of similar pictures on pictorial database, Pattern Recognit. 24, 1991, 675–680. 8. T. S. Chua and L. Q. Ruan, A video retrieval and sequencing system, ACM Trans. Inform. Syst. 13, 1995, 373–407. 9. F. J. Fsu, S. Y. Lee, and P. S. Lin, 2D C-tree spatial representation for iconic image, in Proceeding of the 2nd International Conference on Visual Information Systems (VISUAL ’97), San Diego, CA, Dec. 1997, pp. 287–296. 10. F. J. Hsu, S. Y. Lee, and P. S. Lin, Similarity retrieval by 2D C-trees matching in image databases, J. Visual Commun. Image Rep. 9, 1998, 87–100. 11. F. J. Hsu, S. Y Lee, and P. S. Lin, Video data indexing by 2D C-trees, J. Visual Language Comput. 9, 1998, 375–397. 12. V. N. Gudivada and V. V. Raghaban, Design and evaluation of algorithms for image retrieval by spatial similarity, ACM Trans. Inform. Syst. 13, 1995, 115–144. 13. S. Y. Lee and F. J. Hsu, 2D C-string: A new spatial knowledge representation for image database system, Pattern Recognit. 23, 1995, 1077–1087. 14. S. Y. Lee, M. K. Shan, and W. P. Yang, Similarity retrieval of iconic image database, Pattern Recognit. 22, 1989, 675–682. 15. S. Y. Lee, M. C. Yang, and J. W. Chen, Signature file as spatial filter for iconic image database, J. Visual Language Comput. 3, 1993, 373–397. 16. T. W. Lin, Compressed quadtree representations for storing similar images, Image Vision Comput. 15, 1997, 883–843. 17. U. Manber, Introduction To Algorithms: A Creative Approach, Addison-Wesley, Reading, MA, 1989. 18. A. D. Narasimhalu, Special section on content-based retrieval, ACM Multimedia Syst. 3, 1995, 226–249. 19. K. Shearer, S. Venkatesh, and D. Kieronska, Spatial indexing for video databases, J. Visual Commun. Image Rep. 7, 1996, 325–335.