Efficient video segment matching for detecting temporal-based video copies

Efficient video segment matching for detecting temporal-based video copies

Neurocomputing 105 (2013) 70–80 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom ...

803KB Sizes 1 Downloads 124 Views

Neurocomputing 105 (2013) 70–80

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Efficient video segment matching for detecting temporal-based video copies Chih-Yi Chiu n, Tsung-Han Tsai, Cheng-Yu Hsieh Department of Computer Science and Information Engineering, National Chiayi University, No.300 Syuefu Road., Chiayi City 60004, Taiwan

a r t i c l e i n f o

a b s t r a c t

Available online 8 October 2012

Content-based video copy detection has grabbed an increasing attention in the video search community due to the rapid proliferation of video copies over the Internet. Most existing techniques of video copy detection focus on spatial-based video transformations such as brightness enhancement and caption superimposition. It can be accomplished efficiently by the clip-level matching technique, which summarizes the full content of a video clip as a single signature. However, temporal-based transformations involving random insertion and deletion operations pose a great challenge to clip-level matching. Although some studies employ the frame-level matching technique to deal with temporalbased transformations, the high computation complexity might make them impractical in real applications. In this paper, we present a novel search method to address the above-mentioned problems. For a given query video clip, it is partitioned into short segments, then each of which linearly scans over the video clips in a dataset. Rather than exhaustive search, we derive the similarity upper bounds of these query segments as a filter to skip unnecessary matching. In addition, we present a min-hash-based inverted indexing mechanism to find candidate clips from the dataset. Our experimental results demonstrate that the proposed method is robust and efficient to deal with temporal-based video copies. & 2012 Elsevier B.V. All rights reserved.

Keywords: Content-based retrieval Near-duplicate detection Spatial and temporal video transformation

1. Introduction The rapid development of multimedia technologies in recent years has spurred enormous growth in the amount of digital video content available for public access. Since such content can be easily duplicated, edited, and disseminated via the Internet, the proliferation of video copies has become a serious problem. Video copy detection techniques have therefore generated a great deal of interest in the research community and the multimedia industry. With such a technique, content owners, e.g., Disney, can track particular videos with respect to royalty payments and copyright infringements; platform providers, e.g., YouTube, can remove identical copies uploaded by users. There are two main techniques proposed to detect video copies: digital watermarking and content-based video copy detection (CBVCD). Digital watermarking [17,20] embeds identification codes of the content owner in source videos. Then, during detection, the watermarks are extracted to verify the integrity of the image/video content. The CBVCD technique, on the other hand, takes perceptual features of the video content as a unique signature to distinguish from other video contents. Since CBVCD does not embed any information in

n

Corresponding author. Tel.: þ886 5 2717228; fax: þ886 5 2717741. E-mail addresses: [email protected], [email protected] (C.-Y. Chiu), [email protected] (T.-H. Tsai), [email protected] (C.-Y. Hsieh). 0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.04.036

videos, the quality of the video content is not affected. Moreover, the two techniques can complement each other in pursuit of more robust detection. In this paper, we focus on the CBVCD technique. For CBVCD researchers, the major concerns are the feature representation and matching technique for videos. The design issue of the feature representation is the compactness and robustness against various video transformations. Generally, video transformations can be categorized based on spatial and temporal transformations. Spatial-based transformations, such as brightness enhancement and caption superimposition, modify the frame image of the source video. On the other hand, temporal-based transformations, such as frame rate change and segment insertion/deletion, alter the temporal context of the source video. In contrast to spatial-based transformations, temporal-based transformations have received relatively little attention in CBVCD research. However, temporal-based transformations might cause a serious desynchronization problem between the source and copy. For example, in Fig. 1, we chop a snippet from a source video and insert it into an unrelated video clip to generate a new video. Since only a small part of the new video is copied from the source video, it poses a great challenge to unveil their relation in an effective and efficient manner [18,26,32]. Two main video matching techniques, namely, clip-level and frame-level, are widely used in CBVCD. The clip-level matching technique summarizes the full content of a video clip into a compact signature for fast matching. However, the fine details between videos, such as their temporal relations, are usually ignored, so the

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

clip-level matching is unsuitable for dealing with temporal-based transformations. On the other hand, the frame-level matching technique, which analyzes the similarity between frame pairs from the spatial and temporal aspects, can overcome the temporal-based transformations to some extent. However, the computation overhead makes frame-level matching impractical in large-scale retrieval. Fig. 2 gives examples of the two matching techniques. In this study, we present a novel search method to address the above-mentioned problems in detecting temporal-based video copies. Suppose that a query clip is submitted to search a dataset of target clips. The query clip is first partitioned into short query segments, each of which performs linear scanning over a target clip. By identifying the interrelations among the partitioned query segments, we derive their similarity upper bounds that are used to skip unnecessary matching. In addition, we proposed a coarseto-fine CBVCD framework by leveraging the characteristics of two different feature representations, namely, the ordinal-based signature [4] and the scale invariant feature transform (SIFT)-based signature [19]. The ordinal-based signature is used to represent a video clip. We conduct the min-hash theory [1] to hash the

71

ordinal-based signature. Based on the min-hash signature, we construct an inverted index structure to index target clips. If the query clip and target clip’s min-hash signatures are hit, we then perform linear scanning for the two clips. Here the SIFT-based signature is employed instead of the ordinal-based signature because the SIFT-based signature has a stronger discriminative power. The flow of the proposed framework is illustrated in Fig. 3. Compared with clip-level matching, although matching with short video segments increases the opportunity of finding temporal-based video copies, its price is the increment of the computation overhead. The main contribution of the proposed search method is the video partitioning and pruning strategy, which makes the linear scanning process more efficient. As shown in Fig. 4, the query clip and target clip are treated differently: the

Target clip

SIFT-based signature

Matching between the segment pairs SIFT-based signature

Query clip Fig. 1. An example of the temporal-based video copy generated by random insertion/deletion operations. The crosses indicate that the corresponding video content has been deleted.

Sliding window Fig. 4. The proposed search method for partitioned video segments.

Fig. 2. Video matching techniques: (a) clip-level matching (b) frame-level matching.

Fig. 3. The flow of the proposed CBVCD framework.

72

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

query clip is partitioned into overlapped query segments framed by a fixed length sliding window; the target clip is partitioned into non-overlapped target segments with equal length to the query segment. This design can increase the accuracy in video sequence matching and the robustness to temporal-based transformations [18]. Since the adjacent query segments are highly overlapped, they share a great portion of video content. We exploit the overlapping information to derive the similarity upper bound for each matching pair of the query segment and target segment. The computation for the similarity upper bound is relatively lightweight compared with that for the actual similarity; it can be used as a fast filter to determine if a matching pair has to be further examined with the actual similarity. As the most of similarity upper bounds are below the detection threshold, a substantial number of matching pairs can be pruned to expedite the linear scanning process. Consequently, the computation overhead of segment-level matching is limited. The proposed method would be more efficient than frame-level matching and more accurate than clip-level matching. The remainder of this paper is organized as follows. Section 2 contains a review of related work on CBVCD. Section 3 describes the proposed methods. We discuss the experiment results in Section 4. Section 5 summarizes our conclusion.

2. Related work 2.1. Feature representation Feature representations can be categorized into three types: global, local, and spatiotemporal features. The global features, such as the ordinal measure [4] and the color-shift/centroidbased signature [11], are used to characterize the entire spatial region of a video frame. The merit of global-based features is their compactness. For example, only a 9-dimensional ordinal measure is needed for a 3  3-block frame. Such a compact feature is very effective to resist spatial transformations [7,11,12]. The local features, such as the SIFT descriptor [19] and speeded up robust features (SURF) [3], are used to capture local region properties of the keypoints in a frame. In general, dozens of keypoints may be extracted from a frame. Some studies aggregate keypoints as a bag-of-words [24] form and recast the enormous computation among keypoints to vector space calculation. They have demonstrated the stronger discrimination and better robustness compared with the global features [9,15]. Moreover, because different features might complement each other, their combination would improve the search accuracy. For example, Chiu et al. [8] showed that the combination of the ordinal and SIFT-based features is more robust than the individual; Wu et al. [29] employed the color histogram for fast rejection and the PCA–SIFT descriptor for detailed matching. Song et al. [25] and Yang et al. [30] presented an optimization approach that learned the local and global structures of different features to find the best video representation scheme. The spatiotemporal features consider both spatial and temporal information in continuous video frames. For example, Hoad and Zobel [11] proposed the color-shift/centroid-based signature, which computes the changes in color histograms and pixel locations of neighboring frames. Law-To et al. [16] employed keypoint trajectories that are generated by connecting the continuous keypoints of adjacent frames. Basharat et al. [2] extracted SIFT trajectories and analyzed their motion and spatial properties to form spatiotemporal volumes. Shang et al. [22] proposed a w-shingling method that constructs a frame’s shingle by combining the ordinal relation features of the frame and its next w-1 frames. Zhou and Chen [33] introduced the video cuboid

signature, which characterizes a block of pixels over adjacent keyframes. Esmaeili et al. [10] extracted temporally informative representative images from a video sequence and applied a 2D-DCT to generate a spatiotemporal fingerprint. Although embedding temporal information in the feature representation can enhance the discriminative power, the spatiotemporal features have difficulty in handling temporal transformations; the source and copy videos might produce different features if the temporal context is modified.

2.2. Matching technique We review two matching techniques in CBVCD: clip-level matching and frame-level matching. In clip-level matching, each video clip is treated as a basic unit, and the complete clip is summarized in a compact signature. For example, Chueng and Zakhor [6] presented a video signature generated by random sampling m frames in a video clip. The similarity between two video clips was computed by the proposed Voronoi video similarity function of two m-tuple video signatures. Huang et al. [13] expressed a video clip as a coordinate system’s origin and axes by applying principle component analysis to exploit the distribution of the frame content. A modified B þ tree was used to index the coordinate system with a two-dimensional distance for the video clip. Shang et al. [22] aggregated the visual shingles of all the frames in a video as a bag-of-shingles. A histogram intersection kernel was integrated into an inverted index structure for fast computation of the similarity between two video clips. The concept of clip-level matching can be easily extended to shotlevel matching, i.e., a clip is partitioned into shots and each shot is represented as a signature [9,29]. With the compact representation and efficient index structure, clip-level matching achieves an excellent execution speed in large-scale retrieval. However, the compact representation is a trade-off for robustness. If a large part of video content of two video clips is near-duplicate, a clip-level search can detect the phenomenon effectively. By contrast, in some temporal-based transformation cases, where only a small fraction of video content is near-duplicate, the similarity between two video clips would be very low, and might therefore induce a false negative, i.e., the true copy is not detected. Frame-level matching seeks to identify the frame pair relation between video segments. The temporal information is usually integrated in the similarity measurement to enhance the detection robustness. Law-To et al. [16] proposed a voting algorithm to find the best spatiotemporal offset of the trajectories of two keypoints. Joly et al. [15] applied a random sample consensus (RANSAC) algorithm to estimate the temporal transformation parameters between two frame sequences iteratively. Since the above methods are designed to estimate only one temporal model for a copy video, they might fail when two or more temporal models exist in the copy video. Some dynamic programming-based methods are proposed to handle minor temporal variations when matching two frame sequences [7,12]; they also have difficulty in detecting temporal-based transformations induced by random insertion and deletion. Hence, more sophisticated methods are presented to address the problem. For example, Chiu et al. [8] employed the Hough transform to detect a continuous pattern with high similarity scores between frame pairs. Shen et al. [23] transformed the frame matching task into a bipartite graph problem and solved it by the maximum size matching algorithm. Yeh and Cheng [31] built a directed weighting graph of frame pairs based on the temporal order consistency to find the best alignment. Although these methods can resist temporal-based transformations to some extent, the computational overhead is quite significant when matching large amount and distorted frame sequences.

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

73

3. The proposed method

simupbund (Q2, Ts)

similarity / similarity upper bound

1 In this study, the CBVCD task is formulated as follows. Let Q¼{qi 9 i¼1, 2,y, nQ} be a query clip with nQ frames, where qi is the ith query frame; and let T¼{tj 9 j ¼1, 2,y, nT} be a target clip with nT frames, where tj is the jth target frame. Assume that Q is a source video and T is a suspect video. Suppose that a similarity function sim(Qr, Ts) is defined to measure the similarity between two video segments Qr DQ and Ts DT. If sim(Qr, Ts) is greater than a pre-defined threshold y, we consider Ts as a copy of Qr. The goal of the CBVCD task is to devise a robust and efficient search process to determine, between Q and T, if there is such a segment pair (Qr, Ts) so that sim(Qr, Ts)4 y.

sim (Q1, Ts) sim (Q2, Ts)

0.8

threshold θ

0.6

0.4

0.2

3.1. Video segment representation and similarity comparison We employ the SIFT descriptors and adopt the bag-of-words model to characterize each video frame. The bag-of-words model has been widely used in object recognition and video retrieval recently [14,28]. Its compact representation and strong discriminative power is suitable for the CBVCD task [8,24,29]. In this study, we apply the model to construct the video feature representation in the following. A training dataset collected from another video collection is used to generate a codebook of D codewords by applying the Linde–Buzo–Gray (LBG) algorithm [21]. Then, for each query and target frame, denoted as f, we extract its SIFT descriptors and quantize these SIFT descriptors according to the codebook. The frame f is represented by a signature of a D-dimensional

0 100

120

140 160 time interval s

and Ts, i.e., simðQ rn ,T s Þ. Then, the similarity between Qr and Ts, rA{1, 2,y, R}, r arn, satisfies the following inequality:  -  - - 2U 9Q rn \T s 9 þ 9Q r \Q rn 9 2U9Q r \T s 9  simupbnd ðQ r ,T s Þ, simðQ r ,T s Þ ¼ - r 9Q r 9 þ 9T s 9 9Q r 9 þ 9T s 9

-

and the SIFT-based signature f as an indicator of the occurrence of codewords. If the dth codeword occurs in f, we denote -

-

the signature’s dth element as f ðdÞ ¼ 1; otherwise f ðdÞ ¼ 0.   Let V ¼ f 1 ,f 2 ,. . .,f nV be a video segment with nV frames. Its -

SIFT-based signature V is defined by -

-

-

ð3Þ

ð1Þ

-

-

-

2  9V 1 \V 2 9 -

-

-

-

-

where A \ B ¼ 9A 99A \ B 9 returns the set difference of vectors -

-

A and B . Eq. (3) specifies the similarity upper bound between Qr and Ts. We leverage the derived similarity upper bounds to prune some query segments without matching. Suppose the cardinality -

for d ¼1, 2,y, D. Clearly V ðdÞ ¼ 1 indicates that the dth codeword occurs in V. The signature is represented in a binary form rather than a frequency form (i.e., the number of codewords occurred); our experiment shows that the binary form can be calculated more efficiently without sacrificing accuracy. The similarity between two video segments V1 and V2 is defined by the dice coefficient simðV 1 ,V 2 Þ ¼

- -

of each query segment 9Q r 9 and its set difference to the pivot

-

V ðdÞ ¼ maxðf 1 ðdÞ,f 2 ðdÞ,. . .,f nV ðdÞÞ,

200

Fig. 5. An example of the similarity vs. the similarity upper bound. The X axis indicates the time interval s, and the Y axis indicates the similarity or similarity upper bound score.

binary vector f A f0,1gD . That is, we regard f as a bag of SIFT words -

180

ð2Þ

9V 1 9 þ 9V 2 9

where 9 A 9 returns the cardinality of the vector A, i.e., the number of nonzero elements in A. The computation time of (2) is O(D) contributed from the intersection operation. In the experimental section, we will study the impact of the number of SIFT-based signature dimensions D. 3.2. Query segment pruning Suppose that a query clip Q is partitioned into a set of R query segments, denoted as {Qr 9 r¼ 1, 2,y, R}. The idea of query segment pruning is to skip the computation of sim(Qr, Ts) for some query segments Qr. We employ a lightweight test to determine whether Qr can be pruned during the matching phase. Let Q rn ,r n A f1,2,. . .,Rg, be the pivot query segment selected from {Qr}. Suppose that we have calculated the similarity between Q rn

-

-

query segment 9Q r \Q rn 9 are pre-calculated when the query clip is given. As simðQ rn ,T s Þ is computed using (2), we obtain the values -

-

-

of 9Q rn \T s 9 and 9T s 9. Therefore, simupbnd(Qr, Ts) can be calculated immediately for each Qr according to (3). If simupbnd(Qr, Ts) is not greater than the predefined threshold y, we will not need to calculate the similarity sim(Qr, Ts) between Qr and Ts because we assure that sim(Qr, Ts)r y; otherwise we have to calculate sim(Qr, Ts) using (2). Compared with (2) that takes O(D) time, the computation cost of (3) is relatively lightweight as it only takes O(1) time. Thus, the computation cost for the segment pairs sim(Qr, Ts), r ¼1, 2,y, R can be reduced by pruning some query segments based on their low similarity upper bounds. Fig. 5 shows an example of the similarity vs. the similarity upper bound. Two query segments, denoted as Q1 and Q2, are partitioned from a query clip Q; they are used to match with a target clip T. We select Q1 to be the pivot query segment and plot their sim(Q1, Ts) (solid line), sim(Q2, Ts) (broken arrow line), simupbnd(Q2, Ts) (vertical bars), and threshold y (dashed horizontal line) over the time interval sA[100, 200]. The X axis indicates the time interval s, while the Y axis indicates the similarity or similarity upper bound score. The similarity scores are generally very low except in the neighborhood of peaks, which indicate a possible copy in Q1 and Ts. Since most of simupbnd(Q2, Ts) do not exceed the threshold, a substantial number of computations for sim(Q2, Ts) can be skipped. The selection of the pivot query segment Q rn is an interesting issue. Intuitively, the lower the similarity upper bound of a query

74

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

Fig. 6. Two strategies of segment-level matching: (a) NQS and (b) OQS.

segment, the higher the probability that its similarity computation can be skipped. Consider the set difference between two -

-

-

-

-

-

query segments 9Q r \Q rn 9 in (3). It is clear that 9Q r \Q rn 9 should be small to lower the similarity upper bound. In other words, Q rn should have the smallest set difference with other Qr for r arn. We use the following expression to select the pivot query segment Q rn : r n ¼ argminr0

R X

0

9Q r \Q r0 9,r A f1,2,. . .,Rg

ð4Þ

r¼1

For each query segment Q r0 , the above equation sums up all its set difference cardinalities with other query segments, and selects the one with the minimum summation as the pivot query segment Q rn ; it is expected to yield the highest probability to skip the similarity computation. 3.3. Two strategies for video clip partition We design two video clip partition strategies to accomplish the segment-level matching, namely, non-overlapped query segment (NQS) and overlapped query segment (OQS). The NQS strategy is to partition Q into a set of non-overlapped equallength query segments {Qr}. Then, a sliding window whose size is equal to the length of Qr is used to scan over T and extract target segments {Ts}. Note that each of {Ts} is overlapped with its neighboring target segments. Denote the length of the partitioned video segment as l, the rth query segment as Q r ¼ n o  n  qðr1ÞUl þ 1 ,qðr1ÞUl þ 2 ,. . .,qrUl , r A 1,2,. . .,R ¼ lQ , and the sth target segment extracted by the sliding window as Ts ¼{ts, ts þ 1,y, ts þ l-1}, sA{1, 2,y, nT–lþ1}. Then we compute the similarity between each pair of sim(Qr, Ts); the total time required to match n n n n segments in Q and T is lQ ðnT l þ 1Þ ¼ Ql T nQ þ lQ . An illustration of the NQS strategy is given in Fig. 6(a). On the contrary, the OQS strategy is to partition T into a set of non-overlapped l-frame target segments {Ts} and employ the sliding window to scan over Q and extract query segments {Qr}, where each of {Qr} is overlapped with its neighboring query segments. In this strategy, we denote the sth target segment as     T s ¼ t ðs1ÞUl þ 1 ,t ðs1ÞUl þ 2 ,. . .,t sUl , s A 1,2,. . ., nlT , and the rth query segment extracted by the sliding window as Qr ¼{qr, qr þ 1,y, qr þ l-1}, rA{1, 2,y, R¼nQ—lþ 1}. We have to spend

Fig. 7. An example of the 3  3 ordinal relation of a video frame and its hash vector H.

  n n nQ l þ 1 ¼ Ql T nT þ nlT time to complete the segment matching process for Q and T. Fig. 6(b) gives an illustration of the OQS strategy. According to the time complexity analysis, the NQS strategy is more efficient than the OQS strategy if nQ 4nT; otherwise the NQS strategy is less efficient if nQ onT. Hereinafter, we decide to utilize the OQS strategy in the segment-level matching technique. The decision is based on the following reasons. First, in a database search system, the target data size is generally much greater than the query data size, i.e., nQ onT; the OQS strategy can obtain a lower computation time according to the above complexity analysis. Second, in the later experimental section, we show that the OQS strategy prunes more unnecessary matching between segment pairs, while it keeps a comparable accuracy with the NQS strategy. In addition, the OQS strategy is more storageefficient than the NQS strategy, since the number of the partitioned target segments in the OQS strategy is much smaller than that in the NQS strategy.

nT l

3.4. Video clip indexing We apply an inverted indexing mechanism to expedite the whole search process in CBVCD. The idea is to quickly filter out target clips that are not similar to the query clip, so that only a fraction of target clips in the dataset have to be verified by linear

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

H ¼ ðRð½2,2,½1,2Þ,Rð½2,2,½2,3Þ,Rð½2,2,½3,2Þ,Rð½2,2,½2,1Þ,Rð½1,1,½1,3Þ, Rð½1,3,½3,3Þ,Rð½3,3,½3,1Þ,Rð½3,1,½1,1Þ,Rð½1,1,½3,3Þ,Rð½1,3,½3,1ÞÞ,

ð5Þ where R(s, t) is an ordinal relation function that returns 1 if the rank of the sth block is greater than that of the tth block; otherwise it returns 0. The ordinal relation functions used in (5) are selected according to the spatial topology and information entropy proposed by Shang et al. [22]. Fig. 7 shows an example of the ordinal-based feature. Let G ¼2g. We transform H to a nonnegative integer in the interval [0, G–1] by the following polynomial evaluated at x¼ 2: h ¼ Hð1Þxg1 þ Hð2Þxg2 þ . . . þ Hðg1Þx þHðgÞ,

1 NQS, θ = 0.3

0.9

NQS, θ = 0.4 OQS, θ = 0.3

0.8

OQS, θ = 0.4

0.7 0.6 QSPR

scanning in detail. To achieve an efficient indexing, we represent a video clip by the ordinal measure [4] rather than the SIFT descriptor by taking the advantage of the compactness of the ordinal measure. The ordinal measure is extracted by dividing a video frame into 3  3 non-overlapping blocks and computing their intensity ranks. Then, g LBP-based ordinal relation functions [22] are used to generate a hash value for a video frame f. For example of using 10 ordinal relation functions (g ¼10), the frame’s hash vector H is expressed by:

0.5 0.4 0.3 0.2 0.1 5

O

f A f0,1gG , which is a G-dimensional binary vector where all bins are set to zero except for the hth bin that is set to one. Similar to (1), the ordinal-based signature of a video segment V is repre! - O O O O sented as V dÞ ¼ max f 1 ðdÞ,f 2 ðdÞ,. . .,f nV ðdÞ for d ¼1, 2,y, G. An inverted index table is built in the following. We denote the inverted index table as X, which contains G cells corresponding to G hash values. Each cell stores a link list of the target video clips. -

A target video clip T is indexed according to its signature T O ; that -

is, if T O ðdÞ ¼ 1, T’s file ID is inserted to the dth cell X(d). When a query clip Q is submitted, it is processed in a similar manner.  þ ¼ 1 and X(d þ ) contains T, If there is a d þ A{1, 2,y, G} that Q O d it means Q and T are hit in X and they will be verified through linear scanning. If the hit ratio of inverted indexing is high, many target clips have to be matched with the query clip and thus increase the processing time. To decrease the hit ratio of inverted indexing, we conduct the min-hash theory to compact the ordinal-based signature. The min-hash theory, which is proposed to solve the nearest neighbor search problem efficiently [1], has been successfully applied in similarity measurement of CBVCD [8,9]. The basic idea is to take k minimum hash values from the original signature as its approximation. We apply the min-hash theory to generate a video segment’s min-hash signature based on its ordinal-based signature. The min-hash signature of a video segment V is expressed as: 8 -   > > < V H dÞ ¼ 1 if d A mink d 9 V O ðdÞ 40 ð7Þ , > > : H V ðdÞ ¼ 0 otherwise for d¼ 1, 2,y, G, where mink(A) returns the k smallest values of a set A in ascending order. If the size of A is not larger than k, mink(A) returns A in ascending order. The min-hash signature is also a binary vector. It is used to replace the ordinal-based signature for H

inverted indexing in the same manner: if V ðdÞ ¼ 1, V is inserted/hit -

to the dth cell X(d). Note that the cardinality 9V H 9 is at most k,

10

15 20 video segment length

30

Fig. 8. A comparison of the QSPR of the two partitioning strategies.

ð6Þ

where H(g) is the gth element in vector H and h is the hash value of the video frame. We then generate a ordinal-based signature

75

-

where k does not exceed 9V O 9; the hit ratio of inverted indexing can be thus decreased. To investigate the best configuration of G and k, in the experimental section, we design a series of evaluations and discuss their performance.

4. Experiments We evaluated the proposed framework on CC_WEB_VIDEO [5] and TRECVID [27] collections. CC_WEB_VIDEO collection contains 24 folders with a total of 12,890 video clips downloaded from video sharing websites. The video length is approximately 732 h. In each folder, there is a major group of near-duplicate video clips. These near-duplicate video clips are mainly different in their visual quality, compression codec, frame resolution, cropping, subtitling, frame rate, etc. For TRECVID collection, we downloaded the IACC.1.A video data used in the content-based copy detection task, with a total of 8175 video clips. The video length is approximately 225 h. All these video clips were converted into a uniform format of 320  240 pixels frame resolution and 1 frame per second (fps). They served as the target dataset, with a total of 21,065 files and 3,448,191 frames (about 957 h). We compiled two categories of query datasets, including spatial query dataset and temporal query dataset. The spatial query dataset was created as follows. From CC_WEB_VIDEO collection, we selected the first video clip of each folder as the query clip. Four undergraduate students annotated a ground truth for every target video clip by tagging ‘‘copy’’ (with segment timestamps) or ‘‘not copy’’. To create the temporal query dataset, we chopped a 10–20 s segment from a spatial query clip, inserted the segment into an unrelated video clip at random, and applied a quality degradation transformation to the entire clip, as illustrated in Fig. 1. The quality degradation transformation includes brightness enhancement 20%, compression ratio 50% (by IndeoR 5.10), and random noise 10%. In total we compiled 24 spatial and 24 temporal query clips, each of which was truncated to 60 s duration. 4.1. Video partition strategy comparison We first investigate the partition strategy presented in Section 3.3, namely, NQS and OQS. An efficiency metric called query segment

76

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

pruning ratio (QSPR) is used for evaluation: Q SPR ¼

The number of pruned query segments , R-1

ð8Þ

where R is the number of partitioned query segments. QSPR reflects the proportion of query segments to be pruned without computing the similarity with a target segment. A small dataset based on the first folder of CC_WEB_VIDEO collection is compiled. The dataset contained 809 video files, each of which was truncated to 60 s duration. The first video file was used as the query clip, and its first query segment was selected as the pivot query segment. Note that nQ ¼nT ¼60 for every query Table 1 A list of the F1-measure of the two partitioning strategies. The NQS strategy: The OQS strategy: non-overlapped query segment overlapped query segment

y ¼0.3, l ¼20 0.9485 y ¼0.4, l ¼20 0.9395

0.9486 0.9418

0.5

where true positives (TP) refer to the number of positive examples correctly labeled as positives; false negatives (FN) refer to the number of positive examples incorrectly labeled as negatives; and false positives (FP) refer to the number of negative examples incorrectly labeled as positives. Generally, when the similarity threshold increases (decreases), the recall rate declines (grows) and the precision rate grows (declines). The F1-measure is the harmonic mean of the recall and precision rates; it can be treated as a trade-off between the recall and precision rates. Table 1 shows that the OQS strategy yields a comparable (even slightly better) F1-measure to the NQS strategy, even a larger portion of the similarity computation is skipped in OQS. According to the above observation, we decide to employ the OQS strategy in the subsequent experiments.

0.4

4.2. Binary signature vs. frequency signature

1 0.9 0.8 0.7 0.6

0.3 binary frequency

0.2 0.1 0

0.1

0.2

0.3

0.4

0.5 0.6 recall

0.7

0.8

0.9

Fig. 9. The PR graph of the binary signature and the frequency signature.

1

The SIFT-based signature is represented in a binary form. That is, the signature only records the codeword occurrence rather than the codeword frequency. In this subsection, we assess the performance of the binary and frequency signatures. The spatial query dataset was used for evaluation. We set the number of signature dimensions D ¼1024, the similarity threshold y ¼0.4, and the length of the partitioned video segment l¼ 20. Fig. 9 shows the precision-recall graph (PR graph) of the two signatures. Although the binary signature omits the frequency information,

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

QSPR

F1-measure

precision

and target clip, so that the evaluation did not favor any partitioning strategies. Fig. 8 shows a comparison of QSPR of the two partitioning strategies. We set the number of signature dimensions D ¼1024, the similarity threshold y ¼{0.3, 0.4}, and the length of the partitioned video segment l¼{5, 10, 15, 20, 30}. It is apparent that setting a higher y increases the QSPR. The OQS strategy yields a significant better QSPR than the NQS strategy. The reason is that in the OQS strategy, the partitioned query segment and its neighboring query segments are highly overlapped, so that their set differences are very small. According to (3), the similarity upper bounds calculated by OQS will be lowered; thus, a query segment has a higher probability to be pruned. We also list the average F1-measure rates of the two strategies in Table 1, where the F1-measure rate is defined as:    F1measure ¼ 2r ecallprecision =ðrecall þprecisionÞ, recall ¼ TP=ðTP þ FNÞ, precision ¼ TP=ðTP þFPÞ, ð9Þ

0.5 0.4 0.3

with pivot selection without pivot selection

0.5 0.4 0.3

with pivot selection without pivot selection

0.2

0.2

0.1

0.1 0.1

0.2

0.3 0.4 threshold θ

0.5

0.6

0.1

0.2

0.3 0.4 threshold θ

0.5

0.6

Fig. 10. The performance with/without the pivot segment selection mechanism in terms of (a) F1-measure and (b) QSPR.

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

QSPR

F1-measure

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

0.5 0.4

77

0.5 0.4

D = 1024 D = 2048 D = 4096 D = 8192

0.3 0.2

0.3

D = 1024 D = 2048 D = 4096 D = 8192

0.2

0.1

0.1 0.1

0.2

0.3 0.4 threshold θ

0.5

0.6

0.1

0.2

0.3 0.4 threshold θ

0.5

0.6

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

QSPR

F1-measure

Fig. 11. The number of signature dimensions D vs. the similarity threshold y in terms of (a) the F1-measure and (b) the QSPR.

0.5 0.4 θ = 0.3 θ = 0.4

0.3

0.5 0.4 0.3

0.2

0.2

0.1

0.1 1 5 10 15 20 30 video segment length l

60

θ = 0.3 θ = 0.4

1 5 10 15 20 30 video segment length l

60

Fig. 12. The performance of various video segment lengths l in terms of (a) F1-measure and (b) QSPR.

its accuracy is comparable to the frequency signature. The execution time is 0.282 s for the binary signature and 0.343 s for the frequency signature. This is because the binary signature can apply the faster bit-and operation to calculate the bit stream intersection, while the frequency signature has to use the slower minimal operation to calculate the histogram intersection. In addition, the binary signature is storage-efficient compared with the frequency signature. In the following experiments, we employ the binary signature to represent videos. 4.3. Pivot selection This experiment evaluates the proposed pivot selection mechanism in Section 3.2, namely, with pivot selection and without pivot selection. Recall that the pivot selection mechanism applies (4) to select a partitioned query segment as a pivot. For the other mechanism, we simply selected the first query segment as the pivot without loss of generality. The spatial query dataset was used with the following configuration: D ¼1024, l¼20, and y ¼{0.1, 0.2, 0.3, 0.4, 0.5, 0.6}. Fig. 10(a) shows the F1-measure, where the two mechanisms exhibit the same performance. However, in Fig. 10(b), the QSPR of the pivot selection mechanism is much higher than that of the

other mechanism. Clearly, the proposed pivot selection mechanism is very effect to expedite the matching process; it is applied in the subsequent experiments. 4.4. SIFT-based signature dimension This subsection discusses the effect of the dimensionality of the SIFT-based signature. The spatial query dataset was used. Each video clip was partitioned into several 20-frame video segments, i.e., l ¼20. We assessed the performance under various parameters y ¼{0.1, 0.2, 0.3, 0.4, 0.5, 0.6} and D ¼{1024, 2048, 4096, 8192}. Fig. 11(a) shows F1-measure. A larger D obtains a higher F1-measure when y ¼0.1 and 0.2, but the phenomenon reverses when y Z0.3. We consider that, with a suitable choice of y, even a lower D can yield a comparable accuracy with a higher D. Fig. 11(b) shows QSPR. Similarly, a larger D obtains a slightly higher QSPR at a lower y and it reverses when y Z0.4. Basically, the QSPR is strictly increasing as y grows because the gap between the similarity upper bound and the threshold is enlarged. However, a higher y will harm the F1-meausre. To strike a balance between the accuracy and efficiency, we select to use D¼1024 in the subsequent experiments; it yields a satisfactory F1-measure while obtaining an

78

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

acceptable QSPR at y ¼ 0.4. Another merit that is using a small D spends less computation time and storage space.

We then investigate the impact of the length of the partitioned video segment. Various segment lengths l¼ {1, 5, 10, 15, 20, 30, 60} were evaluated. The spatial query dataset and the following configuration were used in the experiment: D ¼1024, G ¼8192, k¼60, and y ¼{0.3, 0.4}. In Fig. 12(a), the F1-measure seems very stable at different l, while in Fig. 12(b), a larger l yields a higher QSPR. The reason is because a long segment usually exhibits a high overlap to the pivot segment, making the set difference small in (3). Note that there is only one query segment when l¼ 60. Since no query segment can be pruned, the QSPR is always zero at l ¼60.

1 0.9 0.8 0.7 F1-measure

4.5. Video segment length

0.6 0.5 0.4

4.6. Temporal transformation

θ = 0.2

0.3

θ = 0.3

0.2

θ = 0.4

0.1 1

5

10 15 20 30 video segment length l

60

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6 TCPR

F1-measure

Fig. 13. The F1-measure rates under different video segment lengths and thresholds.

In this subsection, we change to use the temporal query dataset for evaluation. The following configuration was used in the experiment: D ¼ 1024, G ¼8192, k¼ 60, y ¼{0.3, 0.4}, and l¼{1, 5, 10, 15, 20, 30, 60}. We find some interesting phenomena in Fig. 13. For long segments (l ¼30 and 60), their F1-measure rates are quite low at y ¼0.4. This is because the signature of the long segment might not match with that of the temporal-based copy, the recall rate

0.5 0.4

G = 1024 G = 2048 G = 4096 G = 8192

0.3 0.2 0.1

G = 1024 G = 2048 G = 4096 G = 8192

0.5 0.4 0.3 0.2 0.1

1 3 5 10 15 20 60 the number of min-hash signature dimensions k

1 3 5 10 15 20 60 the number of min-hash signature dimensions k

1 0.9 0.8

F1-measure

0.7 0.6 0.5 0.4 0.3 0.2

G = 1024 G = 2048 G = 4096 G = 8192

0.1 1 3 5 10 15 20 60 the number of min-hash signature dimensions k Fig. 14. The number of the ordinal-based signature G vs. the number of the min-hash signature k in terms of (a) F1-measure of the spatial query dataset; (b) TCPR of the spatial query dataset; and (c) F1-measure of the temporal query dataset.

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

79

Table 2 The accuracy of the baseline methods and our method. Method

Spatial query dataset

Shang et al. [22] Tan et al. [26] Ours

Temporal query dataset

Recall

Precision

F1-measure

Recall

Precision

F1-measure

0.9512 0.9512 0.9299

0.9042 0.9095 0.9162

0.9271 0.9299 0.9230

0.8232 0.9116 0.8933

0.9024 0.9084 0.9090

0.8610 0.9100 0.9011

declines substantially. Although they gain a great improvement at y ¼0.3, but drop again at y ¼0.2. This is because their precision rates decrease sharply as y decreases. On the other hand, the F1-measure rates of short segments (l ¼1 and 5) do not alter greatly. Since the short segments contain less information, their precision rates would be easily damaged, especially at a low y (see l ¼1 and y ¼ 0.2). For l ¼10, 15, and 20, we obtain relatively stable F1-measure rates. Overall speaking, when against temporalbased transformations, the F1-measure is degraded because of the decline in the recall rate. To improve the accuracy, the threshold should be lowered; however, using a very low threshold might cause serious harm in the precision rate and computation efficiency. 4.7. Inverted indexing We discuss the inverted indexing mechanism presented in Section 3.4. Two main parameters, namely, the number of the ordinal-based signature dimensions G and the number of the minhash signature dimensions k, are investigated. The following configuration was set in the experiment: D ¼ 1024, y ¼0.4, l ¼20, G ¼{1024, 2048, 4096, 8192}, and k ¼{1, 3, 5, 10, 15, 20, 60}. Both the spatial and temporal query datasets are used in this experiment. We first illustrate the result of the spatial query dataset in Fig. 14(a) and (b), which show the F1-measure and the target clip pruning ratio (TCPR), respectively. TCPR is defined by TCPR ¼

The number of pruned target clips : The number of target clips

ð10Þ

TCPR is used to assess the efficiency performance of the proposed inverted indexing. It reflects the proportion of the target dataset clips to be pruned without linear scanning. At different G, the variation in F1-measure is minor while the variation in TCPR is noticeable. Using a large G decreases the hit ratio in inverted indexing and thus increases the TCPR. For different k, they yield approximate F1-measure scores except for k¼1. As k grows, TCPR declines until kZ 20. The experimental result shows that it is unnecessary to preserve the whole information of the ordinalbased signature. Its min-hash values can yield more efficient performance without sacrificing accuracy too much. The result of using the temporal query dataset is shown in Fig. 14(c). The F1-measure degrades slightly compared with Fig. 14(a). It demonstrates that the proposed inverted indexing also works well to deal with temporal-based transformations.

Jaccard coefficient of their signatures. An inverted indexing structure with the fast intersection kernel is used for speed-up. In this experiment, the LBP-based ordinal relation feature was used in our implementation and the parameter w¼3. On the contrary, Tan et al.‘s method follows the frame-level matching technique. Given a query video and a target video, their frames were aligned by constructing a temporal network. For each query frame, the top-k similar frames were retrieved from the target video. Edges were established between the retrieved frames based on heuristic temporal constraints, including temporal distortion level wnd and minimal length ratio of the near-duplicate segment Lmin. A network flow algorithm was employed to find the best path in the graph. In our implementation, Tan et al.‘s method used our proposed SIFT-based signature and set related parameters wnd¼5, k¼ 1, and Lmin ¼0.1. We also integrated our inverted indexing mechanism in their method for fast filtering. The parameters used in our method were set by G ¼8192, k¼3, D ¼1024, y ¼0.4, and l¼20. Table 2 lists F1-measure of these compared methods. The spatial and temporal query datasets are evaluated separately. In the spatial query dataset, the F1-measure scores of the three methods are comparable. However, in the temporal query dataset, Shang et al.’s method incurs a greater reduction on the recall rate (about 13%) than others, whereas Tan et al.’s method and our method have relatively minor degradation. It reflects the major problem of clip-level matching that uses a signature to represent the whole video clip. If the clip is modified by temporal-based transformations, the signature of the modified clip might be very different from that of the source clip and thus would increase the number of false negatives. We finally assess the execution time of these methods. These programs were implemented in Cþþ, and ran on a PC with a 2.8 GHz CPU and 4 GB RAM. Our method spent 0.282 s to search a 60-frame query clip in the target dataset in average. Without the proposed query segment pruning, our method took 0.387 s to complete the same search task. For Shang et al.’s method and Tan et al.’s method, they spent 0.012 s and 144.773 s, respectively. Note that these methods have different data volumes to be processed: Shang et al.’s method searches 21,065 LBP-based ordinal features (i.e., the number of target clips); Tan et al.’s method searches 3,448,191 SIFT-based signatures (i.e., the number of frames); and our method searches 160,721 SIFT-based signatures (i.e., the number of segments).

4.8. Baseline comparison 5. Conclusion We implement Shang et al.’s method [22] and Tan et al.’s method [26] as the baselines. Shang et al.’s method follows the clip-level matching technique. Given a video clip, the ordinal relation functions were used to represent each frame, as described in Section 3.4. To model the spatiotemporal property of a video clip, the w-shingle feature that characterizes the ordinal relations of w frames was applied. The w-shingle feature was regarded as a visual word, and the video clip was represented as an aggregation of all frames in a signature of the bag-of-words model. The similarity between two clips was computed by the

In this paper, we present a novel search method based on segment-level matching to address the accuracy and efficiency issues in detecting temporal-based video copies. By exploiting the similarity upper bound derived from interrelation between video segments, the proposed method can prune unnecessary matches to expedite the search process. In addition, a coarse-to-fine framework leverages the ordinal-based and SIFT-based feature representations to achieve effective indexing and robust matching. Compared with clip-level and frame-level matching, our

80

C.-Y. Chiu et al. / Neurocomputing 105 (2013) 70–80

method strikes a balance between the accuracy and efficiency, in particular for dealing with temporal-based transformations.

[27]

References

[28]

[1] A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimension, Commun. ACM 51 (2008) 117–122. [2] A. Basharat, Y. Zhai, M. Shah, Content based video matching using spatiotemporal volumes, Comput. Vision Image Understand. 110 (2008) 360–377. [3] H. Bay, A. Ess, T. Tuytelaars, L.V. Gool, SURF: speeded up robust features, Comput. Vision Image Understand. 110 (2008) 346–359. [4] D.N. Bhat, S.K. Nayar, Ordinal measures for image correspondence, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 415–423. [5] CC_WEB_VIDEO: near-duplicate web video dataset. /http://vireo.cs.cityu. edu.hk/webvideo/S. [6] S.C. Cheung, A. Zakhor, Efficient video similarity measurement with video signature, IEEE Trans. Circ. Syst. Video Technol. 13 (2003) 59–74. [7] C.Y. Chiu, C.S. Chen, L.F. Chien, A framework for handling spatiotemporal variations in video copy detection, IEEE Trans. Circ. Syst. Video Technol. 18 (2008) 412–417. [8] C.Y. Chiu, H.M. Wang, C.S. Chen, Fast min-hashing indexing and robust spatio-temporal matching for detecting video copies, ACM Trans. Multimedia Comput., Commun., Appl. 6 (10) (2010) 1–23. [9] O. Chum, J. Philbin, M. Isard, A. Zisserman, Scalable near identical image and shot detection, in: Proceedings of ACM International Conference on Image and Video Retrieval (CIVR), Amsterdam, The Netherlands, July 9–11 2007. [10] M.M. Esmaeili, M. Fatourechi, R.K. Ward, A robust and fast video copy detection system using content-based fingerprinting, IEEE Trans. Inform. Forensic Secur. 6 (2011) 213–226. [11] T.C. Hoad, J. Zobel, Detection of video sequence using compact signatures, ACM Trans. Inform. Syst. 24 (2006) 1–50. [12] X.S. Hua, X. Chen, H.J. Zhang, Robust video signature based on ordinal measure, in: Proceedings of IEEE International Conference on Image Processing (ICIP), Singapore, October 24–27 2004. [13] Z. Huang, H.T. Shen, J. Shao, X. Zhou, Bounded coordinate system indexing for real-time video clip search, ACM Trans. Inform. Syst. 27 (17) (2009) 1–33. [14] H. Je´gou, M. Douze, C. Schmid, Improving bag-of-features for large-scale image search, Int. J. Comput. Vision 87 (2010) 316–336. [15] A. Joly, O. Buisson, C. Frelicot, Content-based copy retrieval using distortionbased probabilistic similarity search, IEEE Trans. Multimedia 9 (2007) 293–306. [16] J. Law-To, O. Buisson, V. Gouet-Brunet, N. Boujemaa, Robust voting algorithm based on labels of behavior for video copy detection, in: Proceedings of ACM International Conference on Multimedia (ACM-MM), Santa Barbara, USA, October 23–27 2006. [17] T.Y. Lee, S.D. Lin, Dual watermark for image tamper detection and recovery, Pattern Recognition 41 (2008) 3497–3506. [18] W. Li, Y. Liu, X. Xue, Robust audio identification for MP3 popular music, in: Proceedings of ACM International Conference on Information Retrieval (SIGIR), Geneva, Switzerland, July 19–23 2010. [19] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vision 60 (2004) 91–110. [20] J.J. Murillo-Fuentes, Independent component analysis in the blind watermarking of digital images, Neurocomputing 70 (2007) 2881–2890. [21] K. Sayood, Introduction to Data Compression, Morgan Kaufmann, San Francisco, California, 1996. [22] L. Shang, L. Yang, F. Wang, K.P. Chan, X.S. Hua, Real time large scale nearduplicate web video retrieval, in: Proceedings of ACM International Conference on Multimedia (ACM-MM), Firenze, Italy, October 25–29 2010. [23] H.T. Shen, J. Shao, Z. Huang, X. Zhou, Effective and efficient query processing for video subsequence identification, IEEE Trans. Knowl. Data Eng. 21 (2009) 321–334. [24] J. Sivic, A. Zisserman, Video Google: a text retrieval approach to object matching in videos, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), Nice, France, October 14–17 2003. [25] J. Song, Y. Yang, Z. Huang, H.T. Shen, R. Hong, Multiple feature hashing for real-time large scale near-duplicate video retrieval, in: Proceedings of ACM International Conference on Multimedia (ACM-MM), Scottsdale, USA, November 28–December 1 2011. [26] H.K. Tan, C.W. Ngo, R. Hong, T.S. Chua, Scalable detection of partial nearduplicate videos by visual temporal consistency, in: Proceedings of ACM

[29]

[30]

[31] [32]

[33]

International Conference on Multimedia (ACM-MM), Augsburg, Germany, September 23–28 2009. TRECVID: TREC Video Retrieval Evaluation. /http://www-nlpir.nist.gov/pro jects/tv2011/S. J. Wang, Y. Li, Y. Zhang, C. Wang, H. Xie, G. Chen, X. Gao, Bag-of-features based medical image retrieval via multiple assignment and visual words weighting, IEEE Trans. Med. Imaging 30 (2011) 1996–2011. X. Wu, A.G. Hauptmann, C.W. Ngo, Practical elimination of near-duplicates from web video search, in: Proceedings of ACM International Conference on Multimedia (ACM-MM), Augsburg, Germany, September 23–28 2007. Y. Yang, Y.T. Zhuang, F. Wu, Y.H. Pan, Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval, IEEE Trans. Multimedia 10 (2008) 437–446. M.C. Yeh, K.T. Cheng, Fast visual retrieval using accelerated sequence matching, IEEE Trans. Multimedia 13 (2011) 320–329. Q. Zhang, Y. Zhang, H. Yu, X. Huang, Efficient partial-duplicate detection based on sequence matching, in: Proceedings of ACM International Conference on Information Retrieval (SIGIR), Geneva, Switzerland, July 19–23 2010. X. Zhou, L. Chen, Monitoring near duplicates over video streams, in: Proceedings of ACM International Conference on Multimedia, Firenze, Italy, October 25–29 2010.

Chih-Yi Chiu received the B.S. Degree in Information Management from National Taiwan University, Taiwan in 1997, and the M.S. Degree in Computer Science from National Taiwan University, Taiwan in 1999, and the Ph.D. Degree in Computer Science from National Tsing Hua University, Taiwan in 2004. From January 2005 to July 2009, he was with Academia Sinica as a Postdoctoral Fellow. In August 2009, he joined National Chiayi University, Taiwan as an Assistant Professor in the Department of Computer Science and Information Engineering. His current research interests include multimedia retrieval, human–computer interaction, and digital archiving.

Tsung-Han Tsai received the B.S. Degree in Computer Science and Information Engineering from National Chiayi University, Taiwan in 2011. In August 2011, he joined National Chiayi University, Taiwan as a graduate student in the Department of Computer Science and Information Engineering. His current research interest includes multimedia content analysis and retrieval.

Chen-Yu Hsieh received the B.S. and M.S. Degrees in Computer Science and Information Engineering from National Chiayi University, Taiwan in 2010 and 2012. He is currently with Chung-Shan Institute of Science & Technology, Taoyuan, Taiwan. His current research interest is multimedia retrieval and computer graphics.