Pattern Recognition Letters 33 (2012) 2285–2291
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
An FFT-based fast melody comparison method for query-by-singing/humming systems Wei-Ho Tsai ⇑, Yu-Ming Tu, Cin-Hao Ma Department of Electronic Engineering & Graduate Institute of Computer and Communication Engineering, National Taipei University of Technology, No. 1, Sec. 3, Chunghsiao E. Rd, Taipei City 10608, Taiwan
a r t i c l e
i n f o
Article history: Received 4 April 2012 Available online 11 September 2012 Communicated by G. Borgefors Keywords: Dynamic time warping Fast Fourier transform Query-by-humming Query-by-singing
a b s t r a c t Query-by-singing/humming (QBSH) is a promising way to retrieve music recordings based on main melody’s similarity. This paper presents an efficient QBSH method that enables fast melody comparison. In contrast to the most prevalent QBSH method, which measures the distances between note sequences in the time domain, the proposed method performs distance computation in the frequency domain. This is done with the fast Fourier transform, which converts different-length note sequences into equaldimension vectors via zero padding. The equal dimensionality allows us to compare the vectors using Euclidean distance directly, which avoids performing time-consuming alignment between sequences. To take both efficiency and effectiveness into account, the proposed fast melody comparison method is combined with the dynamic time warping technique into a two-stage sequence matching system. Our experiments show that the proposed system outperforms several existing speed-up DTW-based systems in terms of both efficiency and effectiveness. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction Music retrieval has been an important research topic, out of the need to help people locate the desired items from a burgeoning amount of music data. As the concrete descriptions, such as title, lyrics, or singer, cannot fully represent the abstract content of music, such as melody or emotion, conventional keyword-based retrieval methods show their limitations on handling music data. For example, it is often the case that people know what the song they want sounds like, but just cannot recall its title or lyrics. This results in a difficulty for most users to retrieve music using keyword-based queries. To overcome this problem, a promising solution is so-called query-by-singing/humming (QBSH) (Jang and Lee, 2001; Jang et al., 2001; Pauws, 2002; Shifrin and Burmingham, 2003; Shih et al., 2003; Doraisamy and Ruger, 2003; Salvador and Chan, 2007; Unal et al., 2007; Casey et al., 2008; Guo et al., 2008; Jang and Lee, 2008; Yu et al., 2008; Cao et al., 2009; Hou et al., 2009; Kim et al., 2011; Khan and Mushtaq, 2011), which allows users to retrieve a song by simply singing or humming a fragment of the song. However, although techniques on QBSH have been studied for more than one decade, they are still far from popular in real applications. Apparently, more work is needed to further improve the effectiveness and efficiency of the existing QBSH techniques. A QBSH system relies on an effective mechanism of melody similarity comparison. Since most users are not professional ⇑ Corresponding author. Tel.: +886 2 27712171x2257; fax: +886 2 27317120. E-mail address:
[email protected] (W.-H. Tsai). 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.08.020
singers, a sung/hummed query may contain inevitable tempo errors, key errors, note insertion errors, note dropout errors, etc. To handle these errors, various approximate matching methods, such as dynamic time warping (DTW) (Jang and Lee, 2001; Salvador and Chan, 2007; Yu et al., 2008; Kim et al., 2011), hidden Markov model (Shifrin and Burmingham, 2003; Shih et al., 2003), and N-gram model (Doraisamy and Ruger, 2003; Hou et al., 2009), have been studied, with DTW being the most popular. Table 1 summaries the pros and cons of the above-mentioned approaches. It is worth emphasizing that the advantage of DTW mainly lies in its flexibility to allow two time-dependent patterns that are similar but locally different to align in a non-linear way; hence, it is insensitive to the errors in a sung/hummed query. However, because the alignment between two patterns is not known a priori, DTW exhaustively examines various possible alignments, which takes considerable time. In this study, we propose a fast method of melody similarity comparison. Instead of comparing melody patterns in the time-domain, the proposed method operates in the frequency domain. It uses fast Fourier transform (FFT) to convert variable-length melody patterns into equal-dimension vectors. The equal dimensionality allows us to compare the vectors using Euclidean distance directly, and thus enables a fast melody matching. The remainder of this paper is organized as follows. Section 2 reviews a baseline QBSH system based on dynamic time warping. Section 3 introduces a fast melody comparison method and the proposed QBSH system. Section 4 compares the proposed system with other speed-up DTW-based systems. Section 5 discusses our
2286
W.-H. Tsai et al. / Pattern Recognition Letters 33 (2012) 2285–2291
Table 1 Pros and cons of the three major approaches used in QBSH studies. Approach Dynamic time warping Hidden Markov model N-gram model
Pros
Cons High computational complexity
Robust to the errors in the note sequences Superior discriminability for different music documents having similar melodies in part Robust to the errors in the note sequences
Inferior discriminability for different music documents having similar melodies in part Inferior discriminability for different music documents having similar melodies in part Sensitive to the errors in the note sequences
Low computational complexity
experiment results. Then, in Section 6, we present our conclusions and indicate the directions of our future work. 2. A baseline QBSH system A QBSH system is designed to take as input an audio query hummed by a user, and produce as output a list of song documents ranked by relevance to the query, where the relevance refers to the level of similarity in melody. Fig. 1(a) shows a baseline system based on dynamic time warping (DTW). The document set consists of N MIDI song clips, where each is represented by a note sequence of main melody. When a user hums to the system, his/her voice is recorded and converted into a sequence of MIDI note numbers. The method of obtaining the note sequence in this study follows the one in (Jang et al., 2001). It is based on the average magnitude difference function (AMDF) computed using a sliding window (frame). For every 1/64 s of a voice recording, a pitch frequency f0 is determined and then converted into MIDI note number:
f0 þ 69:5 ; b ¼ 12 log2 440
ð1Þ
where b c is a floor operator. The resulting note sequence is further smoothed using median filtering, which replaces each note with the local median of the notes of its neighboring frames. Next, the note sequence is compared with each of the documents’ note sequences. As the query’s note sequence may contain tempo errors, note dropout errors, note insertion errors, etc., the comparison is done with DTW to absorb the errors. In our work, it is assumed that a user hums a certain document’s whole melody or initial part of melody. In addition, considering that a query may be hummed in a different key or register than the document, multiple DTWs are performed to examine if a query’s sequence could better match a certain document’s sequence after it is shifted upward or downward by several semitones. Specifically, a query’s sequence is spanned to J sequences; one is itself, and the others are those shifted upward or downward (J 1)/2 semitones from the original query’s sequences. Each of the J sequences is then compared with each document’s sequence via DTW. The minimum of the resulting J DTW distances is taken as the distance between the query and document. For ease of discussion, we refer to this strategy as J-DTW sequence matching. The relevance of each document to the query is characterized by the reciprocal of their distance. 3. The proposed system: fast sequence matching based on FFT Although DTW provides an effective way to compare two sequences, its computation is rather time-consuming. The major computational burden arises that the alignment between two sequences is not known a priori; hence, DTW exhaustively examines various possible alignments. Recognizing this drawback, we propose a fast sequence matching to sidestep the alignment process. Our basic idea is to measure the distances between note sequences
in the frequency domain rather than time domain. Thanks to the merit of fast Fourier transform (FFT), we can convert differentlength note sequences into equal-dimension vectors via zero padding. The equal dimensionality allows us to compare the vectors using Euclidean distance directly, which enables a fast distance computation. Fig. 1(b) shows the block diagram of the proposed fast note sequence matching. Let T and L be the lengths of a query’s note sequence and a document’s note sequence, respectively. It is reasonable to assume that the region of a document containing the same melody as the query should not be longer than one and half length of the query.1 Thus, for the sake of computational efficiency, we clip the document’s note sequence, if its length is larger than 1.5T, so that the maximal value of L is 1.5T. Then, each note sequence is zeromeaned and padded with zeros to a length of P, where P is a power of 2. Next, we perform P-point FFT for each note sequence, and normalize the resulting FFT vector to have unit energy. Let Q = [Q1, Q2, . . ., QP] and U = [U1, U2, . . ., UP] be the FFT vectors derived from the query’s sequence q = [q1, q2, . . ., qT, qT+1, qT+2, . . ., qP] = [q1, q2, . . ., qT, 0, 0 . . ., 0] and document’s sequence u = [u1, u2, . . ., uL, uL+1, u L+2, . . ., uP] = [u1, u2 . . ., uL, 0, 0 . . ., 0], respectively, where
Q 0k ; Q k ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PP 0 2 p¼1 jQ p j Q 0k ¼
ð2Þ
P X 2p qp ej P ðk1Þðp1Þ ;
ð3Þ
p¼1
and the computation of Uk is similar to Eqs. (3) and (2). Then, considering that the high frequency components often result from the errors in a note sequence, we compute the distance between Q and U using a weighted Euclidean distance:
DðQ ; UÞ ¼
P=2 X
0:05
jQ k U k j2 k
;
ð4Þ
k¼1
where the term k(0.05) discounts the distance measured in the higher frequency. Note that the purpose of the mean removal before padding zeros to a note sequence is to avoid the contour of a note sequence being changed significantly after zero padding. Fig. 2 shows an example of two note sequences with the same shape but in different height. We can see from Fig. 2 that if no mean removal is performed, the two sequences’ contours become different after zero padding. Fig. 3 shows some examples of note sequences and their FFT spectrums, where the note sequences are made artificially for ease of discussion, and the FFT spectrums are plotted instead of the FFT vectors we actually used in the distance computation, because FFT vectors are complex and difficult to observe. Fig. 3(a) simulates a 1 Similar assumptions have commonly been used, such as in Jang and Lee (2008) and Yu et al. (2008).
W.-H. Tsai et al. / Pattern Recognition Letters 33 (2012) 2285–2291
2287
Fig. 1. A baseline and the proposed QBSH systems.
document’s note sequence. Fig. 3(b) and (c) are the stretched and shrunk versions of Fig. 3(a), respectively, which simulate the possible tempo errors in a query. Fig. 3(d) simulates the case of a user singing a little out of tune. Fig. 3(e) simulates the case of a user singing in different keys, so that the query’s notes may be several octaves above or below the main melody. Fig. 3(f) and (g) represent the documents different from the one in Fig. 3(a). In summary, Fig. 3(a)–(e) are in relevant to the same melody, whereas Fig. 3 (f) and (g) represent other different melodies. The FFT spectrums derived from Fig. 3(a)–(g) are shown in Fig. 3(h)–(n), respectively.
We can see that Fig. 3(h)–(l) are similar to each other, compared to Fig. 3(m) and (n). The pairwise distances of Fig. 3(h)–(n) are listed in Table 2. We can see from Table 2 that the distances between the FFT vectors of the note sequences associated with the same melody (marked with bold type) are smaller than those associated with different melodies. However, it is found that the proposed fast melody matching method is not suitable to handle long note sequences, because Fourier transform is built on the assumption that the analyzed sequence is stationary. To take both efficiency and effectiveness
2288
W.-H. Tsai et al. / Pattern Recognition Letters 33 (2012) 2285–2291
Note number
(a)
Time
Assume that the length of a sequence is reduced from T to T/K. The resulting complexity of the overall sequence matching can be characterized by O(N(T/K)2 + MJT2), which is also much greater than O(Tlog2T + MJT2) of our method’s complexity. Thus, the proposed QBSH system can be more efficient in handling a huge set of music documents, compared to the existing speed-up DTWbased systems. 5. Experiments
Note number
5.1. Music data
(b)
Time
Fig. 2. Two note sequences having the same shape but different heights. After zero padding, the shapes of the two sequences become rather different.
into account, the proposed fast melody comparison method is combined with the dynamic time warping technique into a twostage sequence matching system. As shown in Fig. 1(c), instead of performing DTW with respect to all the N documents’ note sequences, we use a fast sequence matching method to filter out the unlikely candidates. Then, the top M documents ranked by the fast sequence matching are passed to the J-DTW matching. Such a two-stage sequence matching framework could significantly reduce the system complexity, compared to the one based solely on DTW. Assume that the length of a query’s note sequence is T. Then, the complexity of computing a J-DTW distance between the query and a document can be characterized by O(JT2), i.e., the complexity in the order of JT2. For N documents to be compared, the complexity of a DTW-based sequence matching is O(NJT2). On the other hand, the complexity of computing the distance using Eq. (4) can be characterized by O(Plog2P) O(Tlog2T), which is mainly from the computation of FFT. If the FFT vectors of all documents’ sequences are computed off-line, the complexity of our system in the note sequence matching is O(Tlog2T + MJT2) O(MJT2) O(NJT2), when M N. That is, the proposed fast sequence matching largely reduces the complexity of a DTW-based QBSH system. 4. Comparisons with other speed-up DTW-based approaches In related studies, such as Jang and Lee (2001), Salvador and Chan (2007), and Jang and Lee (2008), several methods based on two-stage sequence matching for speeding up a DTW-based QBSH system are discussed. One method is using 1-DTW to filter out the unlikely candidates and then perform J-DTW for the top-M ranked documents. Its complexity can be characterized by O(NT2 + MJT2), which is much greater than O(Tlog2T + MJT2) of our method’s complexity. Another group of methods for reducing the complexity of the sequence matching are to cut down the length of a note sequence using one of the following strategies: (i) using partial query, e.g., the initial T/2 part of query; (ii) shortening the length of a note sequence during the note extraction; (iii) decimating note sequence, e.g., deleting one note for every two notes.
This study used two music databases. The first one, denoted by DB-1, stems from the corpus of MIREX 2006 QBSH task (Jang et al., 2006). It contains a music document set and sung/hummed query set. The document set consists of 48 MIDI songs and 2000 Essen Collection MIDI noise files. The query set consists of 2797 sung/ hummed clips from 118 singers. For each query, there is only one relevant document, which is one of the 48 MIDI songs. The second database, denoted by DB-2, was collected by ourselves. It contains a document set of 30 MIDI song clips and 20,000 artificial note sequences and a query set of 300 sung queries from ten female singers. Each query’s relevant document is one of the 30 MIDI songs. All the queries were recorded with sampling rate of 8 kHz and bit resolution of 8 bits. 5.2. Experiment results In this study, we evaluated the performance of the proposed system in terms of its effectiveness and efficiency. The effectiveness was measured using two metrics: Mean Precision (MP) and Mean Reciprocal Rank (MRR). The Mean Precision is the percentage of the queries whose relevant documents are ranked first,2 i.e.,
MP ¼
Number of queries whose relevant documents are ranked first Total number of queries ð5Þ
The MRR is concerned with the inverse of the rank of each query’s relevant document. Specifically,
MRR ¼
C 1X 1 ; C i¼1 Ri
ð6Þ
where C is the total number of queries, and Ri is the rank of the i-th query’s relevant document, 1 6 Ri 6 N. On the other hand, the efficiency was characterized by the amount of realtime required, which is the response time divided by query duration. Our experiments were conducted using a PC with an Intel Core i3 3.2 GHz CPU. We compared our system with the baseline system and several speed-up systems discussed in Section 4: Two-stage system using 1-DTW sequence matching to filter out the unlikely candidates and then using 3-DTW sequence matching for the top M ranked documents, denoted by 1-DTW + 3DTW System. Two-stage system using 1-DTW sequence matching and halflength sequence decimation to filter out the unlikely candidates and then using 3-DTW sequence matching for the top M ranked documents, denoted by D2-1-DTW + 3-DTW System. Table 3 shows the experiment results conducted using DB-1. It can be seen from Table 3(a) that 5-DTW System yields the highest 2 As there is only a document relevant to each query in our databases, MP is equivalent to ‘‘recognition accuracy’’.
2289
W.-H. Tsai et al. / Pattern Recognition Letters 33 (2012) 2285–2291
100 50 100
5
10
15
20
25
30
35
40
5
10
15
20
25
30
35
40
5
10
15
20
25
30
35
40
5
10
15
20
25
30
35
40
5
10
15
20
25
30
35
40
5
10
15
20
25
30
35
40
5
10
15
20
25
30
35
40
50 100 50 100 50 100 50 100 50 100 50
0.4 0.2 0 0.4
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0.2 0 0.4 0.2 0 0.4 0.2 0 0.4 0.2 0 0.4 0.2 0 0.2 0.1 0
Fig. 3. Some examples of artificial note sequences and their FFT spectrums.
values of MP and MRR, but takes tremendous amount of realtime to process a query. It can also be seen from Table 3(b) that 1-DTW + 3-DTW System achieves better efficiency than J-DTW System by slightly sacrificing the values of MP and MRR. From Table 3(c), we can observe that D2-1-DTW + 3-DTW System further reduces the amount of realtime required, but its values of
MP and MRR decreases noticeably. In Table 3(d), it is shown that the proposed system achieves the smallest amount of realtime required, while maintain the high values of MP and MRR as that does in 3-DTW System. The results also indicate that the proposed system is capable of supporting real time response for a QBSH system.
2290
W.-H. Tsai et al. / Pattern Recognition Letters 33 (2012) 2285–2291
Table 2 Pairwise distances of Fig. 3(h)–(n).
(h) (i) (j) (k) (l) (m) (n)
6. Conclusions
(h)
(i)
(j)
(k)
(l)
(m)
(n)
0 – – – – – –
0.210 0 – – – – –
0.199 0.249 0 – – – –
0.208 0.252 0.273 0 – – –
0.0 0.210 0.199 0.208 0 – –
0.339 0.390 0.415 0.343 0.339 0 –
0.437 0.361 0.452 0.359 0.437 0.338 0
Table 3 Results of QBSH for database DB-1. J
MP
MRR
Amount of realtime
0.839 0.890 0.900
2.83 7.32 12.04
0.877 0.880 0.887
3.69 4.13 4.96
(c) D2-1-DTW + 3-DTW system 50 0.554 100 0.559 200 0.565
0.867 0.871 0.880
2.59 3.58 3.94
(d) Proposed system 50 0.582 100 0.582 200 0.582
0.890 0.890 0.890
0.51 0.98 1.94
(a) Baseline J-DTW system 1 0.551 3 0.583 5 0.624 (b) 1-DTW + 3-DTW system Top M 50 100 200
0.561 0.564 0.579
Table 4 Results of QBSH for database DB-2. J
MP
Acknowledgements MRR
Amount of realtime
0.786 0.802 0.813
31.69 78.23 125.82
0.789 0.791 0.794
31.71 33.18 34.16
(c) D2-1-DTW + 3-DTW system 50 0.472 100 0.481 200 0.489
0.749 0.756 0.764
31.43 32.48 32.86
(d) Proposed system 50 0.521 100 0.535 200 0.542
0.799 0.801 0.804
0.96 1.48 2.39
(a) Baseline J-DTW system 1 0.518 3 0.536 5 0.552 (b) 1-DTW + 3-DTW system Top M 50 100 200
0.520 0.525 0.529
In this work, we have developed an efficient QBSH system, which enables fast melody matching of note sequences. By taking the fast Fourier transform of note sequences, the system measures the similarities between note sequences using Euclidean distance directly, which avoids performing time-consuming alignment between sequences in the time domain. Recognizing that FFT may ignore the temporal information of a time sequence, the proposed fast melody comparison method is combined with dynamic time warping technique to take care of both efficiency and effectiveness. Our experiments were conducted to compare the proposed system with the baseline and several other speedup DTW-based systems. It was found that the baseline DTW-based system achieves the best MP and MRR over the other systems. However, the computational complexity of the baseline DTW-based system is so high that it is not acceptable in practical QBSH applications. In addition, although several existing speedup DTW-based systems reduce the computational complexity to some degree, they are still far from able to handle a large scale of music documents, such as 10,000 songs. By contrast, the proposed system spends the least time in processing a query while achieves the comparable MP and MRR to the baseline DTW-based system. Furthermore, the proposed system is the only one of the systems in question capable of supporting real time response for a document set of 10,000 songs. The results confirm the superiority of our system over the existing systems. In the future, it is worth extending the QBSH problem from dealing with the monophonic note sequences considered in this work to a polyphonic case. In an attempt to compare a monophonic sequence with a polyphonic sequence or to compare with two polyphonic sequences, we may study the feasibility of modifying Eq. (4) from the current vector form to a matrix form.
This research was partially supported by the National Science Council, Taiwan, ROC, under Grant NSC 100-2628-E-027-002. The authors would like to thank the anonymous reviewers for their careful reading of this paper and their constructive suggestions. References
Table 4 shows the experiment results conducted using DB-2. As the number of documents in DB-2 is much larger than that in DB-1, we can see from Table 4(a) that the baseline J-DTW system takes vast amount of realtime to handle a query. It can also be seen from Table 4(b) and (c) that although both 1-DTW + 3-DTW System and D2-1-DTW + 3-DTW System reduce the amount of realtime as compared to the baseline 3-DTW system, they are still too inefficient to be used in practical QBSH applications. By contrast, we can see from Table 4(d) that the proposed system takes only 0.96 to 2.39 times of realtime to handle a query while achieves comparable effectiveness to the baseline system. The results indicate the superiority of the proposed system over other systems.
Cao W., Jiang, D., Hou, J., Qin, Y., Zheng, T. F., Liu, Y., 2009., A phrase-level piecewise linear scaling algorithm for melody match in Query-by-Humming systems. In: Proc. IEEE Internat. Conf. on Multimedia and Expo. Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M., 2008. Content-based music information retrieval: Current directions and future challenges. In: Proc. IEEE. Doraisamy, S., Ruger, S.M., 2003. Robust polyphonic music retrieval with N-grams. J. Intell. Inf. Syst. 21 (1), 53–70. Guo, L., He, X., Zhang, Y., Lu, Y., 2008. Content-based retrieval of polyphonic music objects using pitch contour. In: Proc. IEEE Internat. Conf. on Acoust. Speech, Signal Process. Hou, J., Jiang, D., Cao, W., Qin, Y., Zheng, T.F., Liu, Y., 2009. Effectiveness of N-gram fast match for query-by-humming systems. In: Proc. IEEE Internat. Conf. on Multimedia and Expo. Jang, J.S., Lee, H.R., 2001. Hierarchical filtering method for content-based music retrieval via acoustic input. In: Proc. ACM Internat. Conf. on Multimedia. Jang, J.S., Lee, H.R., Kao, M.Y., 2001. Content-based music retrieval using linear scaling and branch-and-bound tree search. In: Proc. IEEE Internat. Conf. on Multimedia and Expo. Jang, J.S., Lee, N.J., Hsu, C.L., 2006. Simple but effective methods for QBSH at MIREX 2006. In: Proc. Internat. Conf. on Music Information Retrieval. Jang, J.S., Lee, H.R., 2008. A general framework of progressive filtering and its application to query by singing/humming. IEEE Trans. Audio Speech Lang. Process. 16 (2), 350–358 . Khan, N.A., Mushtaq, M., 2011. Open issues on query by humming. In: Proc. Fourth Internat. Conf. on Applications of Digital Information and Web Technologies. Kim, K., Park, K.R., Park, S.J., Lee, S.P., Kim, M.Y., 2011. Robust query-by-singing/ humming system against background noise environments. IEEE Trans. Consum. Electron. 57 (2), 720–725 . Pauws, S., 2002. CubyHum: A fully operational query by humming system. In: Proc. Internat. Symp. Music Information Retrieval.
W.-H. Tsai et al. / Pattern Recognition Letters 33 (2012) 2285–2291 Salvador, S., Chan, P., 2007. Toward accurate dynamic time warping in linear time and space. Intell. Data Anal. 11, 561–580. Shifrin, J., Burmingham, W., 2003. Effectiveness of HMM-based retrieval on large databases. In: Proc. Internat. Conf. on Music Information Retrieval. Shih, H.H., Narayanan, S.S., Kuo, C.C.J., 2003. Multidimensional humming transcription using a statistical approach for query by humming systems. In: Proc. IEEE Internat. Conf. on Acoust. Speech, Signal Process.
2291
Unal, E., Chew, E., Georgiou, P., Narayanan, S., 2007. Challenging uncertainty in query-by-humming systems: A fingerprinting approach. IEEE Trans. Audio Speech Lang. Process. 16 (2), 359–371. Yu, H.M., Tsai, W.H., Wang, H.M., 2008. A query-by-singing system for retrieving karaoke music. IEEE Trans. Multimedia No. 10 (8), 1626–1637.