An integrated music video browsing system for personalized television

Expert Systems with Applications 38 (2011) 776–784 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www...

Download PDF

2MB Sizes 6 Downloads 34 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 38 (2011) 776–784

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

An integrated music video browsing system for personalized television Hyoung-Gook Kim a, Jin Young Kim b, Jun-Geol Baek c,* a

Department of Wireless Communication Engineering, Kwangwoon University, Wolgye-dong, Nowon-gu, Seoul 139-701, Republic of Korea Department of Electronic and Computer Engineering, Chonnam National University, Yongbong-dong, Buk-gu, Gwangju 500-757, Republic of Korea c Division of Information Management Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136-701, Republic of Korea b

a r t i c l e

i n f o

Keywords: Music video browsing Music emotion classiﬁcation Theme-based music classiﬁcation Highlight detection Shot classiﬁcation

a b s t r a c t In this paper, we propose an integrated music video browsing system for personalized digital television. The system has the functions of automatic music emotion classiﬁcation, automatic theme-based music classiﬁcation, salient region detection, and shot classiﬁcation. From audio (music) tracks, highlight detection and emotion classiﬁcation are performed on the basis of information on temporal energy, timbre and tempo. For video tracks, shot detection is fulﬁlled to classify shots into face shots and color-based shots. Lastly automatic grouping of themes is executed on music titles and their lyrics. With a database of international music videos, we evaluate the performance of each function implemented in this paper. The experimental results show that the music browsing system achieves remarkable performances. Thus, our system can be adopted in any digital television for providing personalized services. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction Personalization is one of the key features of interactive digital television. Personalized digital television (PDTV) aims to satisfy TV viewers who have different and various interests and face overwhelming amounts of digital videos. For the successful implementation of PDTV, it has to provide diverse features of video content concisely as well as have the ability to analyze each viewer’s inclination. For the analysis of user-needs, a fast browsing system that provides various features of video content is essential. Until now, automatic video browsing systems have been proposed and developed for sports video, news video, home video, and movies. All of these systems are integrated with PDTVs. However, relatively little research has been conducted on the analysis and browsing music videos, which is one of the most popular forms of content. Music videos are available via many cable TV channels or websites. Since its inception, the music video genre has attracted an increasingly large viewership of different age levels. With the rapid development of various technologies for multimedia content capture, storage, high bandwidth/speed transmission, and compression standards, the production and distribution of music videos have increased rapidly and become more available. Nowadays, many music content providers and companies are putting their music videos on television and the Internet. Currently, however, it is not easy to browse, search or access speciﬁc videos of interest. Users have difﬁculties in ﬁnding their favorite music videos from the vast music video database. * Corresponding author. Tel.: +82 2 3290 3396; fax: +82 2 929 5888. E-mail address: [email protected] (J.-G. Baek). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.032

Although music video summaries or thumbnails such as salient regions are available at most music websites or music TV channels, they are generated manually. Therefore, how to automatically detect a concise and informative salient region of an original music video for rapid browsing is a challenging task. It is necessary to facilitate automatic integrated browsing systems for digital music videos. There have been many papers related to one of the elemental techniques: music emotion classiﬁcation (Feng, Zhuang, & Pan, 2003; Liu, Lu, & Zhang, 2003; Zhu, Shi, Kim, & Eom, 2006), automatic theme classiﬁcation (Berkkerman & Allan, 2004; Logan, Kositsky, & Moreno, 2004; Mahedero, Mart´ınez, & Cano, 2005; Scott & Matwin, 1998), salient region detection (Agnihotri, Dimitrova, & Kender, 2004; Agnihotri, Dimitrova, Kender, & Zimmerman, 2003; Xu, Sho, Maddage, & Kankanhalli, 2005), and face detection (Pham & Cham, 2007; Schapire & Singer, 1999; Viola & Jones, 2001a, 2001b). However, they did not display the integrated music browsing system. That is, it is necessary to integrate all elemental techniques into a uniﬁed browsing system to give users a comfortable scenario and diverse music video features for user-friendly browsing. In this paper, we propose an integrated music video browsing system providing users with the various features of emotion, theme, salient region, and speciﬁc shots of face and color intensity for music videos. Referring to those features viewers can select their favorite music videos rapidly and conveniently. That is, in the ﬁrst stage, users can select music videos by emotion or theme. In the second stage, they can choose music videos by watching salient regions and other speciﬁc face or color shots. To implement our music browsing system, we adopted, modiﬁed, and enhanced the

H.-G. Kim et al. / Expert Systems with Applications 38 (2011) 776–784

777

performance of each elemental technique that was previously proposed in the literature. The rest of the paper is as follows. In Section 2, we present our proposed system framework for music video browsing. We present our proposed music track analysis in Section 3 and visual track analysis in Section 4. Also, we present the theme-based music categorization in Section 5. Experimental results on a music video database are given in Section 6. Finally, Section 7 suggests some conclusions and future directions.

music content and are used to analyze patterns of interaction or user idiosyncrasies. A block diagram of the implemented music browsing system is shown in Fig. 3. As shown in Fig. 3, audio analysis is performed with MDCT coefﬁcients readily available from AC-3 audio codec. For the video track, the video shots are segmented by using DC images that are regionally decoded from MPEG-II compressed video. The speciﬁcs of audio and video processing are explained in the following sections.

2. System framework for music video browsing

3. Music track analysis

We designed and implemented the music video browsing system with the objective of rapidly searching favorite music videos in the vast music videos from IPTV music channel. The proposed music video browsing system is illustrated in Fig. 1. The system consists of two phases: database generation and browsing. In the database generation phase, the video and audio signals from all the music video content of IPTV music channels are encoded using MPEG-2 video and AC-3 audio. The proposed system analyzes music and video tracks separately. Using the audio streams, the system detects salient regions and classiﬁes music emotions. From the video signals, special shots of face and color information are extracted. Theme classiﬁcation is performed with titles and lyrics. All the results are stored in the IPTV servers as meta-data. In the browsing phase, users search for their favorite music videos via a two-stage scenario. First, users can select music videos with the information of a music video’s emotion or theme. Given a selected query button (emotion or theme), music videos belonging to the selected emotion or theme appear on the TV screen. In the second stage, three buttons such as hsearch salient regionsi, hsearch scene by colori, and hsearch scene by facei appear on the TV screen. When hsearch salient regionsi is selected as depicted in Fig. 2(a), the salient region of each music video is played. The user can watch the salient region of each music video in order to select his/her favorite music videos quickly. By pressing hYesi button on the interface, the selected music videos are sent to user’s favorite music list such as hMy Music Videosi. The browsing process can be plotted onto a graphical interface as shown in Fig. 2. The music videos listed on hMy Music Videosi are the personalized

The unique characteristic of the music video contents is that the music plays the dominant role in a music video. Since the chorus is melodically stronger than other sections and is the most repeated section, it is the most memorable portion of the song. In this paper, we use the music highlight, including chorus, as a thumbnail for the music content in order to avoid the time-consuming chorus detection based on music structure analysis (Goto, 2003). Allowing the user to browse for music by its emotional effect represents a step toward addressing the needs and preferences of the user in music information retrieval. This technology will allow the user to easily select a subset of their own personal music database to suit their current mood. According to Xu et al. (2005) and Goto (2003), the authors proposed the music chorus detection algorithms based on octave band features, correlation, and similarity likelihood. Their methods are computationally intensive and the detection rates are nearly 86%. In this paper, we present a simple algorithm that employs only with energy feature based on the prior knowledge; the chorus is repeated more than twice in a music ﬁle and the ﬁrst chorus may be found in a region from the 40 to 130 s from the start of the video. In the following subsections, we discuss how to detect the music highlight from the audio streams. 3.1. Music highlight detection To ﬁnd a highlight section of a song, we use only energy information of audio streams. The music highlight detection works as shown in Fig. 4.

Fig. 1. Basic concept of the proposed music video browsing system.

778

H.-G. Kim et al. / Expert Systems with Applications 38 (2011) 776–784

to 65 Hz, the number of MDCT coefﬁcients corresponding to 8,372 Hz, the start sub-segment, and the end sub-segment, respectively. The SNMDCT En calculated from each sub-segment is used for detecting candidates for the start and end positions of the music highlight, and the determination of the peak position. The SNMDCT including its position is stored into Yn as highlight start-end candidate values (HSECV) for ﬁnding the highlight start candidate position (HSCP) and the highlight end candidate position (HECP). Before detecting HSCP and HECP, the peak position (PP) should be determined by comparing En with En+1. If En is larger than En+1, the peak value (PV) is set to En and the peak position (PP) is set to n. The detected PP is compared with the highlight startend candidate positions (HSECP) in order to ﬁnd start candidate position (HSCP) and end candidate position (HECP). If HSECP is smaller than PP, then HSECP is set to HSCP and HSECV is the highlight start candidate value (HSCV). Then, HSCV is compared with other HSCVs for ﬁnding the smallest value (SHSCV) among HSCVs. The SHSCP (the position of SHSCV) is determined as the start position (HSP) of the music highlight. On the other hand, if PP is smaller than HSECP, HSECP is set to HECP and HSECV is the highlight end candidate value (HECV). Thereafter, the end position (HEP) of the music highlight is detected. Fig. 5 shows the results of music highlight detection compared to real chorus boundary in a music ﬁle. As shown in Fig. 5, the detected music highlight contains the chorus. This music highlight provides the user an efﬁcient thumbnail for selecting music videos.

3.2. Music emotion classiﬁcation

Fig. 2. Demonstration of music video browsing system.

By observing music videos, it became known that the chorus is repeated more than twice in a music ﬁle and that the ﬁrst chorus may be found in a region from the 40 to 130 s. Therefore, we deﬁne the region as chorus candidate region and search the music highlight within the chorus candidate region. First, the MDCT coefﬁcients, extracted from the AC-3 audio encoder, are converted to the decibel scale per a sub-segment of the length of one second. Each decibel-scale spectral vector is normalized with the root mean square (RMS) energy envelope. Finally, the sum of the normalized decibel-scale MDCT called SNMDCT is computed as

En ¼

PNe 1 n¼Ns f10log10 ðMDCTðn; pÞÞg qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; Ns Ne PPe f10log ðMDCTðn; pÞÞg2 p¼Ps

ð1Þ

10

where p, n, Ps, Pe, Ns, and Ne are the index of the frequency range, the frame index, the number of MDCT coefﬁcients corresponding

For automatic music emotion classiﬁcation, we extract two kinds of features in the compression domain: timbre and tempo features. The timber features include spectral centroid, spectral bandwidth, roll-off frequency, and spectrum ﬂux of the MDCT amplitude spectrum. Also, the peaks, valleys, arithmetic mean, ﬂatness, and crest values on the seven logarithmic sub-bands of the MDCT amplitude spectrum are adopted. The speciﬁcs for calculation can be found in Zhu et al. (2006). The timbre feature vector consists of 26 components with 13 ms time resolution. For extracting tempo feature, MDCT coefﬁcients are grouped in logarithmic 8 octave-scale sub-bands to extract reduced-rank spectral features. Frequency channels are logarithmically spaced in non-overlapping 1=4 -octave bands spanning between 62.5 Hz, the default ‘‘low edge”, and 8,372 Hz, the default ‘‘high edge”. The output of the logarithmic frequency range is the weighted sum of the MDCT coefﬁcients in each logarithmic sub-band. The amplitude deviation between two successive frames is obtained, and then three-order low pass ﬁlter with 10 Hz cut-off frequency are applied on the sub-band amplitude. Then, a three-second continuous wavelet transform (WT) with eight different dyadic scales is performed on each amplitude deviation signal. Finally, the amplitude of WT is deﬁned as tempo spectrum. The tempo feature vector contains 64 components with one second time resolution. To effectively combine these two feature-sets with different time resolution for music classiﬁcation, a classiﬁer having two layers with support vector machine (SVM)-based AdaBoost learning algorithm is applied. Layer I is responsible for timbre features and Layer II is related to tempo features. Both layers comprise a number of pair-wise classiﬁers, each of which is responsible for e , which distinguishing between music class C and its anti-class C is comprised by the data not belonging to class C. Each pair-wise classiﬁer includes a PCA transform and a SVM classiﬁer. The dimensionality of PCA transform matrix and the support vectors in SVM models are determined by AdaBoost.

H.-G. Kim et al. / Expert Systems with Applications 38 (2011) 776–784

779

Fig. 3. Block diagram of the music browsing system.

The linear SVM model uses geometric properties for exactly calculating the optimal separating hyper-plane expressed by a linear function of the inputs, as in Eq. (2):

f ðXÞ ¼ W T X þ b;

b : bias; X ¼ fx1 ; . . . xn g:

ð2Þ

e ) of the Each binary classiﬁer labels a positive (C) or negative ( C input feature vector. The classiﬁcation rule determines the music class according to the ratio between the positive frames and the negative frames:

( I ¼ arg max a j

) P1;Cj P2;Cj : þ ð1 aÞ e 1;Cj e 2;Cj P1;Cj þ P P2;Cj þ P

ð3Þ

In Eq. (3), Pi,Cj is the number of positive frames of class j in ith layer, e i;Cj is the number of negative frames of class j in ith layer. I is and P the result of music classiﬁcation. The smoothing parameter, a = 0.7, is adopted in our implementation. 4. Video tracking analysis and generating salient regions The video track analysis detects and classiﬁes video shots so as to create continuous and meaningful salient regions of music videos by aligning the shots with the detected music highlights. The video track analysis includes shot boundary detection and semantic shot classiﬁcation. 4.1. Shot boundary detection With an MPEG-2 video stream as input signal, the shot boundary detection along with key-frame extraction ﬁnds scene change time (shots) and reduces them to key-frame images. Comparing a histogram of RGB color for each video frame detects the positions of the shot changes.

For a good computational performance, only I-frames with their DC coefﬁcients from the compressed domain of the MPEG-stream are considered for color histogram computation. As the I-frames appear at least twice per second, the temporal resolution is accurate enough for face detection. For ﬁnding the face segments, only hard-cut shots are considered as the prominent shot transition type in music videos. Furthermore, a spatial portioning scheme with few four horizontal slices is used to detect shot boundaries independently in certain regions. An adaptive threshold decides the shot boundaries. The threshold as weight average value depends on L consecutive color histogram distances. For every interval between two shot boundaries, a key-frame in the middle of the interval is extracted. The original MPEG compressed video sequence can be represented by the shot set where i is the number of shots detected. 4.2. Shot classiﬁcation For better representation of the shot’s semantic meaning, we classify the detected shots into two independent categories: face shots or non-face shots, and color-dependent shots. 4.2.1. Face detection Face is an important characteristic in music videos, which indicates the singer or actor/actress in the music video. Therefore, the salient region of music videos should contain the face shots. Face and non-face shots alternatively appear in the music video, when the semantic meaning changes in video content. Considering these dynamic characteristics, we adopt an asymmetric boosted cascade classiﬁer (Pham & Cham, 2007; Viola & Jones, 2001b) for face detection. The proposed face shot detection algorithm works as shown in Fig. 6. In the process of the face shot detection, we utilize not only an off-line learning model, but also a local on-line model adaptive to

780

H.-G. Kim et al. / Expert Systems with Applications 38 (2011) 776–784

Music track Segmentation in the Chorus Candidate Region

Computing SNMDCT

En

En > E n+1

Yes

Peak Value’(PV’)=En Peak Position’(PP’)= n

Yn =En for Chorus Start-End Candidate Position(HSECP)

No

Peak Value’’(PV’’)=En+1 Peak Position’’(PP’’)=n+1

Peak Position(PP) Determination

HSECP
No

Start Candidate Value(HSCV) Detection

End Candidate Value(HECV) Detection

Start Position(HSP)=Start Candidate Position(HSCP)

End Position(HEP)=End Candidate Position(HECP)

Highlight Start-End Position(HSP, HEP) Detection Fig. 4. Flowchart of the music highlight detection.

analyzed video on an ongoing basis. This is necessary because the variety of music video styles cannot be covered with only the offline learning model. Thus, we need to analyze the statistical information of the current stream for applying a local on-line adaptive model.

First, the input video stream is decompressed after shot boundary detection since the DC coefﬁcients used for the scene change detection are not enough for face detection in music videos. A face shot in the MPEG-decompressed video stream is detected by using the off-line face shot model. For learning the off-line face model, the asymmetric AdaBoost-based method (Pham & Cham, 2007) is applied because it can avoid the danger of missing positive patterns and can achieve high detection rates by cascading the weak classiﬁers. The off-line model is generated by using Haar-like features (Papageorgiou, Oren, & Poggio, 1998; Viola & Jones, 2001c) extracted from the training set consisted of 4000 frontal face images and 7000 non-face images. Face and non-face images are scaled to a size of 24 24 pixels. The Haar-like feature represents the integral image and allows for very fast feature evaluation. A cascade of 25 boosted classiﬁers, from coarse to ﬁne, is set up. The ﬁrst classiﬁer has one weak sub-classiﬁer, while the ﬁnal classiﬁer has 200 weak sub-classiﬁers. Through the asymmetric AdaBoost training procedure, each classiﬁer in the cascade is trained with the Haar-like features extracted from training faces and non-face sub-windows. Second, by using the results of the face shot detection based on the off-line model, we execute the learning on-line local adaptation (LLA). We use only the shots that have a good conﬁdence value (CV). That is, the input shots that have a lower CV than a given model threshold are discarded. Our algorithm training of the LLA model is applied for only 30 s of a music video. Once the adaptation model training is complete, only the LLA model is used for face shot detection. The on-line learning asymmetric boosted classiﬁer seeks to balance the skewness of the labels presented by weak classiﬁers, allowing them to be trained more equally, and hence resulting in a good performance gain and a fast convergence speed. Fig. 7 shows the output of the face detector on some test music videos. 4.2.2. Color intensity The colors can be extracted from the key-frame of each shot. In order to extract those colors, HSV color histogram is extracted from the key-frame of each shot. The most prevailing dominant bins from the 12th bin to the 23rd bin are selected, and we determine the average HSV values of the pixels contained in those bins as the key-frame colors.

Fig. 5. Detection of music highlight compared with real chorus.

H.-G. Kim et al. / Expert Systems with Applications 38 (2011) 776–784

4.3. Generation of salient region

Face Shots

Visual Track

Yes Video Encoder

Face Training Set

End of the Stream?

DC Image Generation

Haar-like Feature Extraction

Face Shot Detection Using Adaptive Model

Shot Change Detection

Face Model Generation Using AdaBoost Learning

No

Detected Shot Adaptive Face Model

Video Decoder

781

The purpose of music–visual alignment is to synchronize the highlight regions detected from the music track and face shots from the visual track so as to determine the region meaningful and smooth. Fig. 9 represents a generation of salient regions. Assume the music highlight in a music song is represented as: highlight position (HP) = {highlight start position (HSP), highlight end position (HEP)}, and the ith face shots detected within a music highlight region are represented as: face shot position (FP) = {FSSPi (ith face shot start position), FSEPi (ith face shot end position)}. If the number of face shots at FP is smaller than the deﬁned threshold, d, the salient region (SR) is decided as: SR = {FSSP, FSEP}. If not, the salient region (SR) is dependent on HP and represented as: SR = {HSP, HEP}. Fig. 10 shows the alignment process of music highlight detection and face shot detection according to Fig. 9. 5. Theme-based music categorization

Face Shot Detection Using Offline Model

Face Shot Model

No

No

Current Shot=Face Shot?

CV>Model Threshold?

Learning Online Model

Yes

Yes

Fig. 6. Flowchart of on-line face detection.

Fig. 8 represents the detected shots according to color intensity. The ith shots in the music video can be represented as: Shoti = hStart-B, End-B, Classi, where Start-B and End-B denote the start and ending boundaries of the shot, respectively, and Class indicates that the shot is yellow or blue or red, etc.

The thematic categorizer decides thematic categories of music songs by song title and lyrics, which are obtained from the ﬁle name or ID3 tag of the music ﬁle. Fig. 11 shows the architecture of theme categorization based on the vector similarity approach (Salton & Buckley, 1998; Viola & Jones, 2001c). For category representation, morpheme unigrams and bigrams are extracted from titles and lyrics in the training database after morpheme analysis. Features that indicate each theme are selected from the candidates. We use chi-square statistics and post-processing rule in feature selection. After features are selected, vectors that are composed of hfeature: term weighti are constructed about each theme. In song title and lyrics, word frequency is not important because a small number of words are brieﬂy used. Thus, Boolean weights are used as term weights. For music categorization, as with category indexation, a music title and lyrics are represented as a vector of hfeature: term weighti after morpheme analysis. The term weight of a feature is 1 if the feature exists in the music song title; otherwise, it is 0. The thematic categorizer computes the similarity between the categoryand title-vectors and assigns thematic categories to the song title. The inner product of the two vectors is used to compute the similarity.

Fig. 7. Some results of face detection based on on-line asymmetric AdaBoost.

782

H.-G. Kim et al. / Expert Systems with Applications 38 (2011) 776–784

Fig. 8. Examples of shots detected according to color intensity. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Detected highlight region HP={HSP,HEP} Count (FP)<= Face region In CP FSP={FSSP,FSEP}

No

Yes Salient Region(SR) ={HSP,HEP}

Salient Region(SR) ={FSSP, FSEP}

Fig. 9. Block diagram of the salient region determination.

6. Experimental results We evaluated our approach of an integrated music browsing with a database of 500 music videos recorded from several music TV pro-

grams. The database consists of 200 US music videos and 300 Korean music videos. The videos are all encoded by either MPEG1 with 352 240 frame resolution or MPEG-2 with 720 480 frame resolution. The durations of the music videos were 2–4 min. The experiments included the accuracy of music emotion classiﬁcation, theme-based categorization, shot classiﬁcation, and detection of salient regions. Also, we present the overall results obtained from subjective MOS tests. 6.1. Music emotion classiﬁcation To evaluate the performance of the proposed music emotion classiﬁcation method, music songs were divided into three kinds of music emotions categories: (1) calm, (2) pleasant, and (3) excited. Corresponding to each music emotion, there were 500 songs in the dataset. Nine independent trained listeners labeled each song. Only songs, which were consistently agreed by all nine listeners as belonging to the same emotion category, were accepted

Fig. 10. Alignment process of the detected music highlight and the detected face shots

783

H.-G. Kim et al. / Expert Systems with Applications 38 (2011) 776–784

Music Title & Lyrics

Feature Candidate Extraction

Table 1 Face shot classiﬁcation results.

Music Title & Lyrics

Shot class

Total

Correct

False alarm

Recall (%)

Precision (%)

Close-up face Non-face

19087 13357

17574 12275

1513 1082

95.26 89.35

91.47 90.84

Morph Analysis

Table 2 MOS test results.

Feature Selection Title Indexing Category Indexing

Title Vector

Category Vector

Thematic Categorization

Category Representation

Thematic Category

Music Categorization Fig. 11. Block diagram of the theme-based music categorization.

into the dataset. In our experiment, the precision of music emotion classiﬁcation could reach 96.3%. 6.2. Theme-based categorization In the experiment for theme-based music classiﬁcation, we categorized 500 songs into 14 categories, namely: love, parting, longing, rain, nature, policy and war (social criticism), travel, spring, summer, fall, winter, city, Christmas, and ‘others’. Our test set included songs for which the theme was not deducible from title and lyrics and that did not fall under the abovelisted 13 categories. These songs fall into ‘others’ category. A result of 85.3% precision and 87.6% recall is obtained for 500 songs. This algorithm is effective in the text categorization task, wherein the content of the lyric can correspond to several thematic categories. It has the merit of being easily usable in mobile applications because it entails less processing time and is inexpensive. 6.3. Shot classiﬁcation We investigated the accuracy of on-line learning asymmetric boosted classiﬁer that is used for face shot classiﬁcation. On a 700 MHz Pentium III processor, the face detector can process a 352 240 pixel image in about 0.54 s. The classiﬁcation results are shown in Table 1. 6.4. Salient region detection Since there is no objective measure available to evaluate the quality of a music video salient region, the salient regions detected from music videos by our approach were judged in a mean opinion score (MOS) test by 22 individuals. The results are summarized in Table 2. The MOS test results show that the detected salient region performed much better than the ﬁrst region with a length of 30-s without audiovisual processing.

Method

MOS ± standard deviation

First region with about 30 seconds-length The detected salient regions

1.9 ± 0.35 3.4 ± 0.25

6.5. Overall test We evaluated the proposed browsing system by a subjective test. 22 users attended to the MOS tests. There were two criteria for judging the queried music videos: speed (i.e., response time) and satisfaction. Users were requested to grade the system on a ﬁve-point scale by comparing the system with a primitive browsing system without the features of our system. The overall result was 3.9 ± 0.23. We came to the conclusion that a satisfactory outcome was achieved. 7. Conclusions In this paper, we presented the design and evaluation of the integrated music video browsing system, providing the features of salient region, emotion information, theme, face shots, and color-speciﬁc shots. Our approach integrates audio-, video- and textanalyzes with low-complexity. Thus, it will be easy to incorporate the proposed system into a target platform such as a DVR or DTV. With both objective and subjective tests, we conﬁrmed the powerfulness of the proposed system in music video analysis and browsing. For the future, we plan to extend our music video browsing system to adaptive systems that analyze the users’ patterns and respond to users’ idiosyncrasies or desires according to expectations. Acknowledgement This work was supported by Korea Research Foundation Grant funded by the Korean Government (KRF-2008-331-D00421) and by the Research Grant of Kwangwoon University in 2008. References Agnihotri, L., Dimitrova, N., Kender, J., & Zimmerman, J. (2003). Music video miner. In Proceedings of the ACM international conference on multimedia, Berkeley, CA, USA. Agnihotri, L., Dimitrova, N., & Kender, J. (2004). Design and evaluation of a music video summarization system. In Proceedings of international conference on multimedia and expo, Taipei, Taiwan. Berkkerman, R., & Allan, J. (2004). Using bigrams in text categorization. CIIR technical report IR-408. Feng, Y. Z., Zhuang, Y. T., & Pan, Y. H. (2003). Music information retrieval by detecting mood via computational media aesthetics. In Proceedings of IEEE/WIC international conference on web intelligence, Beijing, China. Goto, M. A. (2003). Chorus-section detecting method for musical audio signals. In Proceedings of the IEEE international conference on acoustics speech and signal processing (ICASSP2003), Hong Kong. Liu, D., Lu, L., & Zhang, H. J. (2003). Automatic mood detection from acoustic music data. In Proceedings of 4th international conference on music information retrieval (ISMIR 2003), Baltimore, Maryland, USA. Logan, B., Kositsky, A., & Moreno, P. (2004). Semantic analysis of song lyrics. In Proceedings of international conference on multimedia and expo, Taipei, Taiwan. Mahedero, J. P. G., Mart´ınez, A., & Cano, P. (2005). Natural language processing of lyrics. In Proceedings of the 13th annual ACM international conference on multimedia, New York, NY, USA.

784

H.-G. Kim et al. / Expert Systems with Applications 38 (2011) 776–784

Papageorgiou, C. P., Oren, M., & Poggio, T. (1998). A general framework for object detection. In Proceedings of the computer vision and pattern recognition, Santa Barbara, CA, USA. Pham, M.-T., & Cham, T.-J. (2007). Online learning asymmetric boosted classiﬁers for object detection. In Proceedings of the computer vision and pattern recognition, Minneapolis, Minnesota, USA. Salton, G., & Buckley, C. (1998). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 245, 13–523. Schapire, R. E., & Singer, Y. (1999). Improved boosting using conﬁdence-rated predictions. Machine Learning, 37(3), 297–336. Scott, S., & Matwin, S. (1998). Text classiﬁcation using WordNet Hypernyms. In Proceedings of the COLING/ACL workshop on usage of WordNet in natural language processing systems, Montreal, Canada.

Viola, P., & Jones, M. (2001a). Rapid object detection using a boosted cascade of simple features. In Proceedings of the computer vision and pattern recognition, Kauai, Hawaii, USA. Viola, P., & Jones, M. (2001b). Fast and robust classiﬁcation using asymmetric adaboost and a detector cascade. Advances in neural information processing systems, Vancouver, BC, Canada. Viola, P., & Jones, M. (2001c). Robust real time object detection. In Workshop on statistical and computational theories of vision, Vancouver, BC, Canada. Xu, C., Sho, X., Maddage, N. C., & Kankanhalli, M. S. (2005). Automatic music video summarization based on audio–visual–text analysis and alignment. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, Salvador, Brazil. Zhu, X., Shi, Y.-Y., Kim, H.-G., & Eom, K.-W. (2006). An integrated music recommendation system. IEEE Transactions on Consumer Electronics, 52(3), 917–925.

An integrated music video browsing system for personalized television

An integrated music video browsing system for personalized television

Recommend Documents