Higher-order moments for musical genre classification

Signal Processing 91 (2011) 2154–2157 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro ...

Download PDF

225KB Sizes 0 Downloads 117 Views

Report

PDF Reader
Full Text

Signal Processing 91 (2011) 2154–2157

Contents lists available at ScienceDirect

Signal Processing journal homepage: www.elsevier.com/locate/sigpro

Fast communication

Higher-order moments for musical genre classiﬁcation Jin S. Seo a,, Seungjae Lee b a b

Department of EE, Gangneung-Wonju National University, Republic of Korea Content Research Division, Electronics and Telecommunications Research Institute, Republic of Korea

a r t i c l e in f o

abstract

Article history: Received 15 October 2010 Received in revised form 24 March 2011 Accepted 24 March 2011 Available online 30 March 2011

This paper presents a study on the performance of the higher-order moments for musical genre classiﬁcation. Especially the skewness and the kurtosis of the octavescale subbands are considered. Experimental results on the widely used music datasets show that the higher-order moment features can improve classiﬁcation accuracy when combined with the conventional lower-order moment features. & 2011 Elsevier B.V. All rights reserved.

Keywords: Musical genre classiﬁcation Moment Skewness Kurtosis Spectral contrast

1. Introduction With the development of electronics and information technology, there has been an explosion in the growth of the use of digital music through electronic commerce, on-line services, and portable devices. Musical genre is one of the most popular metadata for the description of music content in music information retrieval. This paper focuses on the automatic musical genre classiﬁcation, which is needed to evade the time-consuming and tedious manual annotation. The overview of the musical genre classiﬁcation is shown in Fig. 1. For a successful musical genre classiﬁer, extracting features that allow direct access to the relevant genre-speciﬁc information is crucial. Most of the musical genre classiﬁcation systems utilize the low-level spectral features of the short-time audio signal (typically between 20 and 100 ms), such as mel-frequency cepstral coefﬁcients (MFCC), octave-based spectral contrast (OSC), and other spectrum descriptors, which is related to the timbral

Corresponding author.

E-mail addresses: [email protected] (J.S. Seo), [email protected] (S. Lee). 0165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2011.03.019

texture of an audio signal [1–4]. Usually the short-time spectral features of a frame are not sufﬁcient for genre classiﬁcation. Thus the short-time low-level spectral features are integrated, on longer duration (usually several seconds), into a segment-level feature. Although various integration methods have been applied [3], taking the mean and the standard deviation of the short-time features is the most popular. The segment-level features are used for training and testing the statistical classiﬁers [5–7], such as support vector machines (SVMs), nearest-neighbor (NN) classiﬁers, or Gaussian mixture models (GMM), for genre classiﬁcation. The genre classiﬁcation is performed at every segments, and the majority voting rule is typically used in combining the classiﬁcation result of each segment. This paper studies the performance of the higher-order moments, such as the skewness and the kurtosis, for musical genre classiﬁcation. Higher-order moments contain supplementary statistical information of the signal over the conventional ﬁrst- or second-order moments. For example, skewness is a measure of the asymmetry of the distribution, which may represent the relative disposition of the tonal and non-tonal components of the audio spectrum. In this paper, the higher-order moments are utilized in extracting the low-level spectral features as well

J.S. Seo, S. Lee / Signal Processing 91 (2011) 2154–2157

2155

where mb and sb are the mean and the standard deviation of Xb given by

mb ¼ E½Xb ¼

Nb 1 X X ½k Nb k ¼ 1 b

s2b ¼ E½ðXb mb Þ2 ¼

Fig. 1. The overview of the feature-based musical genre classiﬁcation.

as integrating them into a segment-level feature. Experimental results show that the higher-order moments are conducive in improving the genre classiﬁcation accuracy.

2. Proposed musical genre classiﬁcation method based on the higher-order moments The short-time spectral features in the proposed method are based on the distributional characteristics of the octave-scale subbands. According to the results in [1], the octave-scale subbands are suitable for distinguishing the genre of an audio signal. As in [1], we use six octavescale subbands: 02200, 2002400, 4002800, 80021600, 160023200, and 320026400 (all in Hz unit). We note that the octave-scale bandwidths are wider than the widely used mel-scale bandwidths. The previous work in [1] used the spectral minimum and the contrast (which is the difference between the maximum and the minimum) of the octave-scale subbands. Let Xb[k] be the short-time power spectrum of an audio signal at frequency bin k of the b-th subband. Then the spectrum Xb[k] is sorted in a descending order. The sorted spectrum is referred by X 0b ½k. Then the spectral peak and the valley of the b-th subband is given by ! aN 1 Xb 0 X ½i ð1Þ Peakb ¼ log aNb i ¼ 1 b aN 1 Xb 0 X ½N i þ 1 Valleyb ¼ log aNb i ¼ 1 b b

! ð2Þ

where Nb is the number of frequency bins in the b-th subband, and a is the neighborhood factor (typically set between 0.02 and 0.2). The difference between Peakb and Valleyb is called the spectral contrast [1]. In describing the distributional characteristics of the octave-scale subbands, the spectral minimum and the contrast are not complete. In an attempt to boost the classiﬁcation accuracy further, we extract the skewness and the kurtosis of each subband. The skewness of the b-th subband is the third-order standardized moment deﬁned as OSKb ¼

E½ðXb mb Þ3 3 b

s

¼

1 Nb 1 Nb

PNb

ðXb ½kmb Þ3 3=2 ðX ½kmb Þ2 k¼1 b k¼1

PNb

ð3Þ

Nb 1 X ðX ½kmb Þ2 Nb k ¼ 1 b

ð4Þ

ð5Þ

The kurtosis of the b-th subband is the fourth-order standardized moment deﬁned as PNb 1 ðX ½kmb Þ4 E½ðXb mb Þ4 Nb k¼1 b OKUb ¼ 3 ¼ ð6Þ 2 3 PNb s4b 1 ðX ½kmb Þ2 Nb k¼1 b The skewness is a measure of the asymmetry of the distribution, which can represent the relative disposition of the tonal and non-tonal components of a subband. For example, if the tonal components are frequently occurred in a subband, the distribution of its spectrum will be left-skewed (the mass of the distribution is on the right) [8]. In the opposite case, it would be skewed to the right. The kurtosis is a measure of the peakedness of the distribution. Although the contribution of the kurtosis to the musical genre classiﬁcation might not be clear, the kurtosis measure can depict the effective dynamic range of the spectrum in a subband. Over the extracted short-time spectral features, temporal integration is performed to get segment-level features, which are the input to the statistical classiﬁers. Most of the previous works use the mean and the standard deviation of the shorttime features. In this work, we extend them to the higherorder moments, such as the skewness and the kurtosis. Thus, in this paper, the higher-order moments are considered in both spectral and temporal directions. 3. Experimental results The genre-classiﬁcation accuracy of the higher-order moments was evaluated on the two widely used music datasets. The ﬁrst music dataset (abbreviated as ISMIR2004) is the one from the ISMIR 2004 genre classiﬁcation contest in which there are 1458 songs over six different types of genres: classical, electronic, jazz_blues, metal_punk, rock_pop, and world. The second music dataset (abbreviated as GTZAN) is the one that was used by George Tzanetakis in his work [5]. It consists of 1000 songs over 10 different genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. For the ISMIR2004 dataset, one half of the songs was used for training, and the other half was used for testing. For the GTZAN dataset, the 10 fold cross-validation was performed to get the classiﬁcation accuracy. Each song in the dataset was converted to mono at a sampling frequency of 22 050 Hz and then divided into frames of 46.4 ms overlapped by 23.2 ms. The octave-scale statistical features were computed at each frame and temporally integrated over the 6 s. Then the linear SVM classiﬁer was trained and tested in classifying the segment-level features. The genre of each music clip was determined by the majority voting of the classiﬁcation results of the segments in the clip.

2156

J.S. Seo, S. Lee / Signal Processing 91 (2011) 2154–2157

OSK or OKU over the lower-order moments improved the classiﬁcation accuracy by 2% on average (up to 4%). The OSK was more effective than the OKU. Actually the performance of the combination [OME, OST, OSK, OKU] was similar to that of the combination [OME, OST, OSK], which means that OKU is redundant on the presence of the lower-order moments. At the temporal integration, the effectiveness of the higher-order moments was more prominent. Adding the TSK or TKU on the standard approach [TME, TST] improved the classiﬁcation accuracy by more than 2% (up to 5%). In the temporal integration, the skewness (TSK) was also more effective than the kurtosis (TKU). The feature combinations with the same dimensionality resulted in the similar classiﬁcation accuracy especially for the GTZAN dataset. In general, raising feature dimensionality, by adding non-redundant features, improves the classiﬁcation accuracy while

The classiﬁcation results of the ISMIR2004 and GTZAN datasets are given in Tables 1 and 2 respectively. The short-time spectral features, OMN, ODF, OME, OST, OSK, and OKU, in tables denote the octave-scale minimum, difference, mean, standard deviation, skewness, and kurtosis respectively. We note that the valley of the subband in Eq. (2) is denoted by the OMN, and the difference (spectral contrast in [1]) between the peak and the valley is denoted by the ODF. The temporal integration methods, TME, TST, TSK, and TKU, in the tables denote the temporal mean, standard deviation, skewness, and kurtosis. When only one type of spectral feature was used, the location (e.g. mean and minimum) features, OMN and OME, showed the highest classiﬁcation accuracy. The performance of the combined feature [OMN, ODF] (which is OSC proposed in [1]) was almost similar to that of [OME, OST] or [OME, OSK]. Adding the higher-order moment feature

Table 1 Classiﬁcation accuracy (%) of each feature combination and temporal integration methods for ISMIR2004 dataset. (The number in the parentheses refers to the dimension of the resulting feature vector.) Spectral

OMN ODF OME OST OSK OKU OMN,ODF OME,OST OME,OSK OME,OKU OMN,ODF,OSK OME,OST,OSK OME,OST,OKU OMN,ODF,OSK,OKU OME,OST,OSK,OKU MFCC (12-order)

Temporal TME,TST

TME,TST,TSK

TME,TST,TKU

TME,TST,TSK,TKU

75.72 61.04 73.94 62.28 66.53 63.92 79.56 77.64 79.84 77.91 80.80 80.52 79.56 82.30 82.44 75.72

78.19 64.06 77.50 64.75 67.22 64.75 82.85 80.66 82.72 79.70 82.85 82.85 82.99 84.64 84.36 79.56

75.72 62.96 74.90 65.29 67.35 64.61 80.80 78.88 80.25 78.60 81.62 79.97 80.52 83.13 82.58 76.54

78.74 65.71 79.29 66.94 67.63 65.57 82.58 81.34 82.03 80.38 83.13 82.58 82.17 83.68 84.09 80.11

(12) (12) (12) (12) (12) (12) (24) (24) (24) (24) (36) (36) (36) (48) (48) (24)

(18) (18) (18) (18) (18) (18) (36) (36) (36) (36) (54) (54) (54) (72) (72) (36)

(18) (18) (18) (18) (18) (18) (36) (36) (36) (36) (54) (54) (54) (72) (72) (36)

(24) (24) (24) (24) (24) (24) (48) (48) (48) (48) (72) (72) (72) (96) (96) (48)

Table 2 Classiﬁcation accuracy (%) of each feature combination and temporal integration methods for GTZAN dataset. (The number in the parentheses refers to the dimension of the resulting feature vector.) Spectral

OMN ODF OME OST OSK OKU OMN,ODF OME,OST OME,OSK OME,OKU OMN,ODF,OSK OME,OST,OSK OME,OST,OKU OMN,ODF,OSK,OKU OME,OST,OSK,OKU MFCC (12-order)

Temporal TME,TST

TME,TST,TSK

TME,TST,TKU

TME,TST,TSK,TKU

61.5 57.2 64.2 55.4 57.5 50.5 73.3 73.5 73.8 70.4 78.3 77.6 75.3 79.8 80.3 73.7

66.8 60.3 66.2 57.7 58.1 52.4 77.6 78.1 76.3 75.4 80.1 82.7 80.0 83.1 83.7 76.1

64.3 59.2 65.9 57.6 58.3 51.4 75.0 75.9 75.7 72.7 78.6 79.2 78.5 80.8 80.2 75.3

68.5 62.9 68.4 60.1 58.7 53.3 76.7 79.9 77.8 75.1 81.6 83.1 81.9 84.9 84.4 78.2

(12) (12) (12) (12) (12) (12) (24) (24) (24) (24) (36) (36) (36) (48) (48) (24)

(18) (18) (18) (18) (18) (18) (36) (36) (36) (36) (54) (54) (54) (72) (72) (36)

(18) (18) (18) (18) (18) (18) (36) (36) (36) (36) (54) (54) (54) (72) (72) (36)

(24) (24) (24) (24) (24) (24) (48) (48) (48) (48) (72) (72) (72) (96) (96) (48)

J.S. Seo, S. Lee / Signal Processing 91 (2011) 2154–2157

2157

4. Conclusion For musical genre classiﬁcation, the performance of the higher-order moments was tested. The higher-order moments are utilized in extracting the low-level spectral features as well as integrating them into the segment-level features. Experimental results show that the higher-order distributional moments, especially in the temporal integration, are effective in improving classiﬁcation accuracy when combined with the conventional ﬁrst- or second-order moments. Among the considered higher-order moments, the combination with the skewness improved the classiﬁcation accuracy more than that with the kurtosis.

Fig. 2. Classiﬁcation accuracies (%) of the previous works and the proposed distributional features on the ISMIR2004 and GTZAN datasets. For the previous works, the best classiﬁcation accuracy reported in the papers [9,10] is presented. For the proposed features, the temporal integration method of [TME, TST, TSK] was used.

increasing computational complexity, which is also noticeable in Tables 1 and 2. The performance of the proposed method was compared with that of the previous approaches [9,10] on the same music datasets. To compensate for the difference in the dimension of the feature vectors and the type of classiﬁers, the best classiﬁcation accuracy reported in the previous works [9,10] is presented in Fig. 2. The proposed simple distributional features showed better classiﬁcation accuracy on both datasets although the compared previous works are based on more complicated features, such as higher-order tensors [9] and non-negative matrix factorization [10]. By adding the higher-order moment features in both spectral and temporal directions, the classiﬁcation accuracy of the proposed method exceeded 84% on both datasets, which is among the best results reported so far [2,5,9,10]. For a comparison between the octave- and the mel-scale subbands, the 12-order MFCC was also included in the test, and the results are listed in Tables 1 and 2. The results in the tables show that the relatively more sophisticated mel-scale subband, which is known to be effective for speech recognition, might not be the best choice for the musical genre classiﬁcation task. Simpler distributional features of the octave-scale subbands showed similar or better classiﬁcation accuracy than the MFCC.

Acknowledgments This work was supported by the IT R&D program of MSCT/KEIT. [10035229. Development of content protection and distribution technology on DRM-free environment] References [1] D. Jiang, L. Lu, H. Zhang, J. Tao, L. Cai, Music type classiﬁcation by spectral contrast feature, in: Proceedings of ICME-2002, pp. 113–116. [2] E. Pampalk, A. Flexer, G. Widmer, Improvements of audio-based music similarity and genre classiﬁcation, in: Proceedings of ISMIR2005, pp. 628–633. [3] A. Meng, P. Ahrendt, J. Larsen, L. Hansen, Temporal feature integration for music genre classiﬁcation, IEEE Transactions on Audio, Speech, and Language Processing 15 (2007). [4] T. Lidy, C. Silla, O. Cornelis, F. Gouyon, A. Rauber, C. Kaestner, A. Koerich, On the suitability of state-of-the-art music information retrieval methods for analyzing, categorizing and accessing nonwestern and ethnic music collections, Signal Processing 90 (2010) 1032–1048. [5] G. Tzanetakis, P. Cook, Musical genre classiﬁcation of audio signals, IEEE Transaction on Speech and Audio Processing 10 (2002) 293–302. [6] F. Gianfelici, C. Turchetti, P. Crippa, A non-probabilistic recognizer of stochastic signals based on KLT, Signal Processing 89 (2009) 422–437. [7] C. Lee, M.-G. Jang, Fast training of structured SVM using ﬁxedthreshold sequential minimal optimization, ETRI Journal 31 (2009) 121–128. [8] R. Groeneveld, G. Meeden, Measuring skewness and kurtosis, The Statistician 33 (1984) 391–399. [9] I. Panagakis, E. Benetos, C. Kotropoulos, Music genre classiﬁcation: a multilinear approach, in: Proceedings of ISMIR-2008, pp. 583–588. [10] A. Holzapfel, Y. Stylianou, Musical genre classiﬁcation using nonnegative matrix factorization-based features, IEEE Transaction on Speech and Audio Processing 16 (2007).

Higher-order moments for musical genre classification

Higher-order moments for musical genre classification

Recommend Documents