Signal Processing 91 (2011) 2154–2157
Contents lists available at ScienceDirect
Signal Processing journal homepage: www.elsevier.com/locate/sigpro
Fast communication
Higher-order moments for musical genre classification Jin S. Seo a,, Seungjae Lee b a b
Department of EE, Gangneung-Wonju National University, Republic of Korea Content Research Division, Electronics and Telecommunications Research Institute, Republic of Korea
a r t i c l e in f o
abstract
Article history: Received 15 October 2010 Received in revised form 24 March 2011 Accepted 24 March 2011 Available online 30 March 2011
This paper presents a study on the performance of the higher-order moments for musical genre classification. Especially the skewness and the kurtosis of the octavescale subbands are considered. Experimental results on the widely used music datasets show that the higher-order moment features can improve classification accuracy when combined with the conventional lower-order moment features. & 2011 Elsevier B.V. All rights reserved.
Keywords: Musical genre classification Moment Skewness Kurtosis Spectral contrast
1. Introduction With the development of electronics and information technology, there has been an explosion in the growth of the use of digital music through electronic commerce, on-line services, and portable devices. Musical genre is one of the most popular metadata for the description of music content in music information retrieval. This paper focuses on the automatic musical genre classification, which is needed to evade the time-consuming and tedious manual annotation. The overview of the musical genre classification is shown in Fig. 1. For a successful musical genre classifier, extracting features that allow direct access to the relevant genre-specific information is crucial. Most of the musical genre classification systems utilize the low-level spectral features of the short-time audio signal (typically between 20 and 100 ms), such as mel-frequency cepstral coefficients (MFCC), octave-based spectral contrast (OSC), and other spectrum descriptors, which is related to the timbral
Corresponding author.
E-mail addresses:
[email protected] (J.S. Seo),
[email protected] (S. Lee). 0165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2011.03.019
texture of an audio signal [1–4]. Usually the short-time spectral features of a frame are not sufficient for genre classification. Thus the short-time low-level spectral features are integrated, on longer duration (usually several seconds), into a segment-level feature. Although various integration methods have been applied [3], taking the mean and the standard deviation of the short-time features is the most popular. The segment-level features are used for training and testing the statistical classifiers [5–7], such as support vector machines (SVMs), nearest-neighbor (NN) classifiers, or Gaussian mixture models (GMM), for genre classification. The genre classification is performed at every segments, and the majority voting rule is typically used in combining the classification result of each segment. This paper studies the performance of the higher-order moments, such as the skewness and the kurtosis, for musical genre classification. Higher-order moments contain supplementary statistical information of the signal over the conventional first- or second-order moments. For example, skewness is a measure of the asymmetry of the distribution, which may represent the relative disposition of the tonal and non-tonal components of the audio spectrum. In this paper, the higher-order moments are utilized in extracting the low-level spectral features as well
J.S. Seo, S. Lee / Signal Processing 91 (2011) 2154–2157
2155
where mb and sb are the mean and the standard deviation of Xb given by
mb ¼ E½Xb ¼
Nb 1 X X ½k Nb k ¼ 1 b
s2b ¼ E½ðXb mb Þ2 ¼
Fig. 1. The overview of the feature-based musical genre classification.
as integrating them into a segment-level feature. Experimental results show that the higher-order moments are conducive in improving the genre classification accuracy.
2. Proposed musical genre classification method based on the higher-order moments The short-time spectral features in the proposed method are based on the distributional characteristics of the octave-scale subbands. According to the results in [1], the octave-scale subbands are suitable for distinguishing the genre of an audio signal. As in [1], we use six octavescale subbands: 02200, 2002400, 4002800, 80021600, 160023200, and 320026400 (all in Hz unit). We note that the octave-scale bandwidths are wider than the widely used mel-scale bandwidths. The previous work in [1] used the spectral minimum and the contrast (which is the difference between the maximum and the minimum) of the octave-scale subbands. Let Xb[k] be the short-time power spectrum of an audio signal at frequency bin k of the b-th subband. Then the spectrum Xb[k] is sorted in a descending order. The sorted spectrum is referred by X 0b ½k. Then the spectral peak and the valley of the b-th subband is given by ! aN 1 Xb 0 X ½i ð1Þ Peakb ¼ log aNb i ¼ 1 b aN 1 Xb 0 X ½N i þ 1 Valleyb ¼ log aNb i ¼ 1 b b
! ð2Þ
where Nb is the number of frequency bins in the b-th subband, and a is the neighborhood factor (typically set between 0.02 and 0.2). The difference between Peakb and Valleyb is called the spectral contrast [1]. In describing the distributional characteristics of the octave-scale subbands, the spectral minimum and the contrast are not complete. In an attempt to boost the classification accuracy further, we extract the skewness and the kurtosis of each subband. The skewness of the b-th subband is the third-order standardized moment defined as OSKb ¼
E½ðXb mb Þ3 3 b
s
¼
1 Nb 1 Nb
PNb
ðXb ½kmb Þ3 3=2 ðX ½kmb Þ2 k¼1 b k¼1
PNb
ð3Þ
Nb 1 X ðX ½kmb Þ2 Nb k ¼ 1 b
ð4Þ
ð5Þ
The kurtosis of the b-th subband is the fourth-order standardized moment defined as PNb 1 ðX ½kmb Þ4 E½ðXb mb Þ4 Nb k¼1 b OKUb ¼ 3 ¼ ð6Þ 2 3 PNb s4b 1 ðX ½kmb Þ2 Nb k¼1 b The skewness is a measure of the asymmetry of the distribution, which can represent the relative disposition of the tonal and non-tonal components of a subband. For example, if the tonal components are frequently occurred in a subband, the distribution of its spectrum will be left-skewed (the mass of the distribution is on the right) [8]. In the opposite case, it would be skewed to the right. The kurtosis is a measure of the peakedness of the distribution. Although the contribution of the kurtosis to the musical genre classification might not be clear, the kurtosis measure can depict the effective dynamic range of the spectrum in a subband. Over the extracted short-time spectral features, temporal integration is performed to get segment-level features, which are the input to the statistical classifiers. Most of the previous works use the mean and the standard deviation of the shorttime features. In this work, we extend them to the higherorder moments, such as the skewness and the kurtosis. Thus, in this paper, the higher-order moments are considered in both spectral and temporal directions. 3. Experimental results The genre-classification accuracy of the higher-order moments was evaluated on the two widely used music datasets. The first music dataset (abbreviated as ISMIR2004) is the one from the ISMIR 2004 genre classification contest in which there are 1458 songs over six different types of genres: classical, electronic, jazz_blues, metal_punk, rock_pop, and world. The second music dataset (abbreviated as GTZAN) is the one that was used by George Tzanetakis in his work [5]. It consists of 1000 songs over 10 different genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. For the ISMIR2004 dataset, one half of the songs was used for training, and the other half was used for testing. For the GTZAN dataset, the 10 fold cross-validation was performed to get the classification accuracy. Each song in the dataset was converted to mono at a sampling frequency of 22 050 Hz and then divided into frames of 46.4 ms overlapped by 23.2 ms. The octave-scale statistical features were computed at each frame and temporally integrated over the 6 s. Then the linear SVM classifier was trained and tested in classifying the segment-level features. The genre of each music clip was determined by the majority voting of the classification results of the segments in the clip.
2156
J.S. Seo, S. Lee / Signal Processing 91 (2011) 2154–2157
OSK or OKU over the lower-order moments improved the classification accuracy by 2% on average (up to 4%). The OSK was more effective than the OKU. Actually the performance of the combination [OME, OST, OSK, OKU] was similar to that of the combination [OME, OST, OSK], which means that OKU is redundant on the presence of the lower-order moments. At the temporal integration, the effectiveness of the higher-order moments was more prominent. Adding the TSK or TKU on the standard approach [TME, TST] improved the classification accuracy by more than 2% (up to 5%). In the temporal integration, the skewness (TSK) was also more effective than the kurtosis (TKU). The feature combinations with the same dimensionality resulted in the similar classification accuracy especially for the GTZAN dataset. In general, raising feature dimensionality, by adding non-redundant features, improves the classification accuracy while
The classification results of the ISMIR2004 and GTZAN datasets are given in Tables 1 and 2 respectively. The short-time spectral features, OMN, ODF, OME, OST, OSK, and OKU, in tables denote the octave-scale minimum, difference, mean, standard deviation, skewness, and kurtosis respectively. We note that the valley of the subband in Eq. (2) is denoted by the OMN, and the difference (spectral contrast in [1]) between the peak and the valley is denoted by the ODF. The temporal integration methods, TME, TST, TSK, and TKU, in the tables denote the temporal mean, standard deviation, skewness, and kurtosis. When only one type of spectral feature was used, the location (e.g. mean and minimum) features, OMN and OME, showed the highest classification accuracy. The performance of the combined feature [OMN, ODF] (which is OSC proposed in [1]) was almost similar to that of [OME, OST] or [OME, OSK]. Adding the higher-order moment feature
Table 1 Classification accuracy (%) of each feature combination and temporal integration methods for ISMIR2004 dataset. (The number in the parentheses refers to the dimension of the resulting feature vector.) Spectral
OMN ODF OME OST OSK OKU OMN,ODF OME,OST OME,OSK OME,OKU OMN,ODF,OSK OME,OST,OSK OME,OST,OKU OMN,ODF,OSK,OKU OME,OST,OSK,OKU MFCC (12-order)
Temporal TME,TST
TME,TST,TSK
TME,TST,TKU
TME,TST,TSK,TKU
75.72 61.04 73.94 62.28 66.53 63.92 79.56 77.64 79.84 77.91 80.80 80.52 79.56 82.30 82.44 75.72
78.19 64.06 77.50 64.75 67.22 64.75 82.85 80.66 82.72 79.70 82.85 82.85 82.99 84.64 84.36 79.56
75.72 62.96 74.90 65.29 67.35 64.61 80.80 78.88 80.25 78.60 81.62 79.97 80.52 83.13 82.58 76.54
78.74 65.71 79.29 66.94 67.63 65.57 82.58 81.34 82.03 80.38 83.13 82.58 82.17 83.68 84.09 80.11
(12) (12) (12) (12) (12) (12) (24) (24) (24) (24) (36) (36) (36) (48) (48) (24)
(18) (18) (18) (18) (18) (18) (36) (36) (36) (36) (54) (54) (54) (72) (72) (36)
(18) (18) (18) (18) (18) (18) (36) (36) (36) (36) (54) (54) (54) (72) (72) (36)
(24) (24) (24) (24) (24) (24) (48) (48) (48) (48) (72) (72) (72) (96) (96) (48)
Table 2 Classification accuracy (%) of each feature combination and temporal integration methods for GTZAN dataset. (The number in the parentheses refers to the dimension of the resulting feature vector.) Spectral
OMN ODF OME OST OSK OKU OMN,ODF OME,OST OME,OSK OME,OKU OMN,ODF,OSK OME,OST,OSK OME,OST,OKU OMN,ODF,OSK,OKU OME,OST,OSK,OKU MFCC (12-order)
Temporal TME,TST
TME,TST,TSK
TME,TST,TKU
TME,TST,TSK,TKU
61.5 57.2 64.2 55.4 57.5 50.5 73.3 73.5 73.8 70.4 78.3 77.6 75.3 79.8 80.3 73.7
66.8 60.3 66.2 57.7 58.1 52.4 77.6 78.1 76.3 75.4 80.1 82.7 80.0 83.1 83.7 76.1
64.3 59.2 65.9 57.6 58.3 51.4 75.0 75.9 75.7 72.7 78.6 79.2 78.5 80.8 80.2 75.3
68.5 62.9 68.4 60.1 58.7 53.3 76.7 79.9 77.8 75.1 81.6 83.1 81.9 84.9 84.4 78.2
(12) (12) (12) (12) (12) (12) (24) (24) (24) (24) (36) (36) (36) (48) (48) (24)
(18) (18) (18) (18) (18) (18) (36) (36) (36) (36) (54) (54) (54) (72) (72) (36)
(18) (18) (18) (18) (18) (18) (36) (36) (36) (36) (54) (54) (54) (72) (72) (36)
(24) (24) (24) (24) (24) (24) (48) (48) (48) (48) (72) (72) (72) (96) (96) (48)
J.S. Seo, S. Lee / Signal Processing 91 (2011) 2154–2157
2157
4. Conclusion For musical genre classification, the performance of the higher-order moments was tested. The higher-order moments are utilized in extracting the low-level spectral features as well as integrating them into the segment-level features. Experimental results show that the higher-order distributional moments, especially in the temporal integration, are effective in improving classification accuracy when combined with the conventional first- or second-order moments. Among the considered higher-order moments, the combination with the skewness improved the classification accuracy more than that with the kurtosis.
Fig. 2. Classification accuracies (%) of the previous works and the proposed distributional features on the ISMIR2004 and GTZAN datasets. For the previous works, the best classification accuracy reported in the papers [9,10] is presented. For the proposed features, the temporal integration method of [TME, TST, TSK] was used.
increasing computational complexity, which is also noticeable in Tables 1 and 2. The performance of the proposed method was compared with that of the previous approaches [9,10] on the same music datasets. To compensate for the difference in the dimension of the feature vectors and the type of classifiers, the best classification accuracy reported in the previous works [9,10] is presented in Fig. 2. The proposed simple distributional features showed better classification accuracy on both datasets although the compared previous works are based on more complicated features, such as higher-order tensors [9] and non-negative matrix factorization [10]. By adding the higher-order moment features in both spectral and temporal directions, the classification accuracy of the proposed method exceeded 84% on both datasets, which is among the best results reported so far [2,5,9,10]. For a comparison between the octave- and the mel-scale subbands, the 12-order MFCC was also included in the test, and the results are listed in Tables 1 and 2. The results in the tables show that the relatively more sophisticated mel-scale subband, which is known to be effective for speech recognition, might not be the best choice for the musical genre classification task. Simpler distributional features of the octave-scale subbands showed similar or better classification accuracy than the MFCC.
Acknowledgments This work was supported by the IT R&D program of MSCT/KEIT. [10035229. Development of content protection and distribution technology on DRM-free environment] References [1] D. Jiang, L. Lu, H. Zhang, J. Tao, L. Cai, Music type classification by spectral contrast feature, in: Proceedings of ICME-2002, pp. 113–116. [2] E. Pampalk, A. Flexer, G. Widmer, Improvements of audio-based music similarity and genre classification, in: Proceedings of ISMIR2005, pp. 628–633. [3] A. Meng, P. Ahrendt, J. Larsen, L. Hansen, Temporal feature integration for music genre classification, IEEE Transactions on Audio, Speech, and Language Processing 15 (2007). [4] T. Lidy, C. Silla, O. Cornelis, F. Gouyon, A. Rauber, C. Kaestner, A. Koerich, On the suitability of state-of-the-art music information retrieval methods for analyzing, categorizing and accessing nonwestern and ethnic music collections, Signal Processing 90 (2010) 1032–1048. [5] G. Tzanetakis, P. Cook, Musical genre classification of audio signals, IEEE Transaction on Speech and Audio Processing 10 (2002) 293–302. [6] F. Gianfelici, C. Turchetti, P. Crippa, A non-probabilistic recognizer of stochastic signals based on KLT, Signal Processing 89 (2009) 422–437. [7] C. Lee, M.-G. Jang, Fast training of structured SVM using fixedthreshold sequential minimal optimization, ETRI Journal 31 (2009) 121–128. [8] R. Groeneveld, G. Meeden, Measuring skewness and kurtosis, The Statistician 33 (1984) 391–399. [9] I. Panagakis, E. Benetos, C. Kotropoulos, Music genre classification: a multilinear approach, in: Proceedings of ISMIR-2008, pp. 583–588. [10] A. Holzapfel, Y. Stylianou, Musical genre classification using nonnegative matrix factorization-based features, IEEE Transaction on Speech and Audio Processing 16 (2007).