CHAPTER
26
Music Mining
George Tzanetakis Department of Computer Science, University of Victoria, Victoria, Canada
1.26.1 Introduction During the first ten years of the 21st century we have witnessed a dramatic shift in how music is produced, distributed, and consumed. Several factors including advances in digital signal processing, faster computers, and steadily increasing digital storage capacity and network bandwidth have made digital music distribution a reality. It is now possible to purchase and listen to music as well as find all sorts of associated information about it from any computer or smart phone. Currently, portable music players and phones can store thousands of music tracks, and millions of tracks are accessible through streaming over the Internet in digital music stores and personalized radio stations. Enabling anyone with access to a computer and the Internet to listen to essentially most of recorded music in human history is a remarkable technological achievement that would probably be considered impossible even twenty years ago. The research area of Music Information Retrieval (MIR) gradually emerged during this time period in order to address the challenge of effectively accessing and interacting with these vast digital collections of music and associated information such as meta-data, reviews, blogs, rankings, and usage/download patterns. As a research area, data mining emerged from the interaction between the database community that needed to address the challenge of extracting useful not explicitly represented information from large collections of data, and the machine learning community which explored algorithms that can improve their performance over time as they are exposed to more data. Music mining refers to the application of ideas from the field of data mining to the extraction of useful information from large collections of music and is the topic of this chapter. It can be viewed as a subset of music information retrieval research. Music is pervasive and plays an important role in the daily lives of most people today. It is an extremely complex human creation that somehow strongly affects our intellect and emotions. In order to create effective tools for interacting with large music collections we need to design algorithms that can extract high-level information from music signals and use that information in a variety of mining tasks. Techniques from audio signal processing have been used to extract various low-level and midlevel audio features that can subsequently be used to represent music tracks to perform various music mining tasks. In addition it is possible to apply traditional data mining techniques to other sources of music related information such a reviews, lyrics, blogs, rankings and usage patterns. Music has several important characteristics and challenges that make it a particularly interesting research area for data mining. Similarly to image search it requires sophisticated content analysis Academic Press Library in Signal Processing. http://dx.doi.org/10.1016/B978-0-12-396502-8.00026-7 © 2014 Elsevier Ltd. All rights reserved.
1453
1454
CHAPTER 26 Music Mining
algorithms to extract useful information from the raw signal. At the same time it has very rich structured context information associated with it. For example a particular song especially if it is popular might be mentioned in thousands of web pages and blogs. It also has lyrics which can be analyzed and by virtue of being performed by a particular artist has many associations to other pieces of music. This rich tapestry of relevant information provides many opportunities for data mining but at the same time its heterogeneity is a challenge for many algorithms that require more homogeneous and structured data. The sheer amount of data required both for storing the audio but also for storing calculated audio features poses significant system scalability challenges and requires efficient large scale algorithms. The focus of this chapter is music mining in large music collections and does not cover tasks in MIR that deal with individual music tracks such as transcription, audio-score alignment and structural analysis among others. The choice of topics as well as how much detail they are described were to a large extent determined by the corresponding volume of published work in each music mining topic. The remainder of this chapter is organized as follows. The first Section provides an overview of audio feature extraction for audio signals which forms the foundation of many music mining algorithms. The following sections describe various music mining tasks such as classification, clustering, tag annotation, advanced data mining, and visualization. The goal is to provide an overview of the music mining problems that researchers have explored and describe in some detail basic system configurations to solve these tasks. The particular techniques chosen are representative examples that are straightforward to explain rather than a comprehensive list of all possibilities. We also discuss open problems and future trends in music mining. The final section provides pointers to further reading about these topics.
1.26.2 Ground truth acquisition and evaluation In the data mining literature, frequently the primary concern is the algorithm(s) used for extracting information from the data. This data is assumed to be readily available and in the format required for the particular task. However, in many specific applications areas such as music mining the data acquisition process is critical and not trivial. In mining algorithms typically there is a clear distinction between data that is somehow automatically extracted and the desired information that the mining algorithm somehow “extracts.” In order to assess how well a particular mining algorithm performs, we need to know in some way, what the “correct” answer should be. There are several general strategies that have been used to acquire this “ground truth” information. In most cases human users are involved in the process. With access to this ground truth information for a particular dataset, different music mining algorithms and systems can be evaluated, compared, and contrasted. In this section, we discuss the process of ground truth acquisition and evaluation in a generic way that is applicable to most music mining tasks. Many music mining tasks can be viewed as ways of associating music and text. Free-form text (frequently called tags) is the most general type of annotation and subsumes specific annotations such as genre, style, and emotion/motion. In the following sections issues more specific to each task are examined in more detail. In terms of ground truth acquisition the simplest case is when the desired annotations are readily available by some external authority. For example online music stores provide genre labels for most of their music that can be directly used to train genre classification algorithms. For more specialized types of information expert annotation can be used. For example the personalized radio company Pandora utilizes music experts to annotate pieces of music with 400 attributes. Expert data is reliable and of high
1.26.2 Ground Truth Acquisition and Evaluation
1455
quality but it is costly to acquire and therefore harder to scale to large music collections. An alternative approach is to utilize average users/listeners to perform the annotation. The time honored approach to acquiring ground truth, especially in academic work, is the survey typically of undergraduate students. Surveys can be carefully designed and therefore the data obtained tends to be reliable. However, they are time consuming and costly and therefore limited in terms of scalability. More recently there has been a surge in the use of “social” tags which are simply words entered by users to characterize their photos or music. Last.fm is a music discovery Web site that relies on such social tags. By the beginning of 2007, Last’fm’s large base of 40 million monthly users had built an unstructured vocabulary of 960,000 free-text tags and used it to annotate millions of songs. By harvesting the collective effort of millions of users social tagging can scale to large collections. However it comes with its own set of issues. As there is no restriction in the vocabulary, used there is a lot of inconsistency and noise in the data. Another important issue is that there is a sparsity (or lack of) tags for new artists/tracks which has been termed the cold-start problem. This is a specific case of the more general problem of popularity bias in which popular multimedia items tend to be recommended to most users simply because there is a lot of information about them. An interesting alternative that combines some of the control of surveys with the scalability and large number of users of social tags is the concept of annotation games. These are games in which a large group of users are presented with a song, and a list of tags. The goal of the game is to “guess” the tags that other users apply for that song. When a large group of users agree on a tag then the song has a strong association with it. Such annotation games belong to the larger category of games with a purpose in which players play the game because it is engaging and entertaining, while at the same time in the process of playing provide valuable information. The most famous such game is the ESP game that has been used for image annotation. Even though similar annotation games have been proposed for music they have not yet received large scale usage. Assuming that ground truth for a particular task is available it is important to be able to evaluate different algorithms that attempt to solve it. Music mining is an emerging research area with a history of about ten years. Typically early work in a particular music mining task involves assembling a collection of music and associated ground truth for the task. In many cases only the original researchers have access to the data and the work is hard to replicate. Over time some of these datasets have been shared helping make more meaningful comparisons between different algorithms and systems. A big challenge in music mining is the difficulty of sharing data given the multiple copyrights associated with music information. The Music Information Retrieval Evaluation eXchange (MIREX) is an annual evaluation campaign for music information retrieval (MIR) algorithms that is coupled to the International Conference of the Society for Music Information Retrieval (ISMIR). It is organized by the graduate school of library and information sciences at the University of Illinois at Urbana-Champaign. Participating groups submit their systems that follow specific input/output conventions and they are evaluated on data that for some tasks is accessible only by the MIREX organizers and not the participants. In addition to dealing this way with copyright issues it also helps avoid overfitting which is an important problem for data mining algorithms in general. Overfitting refers to the situation where the performance of a mining algorithm in training data is misleadingly higher than its performance on data it has not encountered before. If the full datasets used for evaluation are available it is possible to over-optimize learning algorithms to fit their specific characteristics to the expense of generalization performance, i.e., the performance of the algorithm on data that it has not encountered during training. In the remainder of this chapter representative results from MIREX will be provided for the music mining tasks that are described, when they are available.
1456
CHAPTER 26 Music Mining
1.26.3 Audio feature extraction Audio feature extraction forms the basis of many music mining tasks. Even though a full exposition is beyond the scope of this chapter, a short overview is provided as it is important in understanding what type of information is represented and how it is extracted. The goal of audio feature extraction is to calculate a succinct representation that summarizes musical information about the underlying audio signal. The representation should capture in a statistical sense the different types of musical information that humans are aware of when listening to music. The three basic facets of music information that have mostly been explored in the existing literature are timbre, rhythm, and harmony. Rather than providing formal definitions, which are still debated among musicologists, we describe these terms informally without going into details that would require some knowledge of music theory. Timbre refers to the characteristics of the musical sound that are independent of the actual notes played and are related to the instruments playing and their sound. For example the exact same music piece played by a rock band with electric guitars and drums will sound very different than the same music piece performed by a jazz big band. Rhythm refers to the periodic repeating hierarchical structure that underlies the music independently of what instruments are playing. For example the famous beginning of the 5th Symphony by Beethoven will have the same rhythm independently of whether it is played by a symphonic orchestra or a cheap toy piano. Harmony refers to the simultaneous sounding of groups of discrete pitches/notes as well as how these groups evolve over time. For example a dance remix of a tune by the Beatles will have the same harmony or chord structure with the original while the rhythm and timbre will be completely different. There are many variations in timbral feature extraction but most systems follow a common general template. The audio signal is broken into small slices (typically around 10–40 ms) and some form of frequency analysis, such as the Discrete Fourier Transform, is performed, followed by a summarization step in which a set of numbers (the feature vector) is calculated. This feature vector attempts to summarize/capture the content information of that short slice in time. After this stage the music track can then be represented as a sequence (trajectory) of feature vectors (points) in a high-dimensional feature space. That sequence can then be characterized using a more compact representation for subsequent classification. Most audio features are extracted in three stages: (1) spectrum calculation, (2) frequency-domain summarization, and (3) time-domain summarization. In spectrum calculation, a short-time slice (typically around 10–40 ms) of waveform samples is transformed to a frequency domain representation. The most common such transformation is the Short Time Fourier Transform (STFT). During each shorttime slice the signal is assumed to be approximately stationary and is windowed to reduce the effect of discontinuities at the start and end of the frame. This frequency domain transformation preserves all the information in the signal and therefore the resulting spectrum still has high dimensionality. For analysis purposes, it is necessary to find a more succinct description that has significantly lower dimensionality while still retaining the desired content information. Frequency domain summarization converts the high dimensional spectrum (typically 512 or 1024 coefficients) to a smaller set of number features (typically 10–30). A common approach is to use various descriptors of the spectrum shape such as the Spectral Centroid and Bandwidth. Another widely used frequency domain summarization of the Spectrum are the Mel-Frequency Cepstral Coefficients (MFCCs), a representation which originated from the speech and speaker recognition community. MFCC summarize spectral information (the energy distribution of
1.26.3 Audio Feature Extraction
1457
different frequencies) by taking into account, to some extent, the characteristics of the human auditory system. Such features depend on the instrumentation of a piece, how the timbral “texture” changes over time as well as how humans perceive this information. The goal of time domain summarization is to characterize the musical signal at longer time scales than the short-time analysis slices. Typically this summarization is performed across so called “texture” windows of approximately 2–3 s or it can be also performed over the entire piece of music. Figure 26.1 shows graphically feature extraction, frequency and time summarization. Several variations on time domain summarization have been proposed. A popular approach is to fit Gaussian densities with diagonal covariance or full-covariance and then use the resulting parameters as the feature vector. Another approach has been the use of auto-regressive models that model the evolution of the feature vectors within the time window of interest.
FIGURE 26.1 Feature extraction and texture window.
1458
CHAPTER 26 Music Mining
(a)
(b)
(c)
FIGURE 26.2 The time evolution of audio features is important in characterizing musical content. The time evolution of the spectral centroid for two different 30-second excerpts of music is shown in (a). The result of applying a moving mean and standard deviation calculation over a texture window of approximately 1 s is shown in (b) and (c).
A more detailed view of summarization (or as is sometimes called aggregation) is shown in Figure 26.2. It shows how the time evolution audio features, in this case the spectral centroid, can be summarized by applying a moving mean and standard deviation. Automatic music transcription is the process of converting an audio signal to a musical score (a symbolic representation containing only information about pitch and rhythm). It is a hard problem and existing techniques can only deal with simple “toy” examples. Instead the most commonly used pitch-based representation are the pitch and pitch-class profiles (other alternative names used in literature are pitch histograms and chroma-vectors for Pitch Class Profiles). The pitch profile measures the occurrence of specific discrete musical pitches in a music segment and the pitch class profile considers all octaves equivalent essentially folding the pitch profile into 12 pitch classes. Due to space limitations we can not go into details about how pitch profiles are calculated. Broadly speaking they can either be computed by summing the energy of different frequency bins that correspond to particular pitches or alternatively multiple pitch estimation can be performed and the results can be accumulated into a profile. Figure 26.3 shows a chromagram (i.e., a sequence of chroma vectors over time) corresponding to a continuous pitch glide over two octaves. The successive activation of the chromatic scale notes as well as the warping in octaves can be observed. Automatically extracting information related to rhythm is also important. Rhythmic information is hierarchical in nature and involves multiple related periodicities. A typical representation is a beat histogram (or sometimes called a beat spectrum) that provides a “salience” value for every possible periodicity. A typical approach applies onset detection to generate an onset strength signal that has high energy values at the time locations where there are significant changes in the audio spectrum such as the start of new notes. The periodicities of the onset strength signal at the range of human rhythm (approximately 30–180 beats/min) are calculated using autocorrelation. Another more recent approach is to identify rhythmic patterns that are characteristic of a particular genre automatically and characterize each piece as a occurrence histogram over a set of basic rhythmic patterns. Figure 26.4 shows two example beat histograms from 30 s clips of HipHop Jazz (left) and Bossa Nova (right). As can be seen in both histograms the prominent periodicities or candidate tempos are clearly visible. Once the tempo of the piece is identified the beat locations can be calculated by locally fitting tempo hypothesis with regularly spaced peaks of the onset strength signal.
1.26.3 Audio Feature Extraction
FIGURE 26.3 Chromagram for pitch glide over two octaves.
FIGURE 26.4 Beat histograms of HipHop/Jazz and Bossa Nova.
1459
1460
CHAPTER 26 Music Mining
1.26.4 Extracting context information about music In addition to information extracted by analyzing the audio content of music, there is a wealth of information that can be extracted by analyzing information on the web as well as patterns of downloads/listening. We use the general term musical context to describe this type of information. In some cases such as song lyrics the desired information is explicitly available somewhere on the web and the challenge is to appropriately filter out irrelevant information from the corresponding web pages. Text-based search engines such as Google and Bing can be leveraged for the initial retrieval that can then be followed by some post-processing based on heuristics that are specific to the music domain. Other types of information are not as straightforward and can require more sophisticated mechanisms such as the term weighting used in text retrieval systems, or natural language processing techniques such as entity detection. Such techniques are covered in detail in the literature as they are part of modern day search engines. As an illustrative example we will consider the problem of detecting the country of origin of a particular artist. As a first attempt one can query a search engine for various pairs of artist name and countries and simply count the number of pages returned. The country with the highest number of pages returned is returned as the country of origin. A more sophisticated approach is to analyze the retrieved web pages using term weighting. More specifically consider country c as a term. The document frequency D F(c, a) is defined as the total number of web pages retrieved for artist a in which the country term c appears at least once. The term frequency T F(c, a) is defined as the total number of occurrences of the country term c in all pages retrieved for artist a. The basic idea of term frequency-inverse document frequency weighting (TF-IDF) is to “penalize” terms that appear in many documents (in our case the documents retrieved for all artists) and increase the weight of terms that occur frequently in the set of web pages retrieved for a specific artists. There are several TF-IDF weighting schemes. For example a logarithmic formulation is: n , (26.1) T F I D F(c, a) = ln (1 + T F(c, a)) ∗ ln 1 + D F(c) where DF(c) is the document frequency of a particular country c over the documents returned for all artists a. Using the above equation the weight of every country c can be calculated for a particular artist query a. The country with the highest weight is then selected as the the predicted country of origin.
1.26.5 Similarity search Similarity retrieval (or query-by-example) is one of the most fundamental MIR tasks. It is also one of the first tasks that were explored in the literature. It was originally inspired by ideas from text information retrieval and this early influence is reflected in the naming of the field as Music Information Retrieval (MIR). Today most people with computers use search engines on a daily basis and are familiar with the basic idea of text information retrieval. The user submits a query consisting of some words to the search engine and the search engine returns a ranked list of web pages sorted by how relevant they are to the query. Similarity retrieval can be viewed as an analogous process where instead of the user querying the system by providing text the query consists of an actual piece of music. The system then responds
1.26.5 Similarity Search
1461
by returning a list of music pieces ranked by their similarity to the query. Typically the input to the system consists of the query music piece (using either a symbolic or audio representation) as well as additional metadata information such as the name of the song, artist, year of release, etc. Each returned item typically also contains the same types of meta-data. In addition to the audio content and meta-data other types of user generated information can also be considered such as ratings, purchase history, social relations and tags. Similarity retrieval can also be viewed as a basic form of playlist generation in which the returned results form a playlist that is “seeded” by the query. However more complex scenarios of playlist generation can be envisioned. For example a start and end seed might be specified or additional constraints such as approximate duration or minimum tempo variation can be specified. Another variation is based on what collection/database is used for retrieval. The term playlisting is more commonly used to describe the scenario where the returned results come from the personal collection of the user, while the term recommendation is more commonly used in the case where the returned results are from a store containing a large universe of music. The purpose of the recommendation process is to entice the user to purchase more music pieces and expand their collection. Although these three terms (similarity retrieval, music recommendation, automatic playlisting) have somewhat different connotations the underlying methodology for solving them is mostly similar so for the most part we will use them interchangeably. Another related term, that is sometimes used, is personalized radio in which the idea is to play music that is targeted to the preferences of a specific user. One can distinguish three basic approaches to computing music similarity. Content-based similarity is performed by analyzing the actual content to extract the necessary information. Metadata approaches exploit sources of information that are external to the actual content such as relationships between artists, styles, tags or even richer sources of information such as web reviews and lyrics. Usage-based approaches track how users listen and purchase music and utilize this information for calculating similarity. Examples include collaborative filtering in which the commonalities between purchasing histories of different users are exploited, tracking peer-to-peer downloads or radio play of music pieces to evaluate their “hotness” and utilizing user generated rankings and tags. There are trade-offs involved in all these three approaches and most likely the ideal system would be one that combines all of them intelligently. Usage-based approaches suffer from what has been termed the “cold-start” problem in which new music pieces for which there is no usage information can not be recommended. Metadata approaches suffer from the fact that metadata information is frequently noisy or inaccurate and can sometimes require significant semi-manual effort to clean up. Finally content-based methods are not yet mature enough to extract high-level information about the music. From a data mining perspective similarity retrieval can be considered a ranking problem. Given a query music track q and a collection of music tracks D the goal of similarity retrieval is to return a ranked list of the music tracks in D sorted by similarity so that the most similar objects are at the top of the list. In most approaches, this ranking is calculated by defining some similarity (or distance) metric between pairs of music tracks. The most basic formulation is to represent each music track as a single feature vector of fixed dimensionality x = [x1 , x2 , . . . , xn ]T and use standard distance metrics such as L1 (Manhattan) or L2 (Euclidean) or Mahalanobis on the resulting high dimensional space. This feature vector is calculated using audio feature extraction techniques as described in the previous section. Unless the distance metric is specifically designed to handle features with different dynamic ranges the feature vectors are typically normalized for example by scaling all of them so that their maximum value over the dataset is 1 and their minimum value is 0. A more complex alternative is
1462
CHAPTER 26 Music Mining
to treat each music track as a distribution of feature vectors. This is accomplished by assuming that the feature vectors are samples of an unknown underlying probability density function that needs to be estimated. By assuming a particular parametric form for the pdf (for example a Gaussian Mixture Model) the music track is then represented as a parameter vector θ that is estimated from the data. This way the problem of finding the similarity between music tracks is transformed to the problem of somehow finding how similar are two probability distributions that are estimated from the samples availabe. Several such measures of probability distance have been proposed such as histogram intersection, symmetric Kullback-Leibler divergence and earth mover’s distance. In general computation of such probabilistic distances are more computationally intensive than geometric distances on feature vectors and in many cases require numerical approximation as they can not be obtained analytically. An alternative to audio feature extraction is to consider similarity based on text such as web pages, user tags, blogs, reviews, and song lyrics. The most common model when dealing with text is the so called “bag of words” representation in which each document is represented as an unordered set of its words without taking into account any syntax or structure. Each word is assigned a weight that indicates the importance of the word for some particular task. The document can then be represented as a feature vector comprising of the weights corresponding to all the words of interest. From a data mining perspective the resulting feature vector is no different than the ones extracted from audio feature extraction and can be handled using similar techniques. In the previous section we described a particular example of text based feature extraction for the purpose of predicting the country of origin as an artist. As an example of how a text-based approach can be used to calculate similarity consider the problem of finding how similar are two artists A and B. Each artist is characterized by a feature vector consisting of term weights for the terms they have in common. The cosine similarity between the two feature vectors is defined as the cosine of the angle between the vectors and has the property that it is not affected by the magnitude of the vector (which would correspond to the absolute number of times terms appear and could be influenced by the popularity of the artist): A(t) × B(t) . (26.2) sim(A, B) = cos θ = t 2× 2 A(t) B(t) t t Another approach to calculating similarity is to assume that the occurrence of two music tracks or artists within the same context indicates some kind of similarity. The context can be web pages (or page counts returned by a search engine), playlists, purchase histories, and usage patterns in peer-to-peer (P2P) networks. Collaborative filtering (CF) refers to a set of techniques that make recommendations to users based on preference information from many users. The most common variant is to assume that the purchase history of a particular user (or to some extent equivalently their personal music collections) is characteristic of their taste in music. As an example of how co-occurrence can be used to calculate similarity, a search engine can be queried for documents that contain a particular artist A and B, as well as documents that contain both A and B. The artist similarity between A and B can then be found by: sim(A, B) =
co(A, B) , min (co(A), co(B))
(26.3)
where co(X ) is the number of pages returned for query X or the co-occurrences of A and B in some context. A similar measure can be defined based on co-occurrences between tracks and artists in playlists
1.26.5 Similarity Search
and compilation albums based on conditional probabilities: co(A, B) co(A, B) 1 + . sim(A, B) = ∗ 2 co(A) co(B)
1463
(26.4)
Co-occurrences can also be defined in the context of peer-to-peer networks by considering the number of users that have both artists A and B in their shared collection. The popularity bias refers to the problem of popular artists appearing more similar than they should be due to them occurring in many contexts. A similarity measure can be designed to down weight the similarity between artists if one of them is very popular and the other is not (the right-hand part of the following equation): |C(A) − C(B)| C(A, B) ∗ 1− , (26.5) sim(A, B) = C(B) C(Max) where C(Max) is the number of times the most popular artist appears in a context.
1.26.5.1 Evaluation of similarity retrieval One of the challenges in content-based similarity retrieval is evaluation. In evaluating mining systems ideally one can obtain ground truth information that is identical to the outcome of the mining algorithm. Unfortunately this is not the case in similarity as it would require manually sorting large collections of music in order of similarity to a large number of queries. Even for small collections and number of queries, collecting such data would be extremely time consuming and practically impossible. Instead the more common approach is to only consider the top K results for each query, where K is a small number, and have users annotate each result as relevant or not relevant. Sometimes a numerical discrete score is used instead of a binary relevance decision. Another possibility that has been used is to assume that tracks by the same artist or same genre should be similar and use such groupings to assign relevance values. Evaluation metrics based on information retrieval can be used to evaluate the retrieved results for a particular query. They assume that each of the returned results has a binary annotation indicating whether or not it is relevant to the query. The most common one is the F-measure which is a combination of the simpler measures of Precision and Recall. Precision is the fraction of retrieved instances that are relevant. Recall is the fraction of relevant instances that are retrieved. As an example, consider a music similarity search in which relevance is defined by genre. If for a given query of Reggae music it returns 20 songs and 10 of them are also Reggae the precision for that query is 10/20 = 0.5. If there are a total of 30 Reggae songs in the collection searched then the Relevance for that query is 10/30 = 0.33. The F-measure is defined as the harmonic mean of Precision P and Recall R. P×R . (26.6) F =2× P+R These three measures are based on the list of documents returned by the system without taking into account the order they are returned. For similarity retrieval, a more accurate measure is the Average Precision which is calculated by computing the precision and recall at every position in the ranked sequence of documents, creating a precision-recall curve and computing the average. This is equivalent to the following finite sum: n k=1 P(k) × r el(k) , (26.7) AP = #r elevant documents
1464
CHAPTER 26 Music Mining
Table 26.1 20120 MIREX Music Similarity and Retrieval Results
RND TLN3 TLN2 TLN1 BWL1 PS1 PSS1 SSPK2
FS
BS
P@5
P@10
P@20
P@50
17 47 47 46 50 55 55 57
0.2 0.97 0.97 0.94 1.08 1.22 1.21 1.24
8 48 48 47 53 59 62 59
9 47 47 45 51 57 60 58
9 45 45 43 50 55 58 56
9 42 42 40 47 51 55 53
where P(k) is the precision at list position k and r el(k) is an indicator function that is 1 if the item at list position (or rank) k is a relevant document and 0 otherwise. All of the measures described above are defined for a single query. They can easily be extended to multiple queries by taking their average across the queries. The most common way of evaluating similarity systems with binary relevance ground-truth is the Mean Average Precision (MAP) which is defined as the mean of the Average Precision across a set of queries. The Music Information Retrieval Evaluation Exchange (MIREX) is an annual evaluation benchmark in which different groups submit algorithms to solve various MIR tasks and their performance is evaluated using a variety of subjective and objective metrics. Table 26.1 shows representative results of the music similarity and retrieval task from 2010. It is based on a dataset of 7000 30-second audio clips drawn from 10 genres. The objective statistics are the precision at 5, 10, 20, and 50 retrieved items without counting entries by the artist (artist filtering). The subjective statics are based on human evaluation of approximately 120 randomly selected queries and 5 results per query. Each result is graded with a fine score (between 0 and 100 with 100 being most similar) and a broad score (0 not similar, 1 somewhat similar, 2 similar) and the results are averaged. As can be seen all automatic music similarity systems perform significantly better than the random baseline (RND). The differ in terms of the type of extracted features utilized, the decision fusion strategy (such as simple concatenation of the different feature sets or empirical combinations of distances from the individual feature sets), and whether post-processing is applied to the resulting similarity matrix. There is also a strong correlation between the subjective and objective measures although it is not perfect (for example SSPK2 is better than PSS1 in terms of subjective measures but worst in terms of objective measures).
1.26.5.2 Cover song detection and audio fingerprinting In addition to content-based similarity there are two related music mining problems. The goal of audio fingerprinting is to identify whether a music track is one of the recordings in a set of reference tracks. The problem is trivial if the two files are byte identical but can be considerably more challenging when various types of distortion need to be taken into account. The most common distortion is perceptual
1.26.5 Similarity Search
1465
audio compression (such as the one used for mp3 files) which can result in significant alterations to the signal spectrum. Although these alterations are not directly perceptible by humans they make the task of computer identification harder. Another common application scenario is music matching/audio fingerprinting for mobile applications. In this scenario the query signal is acquired through a low quality microphone on a mobile phone and contains significant amount of background noise and interference. At the same time the underlying signal is the same exact music recording which can help find landmark features and representations that are invariant to these distortions. Cover song detection is the more subtle problem of finding versions of the same song possibly performed by different artists, instruments, and tempo. As the underlying signals are completely different it requires the use of more high level representations such as chroma vectors that capture information about the chords and the melody of the song without being affected by timbral information. In addition it requires sophisticated sequence matching approaches such as dynamic time warping (DTW) or Hidden Markov Models (HMM) to deal with the potential variations in tempo. Although both of these problems can be viewed as content-based similarity retrieval problems with an appropriately defined notion of similarity they have some unique characteristics. Unlike the more classic similarity retrieval in which we expect the returned results to gradually become less similar, in audio fingerprinting and cover song detection there is a sharper cutoff defining what is correct or not. In the ideal case copies or cover versions of the same song should receive a very high similarity score and everything else a very low similarity score. This specificity is the reason why typically approaches that take into account the temporal evolution of features are more common. Audio fingerprinting is a mature field with several systems being actively used in industry. As a representative example, we describe a landmark-based audio fingerprinting system based on the ideas used by Shazam, which is a music matching service for mobile phones. In this scheme, each audio track is represented by the location in time and frequency of prominent peaks of the spectrogram. Even though the actual amplitude of these peaks might vary due to noise and audio compression their actual location in the time frequency plane is preserved quite well in the presence of noise and distortion. The landmarks are combined into pairs and each pair is characterized by three numbers f 1 , f 2 , t which are the frequency of the first peak, the frequency of the second peak and the time between them. Both reference tracks and the query track are converted into this landmark representation. The triplets characterizing each pair are quantized with the basic idea being that if the query and a reference track have a common landmarks with consistent timing they are a match. The main challenge in an industrial strength implementation is deciding on the number of landmarks per second and the thresholds used for matching. The lookup of the query landmarks into the large pool of reference landmarks can be performed very efficiently using hashing techniques to effectively create an inverted index which maps landmarks to the files they originate. To solve the audio cover song detection problem there are two issues that need to be addressed. The first issue is to compute a representation of the audio signal that is not affected significantly by the timbre of the instruments playing but still captures information about the melody and harmony (the combination of discrete pitches that are simultaneously sounding) of the song. The most common representation used in music mining for this purpose are chroma vectors (or pitch class profiles) which can be thought of as histograms showing the distribution of energy among different discrete pitches. The second issue that needs to be addressed is how to match two sequences of feature vectors (chroma vectors in this case) that have different timing and length as there is no guarantee that a cover song is played at the same tempo as the original and there might be multiple sections each with different timing.
1466
CHAPTER 26 Music Mining
1.26.5.3 Sequence matching Sequence matching algorithms are used for a variety of tasks in MIR including polyphonic audio and scores alignment, and real-time score following. More formally the problem is given two sequences of feature vectors with different lengths and timings find the optimal way of “elastically” transforming by the sequences so that they match each other. A common technique used to solve this problem, and also frequently employed in the literature for cover song detection, is dynamic time warping (DTW) a specific variant of dynamic programming. Given two time series of feature vectors X = (x1 , x2 , . . . , x M ) and Y = (y1 , y2 , . . . , y N ) with X , Y ∈ Rd the DTW algorithm yields an optimal solution in O(M N ) time where M and N are the lengths of the two sequences. It requires a local distance measure that can be used to compare individual feature vectors which should have small values when the vectors are similar and large values when they are different: d : Rd × Rd → R ≥ 0.
(26.8)
The algorithm starts by building the distance matrix C ∈ R M×N representing all the pairwise distances between the feature vectors of the two sequences. The goal of the algorithm is to find the alignment or warping path which is a correspondence between the elements of the two sequences with the boundary constraint that the first and last elements of the two sequences are assigned to each other. Intuitively for matching sequences the alignment path will be roughly diagonal and will run through the low-cast areas of the distance matrix. More formally the alignment is a sequence of points ( pi , p j ) ∈ [1 : M]×[1 : N ] for which the starting and ending points must be the first and last points of the aligned sequences, the points are time-ordered and each step size is constrained to either move horizontally, vertically or diagonally. The cost function of the alignment path is the sum of all the pairwise distances associated with its points and the alignment path that has the minimal cost is called the optimal alignment path and is the output of the DTW. Figure 26.5 shows two distance matrices that are calculated based on energy contours of different orchestra music movements. The left matrix is between two performances by different orchestras of the
FIGURE 26.5 Similarity matrix between energy contours and alignment path using dynamic time warping. (a) Good alignment (b) bad alignment.
1.26.5 Similarity Search
1467
same piece. Even though the timing and duration of each performance is different they exhibit a similar overall energy envelope shown by the energy curves under the two axes. The optimal alignment path computed by DTW is shown imposed over the distance matrix. In contrast the matrix on the right shows the distance matrix between two unrelated orchestral movements where it is clear there is no alignment and the optimal alignment path deviates significantly from the diagonal. Hidden Markov Models (HMM) are a probabilistic sequence modeling technique. The system is modeled as going through a set of discrete (hidden) states over time following Markov transitions, i.e., each state only depends on the value of previous state. In regular Markov models the only parameters are the state transition probabilities. In HMM the states are not directly visible but their probabilistic output is visible. Each state has an associated probability density function and the goal of HMM training is to estimate the parameters of both the transition matrix and the state-dependent observation matrix. For sequence matching the estimated sequence of hidden states can be used.
1.26.5.4 Cover song detection Cover song detection is performed by applying DTW between all the query song and all the references and returning as a potential match the one with the minimum total cost for the optimal alignment path. Typically the alignment cost between covers of the same song will be significantly lower than the alignment cost between two random songs. DTW is a relatively costly operation and therefore this approach does not scale to large number of songs. A common solution for large scale matching is to apply an audio fingerprinting type of approach with efficient matching to filter out a lot of irrelevant candidates and once a sufficient small number of candidate reference tracks have been selected apply pair-wise DTW between the query and all of them. Table 26.2 shows the results of the audio cover song detection task of MIREX 2009 in the so called “mixed” collection which consists of 1000 pieces that contain 11 “cover song” each represented by 11 different versions. As can be seen the performance is far from perfect but it is still impressive given the difficulty of the problem. An interesting observation is that the objective evaluation measures are not consistent. For example the RE algorithm performs slightly worse than the SZA in terms of mean average precision but has better mean rank for the first correctly identified cover. Table 26.3 shows the results of the MIREX 2009 audio cover song detection task for the Mazurkas collection which consists 11 different performances/versions of 49 Chopin Mazurkas. As can be seen from the results this is a easier dataset to find covers probably due to the smaller size and more uniformity in timbre. The RE algorithm is based on the calculation of different variants of chroma vectors utilizing multiple feature sets. In contrast to the more common approach of scoring the references in a ranked list and setting up a threshold for identifying covers it follows a classification approach in which a pair is either classified as reference/cover or as
Table 26.2 2009 MIREX Audio Cover Song Detection-Mixed Collection
Mean # of covers in top 10 Mean Average Precision Mean Rank of first correct cover
RE
SZA
TA
6.20 0.66 2.28
7.35 0.75 6.15
1.96 0.20 29.90
1468
CHAPTER 26 Music Mining
Table 26.3 2009 MIREX Audio Cover Song Detection—Mazurkas
Mean # of covers in top 10 Mean Average Precision Mean Rank of first correct cover
RE
SZA
TA
8.83 0.91 1.68
9.58 0.96 1.61
5.27 0.56 5.49
reference/non-cover. The SZA algorithm is based on harmonic pitch class profiles (HPCP) which are similar to chroma vectors but computed over a sparse harmonic representation of the audio signal. The sequence of feature vectors of one song is transposed to the main tonality of the other song in consideration. A state space representation of embedding m and time delay z is used to represent the time series of HPCP with a recurrence quantification measure used for calculating cover song similarity.
1.26.6 Classification Classification is the task of assigning each object (in our case a music track) to one of several pre-defined categories of interest. In data mining it refers to principled ways of building classification systems using as input a set of objects with associated ground truth classification labels which is called the training set. It is a subset of the more general concept of supervised learning in which an algorithm “learns” how to improve its performance over a particular task by analyzing the results on this task provided by humans during a training phase. One of the simplest classifiers is based on techniques such as the ones described in the similarity retrieval section. A new object represented by a vector of attributes/features is classified to the category of its nearest neighbor in the training set. A common variant is to consider the classification label of the majority of k nearest neighbors where k is an odd number. In addition a variety of techniques have been proposed specifically for this task. They include rule-based classifiers, decision trees, neural networks, support vector machines and various parametric classifiers based on Bayesian decision theory such as Naive Bayes and Gaussian Mixture Models. These techniques differ in terms of the assumptions they make, the time they take to train, and the amount of data they require to work well. Their goal is to fit as well as possible the relationship between the attributes or features that characterize the music tracks and the ground truth class label. The trained model should not only predict the class label of the tracks it has encountered in training but of new music tracks it has never encountered before. Classification techniques can be grouped into two large families: generative and discriminative. In generative approaches the classification problem is recast, through Bayesian decision, theory to the problem of estimating a probability density function from a set of samples (the feature vectors of a particular class in the training set). They are called generative because the estimated model can be used to generate new random samples (features). Given a classification task of M classes/categories, c1 , c2 , . . . , c M and an unknown object (in our case music track) represented as a feature vector x the goal is to calculate the M conditional probabilities (also referred to as a posteriori probabilities) P(ci |x) where i = 1, 2, . . . , M. The predicted class label for the unknown vector x will then be the ci corresponding
1.26.6 Classification
1469
to the maximum conditional probability. Using the Bayes Rule these conditional probabilities can be written as: p(x|ci )P(ci ) . (26.9) P(ci |x) = p(x) The prior probabilities P(ci ) can be calculated by simply counting the number of instances belonging to each class in the training set, so the main challenge is to estimate the probability density function of the feature vectors given the class p(x|ci ). Many classification algorithms solve this problem by assuming a particular parametric form for the probability density function and then estimating the parameters from the observed samples for the particular class. For example, the simple Naive Bayes classifier assumes that the probability density function corresponding to each feature is conditionally independent given the class. Therefore it can be expressed as a the product of normal distributions (one for each feature). The parameters that need to be estimated are the means and variances of each feature for the particular class. It can be shown that these correspond to the statistical means and variances of the observed samples in the training set. In music mining the most common generative modeling technique used is Gaussian Mixture Models (GMM) in which each class is modeled as a weighted combination of samples from K Gaussian models. The main insight behind discriminative approaches is that what is important in a classification problem is not to model the distribution of all the observed feature vectors, but rather to focus on the areas of the feature space in which the decision of which class the vector belongs is unclear. Essentially discriminative approaches try to directly solve the classification problem by making some assumptions about a discriminative function (i.e., a function that given as input an unknown feature vector returns a predicted class) and then optimizing the parameters of this function according to some criterion. For example, the assumption can be that the discriminant function is a linear combination of the features (geometrically corresponding to a hyperplane in the feature space) and the criterion might be the number of errors in the training set. In music mining the most common discriminative model used is Support Vector Machines (SVM). Another way of grouping classification algorithms is into parametric approaches in which each class is characterized by a small set of parameters (both the GMM and SVM classifiers are parametric methods) and non-parametric methods in which there is no explicit parameter representation. The canonical example of a non-parametric classifier is the Nearest Neighbor rule which simply classifies an unknown instance with the label associated with the training instance that is closest to it in the feature space. In many music mining classification problems the predicted labels form natural hierarchies. For example Folk Metal is a subgenre of Heavy Metal and Hostility is a subordinate emotion to Anger. A straightforward way of solving hierarchical classification problems is to apply standard classification techniques for the top level categories and then train classifiers for the subordinate categories separately for each top-level category. One issue with this approach is that by making a hard decision at each level errors can be propagated. An alternative is to view the hierarchy as a probabilistic graphical model and essentially compute conditional probabilities for the decisions at each level of the hierarchy using classifiers that output probabilities over all possible classes rather than a single classification decision.
1.26.6.1 Genre classification Music can be grouped into categories in many different ways and most of them have been investigated in the literature. Probably the oldest classification task that was investigated was musical genre
1470
CHAPTER 26 Music Mining
classification. Other tasks include artist/singer/performer classification, mood and emotion detection, instrument recognition and others. The boundaries between categories such as genre labels are fuzzy. Even though there is considerable agreement when listeners are asked to annotate an unknown piece of music with a genre label, there are also individual differences. Top-level genres such as classical or Reggae are more consistently identified, while more obscure genres such as folk metal or grime are meaningful only to smaller groups of listeners. Given the subjective nature of human genre annotations it is unreasonable to expect perfect computer genre classification performance (in fact the notion of perfect performance is only meaningful in relation to some ground truth which in this case will be inherently subjective). One fascinating finding is that average listeners are able to classify a music track with a label from a set of pre-defined top-level genres, with accuracy better than random with exposure to as little as 250 ms (1/4 of a second or approximately the time it takes to say the word “word”) and require only 3 s to reach their best classification performance. This indicates that low level audio-related features carry sufficient information for reliable genre classification. In a well-known user study (details can be found in the Further Reading section) a mean subject agreement of about 70% to the genres assigned by music companies was reported. Further studies have shown that on the same dataset computer classification achieves results comparable to some of the “worst” performing humans (69%) but not as good as the “best” performing humans (95%). In this case the ground-truth is defined as the genre labels that is assigned by the majority of the 27 users that participated in the study rather some external authority. Therefore the “worst” and “best” performance refer to subject agreement with the majority rather than any form of music knowledge. It also has been pointed out that listeners probably can be clustered into groups so that the perception of genres within a particular group is more consistent than across groups. In the majority of published work in automatic genre classification these issues are not considered and the ground truth is known and externally provided. Automatic genre classification is one of the most popular topics in music mining to a large extent due to the simplicity of obtaining ground truth, the availability of datasets for which there are published results, and the direct mapping to classification techniques. It has also served as a good initial test-bed for experimenting with alternative audio and music feature sets. At the same time for a lot of music of interest the genre labels are already available (although for some cases like World music they are too generic to be useful).
1.26.6.2 Emotion and mood classification Listeners can easily identify emotions and moods that are present in pieces of music. It is also known that music can induce emotions and moods to listeners. Therefore it is desirable to utilize emotion and mood related information in music retrieval systems. Unlike other types of audio classification such as genre in which the classification labels are to some extent pre-defined and readily available for particular music tracks, one of the challenges in emotion and mood classification is deciding what labels should be used and obtaining them for a particular set of music tracks. Roughly speaking mood is a more persistent state of mind and can encompass different emotions whereas emotions are more instinctive. Most of the literature in emotion and mood classification relies on various schemes for describing emotions/moods from psychology. For example Hevner experimentally found eight adjective groups for describing music based on a user study of 450 subjects listening to 26 pieces of classical music. These groups are: dignified, sad, dreamy, serene, graceful, happy, exciting,
1.26.6 Classification
Happy
Energy Graceful
Exciting
Serene
Vigorous
Dreamy
Dignified
1471
Calm Energy
Tension Energy
Tension
Calm
Tension Tiredness
Calm Tiredness
Sad
Tiredness
Hevner Adjective Groups
Thayer Emotion Map
FIGURE 26.6 Different ways of organizing emotions.
vigorous and were arranged in a circle in the order they were listed such that the changes are gradual. Figure 26.6 shows this representation. Schemes of describing emotion can be roughly divided into two general families: models based on hierarchies of words describing the different emotions and two dimensional (or sometimes even three dimensional) continuous emotion spaces where each axis corresponds to a pair of adjectives having opposite meanings (for example, happy and sad). A particular emotion is characterized as a point in the emotional space. The use of words to represent emotion (or affect) can be problematic as multiple emotions can be experienced at the same time and there is individual variability in how a particular experience is assessed. Figure 26.6 shows an example of emotion grouping and an example of an emotion space. In terms of data mining techniques, standard classification approaches as the ones described above can be used. The most common sources of training data in the literature are either audio features, text features based on analyzing lyrics, or more generally tags which are described below. Hierarchical classification techniques can also be employed to deal with multiple levels of classification. In order to deal with the fact that multiple emotion/mood words might apply to the same music track a multi-label classification approach can be employed. In traditional classification there is a single ground truth label (from a predefined set of categories) for each training instance and a single predicted label (from the same set of categories) for each testing instance. In multi-label classification there are multiple labels (still from a predefined set of categories) associated with each instance. One can categorize multiple-label classification approaches into two broad families. The first family consists of problem transformation methods that convert a multi-label classification problem to a set of standard classification problems. For example each label can be treated separately as a binary classification problem with instances that are annotated with the label being positive examples and instances that do not contain the label as negative examples. This approach requires training K binary classifiers where K is the number of distinct labels. The second family consists of classification approaches that can be directly applied to
1472
CHAPTER 26 Music Mining
multi-label classification problems. A simple example is the Nearest Neighbor classifier in which a new instance is annotated by all the labels that are associated with the training instance that is closest to it in the feature space. When dealing with a continuous emotional space both the ground truth and the predicted outcome are continuous rather than categorical. In data mining, regression refers to the prediction of a continuous output label from a set of continuous attributes. There is a large variety of regression algorithms which similarly to classification algorithms can be categorized as parametric and non-parametric. In many cases they are based on similar ideas to classification algorithms. For example support vector regression (SVR) is a counterpart to support vector machines (SVMs) and ensemble boosting regression (AdaBoost.RT) is the counterpart of the AdaBoost ensemble classification algorithm.
1.26.6.3 Evaluating classifier performance The goal of a classifier is to be able to classify objects it has not encountered before. Therefore in order to get a better estimate of its performance on unknown data it is necessary to use some of the instances labeled with ground truth for testing purposes and not take them into account when training. The most common such evaluation scheme is called K-fold cross-validation. In this scheme the set of labeled instances is divided into K distinct subsets (folds) of approximately equal sizes. Each fold is used for testing once with the K − 1 remaining folds used for training. As an example if there are 100 feature vectors then each fold will contain 10 feature vectors with each one of them being used one time for testing and K − 1 times for training. The most common evaluation metric is classification accuracy which is defined as the percentage of testing feature vectors that were classified correctly based on the ground truth labels. When classifying music tracks a common post-processing technique that is applied is the so-called artist filter which ensures that the feature vectors corresponding to tracks from the same artist are not split between training and testing and are exclusively allocated only to one of them. The rationale behind artist filtering is that feature vectors from the same artist will tend to be artificially related or correlated due to similarities in the recording process and instrumentation. Such feature vectors will be classified more easily if included in both training and testing and maybe inflate the classification accuracy. Similar considerations apply to feature vectors from the same music track if each track is represented by more than one feature vector, in which case a similar track filter should be applied. The most common evaluation metric for automatic classification is accuracy which is simply defined as the number of correctly classified instances in the testing data. It is typically expressed as a percentage. Additional insight can be provided by examining the confusion matrix which is a matrix that shows the correct classifications in the diagonal and shows how the misclassification are distributed among the other class labels. Table 26.4 shows classification results from MIREX 2010. The classification results are shown as percentage accuracy. They are sorted based on the performance on the largest of the datasets considered (the Genre column). This dataset consists of 7000 clips each 30 s long. The following ten genres are all equally represented (700 clips for each genre): blues, jazz, country, baroque, classical, romantic, electronica, hiphop, rock, metal. The mood dataset consists of 600 30-second clips classified into 5 mood clusters. The labeling was done by human judges. Table 26.5 shows the 5 mood clusters used. The Latin dataset consists of 322 audio files representing 10 Latin music genres (axe, bachata, bolero, forro, gaucha, merengue, pagode, sertaneja, tango) sourced from Brazil and labeled by music experts. These genres are
1.26.6 Classification
1473
Table 26.4 2010 MIREX Classification Tasks
SSPK1 BRPC1 BRPC2 GR1 FE1 RRS1 BPME2 MW1 TN4 WLB1 GP1 RK1 TN1 MBP1 HE1 RJ1 JR2 JR4 RK2 MP2 JR1 RJ2 JR3 WLB2
Genres
Mood
Latin
73.64 70.67 70.00 69.80 69.64 67.89 67.66 67.57 66.66 66.11 64.27 64.03 63.37 63.29 61.16 61.07 60.94 60.93 60.54 60.43 60.01 59.73 59.54 48.50
63.83 58.67 59.00 60.67 60.83 61.67 54.67 54 57.5 55.5 63.17 54.83 55.5 54.0 54.17 54.83 51.17 51.17 47.67 36.17 46.33 50.17 46.83 57.67
79.86 70.75 – 60.18 69.32 62.53 – 37.93 48.54 68.11 66.9 50.85 36.83 61.02 50.76 61.98 60.74 60.27 42.14 57.30 56.49 59.75 57.95 34.99
Table 26.5 2010 MIREX Mood Clusters Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
Passionate Rollicking Literate Humorous Aggressive
Rousing Cheerful Poignant Silly Fiery
Confident Fun Wistful Campy Tense
Boisterous Sweet Bittersweet Quirky Intense
Rowdy Amiable Autumnal Whimsical Volatile
Good natured Brooding Witty Visceral
more differentiated by rhythmic characteristics than the other datasets considered. For computing these classification accuracies, for all datasets, a three-fold cross-validation with artist filter (i.e., all songs of an artists are either part of the training or testing) approach was used. The algorithms differ in terms of the exact details of the feature extraction and the type of supervised learning classifier they utilize.
1474
CHAPTER 26 Music Mining
1.26.6.4 Clustering Clustering is the problem of partitioning a finite set of objects (music tracks in our case) into groups called clusters such that objects belonging to the same cluster are similar to each other, and objects belonging to different clusters are dissimilar. Similarly to classification, this is a well investigated problem in the area of data mining for which several algorithms have been proposed. In music mining it can be used to automatically organize a music collection into coherent groups, or identify users with different musical interest as well as automatically construct hierarchies of music tags. Similarly to classification typically some combination of audio features and text such as lyrics is utilized in music clustering. One of the classic algorithms for clustering is K-means which partitions N instances into K clusters such that each instance belongs to the “nearest” cluster and clusters are characterized by the mean of the instances that are assigned to them. The most common algorithm for K-Means clustering uses an iterative refinement technique consisting of two steps. In the first assignment step each instance is assigned to the cluster with the closest mean. The initial means can be either random or somehow spread in the feature space. In the second update step the means characterizing each cluster are updated to reflect the instances that have been assigned to them. The two steps are repeated with this new set of cluster means until the assignments no longer change. A somewhat similar approach can be used with the so called EM algorithm for finding clusters based on a Gaussian Mixture Model.
1.26.7 Tag annotation The term “tag” refers to any keyword associated to an article, image, video, or piece of music on the web. In the past few years there has been a gradual shift from manual annotation into fixed hierarchical taxonomies, to collaborative social tagging where any user can annotate multimedia objects with tags (so called folksonomies) without conforming to a fixed hierarchy and vocabulary. For example, Last.fm is a collaborative social tagging network which collects roughly 2 million tags (such as “saxophone,” “mellow,” “jazz,” “happy”) per month and uses that information to recommend music to its users. Another source of tags are “games with a purpose” where people contribute tags as a by-product of doing a task that they are naturally motivated to perform, such as playing casual web games. For example TagATune is a game in which two players are asked to describe a given music clip to each other using tags, and then guess whether the music clips given to them are the same or different. Tags can help organize, browse, and retrieve items within large multimedia collections. As evidenced by social sharing websites including Flickr, Picasa, Last.fm, and You Tube, tags are an important component of what has been termed as “Web 2.0.” The goal of automatic tag annotation is to automatically predict tags by analyzing the musical content without requiring any annotation by users. Such systems typically utilize signal processing and supervised machine learning techniques to “train” autotaggers based on analyzing a corpus of manually tagged multimedia objects. Music classification can be viewed as a specialized restricted form of tag annotation where there is fixed vocabulary and only one tag applies to each music track. There has been considerable interest for automatic tag annotation in multimedia research. Automatic tags can help provide information about items that have not been tagged yet or are poorly tagged. This avoids the so called “cold-start problem” in which an item can not be retrieved until it has been tagged. Addressing this problem is particularly important for the discovery of new items such as recently
1.26.7 Tag Annotation
1475
released music pieces in a social music recommendation system. Another also automatic approach is to use text mining techniques to associate tags to particular music tracks. The training data for an automatic tag annotation system is typically represented as a tag-track matrix X where each element x[t, s] represents the strength of association between a particular tag t, and a particular song s. For example, the strength of association for a particular entry in the matrix can be the number of users that annotated that particular song with a particular tag. From a machine learning perspective automatic tag annotation can be viewed as a variation of multilabel classification. In contrast to traditional classification in which each item is assigned one of k mutually exclusive class labels, in multi-label classification each item can be assigned to multiple labels (tags). Many different approaches to multi-label classification have been proposed in the literature. They leverage feature information computed from a training set of examples annotated with multiple labels to train models that can subsequently be used to predict labels for new examples. There are some unique characteristics and related challenges when the ground truth tag data is obtained from the web. The ground truth training data is noisy in the sense that the tags can contain synonyms (“calm” and “mellow”), misspellings (“chello”) and hierarchical relations (“symphony” and “classical”). In addition the data is sparse meaning that there can be few training examples for a given tag. Finally, the absence of a tag cannot always be taken to mean that the tag is not applicable, as it might be the case that the users have simply not yet considered that tag for the particular multimedia item. This phenomenon has been termed weak labeling in contrast to regular strong labeling. A common straightforward approach to auto-tagging is to train K binary classifiers that classify each of the K tags independently. Instances that contain a tag are considered positive, and instances that do not containx it are considered negative. A new instance is annotated with all the tags that the binary classifiers predict as positive. Topic models such as Latent Dirichlet Allocation (LDA) make the assumption that tags can be grouped into a unknown number of higher level groups and build a probabilistic model around this assumption. Pre-processing and post-processing techniques that take into account semantic similarities between tags can also be used to improve the results. For example misspellings can be merged in a preprocessing step. A data driven approach that attempts to capture some of the dependencies between tags directly from the data is called stacking. Stacking is a method of combining the outputs of multiple independent classifiers for multi-label classification. The first step of using stacking for multi-label classification is to train |V | individual tag classifiers using a training set (xi , yi ), where xi denotes the feature vector for instance i and yi is the associated set of labels. The output of these classifiers (binary or probabilistic) f 1 (x), f 2 (x), . . . , f |V | (x) where x is the input feature vector that can then used as a feature to form a new feature set. Let the new feature set be z 1 , z 2 , . . . , z |V | . This feature set, together with the original ground truth labels (zi , yi ), is then used for training a second stage of stacking classifiers. The goal is to have the stacking classifiers make use of information like the correlation between tags and the accuracy of the first stage classifiers to improve the annotation performance. For example suppose that the stage 1 performance for the tag “opera” is not very good but that most of the examples with the tag “opera” receive high probabilities for the tags “classical” and “voice” at stage 1. The stacking stage 2 can take into account this information from other tags and improve annotation performance: something not possible during stage 1, in which each tag is treated independently. Figure 26.7 shows this process as a block diagram. Evaluation of automatic tagging systems is not trivial. In general, the evaluation metrics used are generalizations of commonly used evaluation metrics for single label classification. An annotated “training”
1476
CHAPTER 26 Music Mining
FIGURE 26.7 Stacking for automatic music tag annotation.
set of instances is used to “train” the classifier, and then used to “predict” the tags for a set of instances in a “test” set. We can also distinguish between evaluation metrics that are based on a predicted set of discrete tags (sometimes referred to as the annotation task) and ones that are based on a predicted set of tag affinities/probabilities (sometimes referred to as the ranking task) for each instance in the testing set. A common approach is to treat any classification decision equally and simply count the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) to derive well-known measures such as precision, recall and F-measure. In the case of probabilistic output, multiple score thresholds can be considered as possible boundaries for binarization, in which case it is common to use Receiver operating characteristic (ROC) curves. An ROC curve is a plot of the true positive rate as a function of the false positive rate. The ROC curve can be summarized by the area under curve (AUC-ROC) which can be found by integrating the ROC curve and is upper bound by 1.0. Random guessing in a retrieval task results in an AUC-ROC of 0.5. Different tag annotation methods can have different operating point characteristics in terms of the trade-off between true positives and false positives. A final complication is that calculating metrics over the entire set of tags can be misleading as good performance on “popular” tags that appear in many instances will dominate. However, typically a more balanced response where all tags are considered is desired. In order to address this concern, evaluation metrics averaged across tags are used. Finally it is important to note that in most cases evaluation metrics based on annotated ground truth underestimate what the true performance of the system would be if evaluated by humans. The reason is that frequently, predicted tags that humans would consider applicable are not present in the ground truth and therefore evaluated as mistakes. Table 26.6 shows both binary (F-measure) and affinity (AUC-ROC) evaluation metrics for automatic tag annotation from MIREX 2009 using the MajorMiner data-set. This data-set consists of 2300 clips
1.26.7 Tag Annotation
1477
Table 26.6 MIREX 2009 Tag Annotation Results (MajorMiner, Mood) Measure
BP2
CC4
GP
GT2
HBC
LWW2
MajorMiner F-measure AUC-ROC
0.29 0.76
0.26 0.75
0.01 –
0.29 0.79
0.04 0.74
0.31 0.80
Mood F-measure AUC-ROC
0.19 0.63
0.18 0.64
0.08 –
0.21 0.65
0.06 0.66
0.70 0.80
selected at random from 3900 tracks. Each clip is 10 s long. The clips represent a total of 1400 different tracks on 800 different albums by 500 artists. To give a sense of the diversity of the music collection, the following genre tags have been applied to these artists, albums, and tracks on Last.fm: electronica, rock, indie, alternative, pop, britpop, idm, new wave, hiphop, singer-songwriter, trip-hop, post-punk, ambient, jazz. The MajorMiner game has collected a total of about 73,000 taggings, 12,000 of which have been verified by at least two users. In these verified taggings, there are 43 tags that have been verified at least 35 times, for a total of about 9000 verified uses. These are the tags are used in this task. The table also shows the results in the Mood dataset which consists of 3469 unique songs and 135 mood tags organized into 18 mood tag groups which where used as tags. The songs are Western pop songs mostly from the USPOP collection. Each song may belong to multiple mood tag groups. The main rationale on songs selection is: if more than one tag in a group were applied to a song, or if one tag in a group was applied more than once to a song, this song is marked as belonging to this group. In this task, the participating groups submitted variants of the same algorithm. Due to space constraints we only show the top performing variant of each group. The evaluation was performed using 3-fold cross validation using artist filtering, i.e., the training and test sets contained different artists. The results are averaged across tags. The BP2 system is based on spectral and chroma features and for classification applies both feature selection and model selection. For model selection instead of using classification accuracy, as is common, they use a custom measure designed to deal better with unbalanced data-sets and over-fitting. Each tag is classified separately. The CC4 system is based on MFCC and delta-MFCC that are then transformed into a super-vector using a universal background approach. In this approach, a Gaussian Mixture Model is trained using a large corpus of music that is separate from the files considered for tag annotation. This universal background model is then adapted to better fit the data of the tag dataset. The parameters of the adapted GMM are then used as features and an ensemble classifier consisting of a linear support vector machine and AdaBoost classifier is utilized for classification. Each tag is treated separately. The GPP system uses spectral and chroma features followed by GMM modeling both at the frame level and track level. Each tag is treated as a separate classification problem. GT2 also utilized spectral and chroma features followed by a layer of tag-specific linear support vector machines followed by a stacking layer of also linear support vector machines. The HBC system utilizes bag of codewords representation (basically an occurrence histogram of different codewords which are computed by k-means clustering-this approach is also known as vector quantization) with MFCC and
1478
CHAPTER 26 Music Mining
delta MFCC as the original feature vectors. For tag annotation a codeword bernoulli average approach is utilized. Unfortunately it is hard to draw any reliable conclusions from these results. The exact details of the features and tag annotation approach are not the same, so it is hard to tell if the observed differences between algorithms are due to the feature set used or the tag annotation algorithm.
1.26.8 Visualization Visualization is the display of information in such a way that relationships among data items and attributes can be analyzed and understood. It takes advantage of the strong pattern recognition properties of the human visual system. Traditional visualizations of music, such as the time domain waveforms and spectrograms, popular in audio editors convey very little information about the musical content and are focused on single music tracks. In this section we focus on visualizations of audio collections and associated information about them with specific emphasis on visualizing large collections of music. As we have already discussed a common approach in automatic analysis of music is to represent each track as a feature vector of fixed dimensionality (typical numbers range from 10 to 1000 dimensions). Dimensionality reduction methods try to transform these feature vectors to either 2 or 3 dimensions so that they can be visualized in a natural way by considering the transformed feature vector as a coordinate/point in a visual space. One of the most common technique for dimensionality reduction that can be used for visualization is Principal Component Analysis (or PCA). An alternative is the use of self-organizing maps (SOM) which attempt to perform both dimensionality reduction and clustering. These two methods underlie the majority of proposed music collection visualization systems. In this section we describe these two algorithms and describe generic visualization interfaces that can be built using them. Additional details such as how the user interacts with the generated visualization, the ability to zoom in and out, display hierarchies, etc. can be found in the further reading section. Principal component analysis (PCA) converts a set of feature vectors with possibly correlated attributes into a set of feature vectors where the attributes are linearly uncorrelated. These new transformed attributes are called the principal components and when the method is used for dimensionality reduction their number is less than the number of original attributes. Intuitively this transformation can be understood as a projection of the original feature vectors to a new set of orthogonal axes (the principal components). The projection is such that each succeeding axis explains the highest variance of the original dataset possible, with the constraint that it is orthogonal to the preceding component. In a typical application scenario, where each song is represented by a 70 dimensional feature vector, the application of PCA to these feature vectors can then be used to convert them to 3 dimensional feature vectors which can be visualized as points in a 3D space. A common way of calculating PCA is based on the covariance matrix which is defined as: C=
1 B × BT , N
(26.10)
where T denotes the transpose operator and B is the matrix resulting from subtracting the empirical mean of each dimension of the original data matrix consisting of the N feature vectors. The eigenvectors and eigenvalues of this covariance matrix are then computed and sorted in order of decreasing eigenvalue and the first K where K is the desired number of reduced dimensions are selected as the new basis
1.26.8 Visualization
1479
vectors. The original data can then be projected into the new space spanned by these basis functions, i.e., the principal components. PCA is a standard operation in statistics and it is commonly available in software package dealing with matrices. One of the potential issues with PCA for music visualization is that , because it tries to preserve as much as possible the distances between points from the original feature space to the transformed feature space, it might leave large areas of the available visualized space empty of points. This is particularly undesirable in touch based interfaces, such as tablets, where in the ideal case, everywhere a user might press should trigger some music. Self-organizing maps are an approach that attempts to perform both dimensionality reduction and clustering while mostly preserving topology but not distances. It can result in more dense transformed feature spaces that are also discrete in nature in contrast to PCA which produces a transformed continuous space that needs to be discretized. The SOM is a type of neural network used to map a high dimensional input feature space to a lower dimensional representation. It facilitates both similarity quantization and visualization. It was first documented in 1982 and since then it has been applied to a wide variety of diverse clustering tasks. The SOM maps the original d-dimensional feature vectors X ∈ Rd to two discrete coordinates I ∈ [1 . . . M] and J ∈ [1 . . . N ] on a rectangular grid. The traditional SOM consists of a 2D grid of neural nodes each containing a n-dimensional vector, x(t) of data. The goal of learning in the SOM is to cause different neighboring parts of the network to respond similarly to certain input patterns. This is partly motivated by how visual, auditory and other sensory information is handled in separate parts of the cerebral cortex in the human brain. The network must be fed a large number of example vectors that represent, as closely as possible, the kinds of vectors expected during mapping. The data associated with each node is initialized to small random values before training. During training, a series of n-dimensional vectors of sample data are added to the map. The “winning” node of the map known as the best matching unit (BMU) is found by computing the distance between the added training vector and each of the nodes in the SOM. This distance is calculated according to some pre-defined distance metric which in our case is the standard Euclidean distance on the normalized feature vectors. Once the winning node has been defined, it and its surrounding nodes reorganize their vector data to more closely resemble the added training sample. The training utilizes competitive learning. The weights of the BMU and neurons close to it in the SOM lattice are adjusted towards the input vector. The magnitude of the change decreases with time and with distance from the BMU. The time-varying learning rate and neighborhood function allow the SOM to gradually converge and form clusters at different granularities. Once a SOM has been trained, data may be added to the map simply by locating the node whose data is most similar to that of the presented sample, i.e., the winner. The reorganization phase is omitted when the SOM is not in the training mode. The update formula for a neuron with representative vector N(t) can be written as follows: N(t + 1) = N(t) + (v, t)α(t)(x(t) − N(t)),
(26.11)
where α(t) is a monotonically decreasing learning coefficient and x(t) is the input vector. The neighborhood function (v, t) depends on the lattice distance between the BMU and neuron v. Figure 26.8 illustrates the ability of the automatically extracted audio features and the SOM to represent musical content. The top subfigures (a)–(d) show how different musical genres are mapped to different regions of the SOM grid (the black squares are the ones containing one or more songs from
1480
CHAPTER 26 Music Mining
FIGURE 26.8 Topological mapping of musical content by the self-organizing map. (a) Classical, (b) metal, (c) hiphop, (d) rock, (e) Bob Marley, (f) radiohead, (g) Led Zappelin (h) Dexter Gorden.
each specific genre). As can be seen Classical, Heavy Metal and HipHop are well-localized and distinct whereas Rock is more spread out reflecting its wide diversity. The SOM is trained on a collection of 1000 songs spanning 10 genres. The bottom subfigures (e)–(h) show how different artists are mapped to different regions of the SOM grid. The SOM in this case is trained on a diverse personal collection of 3000 songs spanning many artists and genres. It is important to note that in all these cases the only information used is actual audio signal that is automatically analyzed and no meta data. The locations of the genres/artists are emergent properties of the SOM and demonstrate how it organizes semantically the data.
1.26.9 Advanced music mining In addion to the music mining tasks that have been presented in this chapter, there are several additional topics and tasks that are more briefly described in this section. The decision to not cover them in more detail, had more to do with the limited amount of published work related to them and their usage of more recent and complex mining algorithms, rather than their importance. In many cases, they also lack commonly available data-sets and agreed upon evaluation methodologies. Multiple instance learning is a classification technique in which ground truth is provided for sets (bags) of instances rather than individual instances. A classic example in music mining is that frequently it is possible to obtain classification or retrieval ground truth for artists but the desired classification granularity is for songs. As multiple songs correspond to the same artist and the ground truth labeling does not necessarily apply to all of them, this problem is a natural fit for multiple instance learning. Semi-supervised learning is
1.26.9 Advanced Music Mining
1481
a type of machine learning that makes use of both labeled and unlabeled data for training. It useful in scenarios where there is a limited amount of ground truth data available but large amounts of data that are unlabeled are also available.
1.26.9.1 Symbolic music mining Music especially classical and to some extent popular music can be represented symbolically using representations that are similar to a music score, i.e., they encode which notes are played and at what time. A score is an abstract and structured representation. It can be interpreted in different ways by varying timing or even instrumentation depending on the performer. It can be viewed as a platonic ideal of a song that is normalized with respect to many details such as timbre, pitch and timing variations that complicate audio analysis. At the same time, because the information is discrete and structured it enables types of processing that are very difficult to perform with audio signals. For example it is possible to search efficiently for melodic patterns in large collections of symbolic data. There are several possible symbolic representations for music. The simplest form consists of simply the start and duration of a set of discrete notes or pitches. MIDI (Music Interface for Digital Instruments) files mostly contain this information. More expressive formats, such as Music XML can express additional information
FIGURE 26.9 Piano roll symbolic representation (a) and associated spectrogram (b).
1482
CHAPTER 26 Music Mining
such as graphic characteristics of the musical scores. Figure 26.9 relates a spectrogram representation with a corresponding symbolic representation using the so called piano roll notation. Algorithms from text mining, discrete mathematics and theoretical computer science are used in this research. it is also frequently driven by musicological considerations. In the same way that the canonical example of music mining for audio signals is automatic genre classsification, the canonical example of symbolic music information is searching for a particular monophonic melodic sequence in a database of melodies. This task has, for example, applications in Queryby-humming (QBH) systems where the user sings a melody and the best match is returned as a music file that can be played. Even though some systems only check for exact matches, in most cases, some form of approximate matching is utilized. Melodies are represented as one dimensional strings of characters, where each characters represents either a discrete pitch or interval between succeeding pitches. The large literature on string matching algorithms such as editing distances, finding the longest common subsequence, or finding occurances of one string in another can then be utilized for symbolic music mining. When melody matching needs to be performed in polyphonic audio or the query itself is polyphonic more sophisticated techniques based on computational geometry are utilized. Finally most of the music mining tasks described for audio signals can also be applied to symbolic music signals provided an appropriate feature representation is used. For example features computed from MIDI file such as average interval size, highest note, lowest note, etc. can be used as a feature representation for classification tasks.
1.26.9.2 Bridging symbolic and audio music mining Automatic transcription refers to the process of essentially converting an audio recording to a symbolic representation. Figure 26.9 shows this process graphically. As can been seen the overlapping harmonic structure of notes makes it harder to separate them in a polyphonic context. Automatic transcription is a hard problem, and although significant progress has been made, it is still not very accurate for real world recordings. There are various simpler tasks related to bridging audio and symbolic representations that have been investigating with more successful results. Audio chord and key detection compute an abstract symbolic representation, i.e., a list of chord symbols for a music track. As an example application, using dynamic time warping (DTW) over chroma representations it is possible to find the correspondence between a symbolic representation and an audio recording. This problem is termed polyphonic audio alignment or score following when it is performed in a real-time fashion. Combinations of symbolic and audio representations can also be used to enable new forms of querying musical data. For example in query-by-humming systems, the user sings or hums a melody and an audio recording that contains the melody is returned. It is also possible to analyze the internal structure of a music piece in terms of repeated sections such as the chorus or a bridge. This process has been termed structure analysis. By combining symbolic and audio representations we gain more insight in the complexity and structure of music. In turn, this can inform the more general music mining tasks we have described in this chapter. A task that has received some attention in the popular press but for which there has been no convincing demonstration is hit song science or popularity prediction. The idea is that somehow by analyzing audio and text features one might be able to predict whether a song will be a hit or how popular it will be before it is released on the market.
1.26.10 Software and Datasets
1483
1.26.10 Software and datasets Although frequently researchers implement their own audio feature extraction algorithms, there are several software collections that are freely available that contain many of the methods described in this chapter. They have enabled researchers more interested in the data mining and machine learning aspects of music analysis to build systems more easily. They differ in the programming language/environment they are written, the computational efficiency of the extraction process, their ability to deal with batch processing of large collections, their facilities for visualizing feature data, and their expressiveness/flexibility in describing complex algorithms. Table 26.7 summarizes information about some of the most commonly used software resources as of the year 2012. The list is by no means exhaustive but does provide reasonable coverage of what is available in 2012. Some of these links contains collections of code and others more integrated frameworks. Most of the toolkits support some form of audio feature extraction. Some also include music mining algorithms but in many cases researchers rely on other general purpose machine learning and data mining software. Several of the figures in this chapter were created using Marsyas and some using custom MATLAB code. In general, although there are exceptions, software in MATLAB tends to be less efficient and more suited for prototyping than C/C++ software that is more efficient for large scale problems. An alternative to using a software toolkit is to utilize audio features provided by a web service such as the Echonest API. Over time, several audio datasets and their associated ground truth have been collected and used to evaluate different music mining tasks. For more mature music mining tasks they have enabled, to some degree, comparison of different systems and reproducibility of results. Table 26.8 summarizes information about some of the common datasets used and their characteristics as available in 2012. Almost all of the datasets can be used for experiments in mining especially for tasks that can be performed
Table 26.7 Software for Audio Feature Extraction (in 2012) Name
URL
Programming Language
Auditory Toolbox CLAM D.Ellis Code HTK jAudio Marsyas MA Toolbox MIR Toolbox Sphinx VAMP plugins
tinyurl.com/3yomxwl clam-project.org/ tinyurl.com/6cvtdz htk.eng.cam.ac.uk/ tinyurl.com/3ah8ox9 marsyas.info pampalk.at/ma/ tinyurl.com/365oojm cmusphinx.sourceforge.net/ www.vamp-plugins.org/
MATLAB C++ MATLAB C++ Java C++/Python MATLAB MATLAB C++ C++
1484
CHAPTER 26 Music Mining
Table 26.8 Datasets for Music Mining (in 2012) Name
URL
BEATLES BILLBOARD CAL500 GTZAN LASTFM LATIN MGNTUNE MSD RWC YAHOO
isophonics.net/datasets tinyurl.com/9mocd8y tinyurl.com/9j98jj6 tinyurl.com/9rff8js tinyurl.com/9obzlfn – tinyurl.com/8vkcoqx tinyurl.com/9mr4j8x tinyurl.com/cb9tpfk tinyurl.com/c8pbeb8
Clips 179 649 500 1000 20K (artists) 3227 25863 1000000 315 717 million
Ground Truth Structure, Chord Beat Structure, Chord, Beat Tags Genre, Key Tags, Collaborative Filtering Genres Tags Genre, Tags, Ratings Genre, Chords, Structure Ratings
without ground truth such as similarity retrieval. In addition, several of them provide ground truth for one or more tasks. In many cases, they also contain computed audio features which is particular useful for researchers coming from the more general data mining research. The academic music information retrieval community has been fortunate to have help from industry in creating datasets. For example, the Magnatagatune (MGTTUNE) dataset was created with help from the Magnatune recording label. Another example of a dataset created with help from industry (EchoNest) is the recent and impressive in scope Million Song Dataset (MSD). It is a freely-available collection of audio features and metadata for a million popular music tracks. It also contains additional data such as sets of cover songs, lyrics, song-level tags and similarity and user data. Several datasets are also associated with MIREX tasks. In many cases, they are not accessible for experimentation and participants have to submit their algorithms which are executed by the MIREX organizers in order to get evaluation results.
1.26.11 Open problems and future trends Music information retrieval and more specifically music mining are relatively new emerging research areas so there many unexplored directions with significant challenges that can be investigated. Music information is multi-faceted, multi-disciplinary, multi-cultural, multi-modal and all these aspects complicate music mining. In this section some open problems and future trends are briefly described. In many of the music mining tasks that have been described in this chapter the assumptions they make are at the extremes of more nuanced differences encountered in the real world. For example music recommendation systems are either general in the sense that for a particular query they recommend the same set of similar tracks independently of who the user is, or they are highly personalized. Similarly, human genre annotations is not consistent but at the same time is not completely personalized. The reality is that probably listeners form social groups/clusters with similar taste. There is a lot of interesting work that can be done to leverage this group membership for more effective recommendations. There is a lot of context information that can also be taken into account. Typical listeners have different listening
1.26.11 Open Problems and Future Trends
1485
preferences depending on the time of the day or the activity they are performing. Given the ubiquity of mobile devices that have geolocation capabilities can we leverage this information to adjust music recommendation to particular activities for a particular user? Recent advances in neuro-imaging have also made possible the monitoring of brain activity while listening to music. The resulting data is high dimensional and difficult to interpret but maybe in the future it will help lead to a better understanding of how our brain interprets music. Such understanding might radically change how we think about music mining. A big area of future research is audio feature extraction. Existing features are rather crude, low level statistical descriptions that clearly do not capture a lot of the structure and complexity of music. A particularly interesting area that is receiving a lot of attention at the moment in the machine learning community is Deep Belief Networks (DBN). They are a family of machine learning algorithms that are able to learn higher level abstract representations from lower level input signals in the context of classification problems. Another interesting challenge is that music unfolds over time at multiple levels. Many existing approaches to audio feature extraction and music mining ignore this temporal evolution or approximate it using rather crude models of dynamics. The majority of existing systems utilize one source of information and even in the cases where multiple source of information are utilized, their fusion is performed simply and directly. The integration of information from multiple facts at many different time scales is a big challenge for existing mining algorithms. Another big limiting assumption that the majority of existing algorithm for music mining make is that the music is characterized statistically as a mixture without taking into account that it is composed of individual sound sources. At the same time, it is clear that as humans we pay a lot of attention and are able, to a large extent, to characterize and follow these individual sound sources in complex mixtures over time. This is true for both musically trained listeners and ones that are not. Separating a complex mixture of sounds to the individual components is called sound source separation and is a problem with a lot of existing literature but still far away from being solved. Even if it is solved it is unclear how all these individual sound sources, and their corresponding information can be combined for higher level music mining tasks. Historically MIR originated from work in digital libraries and text retrieval. Therefore it has retained this focus on processing of large archives of existing music. However, new music is created, performed every day and increasingly computers are used for its production, distribution and consumption. There is a lot of potential in using MIR techniques in the context of music creation and especially live music performance. However, many of the developed techniques are not designed to work in real-time and be interactive. The amount of data that needs to be processed constantly poses scalability challenges. For example if we expand the scope of research from recorded music by labels to all the amateur video and audio recordings uploaded on the web it will increase the number of files to be processed by one or maybe even two orders of magnitude. As audio signal processing and machine learning techniques become more sophisticated they became more computationally demanding. Important issues of scalability and parallel implementation will continue to arise making techniques that are practical in a few thousand tracks obsolete or practically infeasible in collections of a million music tracks. As collections get bigger an important consideration is music and recording quality. To most existing music mining systems a piece performed by the same combination of instruments by professional musicians and a high school cover band essentially have the same representation.
1486
CHAPTER 26 Music Mining
The way listeners react to music is constantly changing and varies among different age groups and cultures. Even though western popular music has been dominating music listening around the world, there is a lot of interesting work that can be done in applying music mining techniques to analyze other types of music from around the world. Such work falls under the area of computational ethnomusicology. The large diversity of music cultures can lead to specific mining problems that are variations of existing ones or completely new. To conclude, mining music information is a complex, challenging and fascinating area of data mining. The challenges of scalability, time evolution, heterogeneous structured and unstructured representations, and personalization among others pose significant challenges to existing mining algorithms. Progress in music mining has the potential to lead to significant advances to data mining in general and the converse is also probably true. Finally music mining has already and will probably continue to affect the way music is produced, distributed, and consumed.
1.26.12 Further reading In this section a number of pointers to relevant material to this chapter are provided. The list is not comprehensive but is a good representation of activity in the field. Music Information Retrieval is a relative new field with a history of a little bit more than 10 years [1]. Currently there is no comprehensive textbook that covers it fully. However, there are several related books and overview articles that are great starting points for learning more about the field. Orio [2] is an excellent, although somewhat outdated, tutorial and review of the field of MIR. A more recent collection of articles mostly focusing on music data management and knowledge discovery can be found in Shen et al. [3]. Many of the music mining tasks described in this chapter are covered in more detail in a recent collection of articles on music mining edited Li et al. [4]. The chapters on audio feature extraction, mood and emotional classification, web-based and community-based music information extraction, human computation for music classification, and indexing music with tags are particularly relevant. The basic types of audio feature extraction, i.e., timbre, pitch and rhythm appear in a classic early paper on automatic genre classification [5]. Tempo induction refers to the specific problem of estimating a global tempo for a piece of music but the basic algorithms used are common to any type of rhythm analysis. Gouyon et al. [6] provide an experimental investigation of different tempo induction algorithms. Various types of pitch representations have been investigated especially in the context of chord and key detection [7]. An early seminal work showing how to combine text and audio features for extracting music information is Whitman and Smaragdis [8] and subsequent work has explored the text analysis of lyrics [9], web-based data [10] and microblogs [11]. The strategies and challenges of ground truth acquisition have been explored for music similarity [12], genre classification [13], and tag annotation [14]. An overview of the Music Information Retrieval eXchange (MIREX) for the years 2005–2007 can be found in Downie [15]. An early critical review of content-based music similarity approaches can be found in Aucouturier and Pachet [16]. It has been followed by many publications mostly focusing on how music similarity can be improved with better audio features, integration with text and context features, and design of better similarity metrics [16,17]. High level overviews of audio fingerprinting systems have been provided by Wang [18] and Haitsma and Kalker [19], and a good review of different algorithms can be found in Cano et al. [20]. Two well
1.26.12 Further Reading
1487
known cover song detection systems are Ellis and Poliner [21] and Serra et al. [22]. A comparative study of different component choices for cover song retrieval systems can be found in Liem and Hanjalic [23]. Classification, clustering, and regression are standard problems in data mining and pattern recognition and therefore well covered in many textbooks [24,25]. There is a large literature in automatic genre classification [5,26,27] , in many cases showing that feature sets based on different musical aspects such as harmony and rhythm are all important and their combination gives better results than they do in isolation. Hierarchical classification can be performed by modeling the hierarchy as a Bayesian network [28]. A user study investigating the perception of top-level music genres by average listeners can be found in [29] and a comparison between automatic classification and annotation by users can be found in Lippens et al. [13]. An overview and critical discussion of genre classification can be found in McKay and Fujinaga [30]. Representative examples of emotion spaces that have been used in music mining are the groupings of adjectives used to describe classical music pieces by Hevner [31] as well as the 2D emotional space described by Theyer [32]. Emotion/mood recognition [33] using audio and text features has been covered in [9,34] among others. Regression can also be used for automatic emotion analysis of music [35]. Li et al. propose a clustering algorithm that combines both lyrics and audio features to perform bimodal learning [36]. Work on associating music with text using audio content analysis and machine learning started as early as 2002 [8], not using tags per se, but using keywords extracted from web-pages that ranked highly in search engine results for particular artists. Around 2007–2008, as social tag annotation became more common, some of the first papers that focused on automatic tag annotation for music started appearing in the literature, using different classification approaches and audio feature sets. For example, AdaBoost was used for tag prediction in Eck et al. [37]. A Gaussian Mixture Model over the audio feature space is trained for each word in a vocabulary in the seminal paper on semantic annotation by Turnbull et al. [38]. This work also provided the CAL500 dataset that since then has frequently been used to evaluate tag annotation systems. Stacking for music tag annotation was originally proposed in [39]. The user evaluation of automatic tag annotation system as well as an machine learning approach that works with an open tag vocabulary has been described by Law et al. [40]. Various approaches to visualization in audio-based music information retrieval are surveyed in Cooper et al. [41]. The use of self-organizing maps in music visualization was popularized by the Islands of Music system that rendered the resulting self-organized map as a a set of islands (areas of the grid where many songs where mapped) surrounded by ocean (areas of the grid where fewer songs where mapped) [42]. A number of music mining tasks have been formulated as multiple instance learning by Mandel and Ellis [43]. Symbolic music information retrieval can be done using polyphonic queries and references using a variety of methods based on string and geometric approaches [44]. Structure analysis is typically performed using self-similarity matrices [45]. The most common approach for polyphonic audio score alignment is based on calculating the distance (or similarity) matrix between two sequences typically of chroma vectors and performing dynamic time warping [46,47]. A excellent overview of signal processing methods for music transcription has been edited by Klapuri and Davy [48]. A critical view on hit song science can be found in Pachet and Roy [49]. The problem of identifying similar artists using both lyrics and acoustic data has been explored using a semi-supervised learning approach in which a small set of labeled samples are supplied for the seed labeling and then used to build classifiers that improve themselves using unlabeled data [50].
1488
CHAPTER 26 Music Mining
Glossary AdaBoost Artist Filter
Audio Fingerprinting Beat Histrogram (Spectrum)
Chroma Chromagram Collaborative Filtering Cosine Similarity Discriminatory Classifiers
EM-algorithm Emotion Space F-Measure Games with a Purpose (GWAP) Gaussian Models Gaussian Mixture Model Generative Classifiers
Genre Ground Truth Harmonic Pitch Class Profiles Harmony
a family of classification algorithms a preprocessing technique used in music mining task that ensures that tracks of the same artist are either part of the training set or the testing set but not both an algorithm for determing whether two digital audio files correspond to the same underlying recording a representation of the rhythmic content of a piece of music in terms of the amount of energy in different rhythm-related periodicities that are present a representation of the pitch content of a piece of music a sequence of chroma vectors that can be visualized as an image a technique for recommending items based on the purchase history of users a method of computing similarty between two vectors based on the cosine of the angle between them a family of classification algorithms in which the goal is to disciminate between the classes directly without modeling the distribution of all the sample feature vectors but only the ones that affect the decision boundary a algorithm for training Gaussian Mixture Models a multi-dimensional representation that maps different human emotions and their relations a performance measure for retrieval algorithms computer games that in addition to being engaging and entertaining provide useful information for training computers a parametric family of probability density functions a method for modeling a probability density function as a combination of Gaussian components a family of classification algorithms in which the training data is modeled by estimating probability density functions that can generate new sample feature vectors a categorical label that characterizes a particular type of music the information needed for training and evaluating a mining algorithm a representation of pitch content for audio signals similar in nature to Chroma the structure of piece of music in terms of combinations of discrete notes or pitches
1.26.12 Further Reading
Hidden Markov Models (HMM) ISMIR K-Fold Cross Validation K-Means Last.fm Mean Average Precision: Mel-Frequency Cepstral Coefficients (MFCC) Musical Instrument Digital Interface (MIDI) Music Information Retrieval (MIR) Music Information Retrieval Evaluation Exchange (MIREX) Music XML Naive Bayes Nearest Neighbor Classifier Pandora Pitch Class Profiles Precision Principle Component Analysis Query-by-Humming (QBH) Recall Receiver Opeating Characteristic (ROC) Rhythm Self-Organizing Maps (SOM) Short Time Fourier Transfrom (STFT) Similarity Retrieval
Spectrum Stacking Support Vector Machines
1489
a probabilistic techniques for modeling sequence of feature vectors the annual International conference of the Society for Music Information Retrieval a methodology for evaluating classification performance a standard clustering algorithm an internet music recommendation service a performance measure for retrieval algorithms across multiple queries a set of audio features frequently used in MIR that originated in speech processing a symbolic format for transmitting and storing music related information the research area dealing with all aspects of retrieving information from music signals that are stored digitally an annual campaign for evaluating different algorithms on a variety of MIR tasks a symbolic format for representation all aspects of notated music a family of paramettric classification algorithms a family of non-parametric classification algorithms an internet personalized radio station a representation of the pitch content of a piece of music a performance measure for retrieval algorithms a technique for dimensionality reduction the task of matching a sung melody to a database of music tracks in either symbolic, or audio format a performance measure for retrieval algorithms a graphical representation used to characterize retrieval systems the structure of a piece of music over time a dimensionality reduction and clustering techniques that maps high dimensional feature vectors into a discrete set of locations a transformation of a time varying signal to a time-frequency representation a music mining task where a query music track is provided by the user and a set of tracks that are similar to it are returned by the system a representation of a signal in terms of the amounts of different frequencies a methodology for combining binary classifiers for multi-label classification a family of discriminative classification algorithms
1490
CHAPTER 26 Music Mining
Tags Term Weighting Timbre
words from an open vocabulary supplied by users to characterize multimedia objects such as music tracks, images, and videos a technique from text mining in which different words get assigned different weights when considering the similarity of documents the properties that characterize a particular sound source when compared with other sound sources of the same pitch and loudness
Acknowledgments The author would like to thank Tiago Fernades Tavares and Steven Ness for help with preparing the figures for this chapter. Moreover, the author would like to thank the many researchers from a variety of disciplines who have helped music information retrieval and, more specifically, music mining grow and evolve. Their efforts have made the work described in this chapter possible and the author is proud to have been a part of this amazing community from its inception. Relevant Theory: Signal Processing Theory and Machine Learning See this Volume, Chapter 9 Discrete Transforms See this Volume, Chapter 16 Kernel Methods and SVMs See this Volume, Chapter 20 Clustering
References [1] J. Downie, D. Byrd, T. Crawford, Ten years of ISMIR: Reflections on challenges and opportunities, in: Proceedings of the 10th International Society for Music Information Retrieval Conference, 2009, pp. 13–18. [2] N. Orio, Music retrieval: A tutorial and review, vol. 1. Now Pub, 2006. [3] J. Shen, J. Shepherd, B. Cui, L. Liu, Intelligent music information systems: Tools and methodologies, Information Science Reference, 2008. [4] T. Li, M. Ogihara, G. Tzanetakis (Eds.), Music Data Mining, CRC Press, 2012. [5] G. Tzanetakis, P. Cook, Musical genre classification of audio signals, IEEE Trans. Audio Speech Lang. Process. 10 (5) (2002) 293–302. [6] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, P. Cano, An experimental comparison of audio tempo induction algorithms, IEEE Trans. Audio Speech Lang. Process. 14 (5) (2006) 1832–1844. [7] E. Gómez, Tonal description of polyphonic audio for music content processing, INFORMS J. Comput. 18 (3) (2006) 294–304. [8] B. Whitman, P. Smaragdis, Combining musical and cultural features for intelligent style detection, in: Proc. Int. Symposium on Music Inform. Retriev.(ISMIR), 2002, pp. 47–52. [9] X. Hu, J. Downie, A. Ehmann, Lyric text mining in music mood classification, American music 183 (5049) (2009) 2–209. [10] P. Knees, E. Pampalk, G. Widmer, Artist classification with web-based data, in: Proceedings of the International Conference on Music Information Retrieval, 2004, pp. 517–24. [11] M. Schedl, On the use of microblogging posts for similarity estimation and artist labeling, in: Proceedings of the International Conference on Music Information Retrieval, 2010. [12] D. Ellis, B. Whitman, A. Berenzweig, S. Lawrence, The quest for ground truth in musical artist similarity, in: Proc. ISMIR, vol. 2, 2002, pp. 170–177.
References
1491
[13] S. Lippens, J. Martens, T. De Mulder, A comparison of human and automatic musical genre classification, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2004 (ICASSP’04), vol. 4, IEEE, 2004, pp. iv–233. [14] D. Turnbull, L. Barrington, G. Lanckriet, Five approaches to collecting tags for music, in: Proceedings of the 9th International Conference on Music Information Retrieval, 2008, pp. 225–30. [15] J. Downie, The music information retrieval evaluation exchange (2005–2007): a window into music information retrieval research, Acoust. Sci. Technol. 29 (4) (2008) 247–255. [16] J. Aucouturier, F. Pachet, Music similarity measures: whats the use, in: Proceedings of the ISMIR, Citeseer, 2002, pp. 157–163. [17] T. Li, M. Ogihara, Content-based music similarity search and emotion detection, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 5, IEEE, 2004, pp. V–705. [18] A. Wang, The shazam music recognition service, Commun. ACM 49 (8) (2006) 44–48. [19] J. Haitsma, T. Kalker, A highly robust audio fingerprinting system, in: Proc. ISMIR, vol. 2002, 2002, pp. 144–148. [20] P. Cano, E. Batle, T. Kalker, J. Haitsma, A review of algorithms for audio fingerprinting, in: IEEE Workshop on Multimedia Signal Processing, IEEE 2002, pp. 169–173. [21] D. Ellis G. Poliner, Identifyingcover songs with chroma features and dynamic programming beat tracking, in: IEEE International Conference on Acoustics, Speech and Signal Processing 2007, ICASSP 2007, vol. 4, IEEE, 2007, pp. IV–1429. [22] J. Serra, E. Gómez, P. Herrera, X. Serra, Chroma binary similarity and local alignment applied to cover song identification, IEEE Trans. Audio Speech Lang. Process. 16 (6) (2008) 1138–1151. [23] C. Liem, A. Hanjalic, Cover song retrieval: A comparative study of system component choices, in: Proc. Int. Conf. on Music Information Retrieval (ISMIR), 2009. [24] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Fourth (ed.) Academic Press, 2008. [25] P. Tan, M. Steinbach, V. Kumar, et al. Introduction to Data Mining, Pearson Addison Wesley Boston, 2006. [26] T. Li, M. Ogihara, Q. Li, A comparative study on content-based music genre classification, in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ACM, 2003, pp. 282–289. [27] A. Meng, P. Ahrendt, J. Larsen, Improving music genre classification by short time feature integration, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2005 (ICASSP’05), vol. 5, IEEE, 2005, pp. v–497. [28] C. DeCoro, Z. Barutcuoglu, R. Fiebrink, Bayesian aggregation for hierarchical genre classification, in: Proceedings of the International Conference on Music Information Retrieval, 2007, pp. 77–80. [29] R. Gjerdingen, D. Perrott, Scanning the dial: the rapid recognition of music genres, J. New Music Res. 37 (2) (2008) 93–100. [30] C. McKay, I. Fujinaga, Musical genre classification: is it worth pursuing and how can it be improved, in: Proceedings of the Seventh International Conference on Music Information Retrieval, 2006, pp. 101–106. [31] K. Hevner, Experimental studies of the elements of expression in music, Am. J. Psychol. 48 (2) (1936) 246–268. [32] R. Thayer, The Biophysiology of Mood and Arousal. Oxford University Press, New York, 1989. [33] Y. Kim, E. Schmidt, R. Migneco, B. Morton, P. Richardson, J. Scott, J. Speck, D. Turnbull, Music emotion recognition: a state of the art review, in: Proc. 11th Int. Symp. Music Information Retrieval, 2010, pp. 255–266. [34] E. Schmidt, Y. Kim, Prediction of time-varying musical mood distributions using kalman filtering, in: Ninth International Conference on Machine Learning and Applications, 2010 (ICMLA), IEEE, 2010, pp. 655–660.
1492
CHAPTER 26 Music Mining
[35] Y. Yang, Y. Lin, Y. Su, H. Chen, A regression approach to music emotion recognition, IEEE Trans. Audio Speech Lang. Process. 16 (2) (2008) 448–457. [36] T. Li, M. Ogihara, S. Zhu, Integrating features from different sources for music information retrieval, in: Sixth International Conference on Data Mining 2006, ICDM’06, IEEE, 2006, pp. 372–381. [37] D. Eck, P. Lamere, T. Bertin-Mahieux, S. Green, Automatic generation of social tags for music recommendation, in: Advances in Neural Information Processing Systems, vol. 20, 2007. [38] D. Turnbull, L. Barrington, D. Torres, G. Lanckriet, Semantic annotation and retrieval of music and sound effects, IEEE Trans. Audio Speech Lang. Process. 16 (2) (2008) 467–476. [39] S.R. Ness, A. Theocharis, G. Tzanetakis, L.G. Martins, Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs, in: Proc. ACM Multimedia, 2009. [40] E. Law, B. Settles, T. Mitchell, Learning to tag from open vocabulary labels, in: Principles of Data Mining and Knowledge Discovery, 2010, pp. 211–226. [41] M. Cooper, J. Foote, E. Pampalk, G. Tzanetakis, Visualization in audio-based music information retrieval, Comput. Music J. 30 (2) (2006) 42–62. [42] E. Pampalk, S. Dixon, G. Widmer, Exploring music collections by browsing different views, Comput. Music J. 28 (2) (2004) 49–62, (Summer 2004). [43] M. Mandel, D. Ellis, Multiple-instance learning for music information retrieval, in: Proc. ISMIR, 2008, pp. 577–582. [44] K. Lemström, A. Pienimäki, On comparing edit distance and geometric frameworks in content-based retrieval of symbolically encoded polyphonic music, Musicae Scientiae 11(Suppl. 1) (2007) 135. [45] R. Dannenberg, Listening to NAIMA: an automated structural analysis of music from recorded audio, in: International Computer Music Conf. (ICMC), 2002. [46] G.T.N. Hu, R.B. Dannenberg, Polyphonic audio matching and alignment for music retrieval, in: Workshop on Applications of Signal Processing to Audio and Acoustics, 2003. [47] M. Müller, Information Retrieval for Music and Motion, Springer-Verlag, New York, Inc., 2007. [48] A. Klapuri, M. Davy (Eds.), Signal Processing Methods for Music Transcription, Spring, 2006. [49] F. Pachet and P. Roy, Hit song science is not yet a science, in: Proc. Int. Conf. on Music Information Retrieval (ISMIR), 2008, pp. 355–360. [50] T. Li, M. Ogihara, Music artist style identification by semi-supervised learning from both lyrics and content, in: Proc. ACM Int. Conf. on Multimedia, 2004, pp. 364–367.