Available online at www.sciencedirect.com
The Journal of Systems and Software 81 (2008) 1065–1080 www.elsevier.com/locate/jss
MUSEMBLE: A novel music retrieval system with automatic voice query transcription and reformulation Seungmin Rho a, Byeong-jun Han b, Eenjun Hwang a
b,*
, Minkoo Kim
c
Graduate School of Information and Communication, Artificial Intelligence Laboratory, Ajou University, Suwon 443-749, South Korea b School of Electrical Engineering, Multimedia Information Laboratory, Korea University, Seoul 135-701, South Korea c College of Information Technology, Ajou University, Suwon 443-749, South Korea Received 14 March 2007; received in revised form 25 May 2007; accepted 26 May 2007 Available online 9 June 2007
Abstract So far, many researches have been done to develop efficient music retrieval systems, and query-by-humming has been considered as one of the most intuitive and effective query methods for music retrieval. For the voice humming to be a reliable query source, elaborate signal processing and acoustic similarity measurement schemes are necessary. On the other hand, recently, there has been an increased interest in query reformulation using relevance feedback with evolutionary techniques such as genetic algorithm for multimedia information retrieval. However, these techniques have not been exploited widely in the field of music retrieval. In this paper, we develop a novel music retrieval system called MUSEMBLE (MUSic enEMBLE) based on two distinct features: (i) A sung or hummed query is automatically transcribed into a sequence of pitch and duration pairs with improved accuracy for music representation. More specifically, we developed two new and unique techniques called WAE (windowed average energy) and dynamic ADF (amplitude-based difference function) onsets for more accurate note segmentation and onset/offset detection in acoustic signal, respectively. The former improved energy-based approaches such as AE by defining small but coherent windows with local and global threshold values. On the other hand, the latter improved the AF (amplitude function) that calculates the summation of the absolute values of signal differences for the clustering energy contour. (ii) A user query is reformulated using user relevance feedback with a genetic algorithm to improve retrieval performance. Even though we have especially focused on humming queries in this paper, MUSEMBLE provides versatile query and browsing interfaces for various kinds of users. We have carried out extensive experiments on the prototype system to evaluate the performance of our voice query transcription and genetic algorithm-based relevance feedback schemes. We demonstrate that our proposed method improves the retrieval accuracy up to 20–40% compared with other popular RF methods. We also show that both WAE and Dynamic ADF methods improve the transcription accuracy up to 95%. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Genetic algorithm; Multimedia database; Music retrieval; Pitch tracking; Relevance feedback; Signal processing
1. Introduction With the explosive expansion of digital music and audio contents, efficient retrieval of such data is getting more and more attention, especially in large-scale multimedia database applications. In the past, music information retrieval *
Corresponding author. Tel.: +82 2 3290 3256; fax: +82 2 921 0544. E-mail addresses:
[email protected] (S. Rho),
[email protected] (B.-j. Han),
[email protected] (E. Hwang),
[email protected] (M. Kim). 0164-1212/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2007.05.038
was based on textual metadata such as title, composer, singer or lyric. However, these various metadata-based schemes for music retrieval have suffered from many problems including extensive human labor, incomplete knowledge and personal bias. Compared with traditional keyword-based music retrieval, content-based music retrieval provides more flexibility and expressiveness. Content-based music retrieval is usually based on a set of extracted music features such as pitch, duration, and rhythm. Query-by-humming (QBH) is one of the popular content-based retrieval methods for
1066
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
a large-scale music database (Ghias et al., 1995; McNab et al., 1997; Uitdenbogerd and Zobel, 1999; Hwang and Rho, 2004). QBH systems can take a user’s acoustic input (a short clip of singing, whistling or humming) through a microphone, extract useful features from it and then retrieve matched songs from a music database using them. This is very useful when the user does not know detailed information about the music such as its title or singer, but just remembers a small segment of the music. However, the quality of QBH is strictly dependent on the accuracy of the audio transcription such as duration or pitch of each note. Thus, an efficient algorithm to transcribe an audio signal into a note-like representation is one of the critical components in a QBH-based music retrieval system. One common approach for developing a content-based music retrieval system is to represent music into a string of characters using three possible values for the pitch change: U(p), D(own), and S(ame) or R(epeat). In our previous work, (Hwang and Rho, 2004, 2006), we described the limitation of a pure UDR notation based on the pitch contour and proposed new notations such as uUdDr and LSR to overcome such limitations. Considering the volume of music data, we should consider query response time and storage requirements for indexing in implementing a music retrieval system. For that purpose, we also proposed a dynamic indexing scheme called FAI, which collects frequently queried melody tunes intelligently for fast query matching in our previous work (Hwang and Rho, 2006). Relevance feedback (RF) is a well known technique in the information retrieval (IR) area. It reformulates a query based on the documents which are selected by the user as relevant. Recently, relevance feedback has been widely adopted to improve the performance of both text and multimedia information retrieval Hoashi et al. (2002, 2003). Many RF methods have been studied in the CBIR (Content-Based Image Retrieval) area (Rui et al., 1998; Stejic et al., 2003), even though they were first used in text retrieval systems. However, there have been few music retrieval systems that used RF techniques in music retrieval for improving their retrieval performance. Applying a genetic algorithm (GA) is a powerful problem-solving technique in the artificial intelligence area. It is based on the Darwin’s theory of evolution and principles of biological inheritance. Very few researchers have tried to use evolutionary algorithms like genetic algorithms in the field of music information retrieval. Previous attempts (Tokui and Iba, 2000; Unehara and Onisawa, 2003) to use GA have only focused on automatic music composition, but not on adaptation of the query melody representation. Lopez-Pujalte et al. (2003) implemented a genetic algorithm for relevance feedback in textual information retrieval and running it with different order-based fitness functions. Among the fitness functions present in literature, the ones that yield the best results are those that take into account not only when documents are retrieved, but also the order in which they are retrieved.
In this paper, we propose a novel music retrieval system ‘‘MUSEMBLE’’ based on two prominent features: (i) Humming signal can be transcribed into notes more accurately. This is mainly due to two new methods: WAE and Dynamic ADF which is an improved version of the AF (Amplitude Function). (ii) Relevance feedback mechanism based on GA is provided to improve the quality of query results by reformulating a user query. In addition, the system provides versatile querying and browsing interfaces to provide the query usability. The rest of this paper is organized as follows. In Section 2, we present an overview of ongoing research for analyzing music features and constructing MIR systems. In Sections 3 and 4, we describe our music transcription scheme and music retrieval system ‘‘MUSEMBLE,’’ respectively. In Section 5, we report on some of the experimental results. Section 6 concludes this paper and describes our future directions. 2. Related work In this section, we review some of typical techniques and systems for music information retrieval. As we know, music can be represented in two different ways. One is based on musical scores such as MIDI and Humdrum (Kornstadt, 1998). The other is based on acoustic signals which are sampled at a certain frequency and compressed to save space. Wave (.wav) and MPEG Layer-3 (.mp3) are examples of this representation. 2.1. Symbolic analysis Many research efforts to solve the music similarity problem have used symbolic representation such as MIDI, musical scores, note lists and so on. Based on this, pitch tracking finds a ‘‘melody contour’’ for a piece of music. Next, a string matching technique can be used to compare the transcriptions of songs (Ghias et al., 1995; McNab et al., 1997; Uitdenbogerd and Zobel, 1999; Hwang and Rho, 2004, 2006). String matching has been widely used in music retrieval because melodies are represented using a string sequence of notes. To consider human input errors, dynamic programming can be applied to the string matching; however, this method tends to be rather slow. An inexact model matching approach (Zhuge, 2003) was proposed based on a quantified inexact signature-matching theory to find an approximate model to users’ query requirements. It can enhance the reusability of a model repository and make it possible to use and manage a model repository conveniently and flexibly. Zhuge tried to apply this theory to a problem-oriented model repository system PROMBS (Zhuge, 2000). There are also researches for symbolic MIR based on the ideas from traditional text IR. Using traditional IR techniques such as probabilistic modeling is described in (Pickens, J., 2000) and using approximate string matching
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
in (Lemstro¨m et al., 2001). Some work addressed other IR issues such as ranking and relevance. Hoashi et al., 2003 used relevance feedback for music retrieval based on the tree-structured vector quantization method (TreeQ) developed by Foote. The TreeQ method trains a vector quantizer instead of modeling the sound data directly. 2.2. Acoustic signal analysis There are many techniques to extract pitch contour, pitch interval, and duration from a voice humming query. In general, methods for detecting pitches can be divided roughly into two categories: time-domain based and frequency-domain based. In the time-domain, ZCR (zero crossing rate) and ACF (auto correlation function) are two popular methods. The basic idea is that ZCR gives information about the spectral content waveform cross zero per unit time (David, 2003). In recent works, ZCR appeared in a different form such as VZCR (variance of ZCR) or SZCR (smoothing ZCR) (Huang and Hansen, 2006). On the contrary, ACF is based on the cross correlation function. While a cross correlation function measures the similarity between two waveforms along the time interval, ACF can compare one waveform with itself. In the frequency-domain, FFT (fast Fourier transformation) is one of the most popular methods. This method is based on the property that every waveform can be divided into simple sine waves. But, a low spectrum rate for longer window may increase the frequency resolution while decreasing the time resolution. Another problem is that the frequency bins of the standard FFT are linearly spaced, while musical pitches are better mapped on a logarithmic scale. So, Forberg (1998) used an alternative frequency transformation such as constant Q transform spectrums which are computed from tracked parts. In recent works for the automatic transcription, they used probabilistic machine learning techniques such as HMM (hidden Markov model) and NN (neural network) to identify salient audio features and reduce the dimensionality of feature space. Ryynanen and Klapuri (2006) proposed a singing transcription system based on the HMM-based notes event modeling. The system performed note segmentation and labeling and also applied multipleF0 estimation method (Klapuri, 2005) for calculating the fundamental frequency. 2.3. Recent MIR systems For decades, many researchers have developed contentbased MIR (Music Information Retrieval) systems based on both acoustic and symbolic representations (Ghias et al., 1995; McNab et al., 1997; Typke and Prechelt, 2001; Hwang and Rho, 2006). Ghias et al. (1995) developed a QBH system that is capable of processing acoustic input in order to extract appropriate query information. However, this system used
1067
only three types of contour information to represent melodies. The MELDEX system (McNab et al., 1997) was designed to retrieve melodies from a database using a microphone. It first transformed acoustic query melodies into music notations, and then searched the database for tunes containing the hummed (or similar) pattern. This web-based system provided several match modes including approximate matching for interval, contour, and rhythm. MelodyHound (Typke and Prechelt, 2001), originally known as the ‘‘TuneServer’’, also used only three types of contour information to represent melodies. They recognized the tune based on error-resistant encoding. Also, they used the direction of the melody only, ignoring the interval size or rhythm. The C-BRAHMS (Ukkonen et al., 2003) project developed nine different algorithms known as P1, P2, P3, MonoPoly, IntervalMatching, PolyCheck, Splitting, ShiftOrAnd, and LCTS for dealing with polyphonic music. Suzuki et al. (2006) proposed a MIR system that uses both lyrics and melody information in the singing voice. They used a finite state automaton (FSA) as a lyric recognizer to check the grammar and developed an algorithm for verifying a hypothesis output by a lyric recognizer. Melody information is extracted from an input song using several pieces information of hypothesis such as song names, recognized text, recognition score, and time alignment information. Many other researchers have studied quality of service (QoS)-guaranteed multimedia systems over unpredictable delay networks by monitoring network conditions such as available bandwidth. McCann et al. (2000) developed an audio delivery system called Kendra that used adaptability with a distributed caching mechanism to improve data availability and delivery performance over the Internet. Huang et al. (2001) presented the PARK approach for multimedia presentations over a best-effort network in order to achieve reliable transmission of continuous media such as audio or video. 3. Automatic voice humming transcription This section describes the overall system architecture and the algorithms we developed for automatic voice humming transcription. 3.1. System architecture Fig. 1 shows the overall system architecture for transcribing voice queries such as humming. In order to transcribe voice queries, we first preprocessed them using WAE and dynamic ADF. After the preprocessing, we analyzed notes and extracted their pitch and duration features. More specifically, we used the WAE method to identify silent and voiced frames after framing and a heuristic method for merging discursively detected segments. Also, we used the average magnitude difference function (AMDF) to get a fundamental frequency of each frame.
1068
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
Fig. 1. Process flow for automatic voice transcription.
Using this information, we were able to get note onset/offset information from the segmentation information of silent or ignorable frames and fundamental frequency. Furthermore, we applied the ADF to each frame in order to get an ADF onset from the WAE information. This makes it easier to recognize the note feature. 3.2. Preprocessing Human voice consists of diverse frequency elements. But when a person is humming or singing, it is possible to recognize the pitch of a note under the assumption that human voice has a fundamental period in the very short interval. That is, human voice has a monotonic melody and it has only one fundamental frequency at a short time. Thus, it is necessary to segment the human voice signal into several frames. It is already known that to have a framing size in the range of 20–50 ms is efficient for processing. Also, human voice can be changed in any interval length. However, if the interval is too short, then analyzable frequency range is also too narrow. Thus, we assumed the framing length of 20 ms as the minimum analyzable length. Furthermore,
we used the frame overlapping ratio of 50% for the continuity between frames. 3.2.1. Windowed average energy The WAE is an improved version of the average energy (AE) which is a traditional energy estimation method. AE is defined by the following equation: AE ¼
N 1 X
2 jxðkÞj =N ;
ð1Þ
k¼0
where x(k) is the input sequence, and N is the sequence size. The AE itself indicates the average amount of energy of some signal range. This can be used to classify silent and voiced frames using one global threshold; if the energy is greater than the threshold, it is considered as a voiced frame; otherwise, it is considered as a silent frame. However, we observed several limitations of this traditional approach. First, the classification was solely dependent on one global threshold value. Hence, it was not robust to the variation of amplitudes when the strength of human voice changes. Also, each recording device had its own configuration and characteristic. To incorporate
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
1069
Fig. 2. AE with one global threshold vs. WAE with multiple local thresholds.
Fig. 3. Algorithm for WAE.
such diversity, it might be more efficient to define local thresholds according to the change of environment. From this observation, we defined local thresholds for AE instead of one global threshold. For each window, we define a local threshold to classify silent/voiced frame. This will improve the accuracy of frame discrimination. Fig. 2 shows AE with one global threshold and WAE with multiple local thresholds. From this figure, we see that the local thresholds can reflect the variance of AE better. Fig. 3 shows the algorithm for calculating local thresholds for frames according to the WAE. If the portion of the maximum AE defined by r in a unit window is greater than the global threshold, the value becomes the local threshold. Otherwise, the global threshold becomes the local threshold.
Fig. 4. Effects of WAE and merging tiny silent/voiced segments.
3.2.3. Pitch tracking Pitch is an important parameter in voice signal analysis and can be determined by the fundamental frequency of the unit frame. Recently, frequency-domain based methods like FFT have been widely used for the pitch tracking. However, since we just needed the fundamental frequency information only in this work, we used the average magnitude difference function (AMDF), which is basically a delayed autocorrelation function. For each frame, an AMDF value is defined as the sum of point-wise absolute differences between the two signals: AMDFðkÞ ¼
NX k1
jxðn þ kÞ xðnÞj=ðN kÞ;
ð2Þ
n¼0
3.2.2. Merging note segments ‘‘Tiny silent/voiced segments’’ are not much useful in classifying the note pitch. This is because their length is too short to process. Although note segments from the WAE are much smaller than those from AE with a global threshold, the local AE threshold of each frame alone is not efficient enough to classify silent and voice frames clearly. For example, in Fig. 4, ‘‘tiny silent and voiced segments’’ are dispersed among long voiced and silent segments, respectively. We merged the tiny silent segments with the neighboring voiced segments and the tiny voiced segments with the neighboring silent segments. The maximum length of tiny segments can be defined relatively or absolutely. We used six frames as the maximum length, because the minimal length of a meaningful note must be guaranteed. If the minimum length is too big, note segments could be merged with each other, resulting in a long interval. Otherwise, the scattered segments did not cluster well. In our experiment, we observed that six frames were appropriate to merge the segments. After merging the note segments, we can get the note onset/offset information from the note segments.
where x(n) is the input sequence, N is the frame size, and k is a positive delay. Each frame consists of a number of frequency components. In general, we can get the fundamental frequency of each frame by applying a FFT. However, to get the frequency range of the strongest magnitude using the Fourier transformation, it is necessary to compute the magnitude information for every bin. Under the assumption that there is only one voice query in the input sequence, the result of AMDF reflects the frequency of the strongest magnitude because of the local extremes. The periodic valleys in the result of AMDF are defined as local extremes. Thus, the fundamental period of a frame can be defined as the periodic distance between local extremes. We can also compute the fundamental frequency from the reciprocal of the fundamental period. 3.2.4. Amplitude-based difference function The amplitude function (AF) computes the summation of absolute values of amplitudes within a human voice’s frequency range in each frame (Chai, 2001). In order to
1070
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
improve its accuracy, we proposed a new method called the amplitude-based difference function (ADF) for clustering the energy contour (Hwang et al., 2005). In this paper, we will improve the method further. An ADF onset happens when energy goes up rapidly from a low level or long silence. On the contrary, an ADF offset happens when energy abruptly drops to a very low level. ADF onset is another important feature in detecting the note onsets. In this work, we ignored the offsets because the end of each syllable was not clear from time to time and hence it was difficult to extract offset information. We show the algorithm for computing ADF in Fig. 5. Continuous positive and negative differences are summed up for the whole signal. That is, continuously increasing/ decreasing intervals are merged into an increasing/decreasing interval. Fig. 6 shows the algorithm for calculating the onset thresholds. Previous works used them as global threshold of ADF onsets. However, in our work, we used them as local thresholds for each note segmentation group. 3.3. Note analysis From the voice query preprocessing step described in Section 3.2, we obtained the note onset/offset, the ADF onset, and the fundamental frequency of each frame. By using this information, it is possible to get more accurate note segments than by using note onset/offset information alone. Specifically, we first integrated the note onset/offset and ADF onset. After the segmentation process, we recalculated the fundamental frequency within each note segment for better accuracy. In each note segment, we applied the K-means clustering method to get the distribution of the fundamental frequency values calculated. 3.3.1. Note representation After applying the AMDF, pitches were grouped if the differences between pitches were within the threshold. Continuous groups were merged if there was only one pitch value that was beyond the scope of the pitch threshold between the two groups. From the merged groups, pitch onset and offset values, as well as the representative pitch
Fig. 5. Algorithm for computing ADF.
Fig. 6. Algorithm for the onset threshold.
values, were extracted. The note onset/offset information was quite reliable, but it was not perfect because it was difficult to detect repeated notes. Hence, we considered the ADF onset information and the note onset/offset detected by the voice query preprocessing together. A detailed algorithm to do this is described in Fig. 7. In the first step, we decided the nearest ADF onset candidate before the note onset. If the nearest onset candidate was beyond the threshold, the ADF onset candidate was ignored. Otherwise, the position of the note onset was changed with the ADF onset. This step helped to detect the note pitch and duration more accurately. In the second step, we found the breaking point in the current duration to divide the continuous note segment into two repeated ones. If a point exists within a note that has a larger interval than the smallest duration previously detected, the point is considered as a breaking point and the note is segmented. Fig. 8a and c show the results of note segmentation and AMDF. The AMDF results are marked using gray lines, and note segments are marked using black lines. Fig. 8a shows the note onset/offset result before applying the ADF onset. Fig. 8b shows the result of applying the ADF onset. The marked ADF onsets are valid ones. They exceeded the average ADF onset threshold which was calculated in Figs. 6 and 8c shows the result of integrating note onset/offset and ADF onset information. 3.3.2. The fundamental frequency recalculation To integrate the note onset/offset and ADF onset, we used the procedure in (Hwang et al., 2005). Although the note onset/offset information became more accurate after integration, the pitch information might not be correct in
Fig. 7. The query representation algorithm for integrating note onset/ offset and ADF onset.
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
1071
we describe the overall system architecture, user query refinement, and some of the implementation details. 4.1. System architecture
AMDF and note segmentation (without ADF).
Accumulated difference function (ADF) (onset only).
Fig. 9 shows the overall system architecture of our prototype system for processing user query and retrieving matched ones from the database. The system consists of three main components: GUI, Analyzer, and GA Engine. A user can formulate an initial query using one of the following user query interfaces: QBE, QBH, QBMN, and QBC. The analyzer module takes the user query as a signal or a sequence of notes, and extracts audio features such as pitch and time contour. After that, based on those extracted features, the query is transcribed into uUdDr and LSR string. For the transcribed string, the FAI index is first looked up for a quick match. If the index lookup fails, the music database is searched for. A detailed description for the FAI indexing scheme and its operations can be found in (Hwang and Rho, 2006). Now matched melodies are displayed according to their rank on the browsing interface. When the user selects a melody or its segment as the most relevant one, the GA engine generates new music segments and evaluates the fitness of each segment based on our genetic algorithm. A modified query is generated by the user’s relevance judgment via the feedback interface. The whole query process is repeated until the user is satisfied. 4.2. Query refinement
AMDF and note segmentation (with ADF). Fig. 8. Final representation of each note.
many cases. This was remedied by realigning the pitch information for each note of the segments. In our previous work, we applied a traditional averaging method because of its simplicity. However, the result may not represent pitch information of each segment correctly even when the result of AMDF contains minor noisy values. Thus, we applied k-means clustering method (MacQueen, 1967) to each note segment. k-Means clustering is known to show good performance even when samples contain small noisy values. 4. Implementation We have implemented a prototype music retrieval system called ‘‘MUSEMBLE,’’ the abbreviation of Music Ensemble, which is an acronym for MUSic retrieval systEM Based on a Learning Environment. In this section,
As we mentioned above, we implemented GA-based relevance feedback into our prototype system to improve the retrieval performance. 4.2.1. Relevance feedback Relevance Feedback (RF) is one of the most popular query reformulation methods in the field of information retrieval. In a RF cycle, the user is provided with a list of retrieved documents. The user marks documents that he(she) consider as relevant to the query. In practice, only top 10–20 ranked documents are typically examined. Therefore, we have set up our experiment such that the user listens to the top 20 songs resulting from an initial query and provides relevance feedback for the selected songs to the system. In our work, we have implemented three classic RF methods (Baeza-Yates and Ribeiro-Neto, 1999), which are Standard Rochio (Eq. (3)), Ide Regular (Eq. (4)) and Ide Dec Hi (Eq. (5)). They are used to calculate a modified ! query qnew : 0
! ! B 1 qnew ¼ aqold þb@ j M rel j
X !
8mj 2jM rel j
1
0
!C B mj A c@
1 j M nonrel j
X !
1
!C mj A
8mj 2M nonrel
ð3Þ
1072
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
Fig. 9. System architecture of MUSEMBLE. !
!
qnew ¼ a qold þb
X
!
mj c
!
8mj M rel !
!
qnew ¼ a qold þ
X
!
mj
ð4Þ
!
8mj M nonrel
X ! b ! mj cMaxnonrel ðmj Þ; j M rel j !
ð5Þ
8mj M r
where a, b and c are constants, Mrel is a set of relevant melody segments which are identified by the user among the retrieved music objects, Mnonrel is a set of non-relevant melody segments among the retrieved music objects, and ! Maxnonrel ðmj Þ is a reference to the highest ranked non-relevant melody segments. In the original formulations (Baeza-Yates and RibeiroNeto, 1999), Rochio fixed a = 1, and Ide fixed a = b = c = 1. However, we have chosen the RF constant factors a = 0.5, b = 1 and c = 2 since they showed the best result in our experiment. 4.2.2. Genetic algorithm Genetic algorithms are search and optimization methods that take their inspiration from natural selection and survival of the fittest in the biological world. A genetic algorithm works with a population of chromosomes, which represent the possible solutions to a given problem. A chromosome is considered a candidate solution for the fitness function. Our main motivation for that fitness function is to test the adaptability of chromosomes in the population to the given query. Given the chromosomes, the genetic algorithm often requires a fitness function that returns a numeric value that represents its fitness score. This score will be used in the process of selection of the parents in current population, so that the fittest chromosome will have a greater chance of being selected. In our experiment, for the initial query result using the approximate matching, we got an initial population, for which we applied the fitness function. If the fitness score is below some threshold value, we filter the initial query
result with a higher matching rate and then apply the fitness function again until the fitness score gets higher than the threshold value. The chromosomes evolve in generations by means of genetic operators such as crossover and mutation. Fig. 10 shows our genetic algorithm. We calculated the fitness value of the chromosome using the following formula: N i X 1 1X relevance ðM j Þ; i j¼1 i¼1 relevance ðM i Þ i¼1
Fitness ¼ PN
ð6Þ where N is the total number of music objects retrieved in population P and relevance (Mi) is a function that returns the relevance of music object Mi. The equation for calculating the relevance is: relevance ðM i Þ ¼
QueryLength LDðM i ; QueryÞ ; QueryLength
ð7Þ
where LD is a function that calculates the lowest operating cost. Each relevance value of music object Mi ranges from 0 to 1, where ‘1’ represents the case where the music is relevant to the user’s query with full confidence and ‘0’ indicates the opposite case. The Levenshtein Distance (LD) function represents the distance between two strings by the number of operations such as deletions, insertions, or substitutions required to transform one string into the other (Mitchell, 1996). Before the GA terminates, it produces a new generation repeatedly. For each generation, chromosomes are selected by a tournament selection method. So far, numerous selection schemes have been used in the GA literature such as roulette wheel, rank, tournament selection, and so on. The roulette wheel selection is intuitive and easy to implement but it suffers from a scaling problem. The most frequently used selection methods to avoid this problem are
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
1073
and browsing its result. Both client and server sides were implemented using Java Applets and JSP, respectively. We have used a set of jMusic (http://jmusic.ci.qut.edu.au) APIs for extracting audio features and the Oracle database system for handling the audio metadata. Approximately 2000 MIDI files were used in the experiment. The average file size was about 40 KB. 4.3.1. Query interfaces Currently, the user can formulate a query using one of four different query interfaces. They are QBC (query by contour), QBH (query-by-humming), QBE (query by example), and QBMN (query by music notation). Fig. 11a shows a snapshot of melody contour sketch for querying. In the case of a humming-based interface as shown in Fig. 11b, users are supposed to hum or sing on the microphone. Users can also specify their queries using a MIDI or wave file as in Fig. 11c. A CMN (common music notation)-based query also can be used to create a query by clicking, dragging, or dropping notes on the music sheet applet, as shown in Fig. 11d. Formulated
Fig. 10. Genetic algorithm.
ranking and tournament. Tournament selection is similar to rank selection in terms of selection pressure, but it is computationally more efficient than rank selection (Mitchell, 1996). Therefore, we used the tournament selection method with a tournament size of 5. We choose five chromosomes randomly with equal probability from the population. We used the classical single-point crossover that determined a crossover position, g_locus, which partitioned each of the two chromosomes. Then two chromosomes were swapped at g_locus. Mutation in our algorithm was implemented as a random process. The mutation operator changes some randomly selected locus in a selected string with a small mutation probability (0–1). Mutation increases the diversity of the population and the probability of finding a better solution. Therefore, we used a random mutation method with a randomly generated number between 10 and 10. 4.3. Implementation We implemented a prototype music retrieval system based on a GA-based relevance feedback scheme. The prototype system provided a flexible user interface for querying
Fig. 11. Query interfaces.
1074
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
Fig. 12. Query result interface.
queries are stored as a MIDI format and then transformed into uUdDr and LSR strings. As a traditional query interface, the system supports text-based query using metadata such as composer, title, and file name. All of this text information is collected from the MIDI files.
4.3.2. Result browsing interface Fig. 12 shows a list of matched songs from the FAI index and the music database for a queried melody. Matched songs are ranked and listed in the descending order according to their similarity to the query. As shown
Fig. 13. Feedback interface.
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
in the figure, the user can easily playback any matched melody segment of music by simply clicking the grey bar corresponding to the matched segment. This means that the user does not need to manually scroll or try several locations to listen to the matched melody segment of the retrieved music objects. 4.3.3. Feedback interface The original query will be displayed in the CMN form as shown in Fig. 13 after the user clicks the grey bar marking the matched segment as illustrated in Fig. 12. Then, the user can listen to the query melody with its score notation. The user is finally asked to judge the melody as relevant or not. The user simply marks the checkbox if the retrieved melody segments are relevant. If the user wants to modify the query, he left-clicks the mouse and then drags to the desired position on the music sheet applet. There are two buttons on the right side of the generated notes; one for listening to those scores and the other for requerying. If the user wants to try for better result, he just clicks the ‘‘Requery’’ button. Then, the system reformulates the query using these modified scores and repeats the whole steps for the reformulated query. 5. Experimental results In this Section, we first describe details of the experiment that we performed in order to show the effectiveness and efficiency of our prototype system and then report some of the results. 5.1. Experimental environments The query signals were captured directly from a microphone, and then stored as PCM wave files with 8-bit, 22.05 kHz, mono. We set the framing length to 20 ms, the minimum analyzable length. Also, we set the frame overlapping ratio to 50% for continuity between frames. In the experiment with the WAE method, we used the following parameters: the global threshold of AE magnitude to 0.003, the unit window size to 16 frames, and the differential ratio to 20%. Also, for the ADF threshold, we used 50% of the average ADF. These values were observed as optimal from the experiments we performed. In the pitch tracking step, any notes pitched out of the 87 Hz through 800 Hz range were discarded. It was seldom observed that user hummed notes were beyond this frequency range.
singing with lyrics and the short humming sound with ‘[na].’ After the queries were recorded, we extracted their feature information using our proposed method. Then, we compared the melody transcription results with the user intension for measuring their accuracy. The note segmentation and note pitch were also compared with the exact note information to measure their error rates. In our experiment, we considered four different types of errors. The drop error indicates when a note was lost or merged to an adjacent one during the transcription. The add error is related to the appearance of not-existing notes and the pitch error indicates that adjacent pitch changes are wrong. The duration error occurs when the difference between the detected duration and the original duration was larger than the smaller one. Each error type was represented by the following error estimation equation. Error rate ¼
# of notes where errors occurred # of notes
ð8Þ
Tables 1 and 2 show the error rates of the note pitch and duration detection. The difference between the AE with global threshold and the WAE with cleaning note segmentation is shown in Table 1. From this result, it is manifest that the WAE is more robust than the AE. The number of add errors was reduced significantly because of the merging step. From Table 2, we see that it is more efficient to use an ADF onset for rearranging the note onset/offset. Also, there was some improvement on the new version of the ADF rather than the old one (Hwang et al., 2005). By applying the query representation algorithm in Fig. 7, we reduced the numbers of drop, add, and duration errors.
Table 1 Error rates of AE and WAE Query type
Error type Drop errors
Add errors
Duration errors
Total errors
Singing
AE WAE
6.8 2.2
8.4 0.7
8.3 6.8
23.5 9.7
Humming
AE WAE
2.5 1.1
4.3 0.5
7.9 7.3
14.7 8.9
Table 2 Error rates of singing/humming with/without ADF Query type
Error type
5.2. Transcription performance We collected about 160 queries overall for our experiment. A group of four males and four females participated in our experiment. None of them was a musician or experienced singer for experimental fairness. Each user made 20 different short queries based on memorable tunes of popular songs. Also, each query consisted of query-by-
1075
Drop errors
Add errors
Pitch errors
Duration errors
Singing
Without ADF With ADF (old) With ADF (new)
6.7 4.8 4.5
9.7 8.9 8.7
2.7 2.6 2.6
2.3 2.2 2.1
Humming
Without ADF With ADF (old) With ADF (new)
5.7 5.0 4.8
9.4 8.0 7.8
2.6 2.4 2.4
3.0 2.4 2.3
1076
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
Table 3 Onset detection accuracy Query type
Singing
Humming
Onset type Detected onset
False onset
Recall (%)
Precision (%)
AE AE + ADF WAE WAE + ADF
248 254 269 273
22 20 17 15
80.4 83.3 89.7 91.1
91.1 92.1 93.7 94.5
AE AE + ADF WAE WAE + ADF
262 266 274 279
17 17 15 14
87.2 88.6 92.2 94.3
93.5 93.6 94.6 95.0
Table 3 and Fig. 14 show the effectiveness of our new method. As shown, our method is superior to other methods in terms of error rates. The missed onset errors were reduced significantly when using the WAE. This is because the WAE contained the frames that AE could not produce. On the other hand, missed onset without applying the ADF contained many merged note segments. The main role of the ADF was to split the merged note segment into repeated ones. Thus, the integration AE or WAE with ADF helped to detect the missed onsets. We measured the effectiveness of our method in terms of recall and pre-
Fig. 14. Effectiveness of ADF on AE/WAE.
cision which are defined by formula Eq. (9). The table shows that WAE and ADF methods together improved the transcription accuracy up to 95%. ( onsetsFalse onsets RECALL ¼ DetectedCorrect onsets ð9Þ Detected onsetsFalse onsets PRECISION ¼ Detected onsets We also performed comparison with two other practical systems: AKoff (http://www.akoff.com/) and digital ear (http://www.digital-ear.com/). Those are music recognition applications which perform wave to MIDI conversion. From the result in Fig. 15, we can see that our system produced much less errors compared with other applications in most types of errors. More specifically, AKoff produced much more drop and add errors than our system. Meanwhile, Digital Ear detected all pitch information from each frame, but the method of Digital Ear was just for converting a wave sequence into a MIDI file. 5.3. Music retrieval performance In order to evaluate the effectiveness of our retrieval scheme, we ranked the retrieved music objects based on their score, and calculated the precision of the top n songs in each of the methods we considered. The experiment results are shown in Figs. 16 and 17, which give the average precision in the top 5, 10, 20 retrieved songs, indicated as average@{5, 10, 20}, for each of the algorithms that we implemented, as well as the degree of improvement over the initial unoptimized query. As we expected, the GA, with the fitness function, behaved more reasonably than other relevance feedback methods such as Ide regular, Standard Rochio and Ide Dec-Hi. As shown in Fig. 17, our GA-based feedback method improved the retrieval accuracy up to 20–40%. In contrast, the GA with a random function gave quite poor performance compared with other RF methods. In Fig. 18a, we measured the relationship between the query length and the response time in each generation of our genetic algorithm. As we expected, a small number of generations with a few notes’ query such as 5 or 10 notes gave much better result than a large number of generations
Fig. 15. Comparison with other application systems.
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
with longer notes. This meant that a large number of generations take much more time for the genetic algorithm. We empirically observed that the optimal number of generations was 20 and reasonable query size was about 10. Our prototype system allowed approximate matching for the humming query to compensate for the inaccuracy of the acoustic input. The query might return too many results with a low precision score which were useless to
1077
the user. To solve this problem, when querying with approximate matching, we allowed users to specify the accuracy range from 10% to 100% (query boundary). The graph in Fig. 18b shows that the query length did not seem to affect the response time significantly when the user searched with a few notes such as 5 or 10. On the other hand, it was quite slow with notes longer than 20.
Fig. 16. Comparison with other RF methods.
Fig. 17. Comparison with other relevance feedback methods.
1078
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
Fig. 18. Efficiency of our algorithm.
6. Conclusions In this paper, we presented a new music retrieval system ‘‘MUSEMBLE.’’ This system transcribes user humming signal into notes automatically and more accurately and then executes a relevance feedback technique with a genetic algorithm to improve the retrieval performance. For more robust pitch tracking, we revised the traditional method and proposed WAE and some cleaning procedures to get accurate note segments and onset/offset information. Furthermore, in order to obtain more accurate duration, the note onset/offset and the ADF onset were considered together. With our query representation
algorithm, overall error rates were decreased significantly as shown in our experiment. Also, our GA-based feedback scheme returns perfect result after 20 generations. Also, we observed that a longer query length with a large number of GA generations might result in longer response time. We observed optimal number of generations and query size through a series of tests. Also, for the usability evaluation of our graphical user interface, we conducted various experiments to measure the effectiveness and efficiency of our GA-based feedback method. Overall, our query interface with GA-based feedback improved flexibility and retrieval accuracy.
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
In the future, we will try to extract the information of plosives and develop adaptive error-resilient preprocessing methods. We are also planning to consider two or multi-point crossover method to reduce the positional bias. Acknowledgements This research was supported by the MIC (Ministry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Advancement). (IITA-2006-(C1090-06030002)) and this research is also supported by the Ubiquitous Computing and Network (UCN) Project, the Ministry of Information and Communication (MIC) 21st Century Frontier R&D Program in Korea. References Baeza-Yates, R., Ribeiro-Neto, B., 1999. Modern Information Retrieval. Addison Wesley. Chai, W., 2001. Melody Retrieval on the Web. Requirements of the Degree of Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology. David Gerhard., 2003. Pitch Extraction and Fundamental Frequency: History and Current Techniques. Technical Report TR-CS 2003-06. Ghias, A., et al., 1995. Query by humming – musical information retrieval in an audio database. In: Proceedings of ACM Multimedia 95 – Electronic Proceedings, pp. 231–236. Hoashi, Zeitler, Inoue, 2002. Implementation of relevance feedback for content-based music retrieval based on user preferences. ACM SIGIR, pp. 385–286. Hoashi, Matsumoto, Inoue, 2003. Personalization of user profiles for content-based music retrieval based on relevance feedback. ACM Multimedia, pp. 110–119. Huang, C.M. et al., 2001. Synchronization and flow adaptation schemes for reliable multiple-stream transmission in multimedia presentation. Journal of Systems and Software 56 (2), 133–151. Huang, R., Hansen, J.H.L., 2006. Advanced in unsupervised audio classification and segmentation for the broadcast news and NGSW Corpora. IEEE Trans. on Audio, Speech and Language Processing 14 (3), 907–919. Hwang, E., Rho, S., 2004. FMF(fast melody finder): A web-based music retrieval system. In: Lecture Notes in Computer Science, vol. 2771. Springer-Verlag, pp. 179–192. Hwang, E., Park, S., Kim, S., Byeon, K., 2005. Automatic voice query transformation for query-by-humming systems. In: Proceedings of the conference of (IMSA’2005), pp. 197–202. Hwang, E., Rho, S., 2006. FMF: Query adaptive melody retrieval system. Journal of Systems and Software 79 (1), 43–56. Johan Forberg, 1998. Automatic conversion of sound to the MIDIformat. TMH-QPSR 1-2/1998. Klapuri, Anssi P., 2005. A perceptually motivated multiple-f0 estimation method. 2005 IEEE workshop on applications of signal processing to audio and acoustics, 291–294. Kornstadt, A., 1998. Themefinder: A web-based melodic search tool. In: Computing in Musicology 11. MIT Press. Lemstro¨m, K., Wiggins, G.A., Meredith, D., 2001. A threelayer approach for music retrieval in large databases. In: Second International Symposium on Music Information Retrieval. Bloomington, IN, USA. pp. 13–14. Lopez-Pujalte, C., Guerrero-Bote, V., Moya-Anegon, F., 2003. Orderbased fitness functions for genetic algorithms applied to relevance
1079
feedback. Journal of the American Society for Information Science 54 (2), 52–160. MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 16–22. Matti Ryynanen, Anssi Klapuri, 2006. Transcription of the singing melody in polyphonic music, ISMIR 2006. McCann, J.A. et al., 2000. Kendra: Adaptive Internet system. Journal of Systems and Software 55 (1), 3–17. McNab, R.J. et al., 1997. The New Zealand digital library melody index. Digital Libraries Magazine. Mitchell, M., 1996. An Introduction to Genetic Algorithms. MIT Press. Motoyuki Suzuki, et al., 2006. Music information retrieval from a singing voice based on verification of recognized hypothesis. ISMIR 2006. Pickens, J., 2000. A comparison of language modeling and probabilistic text information retrieval approaches to monophonic music retrieval. Proceedings of the 1st Annual International Symposium on Music Information Retrieval (ISMIR2000). Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S., 1998. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 8 (5), 644–655. Stejic, Z., Takama, Y., Hirota, K., 2003. Genetic algorithm-based relevance feedback for image retrieval using local similarity patterns. Elsevier Journal – Information Processing and Management 39 (1), 1–23. Tokui, N., Iba, H., 2000. Music composition with interactive evolutionary computation. In: Third International Conference on Generative Art, Milan, Italy, pp. 215–226. Typke, R., Prechelt, L., 2001. An interface for melody input. ACM Transactions on Computer–Human Interaction., 133–149. Uitdenbogerd, A., Zobel, J., 1999. Melodic matching techniques for large music databases. In: Proceedings of ACM Multimedia Conference. pp. 57–66. Ukkonen, E., Lemstrom, K., Makinen, V., 2003. Sweepline the music. Lecture Notes in Computer Science 2598, 330–342. Unehara, M., Onisawa, T., 2003. Construction of music composition system with interactive genetic algorithm. Journal of the Asian Design International Conference. Zhuge, H., 2000. A problem-oriented and rule-based component repository. Journal of Systems and Software 50 (3), 201–208. Zhuge, H., 2003. An inexact model matching approach and its applications. Journal of Systems and Software 67 (3), 201–212.
Web references Foote, ‘‘The TreeQ Package,’’ ftp://svr-ftp.eng.cam.ac.uk/pub/comp. speech/tools/treeq1.3.tar.gz. jMusic Java Library, http://jmusic.ci.qut.edu.au. MIR Systems, http://mirsystems.info/index.php?id=mirsystems. AKoff Sound Labs, http://www.akoff.com/. Digital Ear, http://www.digital-ear.com/. Seungmin Rho received his B.S. and M.S. degrees in Computer Science from Ajou University, Korea, in 2001 and 2003, respectively. Currently he is pursuing a Ph.D. degree in the Computer Science Department of Ajou University. He is currently working on audio analysis and intelligent music retrieval system development. His research interests include database, audio and video retrieval, multimedia systems, machine learning, and intelligent agent technologies. Mr. Rho is a member of the IEEE. Byeong-jun Han received his B.S. degree in Electrical Engineering from Korea University, Korea, in 2005. Currently he is pursuing the M.S. degree in the School of Electrical Engineering in Korea University. He is currently working on audio analysis and intelligent music retrieval system development. His research interests include multimedia feature extraction, audio/visual retrieval systems, multimedia data mining, and machine learning. Mr. Han is a student member of the IEEE.
1080
S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080
Eenjun Hwang received his B.S. and M.S. degree in Computer Engineering from Seoul National University, Seoul, Korea, in 1988 and 1990, respectively; and his Ph.D. degree in Computer Science from the University of Maryland, College Park, in 1998. From September 1999 to August 2004, he was with the Graduate School of Information and Communication, Ajou University, Suwon, Korea. Currently he is a member of the faculty in the School of Electrical Engineering, Korea University, Seoul, Korea. His current research interests include database, multimedia systems, information retrieval, XML, and Web applications.
Minkoo Kim received his B.S. degree in Computer Engineering from Seoul National University, Seoul, Korea, in 1977; and M.S. degree in Computer Engineering from KAIST (Korea Advanced Institute of Science and Technology), Daejeon, Korea, in 1979. He received his Ph.D. degree in Computer Science from the Pennsylvania State University, in 1989. From January 1999 to January 2000, he was with the University of Louisiana, CACS as a visiting researcher. Since 1981, he has been a member of the faculty in the College of Information Technology, Ajou University, Suwon, Korea. His current research interests include multi-agent systems, information retrieval, ontology and its applications.