MUSEMBLE: A novel music retrieval system with automatic voice query transcription and reformulation

Available online at www.sciencedirect.com The Journal of Systems and Software 81 (2008) 1065–1080 www.elsevier.com/locate/jss MUSEMBLE: A novel musi...

Download PDF

2MB Sizes 227 Downloads 41 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

The Journal of Systems and Software 81 (2008) 1065–1080 www.elsevier.com/locate/jss

MUSEMBLE: A novel music retrieval system with automatic voice query transcription and reformulation Seungmin Rho a, Byeong-jun Han b, Eenjun Hwang a

b,*

, Minkoo Kim

c

Graduate School of Information and Communication, Artiﬁcial Intelligence Laboratory, Ajou University, Suwon 443-749, South Korea b School of Electrical Engineering, Multimedia Information Laboratory, Korea University, Seoul 135-701, South Korea c College of Information Technology, Ajou University, Suwon 443-749, South Korea Received 14 March 2007; received in revised form 25 May 2007; accepted 26 May 2007 Available online 9 June 2007

Abstract So far, many researches have been done to develop eﬃcient music retrieval systems, and query-by-humming has been considered as one of the most intuitive and eﬀective query methods for music retrieval. For the voice humming to be a reliable query source, elaborate signal processing and acoustic similarity measurement schemes are necessary. On the other hand, recently, there has been an increased interest in query reformulation using relevance feedback with evolutionary techniques such as genetic algorithm for multimedia information retrieval. However, these techniques have not been exploited widely in the ﬁeld of music retrieval. In this paper, we develop a novel music retrieval system called MUSEMBLE (MUSic enEMBLE) based on two distinct features: (i) A sung or hummed query is automatically transcribed into a sequence of pitch and duration pairs with improved accuracy for music representation. More speciﬁcally, we developed two new and unique techniques called WAE (windowed average energy) and dynamic ADF (amplitude-based difference function) onsets for more accurate note segmentation and onset/oﬀset detection in acoustic signal, respectively. The former improved energy-based approaches such as AE by deﬁning small but coherent windows with local and global threshold values. On the other hand, the latter improved the AF (amplitude function) that calculates the summation of the absolute values of signal diﬀerences for the clustering energy contour. (ii) A user query is reformulated using user relevance feedback with a genetic algorithm to improve retrieval performance. Even though we have especially focused on humming queries in this paper, MUSEMBLE provides versatile query and browsing interfaces for various kinds of users. We have carried out extensive experiments on the prototype system to evaluate the performance of our voice query transcription and genetic algorithm-based relevance feedback schemes. We demonstrate that our proposed method improves the retrieval accuracy up to 20–40% compared with other popular RF methods. We also show that both WAE and Dynamic ADF methods improve the transcription accuracy up to 95%. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Genetic algorithm; Multimedia database; Music retrieval; Pitch tracking; Relevance feedback; Signal processing

1. Introduction With the explosive expansion of digital music and audio contents, eﬃcient retrieval of such data is getting more and more attention, especially in large-scale multimedia database applications. In the past, music information retrieval *

Corresponding author. Tel.: +82 2 3290 3256; fax: +82 2 921 0544. E-mail addresses: [email protected] (S. Rho), [email protected] (B.-j. Han), [email protected] (E. Hwang), [email protected] (M. Kim). 0164-1212/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2007.05.038

was based on textual metadata such as title, composer, singer or lyric. However, these various metadata-based schemes for music retrieval have suﬀered from many problems including extensive human labor, incomplete knowledge and personal bias. Compared with traditional keyword-based music retrieval, content-based music retrieval provides more ﬂexibility and expressiveness. Content-based music retrieval is usually based on a set of extracted music features such as pitch, duration, and rhythm. Query-by-humming (QBH) is one of the popular content-based retrieval methods for

1066

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

a large-scale music database (Ghias et al., 1995; McNab et al., 1997; Uitdenbogerd and Zobel, 1999; Hwang and Rho, 2004). QBH systems can take a user’s acoustic input (a short clip of singing, whistling or humming) through a microphone, extract useful features from it and then retrieve matched songs from a music database using them. This is very useful when the user does not know detailed information about the music such as its title or singer, but just remembers a small segment of the music. However, the quality of QBH is strictly dependent on the accuracy of the audio transcription such as duration or pitch of each note. Thus, an eﬃcient algorithm to transcribe an audio signal into a note-like representation is one of the critical components in a QBH-based music retrieval system. One common approach for developing a content-based music retrieval system is to represent music into a string of characters using three possible values for the pitch change: U(p), D(own), and S(ame) or R(epeat). In our previous work, (Hwang and Rho, 2004, 2006), we described the limitation of a pure UDR notation based on the pitch contour and proposed new notations such as uUdDr and LSR to overcome such limitations. Considering the volume of music data, we should consider query response time and storage requirements for indexing in implementing a music retrieval system. For that purpose, we also proposed a dynamic indexing scheme called FAI, which collects frequently queried melody tunes intelligently for fast query matching in our previous work (Hwang and Rho, 2006). Relevance feedback (RF) is a well known technique in the information retrieval (IR) area. It reformulates a query based on the documents which are selected by the user as relevant. Recently, relevance feedback has been widely adopted to improve the performance of both text and multimedia information retrieval Hoashi et al. (2002, 2003). Many RF methods have been studied in the CBIR (Content-Based Image Retrieval) area (Rui et al., 1998; Stejic et al., 2003), even though they were ﬁrst used in text retrieval systems. However, there have been few music retrieval systems that used RF techniques in music retrieval for improving their retrieval performance. Applying a genetic algorithm (GA) is a powerful problem-solving technique in the artiﬁcial intelligence area. It is based on the Darwin’s theory of evolution and principles of biological inheritance. Very few researchers have tried to use evolutionary algorithms like genetic algorithms in the ﬁeld of music information retrieval. Previous attempts (Tokui and Iba, 2000; Unehara and Onisawa, 2003) to use GA have only focused on automatic music composition, but not on adaptation of the query melody representation. Lopez-Pujalte et al. (2003) implemented a genetic algorithm for relevance feedback in textual information retrieval and running it with diﬀerent order-based ﬁtness functions. Among the ﬁtness functions present in literature, the ones that yield the best results are those that take into account not only when documents are retrieved, but also the order in which they are retrieved.

In this paper, we propose a novel music retrieval system ‘‘MUSEMBLE’’ based on two prominent features: (i) Humming signal can be transcribed into notes more accurately. This is mainly due to two new methods: WAE and Dynamic ADF which is an improved version of the AF (Amplitude Function). (ii) Relevance feedback mechanism based on GA is provided to improve the quality of query results by reformulating a user query. In addition, the system provides versatile querying and browsing interfaces to provide the query usability. The rest of this paper is organized as follows. In Section 2, we present an overview of ongoing research for analyzing music features and constructing MIR systems. In Sections 3 and 4, we describe our music transcription scheme and music retrieval system ‘‘MUSEMBLE,’’ respectively. In Section 5, we report on some of the experimental results. Section 6 concludes this paper and describes our future directions. 2. Related work In this section, we review some of typical techniques and systems for music information retrieval. As we know, music can be represented in two diﬀerent ways. One is based on musical scores such as MIDI and Humdrum (Kornstadt, 1998). The other is based on acoustic signals which are sampled at a certain frequency and compressed to save space. Wave (.wav) and MPEG Layer-3 (.mp3) are examples of this representation. 2.1. Symbolic analysis Many research eﬀorts to solve the music similarity problem have used symbolic representation such as MIDI, musical scores, note lists and so on. Based on this, pitch tracking ﬁnds a ‘‘melody contour’’ for a piece of music. Next, a string matching technique can be used to compare the transcriptions of songs (Ghias et al., 1995; McNab et al., 1997; Uitdenbogerd and Zobel, 1999; Hwang and Rho, 2004, 2006). String matching has been widely used in music retrieval because melodies are represented using a string sequence of notes. To consider human input errors, dynamic programming can be applied to the string matching; however, this method tends to be rather slow. An inexact model matching approach (Zhuge, 2003) was proposed based on a quantiﬁed inexact signature-matching theory to ﬁnd an approximate model to users’ query requirements. It can enhance the reusability of a model repository and make it possible to use and manage a model repository conveniently and ﬂexibly. Zhuge tried to apply this theory to a problem-oriented model repository system PROMBS (Zhuge, 2000). There are also researches for symbolic MIR based on the ideas from traditional text IR. Using traditional IR techniques such as probabilistic modeling is described in (Pickens, J., 2000) and using approximate string matching

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

in (Lemstro¨m et al., 2001). Some work addressed other IR issues such as ranking and relevance. Hoashi et al., 2003 used relevance feedback for music retrieval based on the tree-structured vector quantization method (TreeQ) developed by Foote. The TreeQ method trains a vector quantizer instead of modeling the sound data directly. 2.2. Acoustic signal analysis There are many techniques to extract pitch contour, pitch interval, and duration from a voice humming query. In general, methods for detecting pitches can be divided roughly into two categories: time-domain based and frequency-domain based. In the time-domain, ZCR (zero crossing rate) and ACF (auto correlation function) are two popular methods. The basic idea is that ZCR gives information about the spectral content waveform cross zero per unit time (David, 2003). In recent works, ZCR appeared in a diﬀerent form such as VZCR (variance of ZCR) or SZCR (smoothing ZCR) (Huang and Hansen, 2006). On the contrary, ACF is based on the cross correlation function. While a cross correlation function measures the similarity between two waveforms along the time interval, ACF can compare one waveform with itself. In the frequency-domain, FFT (fast Fourier transformation) is one of the most popular methods. This method is based on the property that every waveform can be divided into simple sine waves. But, a low spectrum rate for longer window may increase the frequency resolution while decreasing the time resolution. Another problem is that the frequency bins of the standard FFT are linearly spaced, while musical pitches are better mapped on a logarithmic scale. So, Forberg (1998) used an alternative frequency transformation such as constant Q transform spectrums which are computed from tracked parts. In recent works for the automatic transcription, they used probabilistic machine learning techniques such as HMM (hidden Markov model) and NN (neural network) to identify salient audio features and reduce the dimensionality of feature space. Ryynanen and Klapuri (2006) proposed a singing transcription system based on the HMM-based notes event modeling. The system performed note segmentation and labeling and also applied multipleF0 estimation method (Klapuri, 2005) for calculating the fundamental frequency. 2.3. Recent MIR systems For decades, many researchers have developed contentbased MIR (Music Information Retrieval) systems based on both acoustic and symbolic representations (Ghias et al., 1995; McNab et al., 1997; Typke and Prechelt, 2001; Hwang and Rho, 2006). Ghias et al. (1995) developed a QBH system that is capable of processing acoustic input in order to extract appropriate query information. However, this system used

1067

only three types of contour information to represent melodies. The MELDEX system (McNab et al., 1997) was designed to retrieve melodies from a database using a microphone. It ﬁrst transformed acoustic query melodies into music notations, and then searched the database for tunes containing the hummed (or similar) pattern. This web-based system provided several match modes including approximate matching for interval, contour, and rhythm. MelodyHound (Typke and Prechelt, 2001), originally known as the ‘‘TuneServer’’, also used only three types of contour information to represent melodies. They recognized the tune based on error-resistant encoding. Also, they used the direction of the melody only, ignoring the interval size or rhythm. The C-BRAHMS (Ukkonen et al., 2003) project developed nine diﬀerent algorithms known as P1, P2, P3, MonoPoly, IntervalMatching, PolyCheck, Splitting, ShiftOrAnd, and LCTS for dealing with polyphonic music. Suzuki et al. (2006) proposed a MIR system that uses both lyrics and melody information in the singing voice. They used a ﬁnite state automaton (FSA) as a lyric recognizer to check the grammar and developed an algorithm for verifying a hypothesis output by a lyric recognizer. Melody information is extracted from an input song using several pieces information of hypothesis such as song names, recognized text, recognition score, and time alignment information. Many other researchers have studied quality of service (QoS)-guaranteed multimedia systems over unpredictable delay networks by monitoring network conditions such as available bandwidth. McCann et al. (2000) developed an audio delivery system called Kendra that used adaptability with a distributed caching mechanism to improve data availability and delivery performance over the Internet. Huang et al. (2001) presented the PARK approach for multimedia presentations over a best-eﬀort network in order to achieve reliable transmission of continuous media such as audio or video. 3. Automatic voice humming transcription This section describes the overall system architecture and the algorithms we developed for automatic voice humming transcription. 3.1. System architecture Fig. 1 shows the overall system architecture for transcribing voice queries such as humming. In order to transcribe voice queries, we ﬁrst preprocessed them using WAE and dynamic ADF. After the preprocessing, we analyzed notes and extracted their pitch and duration features. More speciﬁcally, we used the WAE method to identify silent and voiced frames after framing and a heuristic method for merging discursively detected segments. Also, we used the average magnitude diﬀerence function (AMDF) to get a fundamental frequency of each frame.

1068

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

Fig. 1. Process ﬂow for automatic voice transcription.

Using this information, we were able to get note onset/oﬀset information from the segmentation information of silent or ignorable frames and fundamental frequency. Furthermore, we applied the ADF to each frame in order to get an ADF onset from the WAE information. This makes it easier to recognize the note feature. 3.2. Preprocessing Human voice consists of diverse frequency elements. But when a person is humming or singing, it is possible to recognize the pitch of a note under the assumption that human voice has a fundamental period in the very short interval. That is, human voice has a monotonic melody and it has only one fundamental frequency at a short time. Thus, it is necessary to segment the human voice signal into several frames. It is already known that to have a framing size in the range of 20–50 ms is eﬃcient for processing. Also, human voice can be changed in any interval length. However, if the interval is too short, then analyzable frequency range is also too narrow. Thus, we assumed the framing length of 20 ms as the minimum analyzable length. Furthermore,

we used the frame overlapping ratio of 50% for the continuity between frames. 3.2.1. Windowed average energy The WAE is an improved version of the average energy (AE) which is a traditional energy estimation method. AE is deﬁned by the following equation: AE ¼

N 1 X

2 jxðkÞj =N ;

ð1Þ

k¼0

where x(k) is the input sequence, and N is the sequence size. The AE itself indicates the average amount of energy of some signal range. This can be used to classify silent and voiced frames using one global threshold; if the energy is greater than the threshold, it is considered as a voiced frame; otherwise, it is considered as a silent frame. However, we observed several limitations of this traditional approach. First, the classiﬁcation was solely dependent on one global threshold value. Hence, it was not robust to the variation of amplitudes when the strength of human voice changes. Also, each recording device had its own conﬁguration and characteristic. To incorporate

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

1069

Fig. 2. AE with one global threshold vs. WAE with multiple local thresholds.

Fig. 3. Algorithm for WAE.

such diversity, it might be more eﬃcient to deﬁne local thresholds according to the change of environment. From this observation, we deﬁned local thresholds for AE instead of one global threshold. For each window, we deﬁne a local threshold to classify silent/voiced frame. This will improve the accuracy of frame discrimination. Fig. 2 shows AE with one global threshold and WAE with multiple local thresholds. From this ﬁgure, we see that the local thresholds can reﬂect the variance of AE better. Fig. 3 shows the algorithm for calculating local thresholds for frames according to the WAE. If the portion of the maximum AE deﬁned by r in a unit window is greater than the global threshold, the value becomes the local threshold. Otherwise, the global threshold becomes the local threshold.

Fig. 4. Eﬀects of WAE and merging tiny silent/voiced segments.

3.2.3. Pitch tracking Pitch is an important parameter in voice signal analysis and can be determined by the fundamental frequency of the unit frame. Recently, frequency-domain based methods like FFT have been widely used for the pitch tracking. However, since we just needed the fundamental frequency information only in this work, we used the average magnitude diﬀerence function (AMDF), which is basically a delayed autocorrelation function. For each frame, an AMDF value is deﬁned as the sum of point-wise absolute diﬀerences between the two signals: AMDFðkÞ ¼

NX k1

jxðn þ kÞ xðnÞj=ðN kÞ;

ð2Þ

n¼0

3.2.2. Merging note segments ‘‘Tiny silent/voiced segments’’ are not much useful in classifying the note pitch. This is because their length is too short to process. Although note segments from the WAE are much smaller than those from AE with a global threshold, the local AE threshold of each frame alone is not eﬃcient enough to classify silent and voice frames clearly. For example, in Fig. 4, ‘‘tiny silent and voiced segments’’ are dispersed among long voiced and silent segments, respectively. We merged the tiny silent segments with the neighboring voiced segments and the tiny voiced segments with the neighboring silent segments. The maximum length of tiny segments can be deﬁned relatively or absolutely. We used six frames as the maximum length, because the minimal length of a meaningful note must be guaranteed. If the minimum length is too big, note segments could be merged with each other, resulting in a long interval. Otherwise, the scattered segments did not cluster well. In our experiment, we observed that six frames were appropriate to merge the segments. After merging the note segments, we can get the note onset/oﬀset information from the note segments.

where x(n) is the input sequence, N is the frame size, and k is a positive delay. Each frame consists of a number of frequency components. In general, we can get the fundamental frequency of each frame by applying a FFT. However, to get the frequency range of the strongest magnitude using the Fourier transformation, it is necessary to compute the magnitude information for every bin. Under the assumption that there is only one voice query in the input sequence, the result of AMDF reﬂects the frequency of the strongest magnitude because of the local extremes. The periodic valleys in the result of AMDF are deﬁned as local extremes. Thus, the fundamental period of a frame can be deﬁned as the periodic distance between local extremes. We can also compute the fundamental frequency from the reciprocal of the fundamental period. 3.2.4. Amplitude-based diﬀerence function The amplitude function (AF) computes the summation of absolute values of amplitudes within a human voice’s frequency range in each frame (Chai, 2001). In order to

1070

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

improve its accuracy, we proposed a new method called the amplitude-based diﬀerence function (ADF) for clustering the energy contour (Hwang et al., 2005). In this paper, we will improve the method further. An ADF onset happens when energy goes up rapidly from a low level or long silence. On the contrary, an ADF oﬀset happens when energy abruptly drops to a very low level. ADF onset is another important feature in detecting the note onsets. In this work, we ignored the oﬀsets because the end of each syllable was not clear from time to time and hence it was diﬃcult to extract oﬀset information. We show the algorithm for computing ADF in Fig. 5. Continuous positive and negative diﬀerences are summed up for the whole signal. That is, continuously increasing/ decreasing intervals are merged into an increasing/decreasing interval. Fig. 6 shows the algorithm for calculating the onset thresholds. Previous works used them as global threshold of ADF onsets. However, in our work, we used them as local thresholds for each note segmentation group. 3.3. Note analysis From the voice query preprocessing step described in Section 3.2, we obtained the note onset/oﬀset, the ADF onset, and the fundamental frequency of each frame. By using this information, it is possible to get more accurate note segments than by using note onset/oﬀset information alone. Speciﬁcally, we ﬁrst integrated the note onset/oﬀset and ADF onset. After the segmentation process, we recalculated the fundamental frequency within each note segment for better accuracy. In each note segment, we applied the K-means clustering method to get the distribution of the fundamental frequency values calculated. 3.3.1. Note representation After applying the AMDF, pitches were grouped if the diﬀerences between pitches were within the threshold. Continuous groups were merged if there was only one pitch value that was beyond the scope of the pitch threshold between the two groups. From the merged groups, pitch onset and oﬀset values, as well as the representative pitch

Fig. 5. Algorithm for computing ADF.

Fig. 6. Algorithm for the onset threshold.

values, were extracted. The note onset/oﬀset information was quite reliable, but it was not perfect because it was difﬁcult to detect repeated notes. Hence, we considered the ADF onset information and the note onset/oﬀset detected by the voice query preprocessing together. A detailed algorithm to do this is described in Fig. 7. In the ﬁrst step, we decided the nearest ADF onset candidate before the note onset. If the nearest onset candidate was beyond the threshold, the ADF onset candidate was ignored. Otherwise, the position of the note onset was changed with the ADF onset. This step helped to detect the note pitch and duration more accurately. In the second step, we found the breaking point in the current duration to divide the continuous note segment into two repeated ones. If a point exists within a note that has a larger interval than the smallest duration previously detected, the point is considered as a breaking point and the note is segmented. Fig. 8a and c show the results of note segmentation and AMDF. The AMDF results are marked using gray lines, and note segments are marked using black lines. Fig. 8a shows the note onset/oﬀset result before applying the ADF onset. Fig. 8b shows the result of applying the ADF onset. The marked ADF onsets are valid ones. They exceeded the average ADF onset threshold which was calculated in Figs. 6 and 8c shows the result of integrating note onset/oﬀset and ADF onset information. 3.3.2. The fundamental frequency recalculation To integrate the note onset/oﬀset and ADF onset, we used the procedure in (Hwang et al., 2005). Although the note onset/oﬀset information became more accurate after integration, the pitch information might not be correct in

Fig. 7. The query representation algorithm for integrating note onset/ oﬀset and ADF onset.

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

1071

we describe the overall system architecture, user query reﬁnement, and some of the implementation details. 4.1. System architecture

AMDF and note segmentation (without ADF).

Accumulated difference function (ADF) (onset only).

Fig. 9 shows the overall system architecture of our prototype system for processing user query and retrieving matched ones from the database. The system consists of three main components: GUI, Analyzer, and GA Engine. A user can formulate an initial query using one of the following user query interfaces: QBE, QBH, QBMN, and QBC. The analyzer module takes the user query as a signal or a sequence of notes, and extracts audio features such as pitch and time contour. After that, based on those extracted features, the query is transcribed into uUdDr and LSR string. For the transcribed string, the FAI index is ﬁrst looked up for a quick match. If the index lookup fails, the music database is searched for. A detailed description for the FAI indexing scheme and its operations can be found in (Hwang and Rho, 2006). Now matched melodies are displayed according to their rank on the browsing interface. When the user selects a melody or its segment as the most relevant one, the GA engine generates new music segments and evaluates the ﬁtness of each segment based on our genetic algorithm. A modiﬁed query is generated by the user’s relevance judgment via the feedback interface. The whole query process is repeated until the user is satisﬁed. 4.2. Query reﬁnement

AMDF and note segmentation (with ADF). Fig. 8. Final representation of each note.

many cases. This was remedied by realigning the pitch information for each note of the segments. In our previous work, we applied a traditional averaging method because of its simplicity. However, the result may not represent pitch information of each segment correctly even when the result of AMDF contains minor noisy values. Thus, we applied k-means clustering method (MacQueen, 1967) to each note segment. k-Means clustering is known to show good performance even when samples contain small noisy values. 4. Implementation We have implemented a prototype music retrieval system called ‘‘MUSEMBLE,’’ the abbreviation of Music Ensemble, which is an acronym for MUSic retrieval systEM Based on a Learning Environment. In this section,

As we mentioned above, we implemented GA-based relevance feedback into our prototype system to improve the retrieval performance. 4.2.1. Relevance feedback Relevance Feedback (RF) is one of the most popular query reformulation methods in the ﬁeld of information retrieval. In a RF cycle, the user is provided with a list of retrieved documents. The user marks documents that he(she) consider as relevant to the query. In practice, only top 10–20 ranked documents are typically examined. Therefore, we have set up our experiment such that the user listens to the top 20 songs resulting from an initial query and provides relevance feedback for the selected songs to the system. In our work, we have implemented three classic RF methods (Baeza-Yates and Ribeiro-Neto, 1999), which are Standard Rochio (Eq. (3)), Ide Regular (Eq. (4)) and Ide Dec Hi (Eq. (5)). They are used to calculate a modiﬁed ! query qnew : 0

! ! B 1 qnew ¼ aqold þb@ j M rel j

X !

8mj 2jM rel j

1

0

!C B mj A c@

1 j M nonrel j

X !

1

!C mj A

8mj 2M nonrel

ð3Þ

1072

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

Fig. 9. System architecture of MUSEMBLE. !

!

qnew ¼ a qold þb

X

!

mj c

!

8mj M rel !

!

qnew ¼ a qold þ

X

!

mj

ð4Þ

!

8mj M nonrel

X ! b ! mj cMaxnonrel ðmj Þ; j M rel j !

ð5Þ

8mj M r

where a, b and c are constants, Mrel is a set of relevant melody segments which are identiﬁed by the user among the retrieved music objects, Mnonrel is a set of non-relevant melody segments among the retrieved music objects, and ! Maxnonrel ðmj Þ is a reference to the highest ranked non-relevant melody segments. In the original formulations (Baeza-Yates and RibeiroNeto, 1999), Rochio ﬁxed a = 1, and Ide ﬁxed a = b = c = 1. However, we have chosen the RF constant factors a = 0.5, b = 1 and c = 2 since they showed the best result in our experiment. 4.2.2. Genetic algorithm Genetic algorithms are search and optimization methods that take their inspiration from natural selection and survival of the ﬁttest in the biological world. A genetic algorithm works with a population of chromosomes, which represent the possible solutions to a given problem. A chromosome is considered a candidate solution for the ﬁtness function. Our main motivation for that ﬁtness function is to test the adaptability of chromosomes in the population to the given query. Given the chromosomes, the genetic algorithm often requires a ﬁtness function that returns a numeric value that represents its ﬁtness score. This score will be used in the process of selection of the parents in current population, so that the ﬁttest chromosome will have a greater chance of being selected. In our experiment, for the initial query result using the approximate matching, we got an initial population, for which we applied the ﬁtness function. If the ﬁtness score is below some threshold value, we ﬁlter the initial query

result with a higher matching rate and then apply the ﬁtness function again until the ﬁtness score gets higher than the threshold value. The chromosomes evolve in generations by means of genetic operators such as crossover and mutation. Fig. 10 shows our genetic algorithm. We calculated the ﬁtness value of the chromosome using the following formula: N i X 1 1X relevance ðM j Þ; i j¼1 i¼1 relevance ðM i Þ i¼1

Fitness ¼ PN

ð6Þ where N is the total number of music objects retrieved in population P and relevance (Mi) is a function that returns the relevance of music object Mi. The equation for calculating the relevance is: relevance ðM i Þ ¼

QueryLength LDðM i ; QueryÞ ; QueryLength

ð7Þ

where LD is a function that calculates the lowest operating cost. Each relevance value of music object Mi ranges from 0 to 1, where ‘1’ represents the case where the music is relevant to the user’s query with full conﬁdence and ‘0’ indicates the opposite case. The Levenshtein Distance (LD) function represents the distance between two strings by the number of operations such as deletions, insertions, or substitutions required to transform one string into the other (Mitchell, 1996). Before the GA terminates, it produces a new generation repeatedly. For each generation, chromosomes are selected by a tournament selection method. So far, numerous selection schemes have been used in the GA literature such as roulette wheel, rank, tournament selection, and so on. The roulette wheel selection is intuitive and easy to implement but it suﬀers from a scaling problem. The most frequently used selection methods to avoid this problem are

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

1073

and browsing its result. Both client and server sides were implemented using Java Applets and JSP, respectively. We have used a set of jMusic (http://jmusic.ci.qut.edu.au) APIs for extracting audio features and the Oracle database system for handling the audio metadata. Approximately 2000 MIDI ﬁles were used in the experiment. The average ﬁle size was about 40 KB. 4.3.1. Query interfaces Currently, the user can formulate a query using one of four diﬀerent query interfaces. They are QBC (query by contour), QBH (query-by-humming), QBE (query by example), and QBMN (query by music notation). Fig. 11a shows a snapshot of melody contour sketch for querying. In the case of a humming-based interface as shown in Fig. 11b, users are supposed to hum or sing on the microphone. Users can also specify their queries using a MIDI or wave ﬁle as in Fig. 11c. A CMN (common music notation)-based query also can be used to create a query by clicking, dragging, or dropping notes on the music sheet applet, as shown in Fig. 11d. Formulated

Fig. 10. Genetic algorithm.

ranking and tournament. Tournament selection is similar to rank selection in terms of selection pressure, but it is computationally more eﬃcient than rank selection (Mitchell, 1996). Therefore, we used the tournament selection method with a tournament size of 5. We choose ﬁve chromosomes randomly with equal probability from the population. We used the classical single-point crossover that determined a crossover position, g_locus, which partitioned each of the two chromosomes. Then two chromosomes were swapped at g_locus. Mutation in our algorithm was implemented as a random process. The mutation operator changes some randomly selected locus in a selected string with a small mutation probability (0–1). Mutation increases the diversity of the population and the probability of ﬁnding a better solution. Therefore, we used a random mutation method with a randomly generated number between 10 and 10. 4.3. Implementation We implemented a prototype music retrieval system based on a GA-based relevance feedback scheme. The prototype system provided a ﬂexible user interface for querying

Fig. 11. Query interfaces.

1074

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

Fig. 12. Query result interface.

queries are stored as a MIDI format and then transformed into uUdDr and LSR strings. As a traditional query interface, the system supports text-based query using metadata such as composer, title, and ﬁle name. All of this text information is collected from the MIDI ﬁles.

4.3.2. Result browsing interface Fig. 12 shows a list of matched songs from the FAI index and the music database for a queried melody. Matched songs are ranked and listed in the descending order according to their similarity to the query. As shown

Fig. 13. Feedback interface.

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

in the ﬁgure, the user can easily playback any matched melody segment of music by simply clicking the grey bar corresponding to the matched segment. This means that the user does not need to manually scroll or try several locations to listen to the matched melody segment of the retrieved music objects. 4.3.3. Feedback interface The original query will be displayed in the CMN form as shown in Fig. 13 after the user clicks the grey bar marking the matched segment as illustrated in Fig. 12. Then, the user can listen to the query melody with its score notation. The user is ﬁnally asked to judge the melody as relevant or not. The user simply marks the checkbox if the retrieved melody segments are relevant. If the user wants to modify the query, he left-clicks the mouse and then drags to the desired position on the music sheet applet. There are two buttons on the right side of the generated notes; one for listening to those scores and the other for requerying. If the user wants to try for better result, he just clicks the ‘‘Requery’’ button. Then, the system reformulates the query using these modiﬁed scores and repeats the whole steps for the reformulated query. 5. Experimental results In this Section, we ﬁrst describe details of the experiment that we performed in order to show the eﬀectiveness and eﬃciency of our prototype system and then report some of the results. 5.1. Experimental environments The query signals were captured directly from a microphone, and then stored as PCM wave ﬁles with 8-bit, 22.05 kHz, mono. We set the framing length to 20 ms, the minimum analyzable length. Also, we set the frame overlapping ratio to 50% for continuity between frames. In the experiment with the WAE method, we used the following parameters: the global threshold of AE magnitude to 0.003, the unit window size to 16 frames, and the diﬀerential ratio to 20%. Also, for the ADF threshold, we used 50% of the average ADF. These values were observed as optimal from the experiments we performed. In the pitch tracking step, any notes pitched out of the 87 Hz through 800 Hz range were discarded. It was seldom observed that user hummed notes were beyond this frequency range.

singing with lyrics and the short humming sound with ‘[na].’ After the queries were recorded, we extracted their feature information using our proposed method. Then, we compared the melody transcription results with the user intension for measuring their accuracy. The note segmentation and note pitch were also compared with the exact note information to measure their error rates. In our experiment, we considered four diﬀerent types of errors. The drop error indicates when a note was lost or merged to an adjacent one during the transcription. The add error is related to the appearance of not-existing notes and the pitch error indicates that adjacent pitch changes are wrong. The duration error occurs when the diﬀerence between the detected duration and the original duration was larger than the smaller one. Each error type was represented by the following error estimation equation. Error rate ¼

# of notes where errors occurred # of notes

ð8Þ

Tables 1 and 2 show the error rates of the note pitch and duration detection. The diﬀerence between the AE with global threshold and the WAE with cleaning note segmentation is shown in Table 1. From this result, it is manifest that the WAE is more robust than the AE. The number of add errors was reduced signiﬁcantly because of the merging step. From Table 2, we see that it is more eﬃcient to use an ADF onset for rearranging the note onset/oﬀset. Also, there was some improvement on the new version of the ADF rather than the old one (Hwang et al., 2005). By applying the query representation algorithm in Fig. 7, we reduced the numbers of drop, add, and duration errors.

Table 1 Error rates of AE and WAE Query type

Error type Drop errors

Add errors

Duration errors

Total errors

Singing

AE WAE

6.8 2.2

8.4 0.7

8.3 6.8

23.5 9.7

Humming

AE WAE

2.5 1.1

4.3 0.5

7.9 7.3

14.7 8.9

Table 2 Error rates of singing/humming with/without ADF Query type

Error type

5.2. Transcription performance We collected about 160 queries overall for our experiment. A group of four males and four females participated in our experiment. None of them was a musician or experienced singer for experimental fairness. Each user made 20 diﬀerent short queries based on memorable tunes of popular songs. Also, each query consisted of query-by-

1075

Drop errors

Add errors

Pitch errors

Duration errors

Singing

Without ADF With ADF (old) With ADF (new)

6.7 4.8 4.5

9.7 8.9 8.7

2.7 2.6 2.6

2.3 2.2 2.1

Humming

Without ADF With ADF (old) With ADF (new)

5.7 5.0 4.8

9.4 8.0 7.8

2.6 2.4 2.4

3.0 2.4 2.3

1076

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

Table 3 Onset detection accuracy Query type

Singing

Humming

Onset type Detected onset

False onset

Recall (%)

Precision (%)

AE AE + ADF WAE WAE + ADF

248 254 269 273

22 20 17 15

80.4 83.3 89.7 91.1

91.1 92.1 93.7 94.5

AE AE + ADF WAE WAE + ADF

262 266 274 279

17 17 15 14

87.2 88.6 92.2 94.3

93.5 93.6 94.6 95.0

Table 3 and Fig. 14 show the eﬀectiveness of our new method. As shown, our method is superior to other methods in terms of error rates. The missed onset errors were reduced signiﬁcantly when using the WAE. This is because the WAE contained the frames that AE could not produce. On the other hand, missed onset without applying the ADF contained many merged note segments. The main role of the ADF was to split the merged note segment into repeated ones. Thus, the integration AE or WAE with ADF helped to detect the missed onsets. We measured the eﬀectiveness of our method in terms of recall and pre-

Fig. 14. Eﬀectiveness of ADF on AE/WAE.

cision which are deﬁned by formula Eq. (9). The table shows that WAE and ADF methods together improved the transcription accuracy up to 95%. ( onsetsFalse onsets RECALL ¼ DetectedCorrect onsets ð9Þ Detected onsetsFalse onsets PRECISION ¼ Detected onsets We also performed comparison with two other practical systems: AKoﬀ (http://www.akoﬀ.com/) and digital ear (http://www.digital-ear.com/). Those are music recognition applications which perform wave to MIDI conversion. From the result in Fig. 15, we can see that our system produced much less errors compared with other applications in most types of errors. More speciﬁcally, AKoﬀ produced much more drop and add errors than our system. Meanwhile, Digital Ear detected all pitch information from each frame, but the method of Digital Ear was just for converting a wave sequence into a MIDI ﬁle. 5.3. Music retrieval performance In order to evaluate the eﬀectiveness of our retrieval scheme, we ranked the retrieved music objects based on their score, and calculated the precision of the top n songs in each of the methods we considered. The experiment results are shown in Figs. 16 and 17, which give the average precision in the top 5, 10, 20 retrieved songs, indicated as average@{5, 10, 20}, for each of the algorithms that we implemented, as well as the degree of improvement over the initial unoptimized query. As we expected, the GA, with the ﬁtness function, behaved more reasonably than other relevance feedback methods such as Ide regular, Standard Rochio and Ide Dec-Hi. As shown in Fig. 17, our GA-based feedback method improved the retrieval accuracy up to 20–40%. In contrast, the GA with a random function gave quite poor performance compared with other RF methods. In Fig. 18a, we measured the relationship between the query length and the response time in each generation of our genetic algorithm. As we expected, a small number of generations with a few notes’ query such as 5 or 10 notes gave much better result than a large number of generations

Fig. 15. Comparison with other application systems.

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

with longer notes. This meant that a large number of generations take much more time for the genetic algorithm. We empirically observed that the optimal number of generations was 20 and reasonable query size was about 10. Our prototype system allowed approximate matching for the humming query to compensate for the inaccuracy of the acoustic input. The query might return too many results with a low precision score which were useless to

1077

the user. To solve this problem, when querying with approximate matching, we allowed users to specify the accuracy range from 10% to 100% (query boundary). The graph in Fig. 18b shows that the query length did not seem to aﬀect the response time signiﬁcantly when the user searched with a few notes such as 5 or 10. On the other hand, it was quite slow with notes longer than 20.

Fig. 16. Comparison with other RF methods.

Fig. 17. Comparison with other relevance feedback methods.

1078

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

Fig. 18. Eﬃciency of our algorithm.

6. Conclusions In this paper, we presented a new music retrieval system ‘‘MUSEMBLE.’’ This system transcribes user humming signal into notes automatically and more accurately and then executes a relevance feedback technique with a genetic algorithm to improve the retrieval performance. For more robust pitch tracking, we revised the traditional method and proposed WAE and some cleaning procedures to get accurate note segments and onset/oﬀset information. Furthermore, in order to obtain more accurate duration, the note onset/oﬀset and the ADF onset were considered together. With our query representation

algorithm, overall error rates were decreased signiﬁcantly as shown in our experiment. Also, our GA-based feedback scheme returns perfect result after 20 generations. Also, we observed that a longer query length with a large number of GA generations might result in longer response time. We observed optimal number of generations and query size through a series of tests. Also, for the usability evaluation of our graphical user interface, we conducted various experiments to measure the eﬀectiveness and eﬃciency of our GA-based feedback method. Overall, our query interface with GA-based feedback improved ﬂexibility and retrieval accuracy.

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

In the future, we will try to extract the information of plosives and develop adaptive error-resilient preprocessing methods. We are also planning to consider two or multi-point crossover method to reduce the positional bias. Acknowledgements This research was supported by the MIC (Ministry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Advancement). (IITA-2006-(C1090-06030002)) and this research is also supported by the Ubiquitous Computing and Network (UCN) Project, the Ministry of Information and Communication (MIC) 21st Century Frontier R&D Program in Korea. References Baeza-Yates, R., Ribeiro-Neto, B., 1999. Modern Information Retrieval. Addison Wesley. Chai, W., 2001. Melody Retrieval on the Web. Requirements of the Degree of Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology. David Gerhard., 2003. Pitch Extraction and Fundamental Frequency: History and Current Techniques. Technical Report TR-CS 2003-06. Ghias, A., et al., 1995. Query by humming – musical information retrieval in an audio database. In: Proceedings of ACM Multimedia 95 – Electronic Proceedings, pp. 231–236. Hoashi, Zeitler, Inoue, 2002. Implementation of relevance feedback for content-based music retrieval based on user preferences. ACM SIGIR, pp. 385–286. Hoashi, Matsumoto, Inoue, 2003. Personalization of user proﬁles for content-based music retrieval based on relevance feedback. ACM Multimedia, pp. 110–119. Huang, C.M. et al., 2001. Synchronization and ﬂow adaptation schemes for reliable multiple-stream transmission in multimedia presentation. Journal of Systems and Software 56 (2), 133–151. Huang, R., Hansen, J.H.L., 2006. Advanced in unsupervised audio classiﬁcation and segmentation for the broadcast news and NGSW Corpora. IEEE Trans. on Audio, Speech and Language Processing 14 (3), 907–919. Hwang, E., Rho, S., 2004. FMF(fast melody ﬁnder): A web-based music retrieval system. In: Lecture Notes in Computer Science, vol. 2771. Springer-Verlag, pp. 179–192. Hwang, E., Park, S., Kim, S., Byeon, K., 2005. Automatic voice query transformation for query-by-humming systems. In: Proceedings of the conference of (IMSA’2005), pp. 197–202. Hwang, E., Rho, S., 2006. FMF: Query adaptive melody retrieval system. Journal of Systems and Software 79 (1), 43–56. Johan Forberg, 1998. Automatic conversion of sound to the MIDIformat. TMH-QPSR 1-2/1998. Klapuri, Anssi P., 2005. A perceptually motivated multiple-f0 estimation method. 2005 IEEE workshop on applications of signal processing to audio and acoustics, 291–294. Kornstadt, A., 1998. Themeﬁnder: A web-based melodic search tool. In: Computing in Musicology 11. MIT Press. Lemstro¨m, K., Wiggins, G.A., Meredith, D., 2001. A threelayer approach for music retrieval in large databases. In: Second International Symposium on Music Information Retrieval. Bloomington, IN, USA. pp. 13–14. Lopez-Pujalte, C., Guerrero-Bote, V., Moya-Anegon, F., 2003. Orderbased ﬁtness functions for genetic algorithms applied to relevance

1079

feedback. Journal of the American Society for Information Science 54 (2), 52–160. MacQueen, J., 1967. Some methods for classiﬁcation and analysis of multivariate observations. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 16–22. Matti Ryynanen, Anssi Klapuri, 2006. Transcription of the singing melody in polyphonic music, ISMIR 2006. McCann, J.A. et al., 2000. Kendra: Adaptive Internet system. Journal of Systems and Software 55 (1), 3–17. McNab, R.J. et al., 1997. The New Zealand digital library melody index. Digital Libraries Magazine. Mitchell, M., 1996. An Introduction to Genetic Algorithms. MIT Press. Motoyuki Suzuki, et al., 2006. Music information retrieval from a singing voice based on veriﬁcation of recognized hypothesis. ISMIR 2006. Pickens, J., 2000. A comparison of language modeling and probabilistic text information retrieval approaches to monophonic music retrieval. Proceedings of the 1st Annual International Symposium on Music Information Retrieval (ISMIR2000). Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S., 1998. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 8 (5), 644–655. Stejic, Z., Takama, Y., Hirota, K., 2003. Genetic algorithm-based relevance feedback for image retrieval using local similarity patterns. Elsevier Journal – Information Processing and Management 39 (1), 1–23. Tokui, N., Iba, H., 2000. Music composition with interactive evolutionary computation. In: Third International Conference on Generative Art, Milan, Italy, pp. 215–226. Typke, R., Prechelt, L., 2001. An interface for melody input. ACM Transactions on Computer–Human Interaction., 133–149. Uitdenbogerd, A., Zobel, J., 1999. Melodic matching techniques for large music databases. In: Proceedings of ACM Multimedia Conference. pp. 57–66. Ukkonen, E., Lemstrom, K., Makinen, V., 2003. Sweepline the music. Lecture Notes in Computer Science 2598, 330–342. Unehara, M., Onisawa, T., 2003. Construction of music composition system with interactive genetic algorithm. Journal of the Asian Design International Conference. Zhuge, H., 2000. A problem-oriented and rule-based component repository. Journal of Systems and Software 50 (3), 201–208. Zhuge, H., 2003. An inexact model matching approach and its applications. Journal of Systems and Software 67 (3), 201–212.

Web references Foote, ‘‘The TreeQ Package,’’ ftp://svr-ftp.eng.cam.ac.uk/pub/comp. speech/tools/treeq1.3.tar.gz. jMusic Java Library, http://jmusic.ci.qut.edu.au. MIR Systems, http://mirsystems.info/index.php?id=mirsystems. AKoﬀ Sound Labs, http://www.akoﬀ.com/. Digital Ear, http://www.digital-ear.com/. Seungmin Rho received his B.S. and M.S. degrees in Computer Science from Ajou University, Korea, in 2001 and 2003, respectively. Currently he is pursuing a Ph.D. degree in the Computer Science Department of Ajou University. He is currently working on audio analysis and intelligent music retrieval system development. His research interests include database, audio and video retrieval, multimedia systems, machine learning, and intelligent agent technologies. Mr. Rho is a member of the IEEE. Byeong-jun Han received his B.S. degree in Electrical Engineering from Korea University, Korea, in 2005. Currently he is pursuing the M.S. degree in the School of Electrical Engineering in Korea University. He is currently working on audio analysis and intelligent music retrieval system development. His research interests include multimedia feature extraction, audio/visual retrieval systems, multimedia data mining, and machine learning. Mr. Han is a student member of the IEEE.

1080

S. Rho et al. / The Journal of Systems and Software 81 (2008) 1065–1080

Eenjun Hwang received his B.S. and M.S. degree in Computer Engineering from Seoul National University, Seoul, Korea, in 1988 and 1990, respectively; and his Ph.D. degree in Computer Science from the University of Maryland, College Park, in 1998. From September 1999 to August 2004, he was with the Graduate School of Information and Communication, Ajou University, Suwon, Korea. Currently he is a member of the faculty in the School of Electrical Engineering, Korea University, Seoul, Korea. His current research interests include database, multimedia systems, information retrieval, XML, and Web applications.

Minkoo Kim received his B.S. degree in Computer Engineering from Seoul National University, Seoul, Korea, in 1977; and M.S. degree in Computer Engineering from KAIST (Korea Advanced Institute of Science and Technology), Daejeon, Korea, in 1979. He received his Ph.D. degree in Computer Science from the Pennsylvania State University, in 1989. From January 1999 to January 2000, he was with the University of Louisiana, CACS as a visiting researcher. Since 1981, he has been a member of the faculty in the College of Information Technology, Ajou University, Suwon, Korea. His current research interests include multi-agent systems, information retrieval, ontology and its applications.

MUSEMBLE: A novel music retrieval system with automatic voice query transcription and reformulation

MUSEMBLE: A novel music retrieval system with automatic voice query transcription and reformulation

Recommend Documents