The Journal of Systems and Software 79 (2006) 43–56 www.elsevier.com/locate/jss
FMF: Query adaptive melody retrieval system Seungmin Rho a
a,*
, Eenjun Hwang
q
b,*
Graduate School of Information and Communication, Ajou University, ADTL Lab. Room 901-3 Paldal Hall, Suwon, 443-749, Republic of Korea b Department of Electronics and Computer Engineering, Korea University, Seoul, Anam-dong, Sungbuk-gu, 136-701, South Korea Received 29 August 2003; received in revised form 30 November 2004; accepted 30 November 2004 Available online 24 December 2004
Abstract Recent progress of computer and network technologies makes it possible to store and retrieve a large volume of multimedia data in many applications. In such applications, efficient indexing scheme is very important for multimedia retrieval. Depending on the media type, multimedia data shows distinct characteristics and requires different approach to handle. In this paper, we propose a fast melody finder (FMF) that can retrieve melodies fast from audio database based on frequently queried tunes. Those tunes are collected from user queries and incrementally updated into index. Considering empirical user request pattern for multimedia data, those tunes will cover significant portion of user requests. FMF represents all the acoustic and common music notational inputs using a well-known string format such as UDR and LSR and uses string matching techniques to find query results. We implemented a prototype system and report on its performance through various experiments. Ó 2004 Elsevier Inc. All rights reserved. Keywords: Indexing; Multimedia database; Music retrieval; String matching
1. Introduction Most traditional approaches for retrieving music are based on titles, composers or file names. However, due to their incompleteness and personal preference, it is sometimes difficult to find music satisfying particular requirements of an application. Even worse, such retrieval techniques cannot support queries such as ‘‘find music that contains pieces similar to the one being played.’’ Content-based music retrieval is usually based on a set of extracted audio features such as pitch, interval, duration and scale. One common approach for
q This research was supported by the MIC (Ministry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Assessment). * Corresponding authors. Tel.: +82312192535; fax: 82312191614 (S. Rho), tel.: +82 2 3290 3256; fax: +82 2 921 0544 (E. Hwang). E-mail addresses:
[email protected] (S. Rho), ehwang04@ korea.ac.kr (E. Hwang).
0164-1212/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2004.11.024
developing content-based music retrieval is to represent music as a string using a set of characters. For example, we may use three characters for representing the pitch contour: U(p), D(own) and S(ame) or R(epeat). In order to find similar melody strings from melody source, information retrieval techniques, especially string matching methods are used. Standard string matching algorithms such as Brute– Force, Knuth–Morris–Pratt or Boyer–Moore can find a certain sequence of strings in the text. Unfortunately, these algorithms find strings that exactly match the input. This is not suitable for acoustic input, since people do not sing or hum accurately, especially if they are inexperienced; even skilled musicians have difficulty in maintaining correct pitches for a song. Therefore, it is common to use approximate matching algorithms instead of exact matching ones (Kosugi et al., 2000; McNab et al., 1997; Uitdenbogerd and Zobel, 1998; Uitdenbogerd and Zobel, 1999). In general, approximate matching algorithms are far less efficient than exact matching algorithms. For that reason, we use both exact
44
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
and approximate matching techniques. Approximate matching is used where inaccuracy can be tolerated and exact matching is used where searching accuracy counts. In this paper, we propose a novel audio retrieval system called FMF (Fast Melody Finder) that dynamically constructs an index from frequent query melodies and various other audio features for efficient audio retrieval. The rest of the paper is structured as follows. Section 2 presents an overview of existing Music Information Retrieval (MIR) systems. Section 3 describes string representations of music and matching algorithms used in the system. Section 4 presents our indexing scheme for fast music retrieval. Section 5 describes our prototype system and reports some of the experimental results and finally the last section concludes this paper.
2. Related work In this section, we review some of the typical techniques and/or systems for music information retrieval. As we know, music can be represented in computers in two different ways. One is based on musical scores such as MIDI (most popular format) and Humdrum (Kornstadt, 1998.). Another is based on acoustic signals which are sampled at a certain frequency and compressed to save space. Examples of this representation include wave (.wav), and MPEG Layer-3 (.mp3). MIDI data can be synthesized into audio signals easily, but there is no known algorithm to do reliable conversion in opposite direction. Many researchers have studied the music similarity problem by analyzing symbolic representations such as MIDI music data, musical scores, and so on. A related technique is to use pitch-tracking to find a Ômelody contourÕ for each piece of music. String matching techniques are then used to compare the transcriptions for each song (Ghias et al., 1995; McNab et al., 1997; Uitdenbogerd and Zobel, 1998; Uitdenbogerd and Zobel, 1999; Hwang and Rho, 2004; Hwang and Park, 2002). String matching is the most widely used method in music retrieval because the representation of melody is formed as string sequences of notes. Considering human input error, dynamic programming can be applied to string matching; however, this method tends to be rather slow. MELDEX (MELody inDEX) system (McNab et al., 1997) uses exact and approximate pattern matching algorithms with dynamic programming to compare a quantized version of the queried melody with the contents of the database. Inexact model matching approach (Zhuge, 2003), which is based on a quantified inexact signature-matching theory, is to find the model approximate to usersÕ query requirement. It can enhance the reusability of model repository and enables users to use and manage model repository conveniently and flexibly.
With this similar matching theory, he tried to apply to a problem-oriented model repository system PROMBS (Zhuge, 2000). Lemstorm et al. (1998) used the suffix-tree as the index and presented a coding scheme of music that is invariant under different keys and tempos, and investigates the application of two approximate matching algorithms to retrieve music. Subramanya et al. (1997) proposed content-based indexing schemes for audio data in multimedia databases. The method is based on the transformation technique used in signal processing, which transforms data from time domain to frequency domain. This offers several advantages such as easy removal of noise, efficient compression and different types of processing. Acoustic approaches analyze the music content directly and thus can be applied to any music for which one has the audio. Blum et al. present an indexing system based on matching features such as pitch, loudness or Mel-frequency cepstral coefficients (MFCCs) (Blum et al., 1999). Foote (1997) has designed a music indexing system based on histograms of MFCC features derived from a discriminatively trained vector quantizer. Tzanetakis (2002) extracts a variety of features representing the spectrum, rhythm and chord changes and concatenates them into a single vector to determine similarity. Logan and Salomon (2001) and Aucouturier and Pachet (2002) model songs using local clustering of MFCC features, determining similarity by comparing the models. Berenzweig et al. (2003) uses a suite of pattern classifiers to map MFCCs into an ‘‘anchor space’’, in which probability models are fit. There exist several prototype Music Information Retrieval (MIR) systems. For instance, Foote (1997) proposed a method for retrieving audio and music that is based on the classification of the music in the database. Classification of a piece of music is based on the distribution of the similarities. Blackburn and DeRoure (1998) presented a new content and time based navigation tool. He extended the navigation such that musicbased navigation could be possible. Themefinder (Kornstadt, 1998) provides a web-based interface to the Humdrum thema command, which in turn allows database search for musical themes. It also allows user to find common themes in Western classical music and Folksongs of the 16th century. Ghias et al. (1995) developed a QBH system that is capable of processing an acoustic input in order to extract the necessary query information. However, this system used only three types of contour information to represent melodies. MELDEX system was designed to retrieve melodies from a database using a microphone. It first transforms acoustic query melodies into music notations, and then searches the database for tunes containing the hummed (or similar) pattern. This web-based system provides several match modes including approx-
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
imate matching for interval, contour and rhythm. MelodyHound (Typke and Prechelt, 2001), originally known as ‘‘TuneServer’’, also used only three types of contour information to represent melodies. Tune recognition is based on error-resistant encoding and uses only the direction of the melody, ignoring the size of intervals or rhythm. Many other researchers have studied QoS(Quality of Service)-guaranteed multimedia systems over unpredictable delay networks by monitoring the network conditions such as bandwidth. McCann et al. (2000) developed an audio delivery system called Kendra that uses adaptability with distributed caching mechanisms to improve data availability and delivery performance over the Internet. Huang et al. (2001) presented PARK approach for multimedia presentations over a best-effort network in order to achieve reliable transmission of continuous media such as audio or video.
3. Melody representations and matching methods 3.1. Melody representation So far, most work has considered only UDR string parsed from pitch information to represent music. However, there are several restrictions in using the UDR string. First, current UDR string cannot describe sudden pitch transitions. For example, as in Fig. 1, although the pitch contours of left and right bars are clearly different, they have the same string ‘‘UDU.’’ Classifying intervals into five extended types could relieve this: up, up a lot, repeat, down and down a lot. The classification for up, down, and repeat is same as mentioned before. The distinction between ‘‘down’’ and ‘‘down a lot’’ could depend on a certain threshold value on interval size, but a more reliable approach is to compare a note with a previous pitch contour. For instance, if the current note is lower than the note before the last one, then it is classified as ‘‘down a lot.’’ With this extension, music can now be parsed into a single string representation of five pitch contour types: u, U, r, d and D for up, up a lot, repeat, down and down a lot, respectively. The first bar of the example can be represented as ‘‘udu’’ while the second bar as ‘‘UDU.’’ Second, pitch contour cannot represent information about noteÕs length, tempo and scale; therefore, an alternative representation scheme for time contour should be
Fig. 1. Two different bars with the same pitch contour.
45
Fig. 2. Two different bars with the same time contour.
considered. Similar to the way a pitch contour is described by the relative pitch transition, a time contour can be described by the duration between notes. Time in music is classified in one of three ways: R for a repetition of the previous time, L for longer than the previous time, or S for shorter than the previous time. For instance, in Fig. 2, the pitch contour and time contour of both bars are ‘‘udr’’ and ‘‘LSR’’. Nevertheless, they are actually different melodies. It is because we did not consider the rhythm of the melody. Therefore, third, we have considered the rhythm of the melody using pitch name and duration of the note. Usual note pitch names—C, D, E, F, G, A, B—correspond to the musical scales ‘‘do re mi fa sol la si’’ and note durations—semiquaver, quaver, crotchet, minim and semibreve—are also represented by its corresponding weight ‘‘1’’, ‘‘2’’, ‘‘4’’, ‘‘8’’ and ‘‘16’’ respectively. In the case of dot note, add ‘‘.’’ after the pitch name and add ‘‘#’’ and ‘‘-’’ in other cases like sharp and flat. For example, eight notes of the melody in Fig. 2 are coded by the following strings (assuming the basic length ‘‘1’’ is a semiquaver): F4jB8jG2jG2jF2jB4:jG4jG4 Suppose we compare this melody with a melody ‘‘F4jC4jG4jG4jF2jC4.jG4jG4’’, which is stored in the database. However, we fail to identify the similarity between the two using an exact matching technique. To solve this problem, we use the longest common subsequence (LCS) algorithm (Baeza-Yates and RibeiroNeto, 1999; Thomas et al., 2001). The LCS problem typically asks for computing the length of some longest common subsequence. The LCS of the both melodies is ‘‘F4jG2jG2jF2jG4jG4’’. 3.2. Matching algorithms We have implemented the following matching algorithms in our prototype system and their performance in the experiment will be given in Section 5.3. Dynamic Programming Algorithm: This has been popular in the field of approximate string matching. Since melody contours are represented as character strings, dynamic programming was applied to melody comparison and has become a standard technique in music information retrieval. When melodies are viewed as strings, one of the popular measures of similarity is the number or cost of editing operations that must be performed to make the
46
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
strings identical. The minimum cost is called ‘‘edit distance.’’ The most common editing operations for melody comparisons are inserting a note (insertion), deleting a note (deletion) and replacing a note (replacement). Those three basic operations establish the foundation of dynamic programming algorithms applied to melody comparison. LetÕs assume we have two sequences A = a1,a2, . . . , am and B = b1,b2, . . . , bn, Let di,j represent the dissimilarity between a1,a2, . . . , ai and b1,b2, . . . , bj, then it is calculated by the following equation for l 6 i 6 m and l6j6n : 8 ðinsertionÞ > < d i1;j þ wðU; ai Þ ðdeletionÞ d i;j ¼ min d i;j1 þ wðbj ; UÞ > : d i1;j1 þ wðai ; bj Þ ðreplacementÞ The calculation of di,j is illustrated in Fig. 3. Here, w(U, ai), w(bj, U), and w(ai, bj) represent the weight associated with the insertion of ai, the deletion of bj and the replacement of ai by bj, respectively. Most of the MIR systems use this method whose weight is 1. For instance, for the two melody segments, we can build a matrix according to the following formula, where c represents the matrix; q and p represent the query melody and the melody piece in the index, respectively. Index i ranges from 0 to query length and index j ranges from 0 to melody piece length: 8 c½0; j ¼ 0 > > > < c½i; 0 ¼ i > c½i; j ¼ if ðp½i ¼ q½jÞ then c½i 1;j 1 > > : else f1 þ minðc½i 1; j; c½i; j 1; c½i 1; j 1Þg Fig. 4 shows an example where dynamic programming algorithm is used to match the melody segment [d, d, u, d, r, d, u] and the query [d, d, u, r, r, u]. Shaded square indicate error positions. The optimal matching has an error cost of 2, which occurs in the lower right-hand corner of the matrix in Fig. 4. Tracing the path that led to the optimal results in the following matches of the strings: Melody in FAI : ½ d Query melody : ½ d
d d
b j-2
u
d
u
r r
b j-1
u
d
ai
d i-1,j-1
d i,j-1
i
bj
d i-1,j
di,j
Fig. 3. Calculation of di,j.
0
1
2
3
4
5
6
7
yj
d
d
u
d
r
d
u
0
xi
0
0
0
0
0
0
0
0
1
d
1
0
0
1
0
1
0
1
2
d
2
1
0
1
1
1
1
1
3
u
3
2
1
0
1
2
2
1
4
r
4
3
2
1
1
1
2
2
5
r
5
4
3
2
2
1
2
3
6
u
6
5
4
3
3
2
2
2
Fig. 4. The dynamic programming algorithm search.
Longest Common Subsequence Algorithm: This algorithm involves establishing a recurrence for the cost of an optimal solution. Let c[i, j] be the length of an LCS of the sequences q[i] and p[j], which are the query melody and melody piece in the database, respectively. If either i or j is zero, then one of the sequences has length 0, so the LCS has length 0. Optimal substructure of the LCS problem gives a recursive formula as follows: 8 if i ¼ 0 or j ¼ 0 > <0 c½i; j ¼ min c½i 1; j 1 þ 1 if i;j > 0&q½i ¼ p½j > : maxðc½i;j 1;c½i 1;jÞ if i;j > 0&q½i 6¼ p½j Fig. 5 gives an example of how the LCS algorithm compares the query melody and melody piece in the database. The LCS length of the sequences [d, d, u, d, r, d, u] and [d, d, u, r, r, u] is 5. The square in row i and column j contains the value of c[i, j]. The entry 5 in c[6, 7]— the lower right-hand corner of the table—is the length of an LCS [d, d, u, r, u] of the two sequences. j i
u
ai-2 ai-1
j
0
1
2
3
4
5
6
7
yj
d
d
u
d
ur
d
u
0
xi
0
0
0
0
0
0
0
0
1
d
0
1
1
1
1
1
1
1
2
d
0
1
2
2
2
2
2
2
3
u
0
1
2
3
3
3
3
3
4
r
0
1
2
3
3
4
4
4
5
r
0
1
2
3
3
4
4
4
6
u
0
1
2
3
3
4
4
5
Fig. 5. Finding the longest common subsequence by LCS algorithm.
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
For i, j > 0, entry c[i, j] depends on whether q[i] = p[j] and the values in entries c[i 1, j], c[i, j 1], and c[i 1, j 1], which are computed before c[i, j]. To trace the elements of an LCS, follow the arrows starting from the upper left-hand corner; the path is shaded. Each arrow on the path points to a highlighted entry which indicates the matched string that is a member of a CS (Common Subsequence). Boyer–Moore algorithm: This is based on the idea that more information can be obtained by matching the pattern from the right than from the left and shows very good performance. It scans the pattern characters for a match starting from the last character in the string. During the search, the pattern characters are scanned for a match starting with the last character in the pattern.
4. Database schema and indexing scheme This section describes a schema for representing music and its metadata in the database and explains how to construct and maintain a dynamic index called FAI out of frequently queried melody tunes for fast music retrieval. 4.1. Schema for music collections Many file formats such as SMDL, NIFF and MIDI are used to store musical scores. A graphical format like Notation Interchange File Format (NIFF) is not suit-
47
able for general interchange, which is one of the main reasons NIFF has not been adopted by many applications. Standard Music Description Language (SMDL) was designed as a representational architecture for music information; however, there is no commercial software supporting it because of its overwhelming complexity. MP3 and other digital audio encoding formats represent music recordings, not music notation. Except for very simple music, computers cannot automatically derive accurate music notation from the music recordings, despite many decades of researches. On the other hand, Musical Instrument Digital Interface(MIDI) file contains more score information than other formats. Also, it can hold text data such as titles, composers, track names, and other descriptions. This is why the MIDI format file is popular in this area and we use it in our implementation. Fig. 6 shows the schema diagram for musical features mostly derived from the MIDI format. In the figure, the Music element has two components: total number of music and scores of music. Each MusicScore element has its id, meta-information, attributes and part of the scores of the music. The MetaInfo element has information such as title, composer and filename. The Attributes element describes musical attributes of a score, such as the clef, key and time signature. Key signature is represented by the number of sharps or flats. The Time element represents Beats and BeatType elements that are the numerator and denominator of the time signature, respectively. The Part element contains its id and the number of phrases.
Fig. 6. XML schema for music meta-data.
48
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
4.2. FAI indexing scheme In the content-based applications, indexing and similarity-searching techniques are two keys to fast and successful data retrieval. Currently, acoustic features are widely used for indices (Wold et al., 1996; Foote, 1997; Typke and Prechelt, 2001). In this paper, we use UDR notation based on the pitch contour and LSR notation based on the time contour to represent music contents in the database. In a typical case, music is usually recognized or memorized by a few specific melody segments of it. It means that most user queries will concentrate on those segments. If those segments are captured into an index into music database, it would give great advantage in terms of response time. Those segments can be identified or induced from the previous user queries. In this paper, we have organized them into an index structure called FAI. Fig. 7 shows the overall indexing scheme. Fig. 7(1) shows the initial stage of FAI. Initially, FAI could be either empty or initialized with predetermined segments for hot music. In the former case, whole melody strings should be looked up for the query tune. Relying on the user response, matched music and its segment matched with the query tune will be recorded. When the frequency that a segment has appeared in the query is high enough, an index entry will be allocated for that segment with a pointer to the music in the database as shown in Fig. 7(2). Eventually, the FAI will be populated with short representative strings for hot music and provide fast access to the music in most cases, as shown in Fig. 7(3). It may be possible that the matched one in the FAI for the query tune is not the one looked for. Therefore,
just returning the music pointed by the FAI entry would not be enough. If the user could not find what he/she wants, then the entire melody source should be looked up. However, since userÕs interest tends to restrict onto a small number of popular music with a few memorable segments, just looking up the FAI will satisfy most user requests. The matching engine carries out linear search for the FAI entries. By maintaining the FAI entries such that the more popular ones are looked up earlier, the overall retrieval performance can be further improved. 4.3. Maintenance of index entries In this subsection, we explain the algorithms and policies for manipulating the FAI entries and processing user queries. 4.3.1. Entry management FAI is supposed to hold a few melody segments of hot music at a certain time period. The number of FAI entries for one song can be restricted to 2 to n depending on the implementation issues. Here, if n becomes too big, the system will suffer from redundancy and face a serious efficiency problem. In the worst case, enough query tunes will be accumulated in the index according to the FAI expansion policy such that the index itself will be almost same as the whole melody in the database. So, it is meaningful to find out an appropriate number of FAI entries for each hot music. In this paper, we are assuming that the maximum number of entries for each music is 3. Now, we will describe how to maintain the FAI index structure. Each FAI entry has four variables: Access Count, Age, Repetition and Size. Here, Access Count
Fig. 7. FAI indexing scheme.
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
49
keeps the access frequency of each entry. It is initialized to Ô1Õ and will be incremented by 1 whenever any query refers to the entry. Two or more entries can be merged into one entry if they are overlapped by the expansion. In that case, the bigger one between them will be the access count of the merged entry. Depending on the music type, its popularity could fluctuate. That is, popular music with high access count could become unpopular, and eventually rarely accessed. Nevertheless, it will still remain in the index due to the high access count, which will prevent other newly popular melody segments from getting into the index. Therefore, another variable Age will record how long the entry has been in the index since its last access. Therefore, older entries will have larger Age value than younger ones. Other variables Repetition and Size represent repeated melody segments in the FAI and the length of each entry, respectively. These values are used for calculating the ranking of the melodies in the query result set. Fig. 8 depicts the structure of the FAI entry with those variables.
can see, the query tunes and the FAI entries are very similar to each other; af and aq are overlapped and bf is included in bq. In both cases, the query tune cannot be precisely matched with the FAI entry, and the whole melodies should be looked up; to improve the search efficiency, we need operations for expansion and modification of the index entries. Fig. 9(a) shows how to expand FAI entries by merging index entries with given query tunes. In Fig. 9(b), however, both the query aq and bq are not a substring but a subsequence of FAI entries af and bf in that af(bf) contains an almost identical substring as aq(bq). Therefore, we also need other operations such as insertion, deletion and substitution in addition to the expansion. To compute the subsequence of the query tune and FAI entry, unnecessary characters should be deleted first and then both strings af(bf) and aq(bq) are aligned from the first matched character. Strings af(bf) and aq(bq) after the deletion and alignment are shown below.
4.3.2. Entry expansion Even though human beings tend to memorize or perceive music by a few specific segments, there could be many possible cases between the query tune and its corresponding FAI entry. In case a tune is totally contained in an FAI entry, there is no problem. However, when they overlap, it can be considered as a mismatch from the FAI search and cause the whole melody to be searched for. In order to avoid such unnecessary search in the future, FAI entries need to be expanded or modified properly such that a new one can represent the old entry and the query tune. Fig. 9 shows how to (a) expand and (b) modify the FAI entries depending on the query tunes. Suppose that af and bf are FAI entries for music A, and aq and bq are the query tunes, respectively. As you
aq : D R R D R D D
Age
af : D R D D R D D bf : D U R R R U U bq : D U R R R D U U
In this alignment, R in aq is mismatched with D in af, and all other characters are matched to each other; therefore, we need the substitution operation. 4.3.3. Entry modification and deletion When a new query tune is given, the system first looks up the FAI entries. If there is no match, new entry with the query tune will be inserted into the FAI if there is enough space for it. If there is no free space, we may need to remove some of the existing entries, which are the least accessed or oldest ones. In order to select the entry to be deleted, the system first checks the Access Count variable and then deletes the entry with the lowest value. If there is more than
Repetition
Access Count
Melody Length
Extend entry URDDRD DDURURDDRRDDUURDURDDDURRRDDDUUDDRRURUUDRRRDUUDRRRDRU DDRRURUUDD RRUDDUD
One FAI entry
Access Count (2)
Age (1)
Repetition (1)
1
2
2
4
3
1
1
2
2
4
Query
3
1
2
2
3
4
1 Extended Entry
1
Flow
Input query
Getting old
Consider as new entry : Age (1), Access Count (1)
Fig. 8. The structure of FAI.
50
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56 Start of Music
af
End of Music
aq Extended entry a ext
Music A
UDR
bf
DUDRDDRRDD
Start of Music
End of Music
af
DDURRRUU
RDDRRDDDU
bq
aq
a ext = a f U a q
Extended entry bext
b ext = b f U bq
DUDRDDRRDDDU
UU... .DRRRR
DURDDURRRUU
DURDDURRRUU
Deletion
DDD
a ext =
(a) Music A
UDRDU
bq
DRRDRDDUR
DU
Modified entry a ext
bf
DUDRDDRDD
DR
UR
SUBST(R, D)
D DRDD
DRDDRDDDU
DURRRUU
UDRDURRRDUU
UDR
Modified entry b ext
DUUU?.?..
bext =
DELETE(D)
DURRR
DRRDU
D UU
DURRRUU
RDDDD
(b)
Fig. 9. Handling index entries when overlapped with input. (a) Exact matching; (b) approximate matching.
then these overlapped melodies are merged into a new entry. This expanded entry will have the Age Ô1Õ and the Access Count Ô5Õ which is obtained from increasing the larger Access Count Ô4Õ by one.
one candidate, then the system considers the Age variable and selects the oldest one. After deleting old entries, the new query tune is inserted into the FAI and its variables are initialized. Suppose we denote entry (a, c, r) as an entry that indicates Age, Access Count and Repetition, respectively. In Fig. 10, each FAI entry has different values for the variables; Ô3, 2, 1Õ forAge, Ô4, 2, 2Õ forAccess Count and Ô3, 1, 4Õ for Repetition. Therefore, we represent these FAI entries as follows: entry ð3; 4; 3Þ; entry ð2; 2; 1Þ; entry ð1; 2; 4Þ When a new query tune is given, the system first checks the Access Count to find the least accessed melody and then deletes it. However, both entry (2, 2, 1) and entry (1, 2, 4) have identical Access Count value Ô2Õ as in Fig. 10. In this case, the system deletes the older one between them. So, entry (2, 2, 1) is deleted. When a new query tune for the melody is given, entry (1, 1, 2) is created and the other entries are aged by Ô1Õ. As we explained in Section 4.3.2, FAI entries are expanded in order to avoid further unnecessary entire database search. Fig. 11 shows how the melody expansion proceeds in FAI. When a new query tune located between the entry (2, 2, 1) and entry (1, 4, 4) is given,
4.3.4. FAI management algorithms Fig. 12 shows the query processing steps for music retrieval. When a query melody is given, the system investigates the FAI first before looking up the whole melodies in the database. If same melody exists in the FAI, the system adjusts its variables and includes it into the result list; otherwise, matching engine looks up the entire melodies in the database. If a match is found from the database search, then the matched melody is compared with FAI entries. At this point, a verification process is needed to see whether the found melody is overlapped with any FAI entry or not. If an overlapped entry exists, then the melody is merged with the overlapped entry into a new one. On the other hand, if there is no overlapped entry, then the found melody is inserted into the FAI as a new entry only if enough space exists for the entry. (Detailed description for the management of FAI and query processing algorithm can be found in the Appendix A.)
DURRURDDRDU DDRRURUUDD UDDRRUDDUD DDRUDDDURURDDRRDDUURDURDDDURRRDDDUUDDRRURUUDRRRDUUDRRRDRUUDRRRDRU
Access count (4, 2, 2)
Flow
Age (3, 2, 1)
Repetition (3, 1, 4)
3
3
2
1
1
4
3
3
2
1
1
4
4
3
2
4
Delete
Query
1
2
New entry
Fig. 10. Deletion of index entry based on the Access Count.
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
51
DURRURDDRD DDURRRDDDU URUUDRRRDUUDR DDRUDDDURURDDRRDDUURDURDDDURRRDDDUUDDRRURUUDRRRDUUDRRRDRUUDRRRDRU
Access count (3, 2, 4)
Flow
Age (3, 2, 1)
3
3
2
1
3
3
2
1
4
3
1
Query
Repetition (3, 1, 4)
1
4
1
4 4
Extended new entry
Fig. 11. Expansion of index entry for overlapped melody.
Query melody
Found
Search FAI
Not found
Search the whole melody database
Result melody Increase the access count of matched FAI entry
Compare the result melody with FAI entries
Yes
Any overlap
No
Melody extension operation
New entry melody
FAI update
Insert the melody into the FAI
dling the audio meta-data. Approximately 12 000 MIDI files were used in the experiment and the average file size was about 40 KB. Fig. 13 shows all the steps and system components involved in the query processing. When a query is given either by humming on the microphone or CMN-based interface, the system interprets it as a signal or a sequence of notes, respectively and extracts audio features such as pitch and time contour. Then, those extracted features are transformed into UDR, LSR and pitch duration notation. For the transformed string, the FAI index is first looked up and then the music database is searched for if the index lookup failed. If a match is found in the index, it adjusts the entryÕs variables and includes it into the result set. If a melody is found from the database and the user confirms it, then the query tune is inserted into the FAI for that music and its variables are initialized. 5.2. User interface
Return the result melody
Fig. 12. Query processing steps.
5. Implementation This section describes the overall architecture of our prototype system and the interface for querying and browsing the result. Also, in order to show the effectiveness of our index scheme, we have performed several experiments and report some of the results. 5.1. System architecture We have implemented a prototype music retrieval system based on the FAI indexing scheme. The prototype system provides flexible user interface for querying and browsing the query result. Client and server side are implemented using Java Applet and JSP, respectively. We have used a set of jMusic[18] APIs for extracting audio features and the eXcelon database system for han-
Unlike traditional database applications, audio data is not easy to query using a text-based interface such as SQL. Two popular approaches for querying audio database are humming and CMN-based representation. Fig. 14 shows the query interface in our prototype system. A CMN-based query is created using a mouse by dragging and dropping on the music sheet applet. In the case of humming-based interface, users are supposed to hum or sing on the microphone directly. Both queries are stored as a MIDI format and then transformed into strings as described in Section 3. In addition, the system allows text-based query using metadata such as composer, title, or file name. All those text information are collected from the MIDI files. For the query flexibility and efficiency, two types of queries can be combined, that is, user can formulate a query using both text information and humming or CMNbased input. Fig. 15 shows the list of matched files from the FAI index and the music database for queried melody. User can play the MIDI file by clicking the hyperlinks.
52
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
Fig. 13. Overall querying process flow.
Fig. 14. Query interfaces of CMN and QBH.
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
5.3. Experiment
53
Time in milliseconds 25000
To measure the effectiveness of string matching algorithms for our indexing scheme, we implemented three exact matching algorithms and one approximate matching algorithm. Three exact matching algorithms used are Naive, Knuth–Morris–Pratt (KMP) and Boyer–Moore (BM). Boyer–Moore is the fastest on-line algorithm whose running time is O(n/m) where n is the length of the entire database. Because of its better performance, we have chosen Boyer–Moore algorithm for exact matching in the system and edit distance with dynamic programming for approximate string matching. 5.3.1. Performance comparison Fig. 16 shows the average response time for user query. The upper curve represents the response time with usual index scheme and the lower curve represents the response time with the FAI index scheme. In both cases, as the number of songs is increased, more time has been spent to find out matches from the database. But using the FAI index shows much better result even for a larger data set. In Fig. 17, we measured the relationship between the query length and the response time. The first graph shows that the query length did not seem to affect the response time seriously with the size of our database. However, we expect that with a larger database containing various types of music, the query length have a significant influence on the response time. More matching files for a query with fewer notes suggest that a pattern
Database FAI
20000
15000
10000
5000
0 0
1000
2000 4000 Number of Music
6000
12000
Fig. 16. Query response time.
of fewer notes in query is common in most music as shown in the second graph. Fig. 18 shows the query response time for Naive, KMP and BM algorithms. The two upper curves represent the query response time for Naive and KMP. The lower curve represents the query response time for BM, which gives better performance than the other two algorithms. For the comparison, we used UDR strings with length p = 3 and p = 10 and measured their average response time. 5.3.2. Performance under effectiveness assumption For the evaluation of retrieval effectiveness in melody retrieval systems, two measures are usually used;
Fig. 15. Query result and MIDI player.
54
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56 Time in milliseconds
Number of matching files
13600
9000
13400
8000
13200
7000
13000
6000 5000
12800
4000
12600
3000
12400
2000
12200 12000
1000
1
2
3
4
5
6
7
8
9
10
0
1
2
3
Number of notes in query
4
5
6
7
8
9
10
Number of notes in query
Fig. 17. The number of notes in query vs. response time.
p=3
Time in milliseconds
p=10
Time in milliseconds 35000
35000
Naïve
30000
30000
Naïve KMP
25000
BM
KMP BM
25000 20000
20000
15000
15000
10000
10000
5000
5000 0
0 0
1000
2000
4000
6000
12000
0
1000
Number of Music
2000
4000
6000
12000
Number of Music
Fig. 18. Query response time for Naive, KMP and BM algorithm when p = 3 and p = 10.
precision and recall. Precision is defined as the proportion of retrieved music that are relevant. Precision can be taken as the ratio of the number of melodies that are judged relevant for a particular query over the total number of melodies retrieved.
Fig. 19. Query melody from the first theme occurrence in Christmas song ‘‘Silver Bells’’.
Recall is defined as the proportion of relevant melodies retrieved. Recall is considerably more difficult to calculate than precision because it requires finding relevant melodies that will not be retrieved during userÕs initial search. Fig. 19 depicts two queries based on the first theme occurrence in Christmas song ‘‘Silver Bells’’ and Table 1 shows some of the results for queries that find the exact and similar melodies under different conditions using pitch, time and combination of both contours. Using the exact matching technique in conjunction with the combination of pitch and time contour gives better precision than the others using the approximate matching technique.
Table 1 Query result Pitch
Query 1
Query 2
ddduddd – – Query1 + Udddrr – –
Time
– RLRSRLR – – Query1 + SRLRRL –
Pitch + Time
– – ddduddd + RLRSRLR – – Query1 + Udddrr + SRLRRL
Relevant
132
18
Retrieved
Precision
Exact
Appro.
Exact
Appro.
545 337 132 61 49 18
2863 2205 389 337 189 48
24.2 39.2 100 29.5 36.7 100
4.61 5.98 33.9 5.34 9.52 37.5
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
6. Conclusions In this paper, we discussed various features of music contents for content-based retrieval and proposed a fast music retrieval scheme based on frequently accessed tunes. Through a series of experiments performed on a prototype system, we showed that our FAI index scheme could give much better performance gain to the existing retrieval systems. This makes sense when we consider the observation that user queries are usually confined onto a small number of music and query tunes hummed or sung are concentrated on a few segments. For this reason, FAI scheme usually takes up only a small amount of system resource to maintain and can be plugged on top of arbitrary music retrieval systems to improve the performance. For further study, we are planning to enhance the FAI scheme in two directions. First, we will find an appropriate number of FAI entries for the better performance through various experiments. Second, we will compare the efficiency with the tree index techniques when the size of FAI becomes much bigger than the original size. We are also planning to consider other music data format like MPEG Layer-3 (.mp3) and alternative technique for extracting monophonic melodies from polyphonic music data. Appendix A Query process algorithm in FAI (1) Get a humming query from input device— microphone, CMN; (2) While (any query comes into the system) If (Exact Matching is selected) Call ExactMatch (); Else Call ApproxMatch(); (3) List up the candidate result melody ordered by the highest ranked melody first; (4) Play the retrieved melody; /* Definition of Functions */ Function ExactMatch (query q) If (melody q is equal to the melody in FAI) Increase access_count value of FAI entry; Return matched_melody; Else ResultSet = search the whole music DB; If (same melody is found in ResultSet) Add a melody into FAI; Function ApproxMatch (query q) Empty Matrix M for query q and string in FAI with i-th row and j-th column; M[0, j] = 0 and M[i, 0] = i; While (No errors found)
55
If (melody q is equal to the melody in FAI) M[i, j] = M[i-1, j-1]; Else M[i, j] = 1 + Min(M[i-1, j], M[i, j-1], M[i-1, j-1]); If (M[i, j] < Threshold) Increase access_count value of FAI entry; Return matched_melody; Else ResultSet = search the whole music DB; If (same melody is found in ResultSet) Add melody into FAI; FAI Management Algorithm Function FAI_Update Input :query melody q, current music number n, overlapped melody number i, music[n].FAI[i], music[n].FAI[i].access_count, music[n].FAI[i].age music[n].FAI[i].repetition (1) Find the entry which is overlapped with query melody; (2) Merge overlapped entries with query melody into one melody; (3) Call deleteEntryfromFAI(musicNum, entryTobeDeleted); (4) Call insertMelodyin to FAI (musicNum, squeryMelody); Function InsertMelodyintoFAI Input:current music number n, query melody q Variable :q.position, max_ FAI_num f = music[n].FAI.max, music[n].FAI[f].access_count, music[n].FAI[f].age music[n].FAI[f].repetition (1) Find the position of new entry; (2) music[n].FAI[f+1] = q ; (3) music[n].FAI[f+1].access_count = 1; Function DeleteEntryfromFAI Input:current music number n, number of FAI entry to be deleted d (1) music[n].FAI[d]; (2) music[n].FAI.max=music[n].FAI.max1; Function MelodyCompare Input: Result result output: Overlapped_melody Variable: FAItemp (1) Copy FAI entry into temp; (2) for (i=0; i++; i=n) //loop for melody
56
S. Rho, E. Hwang / The Journal of Systems and Software 79 (2006) 43–56
for (j=0; j++; j=temp[n].FAI.max) if (result==temp[n].FAI[j]; return temp[n].FAI[j]; else return;
References Aucouturier, J.-J., Pachet, F., 2002. Music similarity measures: WhatÕs the use? In: Proceedings of the International Symposium on Music Information Retrieval. Baeza-Yates, R., Ribeiro-Neto, B., 1999. Modern Information Retrieval. Addison Wesley. Berenzweig, A., Ellis, D.P.W., Lawrence, S., 2003. Anchor space for classification and similarity measurement of music. In: Proceedings of ICME 2003. pp. 29–32. Blackburn, S., DeRoure, D., 1998. A Tool for Content Based Navigation of Music. In: Proceedings of ACM multimedia 98 – Electronic Proceedings, pp. 361–368. Blum, T.L., Keislar, D.F., Wheaton, J.A., Wold, E.H., 1999. Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information. US Patent 5, 918, 223. Foote, J.T., 1997. Content-based retrieval of music and audio. In: Multimedia Storage and Archiving Systems II—Proceedings of SPIE, pp. 138–147. Ghias, A. et al., 1995. Query by humming—musical information retrieval in an audio database. In: Proceedings of ACM Multimedia 95—Electronic Proceedings, pp. 231–236. Huang, C.M. et al., 2001. Synchronization and flow adaptation schemes for reliable multiple-stream transmission in multimedia presentation. Journal of Systems and Software 56 (2), 133–151. Hwang, E., Park, D., 2002. Popularity-adaptive index scheme for fast music retrieval. In: Proceedings of IEEE Multimedia and Expo. Hwang, E., Rho, S., 2004. FMF (fast melody finder): a web-based music retrieval systemLecture Notes in Computer Science, vol. 2771. Springer-Verlag, pp. 179–192. Kornstadt, A., 1998. Themefinder: A web-based melodic search tool. Computing in Musicology 11, MIT Press. Kosugi, N., et al., 2000. A practical query-by-humming system for a large music database. In: Proceedings of the 8th ACM International Conference, pp. 333–342. Lemstorm, K. et al., 1998. Retrieving music—to index or not to index. In: ACM International Multimedia Conference (MM Õ98). pp. 64– 65. Logan, B., Salomon, A., 2001. A music similarity function based on signal analysis. In: Proceedings of ICME 2001. pp. 190–193. McCann, J.A. et al., 2000. Kendra: adaptive Internet system. Journal of Systems and Software 55 (1), 3–17.
McNab, R.J. et al., 1997. The New Zealand digital library MELody index. Digital Libraries Magazine. Subramanya, S.R. et al., 1997. Transforms—based indexing of audio data for multimedia databases. In: IEEE International Conference on Multimedia Systems. Thomas, H. et al., 2001. Introduction to Algorithms, second ed. The MIT press. Typke, R., Prechelt, L., 2001. An Interface for melody input. ACM Transactions on Computer–Human Interaction. 133–149. Uitdenbogerd, A., Zobel, J., 1998. Manipulation of music for melody matching. In: Proceedings of ACM Multimedia Conference. pp. 235–240. Tzanetakis, G., 2002. Manipulation, Analysis, and retrieval systems for audio signals. Ph.D. Thesis, Princeton University. Uitdenbogerd, A., Zobel, J., 1999. Melodic matching techniques for large music databases. In: Proceedings of ACM Multimedia Conference. pp. 57–66. Wold, E. et al., 1996. Content-based classification, search and retrieval of audio. IEEE Multimedia 3 (3), 27–36. Zhuge, H., 2000. A problem-oriented and rule-based component repository. Journal of Systems and Software 50 (3), 201–208. Zhuge, H., 2003. An inexact model matching approach and its applications. Journal of Systems and Software 67 (3), 201–212.
Web references Huron, D., Sapp, C.S., Aarden, B., 2000. Themefinder. http:// www.themefinder.org. jMusic Java library, http://jmusic.ci.qut.edu.au. MiDiLiB project. Content-based indexing, retrieval, and compression of data in digital music libraries. http://www-mmdb.iai.unibonn.de/forschungprojekte/midilib/english. Seungmin Rho received his B.S. and M.S. degree in Computer Science from Ajou University, Korea, in 2001 and 2003, respectively. Currently he is pursuing the Ph.D. degree in the Computer Science Department of Ajou University. He is currently working on audio analysis and distributed multimedia presentation. His research interests include database, audio and video retrieval, multimedia system, QoS and resource management. Eenjun Hwang received his B.S. and M.S. degree in Computer Engineering from Seoul National University, Seoul, Korea, in 1988 and 1990, respectively; and the Ph.D. degree in Computer Science from the University of Maryland, College Park, in 1998. From September 1999 to August 2004, he was with the Graduate School of Information and Communication, Ajou University, Suwon, Korea. Currently he is a member of the faculty in the Department of Electronics and Computer Engineering, Korea University, Seoul, Korea. His current research interests include database, multimedia system, information retrieval, XML and web applications.