Audio signal identification via pattern capture and template matching

Audio signal identification via pattern capture and template matching

Pattern Recognition Letters 21 (2000) 269±275 www.elsevier.nl/locate/patrec Audio signal identi®cation via pattern capture and template matching  J...

139KB Sizes 0 Downloads 40 Views

Pattern Recognition Letters 21 (2000) 269±275

www.elsevier.nl/locate/patrec

Audio signal identi®cation via pattern capture and template matching  J. Eide Martin Kermit *, Age School of Computer Sciences, éstfold College, Post box 1192, Valaskjold, 1705 Sarpsborg, Norway

Abstract This research reports on a system able to classify di€erent signals containing auditive information based on capture of small signal segments present in speci®c types of sound. After using a Haar wavelet transform at the preprocessing stage, a neural network known as the O-algorithm compares segments from candidate audio signals against prede®ned templates stored in the network. The classi®cation performance is tested with three di€erent applications. Ó 2000 Elsevier Science B.V. All rights reserved. Keywords: Haar wavelets; Audio signals; O-algorithm

1. Introduction In the last decade there has been an increasing demand for systems able to classify di€erent types of auditive signals, especially speech recognition systems. This is very much due to the growing potential of voice activated software or electronic equipment and telephone services that require activation from spoken letters or words (Loizou and Spanias, 1996). Audio classi®cation in general has also a wide range of applications other than speech recognition, like i.e. the diagnostics of heart sounds (Nilssen, 1996). Sound classi®cation systems that o€er o€-line classi®cation can be designed to identify di€erent types of audio signals from a limited set of training data with robust performance. Such systems are

*

Corresponding author. Tel.: +47-6910-4114; fax: +47-69104002. E-mail address: [email protected] (M. Kermit).

mainly reported in the ®eld of speech recognition with di€erent approaches (Renals and Rohwer, 1989; Arslan and Hansen, 1999; Lippmann, 1989; Unnikrishnan et al., 1992; Chen et al., 1996). General recognition of speech is also reported in several articles (Chudy et al., 1991; Lang et al., 1990; Tom and Tenorio, 1991; Wutiwiwatchai et al., 1998) which present systems for speaker independent word recognition. Other types of classi®cation have been reported for vowels (Kermit et al., 1999; Irino and Kawahara, 1990). In this research, we present a general audio classi®cation system based on identi®cation of small audio segments. These segments are used as features and the recognition is based on the presence of these features. The reported system is applicable to a variety of audio signals and is able to perform recognition online as well as o€-line. Section 2 presents the details of the system by describing the preprocessing, identi®cation algorithm and the feature selection in separate parts. The data being used for system testing is then

0167-8655/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 9 9 ) 0 0 1 5 6 - 7

270

 Eide / Pattern Recognition Letters 21 (2000) 269±275 M. Kermit, A.J.

presented in Section 3, while Section 4 reports the system performance when the described data is applied. Section 5 concludes and evaluates the system.

chosen to represent the audio signal. The translation and scaling parameters are integers chosen such that m ˆ 1; 2; . . . ; log2 …N † ‡ 1; n ˆ 0; 1; . . . ; d2ÿm N ÿ 1e;

2. The proposed audio recognition scheme A key to successful classi®cation may be to isolate some kind of uniqueness which identi®es the sound to be classi®ed. Such uniqueness is sometimes present during a very short period of time in audio signals and often repeated with some degree of perturbation for multiple succeeding periods throughout a signal. An example is shown in Fig. 1, where the amplitude spectrum of the spoken word entropy is indicated at the top. On the bottom of this ®gure, four periods of the ®rst vowel are magni®ed for illustrative purposes. The four indicated periods have a plausible candidate for a segment representing a unique feature of the vowel E outlined for all periods. 2.1. Preprocessing stage

where d e denotes the ceiling operator. 1 By performing this procedure for P di€erent signals belonging to P di€erent classes, each coecient set wp ˆ ‰wp1 ; wp2 ; . . . ; wpj ; . . . ; wpN Š;

p ˆ 1; 2; . . . ; P

thus gives a template of Haar coecients representing a speci®c audio signal. 2.2. Identi®cation procedure The identi®cation process is performed by a neural network known as the O-algorithm (Eide and Lindblad, 1992; Lindblad et al., 1997). The Oalgorithm performs a similarity match between two sets of data. Similarity between a prede®ned data set a ˆ ‰a1 ; a2 ; . . . ; aN Š and a candidate data set b ˆ ‰b1 ; b2 ; . . . ; bN Š is calculated by v2 ˆ

2 N X …aj ÿ bj †

;

…3†

Even though a small segment of an audio signal may be unique in some sense to a particular signal, signi®cant variations occur. To compensate for some of these perturbations, the Haar wavelet transform is used as a preprocessing tool. Haar wavelets are known to have a dampening e€ect to rich amplitudal variations often present in audio signals, thus representing an average of adjacent sample values. The family of Haar wavelets wm;n …t† is described by

where r is a scaling parameter representing the expected deviation between a and b. The two data sets a and b are considered to be similar if v2 < h, where h is a prede®ned threshold parameter depending on the data at hand. In practice, r can be dicult to determine and justi®cation of h is used instead. Eq. (3) can be applied to several prede®ned data sets

wm;n …t† ˆ 2ÿm=2 w…2ÿm t ÿ n†;

ap ˆ ‰ap1 ; ap2 ; . . . ; apN Š;

…1†

where w…t† represents the mother wavelet. Here, n denotes the translation parameter and m is the amount of scaling. The application of Haar wavelets to a characteristic segment of a signal produces a set of N Haar wavelet coecients, w ˆ ‰w1 ; w2 ; . . . ; wj ; . . . ; wN Š, where wj ˆ hx; wm;n i;

…2†

and x ˆ ‰x1 ; x2 ; . . . ; xj ; . . . ; xN Š are the recorded samples of sound forming the characteristic region

jˆ1

r2

p ˆ 1; 2; . . . ; P

thus performing a similarity match between b and P di€erent data sets ap . The proposed setup can be viewed as a neural network with N inputs and P active neurons each performing the similarity match in (3). The training of this neural network is 1 For any x 2 R, dxe denotes the ceiling of x and bxc the ¯oor of x, or greatest integer in x, where dxe ˆ bxc ˆ x for x 2 Z; bxc ˆ the integer directly to the left of x for x 2 R ÿ Z; dxe ˆ the integer directly to the right of x for x 2 R ÿ Z.

 Eide / Pattern Recognition Letters 21 (2000) 269±275 M. Kermit, A.J.

271

Fig. 1. Illustrative example of a characteristic region used to represent a vowel identi®ed in a speech signal. The vowel E is recognized with multiple periods based on capture of this speci®c region used as a feature for identi®cation.

then the selection of the prede®ned data set ap . Every neuron p thus stores a prede®ned pattern ap and hp related to the particular data in ap in which the candidate pattern b is to be matched against. In the case of audio data, let ap ˆ wp , where wp is a prede®ned characteristic template representing some class of audio signals. wp is given by (2) for each audio signal class p. A candidate template b captured from an audio signal to be classi®ed is thus compared against the prede®ned templates wp and class membership based on similarity of v2p is determined. 2.3. Segment selection The system presented in this text has no limitations to the selection of templates other than the restriction of the segment length, N, to be an integer which is a power of 2. This is due to the Haar wavelet preprocessing step in (2). A straightforward method to select unique segments from audio signals is to visually inspect the amplitude spectrum for outstanding features. If the system is to be trained for recognition of a speci®c feature, e.g. a speci®c defect in a heart sound or a sharp transition region characteristic to some spoken phoneme, this method applies well. If no information about speci®c features are available for the audio at hand, a trial and error search for a characteristic region can be applied

instead. In practice, this is often the case. Repeated training with di€erent templates and testing against other candidate templates can be performed iteratively until satisfactory performance has been achieved. 2.4. Template capture The search for similar templates in signals to be classi®ed is done simply by traversing the vector containing the audio signal in ascending order, one position at the time. For every sample xi present in the audio signal, the N following samples is Haar transformed and used as input to the neural network. In this manner, every possible segment x ˆ ‰xi ; xi‡1 ; . . . ; xi‡N ÿ1 Š is Haar transformed and matched against one or more prede®ned templates wp stored in the neural network using the strategy outlined in Section 2.2. This enables the system to perform on-line classi®cation of audio signals if the Haar wavelet window is moving with the same rate as an input stream of audio data. 3. Data for classi®cation Three sets of data are used in the demonstration of the proposed recognition setup to highlight di€erent perspectives of the system. The ®rst data set is a database consisting of 90 ®les of data

272

 Eide / Pattern Recognition Letters 21 (2000) 269±275 M. Kermit, A.J.

representing 10 separate trials of the 9 vowels present in the Norwegian tongue. The Norwegian vowels are all of the steady state type. 2 The nine Norwegian vowels are listed in Table 1, together with the corresponding International Phonetic Association (IPA) symbols using the notation of Kirshenbaum. Examples of English words containing the speci®c vowel sounds are also given, except for the vowel O which is not easily recognized in the English tongue. The database containing the Norwegian vowels was also used in the experiments reported in (Kermit et al., 1999). All the vowels were uttered by the same male speaker. Each ®le consists of 1024 mono samples, at a 16bit sampling rate of 8 kHz. These ®les of sampled steady state speech sound roughly contain 10±12 vowel periods each. The second set of data consists of 25 ®les representing ®ve separate trials of ®ve di€erent words, car, pit, fed, apple and tall. Each of these words includes the pronunciation of one of the steady state vowels described previous. 8000 mono samples were made for each ®le at the same sampling rate as for the steady state vowels. The last set with data di€ers slightly from the two sets of speech sound described above. A prerecorded musical composition on a compact disc was chosen such that some musical instruments were able to be recorded isolated without interference from other instruments. From this composition, one trial for each of three di€erent music instruments, a bass guitar, a keyboard and a snare drum were recorded separately for roughly 200± 300 ms. At another time in the same composition, a complex musical mixture where all three instruments were present simultaneously, was recorded. This mixture was recorded for 5 s. All signals were resampled in mono using a sampling rate of 8 kHz as for the speech signals giving 16-bit samples. Fig. 2 shows the amplitude spectra of two of the mentioned instruments at the top, and a part of 2 If a vowel does not change its frequency characteristics throughout several periods of the pronunciation when spoken separately, the vowel is said to be of a steady state type. In the English language, only the vowel E has the steady state feature when spoken separately. However, the steady state vowels are often present in spoken words.

Table 1 ASCII approximated IPA transcription symbols for the Norwegian vowels. Examples from the English language are shown for each vowel Vowel

IPA symbol

Examples

A E I O U Y á é  A

/a/ /E/ /i/ /o/ /u/ /y/ /&/ /W/ /O/

f ather; car bed; f ed beat; nosebleed (not present) rule system; symphony map; apple heard; nerd saw; all

the musical composition at the bottom of the ®gure.

4. System evaluation The system performance was then tested using the three di€erent sets of data described in the previous section, the steady state vowels, the words and the music instruments. For the ®rst two data sets, the neural network was trained once, using a single characteristic segment selected by visual inspection from each of the nine vowels. The third data set containing the musical data was used to train the O-algorithm by the trial and error procedure, since the visual inspection method is not likely applicable to data of this nature. 4.1. Application to steady state vowels In this demonstration three window widths were used, 8, 16 and 32 samples. The results are presented as a confusion matrix given in Table 2, where the vowel inputs are listed in the ®rst column. The system response is shown on the corresponding row, for each of the three windows being used. The o€-diagonal values refer to misclassi®cations. The values in the table refers to the actual number of times the network reported a match between the input data and the data set the neural network was trained to recognize. The characteristic segment was picked from the ®rst period in

29

32

31

10

16

1

2 16

90

69

58

65

3 5 51

6 7

53

9

55

2

3 8

8 32 16 8 32

121 115 63

65 8 1

1

62

7 25

34

99 77 71 5 50 3 49 3 24

1

12

3

44 55

A E I O U Y á ;  A

3

8 Widths

86

81

24

As shown in Fig. 1 the vowels exist as slowly varying vowel-like periods when they are present within a spoken word. The idea was that these periods have large enough similarities to the steady state vowels, thus the neural network training information from the previous experiment could also be used in this test. Since each spoken word contains multiple vowel periods, it is possible to classify a vowel by using a winner-take-all strategy. If a word contains, say 6 vowel periods where 4 of these are correctly identi®ed and 2 periods are identi®ed wrong, the vowel will still be classi®ed correctly. Table 3 shows the vowel search results from ®ve di€erent trials for each of the ®ve indicated words. The detected vowel periods are indicated along with their classi®cation in the fourth row, while the last row gives the number of correct and wrong classi®cations of the vowels by the winner-take-all strategy. Only the 8-sample data window was in use in this demonstration.

In the last experiment, the selection of the prede®ned templates used to represent the three different music instruments had to be found by trial and error. The training was stopped when the network no longer reported falsely identi®ed instruments. The top of Fig. 2 shows the segments ®nally used to represent the snare drum and the keyboard after the trial and error search. The lower part of this ®gure shows the two instruments recognized in the musical composition. A window width of N ˆ 32 samples was used. Table 4 shows the system classi®cation for this experiment. The ®rst and second row indicates the actual instrument and the number of possible detections in the

35

17

16 8 32 16 8 32 16 8 32 16 8 32 16 8 32

the ®rst ®le in the group of 10 ®les of vowel data ± for each vowel. With 1024 samples in each ®le, about 90 000 tests for each vowel was performed. The maximum number of obtainable identi®cations on the diagonal part of the table is about 100, as there are about 10 periods in each ®le from the vowel database.

4.3. Application to instrument recognition

A

16

112

107

8 32

 A ; á Y U O I E

273

4.2. Application to speech signals

Vowel

Network output for di€erent window widths, N

Table 2 Results from the recognition system when tested with o€-line recognition of steady state vowels spoken isolated. Each column indicates the classi®cation for each vowel period with three di€erent window widths, N ˆ 8, 16 and 32 samples

 Eide / Pattern Recognition Letters 21 (2000) 269±275 M. Kermit, A.J.

 Eide / Pattern Recognition Letters 21 (2000) 269±275 M. Kermit, A.J.

274

Fig. 2. Example of characteristic segments from two di€erent music instruments. A snare drum is shown in the upper left plot, while the upper right shows a keyboard. The lower plots shows the two regions identi®ed in a musical composition.

Table 3 Results from the recognition system when tested to classify ®ve trials of the ®ve di€erent spoken words indicated by identifying vowel periods contained in the words. The table indicates the number of correct (C) and wrong (W) classi®cations for each vowel period detected in the word. The last row presents the classi®cation of the word using the winner-take-all strategy Network output, N ˆ 8 Used word car Phoneme A Classi®cation Detected periods Detected words

C 35 5

pit I W 27 0

f ed E

C 4 1

W 6 2

C 7 0

apple á W 14 2

C 6 1

tall  A W 22 3

C 37 5

W 6 0

Table 4 Results from system testing with three di€erent instruments present in 5 s of a complex musical composition. The table indicates the number of instrumental repetitions contained in the composition and the number of correct (C) and wrong (W) classi®cations Network output, N ˆ 32 Instrument Possible detections

Bass guitar 4

Classi®cation Detected instruments

C 3

Keyboard 8 W 0

C 5

Snare drum 8 W 0

C 7

W 0

 Eide / Pattern Recognition Letters 21 (2000) 269±275 M. Kermit, A.J.

chosen 5 s of the musical composition. The last row gives the number of correct and wrong detections for each instrument. 5. Concluding remarks In this report the O-algorithm has been discussed for audio signal identi®cation and some applications have been demonstrated. In the ®rst demonstration, windows with widths of 8±32 samples were used as unique representations of vowel periods. The fact that not all periods were detected in the speech sound is usually not a signi®cant problem, as the vowels in speech sound consist of many vowel-like periods and may thus be classi®ed correctly by a winner-take-all strategy. The spread in the results from the word recognition setup is ascribed to the di€erence in pronunciation of the steady state vowels compared to the pronunciation of the vowels in some of these words, pit, fed and apple. Further research will address this problem. The last experiment testing the reported system with music instrument recognition indicates that some degree of identi®cation is possible for audio signals in general if the selection of unique segments is done properly. References Arslan, L.M., Hansen, J.L., 1999. Selective training for hidden markov models with applications to speech classi®cation. IEEE Trans. Speech and Audio Processing 7 (1), 46±54. Chen, W.-Y., Chen, S.-H., Lin, C.-J., 1996. A speech recognition method based on the sequential multi-layer perceptrons. Neural Networks 9 (4), 655±669. Chud y, V., Hapak, L., Chud y, L., 1991. Isolated word recognition in slovak via neural nets. Neurocomputing 3 (5&6), 259±282.

275

 Lindblad, T., 1992. Arti®cial neural networks as Eide, A., measuring devices. Nuclear Instruments and Methods in Physics Research, Vol. A317. Elsevier, Amsterdam, pp. 607±608. Irino, T., Kawahara, H., 1990. A method for designing neural networks using nonlinear multivariate analysis: Application to speaker-independent vowel recognition. Neural Computation 2 (3), 386±397.  Kermit, M., Bodal, K.A., Eide, A.J., Haug, T.M., Kristiansen, A.C., Lindblad, T., Linden, T., 1999. Steady state vowel recognition of speech sound using the O-algorithm. In: Hamza, M.H. (Ed.), Arti®cial Intelligence and Soft Computing (ASC'99), IASTED, August 9±12, pp. 362±367. Lang, K.J., Waibel, A.H., Hinton, G.E., 1990. A time-delay neural network architecture for isolated word recognition. Neural Networks 3 (1), 23±43.  1997. Radial basis Lindblad, T., Lindsey, C.S., Eide, A., function (RBF) neural networks. In: Irvin, J.D. (Ed.), The Industrial Electronics Handbook. CRC Press, Boca Raton, pp. 1014±1018. Lippmann, R.P., 1989. Review of neural networks for speech recognition. Neural Computation 1 (1), 1±38. Loizou, P.C., Spanias, A.S., 1996. High-peformance alphabet recognition. IEEE Trans. Speech and Audio Processing 4 (6), 430±445. Nilssen, E.J., 1996. Feature extraction and classi®cation of heart sounds. Diploma Thesis in Applied Physics, Institute of Mathematical and Physical Sciences, University of Tromsù. Renals, S., Rohwer, R., 1989. Phoneme classi®cation experiments using radial basis functions. In: International Joint Conference on Neural Networks (IJCNN'89), Vol. 1, pp. 461±467. Tom, M.D., Tenorio, M.F., 1991. Short utterance recognition using a network with minimum training. Neural Networks 4 (6), 711±722. Unnikrishnan, K.P., Hop®eld, J.J., Tank, D.W., 1992. Speakerindependent digit recognition using a neural network with time-delayed connections. Neural Computation 4 (1), 108± 119. Wutiwiwatchai, C., Jitapunkul, S., Luksaneeyanawin, S., Ahkuputra, V., Maneenoi, E., 1998. Thai polysyllabic word recognition using fuzzy-neural network. In: Namazi, N.M. (Ed.), Signal and Image Processing (SIP'98), IASTED, October 28±31, pp. 521±525.