Classification of speech dysfluencies with MFCC and LPCC features

Classification of speech dysfluencies with MFCC and LPCC features

Expert Systems with Applications 39 (2012) 2157–2165 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal hom...

873KB Sizes 219 Downloads 232 Views

Expert Systems with Applications 39 (2012) 2157–2165

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Classification of speech dysfluencies with MFCC and LPCC features Ooi Chia Ai, M. Hariharan ⇑, Sazali Yaacob, Lim Sin Chee Biomedical Electronic Engineering Programme, School of Mechatronic Engineering, Universiti Malaysia Perlis(UniMAP), 02600 Arau, Ulu Pauh, Perlis, Malaysia

a r t i c l e Keywords: Stuttering MFCC LPCC kNN LDA

i n f o

a b s t r a c t The goal of this paper is to discuss comparison of speech parameterization methods: Mel-Frequency Cepstrum Coefficients (MFCC) and Linear Prediction Cepstrum Coefficients (LPCC) for recognizing the stuttered events. Speech samples from UCLASS are used for our analysis. The stuttered events are identified through manual segmentation and used for feature extraction. Two simple classifiers are used for testing the proposed features. Conventional validation method is used for testing the reliability of the classifier. The experimental investigation elucidates MFCC and LPCC features which can be used for identifying the stuttered events and LPCC features were slightly outperformed than MFCC features. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction

1.1. Related work

Speech is a verbal means used by humans to express their feelings, ideas and thoughts in communication. Speech consists of articulation, voice and fluency. However, 1% of the population has noticeable speech stuttering problem and it has found to affect female to male with ratio 1:3 or 4 times (Awad, 1997; Chia Ai & Yunus, 2006; Van Borsel, Achten, Santens, Lahorte, & Voet, 2003). Stuttering is defined as a normal flow of speech disrupted by unintentionally of dysfluencies such as repetition, prolongation, interjection of syllables, sounds, words or phrases and involuntary silent pauses or blocks in communication (Awad, 1997; Chia Ai & Yunus, 2006; Sin Chee, Chia Ai, Hariharan, & Yaacob, 2009; TianSwee, Helbin, Ariff, Chee-Ming, & Salleh, 2007). Stuttering cannot be completely cured; it may go into remission for a time (Awad, 1997). Stutterers can learn to shape their speech into fluent speech with appropriate speech pathology treatments. Therefore a stuttering assessment is needed to evaluate performance of stutterers before and after therapy. Traditionally, speech language pathologist (SLP) count and classify occurrence of dysfluencies such as repetition and prolongation in stuttered speech manually. However, these types of stuttering assessments are subjective, inconsistent, time consuming and prone to error (Awad, 1997; Howell, Sackin, & Glenn, 1997a, 1997b; Nöth et al., 2000; Ravikumar, Rajagopal, & Nagaraj, 2009; Ravikumar, Reddy, Rajagopal, & Nagaraj, 2008). Therefore, it might be good if stuttering assessment can be done automatically and thus having more time for the treatment session between stutterer and SLP.

Researchers have focused on developing objective methods to facilitate the SLP during stuttering assessment. Table 1 depicts some of the significant research works that have been conducted in the last two decades chronologically. In Howell and Sackin (1995) located stuttered speech events namely repetition and prolongation. They extracted totally 39 acoustic parameters, 20 vector based on autocorrelation function plus spectral coefficient based on a 19 channel vocoder. Furthermore, envelope of speech waveform was obtained by filtering the signal using 10 Hz low-pass filter. Artificial neural network was employed to discriminate between the stuttered events. The best hit/miss rate was 0.82 and 0.77 for prolongations and repetitions respectively using autocorrelation function plus spectral coefficient. In 1997, an automatic dysfluency count was presented by Howell, Sackin and Glenn (1997a, 1997b). They employed 12 children who speak stuttered English. The speech samples can be obtained from UCLASS database. They approached the recognition task based on nine parameters, for instances, whole word and part word duration; whole word, first part and second part fragmentation; whole word, first part and second part spectral measure; and part word energy. Before extracting the parameters, speech signals were segmented manually into word unit. At the same time, supra-lexical dysfluency like interjection, revisions, incomplete phrase and phrase repetitions were eliminated. Artificial neural networks (ANN) were employed to classify a word either as repetition, prolongation or fluent. The nine parameters were input to the networks. The system yielded 95% accuracy for fluent words, 78% of the overall dysfluent (combining prolongations and repetitions) accuracy with only 58% and 43% of prolongations and repetitions respectively.

⇑ Corresponding author. Tel.: +0060 4 9885237; fax: +0060 4 9885167. E-mail addresses: [email protected] (M. Hariharan), [email protected] (L. Sin Chee). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.07.065

2158

O. Chia Ai et al. / Expert Systems with Applications 39 (2012) 2157–2165

Table 1 Summary of several research works on stuttering recognition, detailing the number of subjects, the features used and the classifiers employed. First author

Database

Features

Classifiers

Best results (%)

Howell (Howell & Sackin, 1995) Howell (Howell et al., 1997a, 1997b) Geetha (Geetha et al., 2000)



Autocorrelation function and envelope parameters

ANNs

80%

12 Speakers (UCLASS)

Duration, energy peaks, spectral of word based and part word based

ANNs

78.01%

51 Speakers

ANNs

92%

Nöth (Nöth et al., 2000) Czyzewski (Czyzewski et al., 2003) Prakash (Prakash, January 9-11, 2003) Szczurowska (Szczurowska et al., 2006) Wis´niewski (Wis´niewski etal., October 18, 2007) Wis´niewski (Wis´niewski et al., 2007) Tian-Swee (TianSwee et al., 2007) Ravikumar (Ravikumar et al., 2008) Ravikumar (Ravikumar et al., 2009) S´wietlicka (S´wietlicka et al., May 12, 2009)

37 speakers

Age, sex, type of dysfluency, frequency of dysfluency, duration, physical concomitant, Rate of speech, historical, attitudinal, and behavioral scores, family history. Duration and frequency of dysfluent portions, speaking rate Frequency, 1st to 3rd formant’s frequencies and its amplitude

HMMs



ANNs & rough set

73.25% & P90.0%

6 normal speech samples + 6 stop-gaps speech samples 10 normal + 10 stuttering children

Formant patterns, speed of transitions, F2 transition duration and F2 transition range





8 speakers

Spectral measure (FFT 512)

Multilayer Perceptron (MLP), Kohonen

76.67%

38 samples for prolongation of fricatives + 30 samples for stop blockade + 30 free of silence samples

Mel Frequency Cepstral Coefficients (MFCC)

HMMs

70%



Mel Frequency Cepstral Coefficients (MFCC)

HMMs

Approximately 80%

15 normal speakers + 10 artificial stuttered speech

Mel Frequency Cepstral Coefficients (MFCC)

HMMs

96%

10 speakers

Mel Frequency Cepstral Coefficients (MFCC)

Perceptron

83%

15 speakers

Mel Frequency Cepstral Coefficients (MFCC)

SVM

94.35%

8 stuttering speakers + 4 normal speakers (yields 59 fluent speech samples + 59 non-fluent speech samples)

Spectral measure (FFT 512)

Kohonen , Multilayer Perceptron(MLP), Radial Basis Function(RBF)

88.1%–94.9%

Geetha, Pratibha, Ashok, and Ravindra (2000) presented a research on classification of childhood dysfluencies using ANNs in year 2000. Fifty one children were employed, 25 were used to train ANN and 26 children were used to test the ANN. Ten variables viz. age, sex, type of dysfluency, frequency of dysfluency, duration, physical concomitant, rate of speech, historical, attitudinal and behavioral scores, family history were used as input to the network. They achieved 92% accuracy of predicting normal non-fluency and stuttering. Nöth et al. (2000) implemented a system which combined the work of SLP and speech recognition system to evaluate the degree of stuttering during therapy session. Thirty seven patients with stuttering symptoms were employed to read an English passage. They used frequency of dysfluent portions in the speech, duration of dysfluency and speaking rate to classify the degree of stuttering. They employed HMM as the classifier and the system achieved high correlation coefficient of 0.99 from the average actual dysfluencies per word to the average detected dysfluencies. In year 2003, Czyzewski, Kaczmarek, and Kostek (2003) approached the recognition task based on detection of stop-gaps, discerning vowel prolongation, detection of syllable repetition. Six fluent speech samples and six stop-gaps speech samples in Polish were used in the experiment. Two classifiers, namely ANNs and rough set were used to detect stuttering events. Results were favourable to rough set-based system yielding best results more than 90% of classification accuracy compared to ANNs with accuracy equal to 73.25%.

Prakash (2003) presented a study to evaluate speech of 10 normal and 10 stuttering children speaking Kannada (a south Indian language). They proposed acoustic parameters such as formant patterns, speed of F2 transition, F2 transition duration, F2 transition range. Some statistical analysis such as mean and standard deviation of the acoustic feature were computed. Walsh Test was applied to find out the significant differences between the two groups and results prove that acoustic parameters are useful for differential diagnosis of children with stuttering and normal non fluency. In 2006, Szczurowska, Kuniszyk-Jozkowiak, and Smolka (2006) described the neural networks tests on ability of recognition and categorizing the non-fluent and fluent speech samples. Recordings that taken from eight stuttering Polish speakers were used as research material. The recordings were analyzed based on FFT512 with the use of 21 digital 1/3-octave filters of centre frequencies between 100 Hz and 10 kHz. Kohonen and Multilayer Perceptron Networks were applied to recognize and classify between fluent and dysfluent. Best result of 76.67% achieved neural network architectures with 171 input neurons, 53 neurons in hidden layer and 1 output neuron. In year 2007, three papers related to automatic stuttering detection system were presented. Two of the papers were published by (Wis´niewski, Kuniszyk-Józ´kowiak, Smołka, & Suszyn´ski, October 18, 2007b). In (Wis´niewski et al., October 18, 2007b), 38 samples were employed for prolongation of fricatives recognition model, 30 samples for stops blockade of recognition model and 30 samples

2159

O. Chia Ai et al. / Expert Systems with Applications 39 (2012) 2157–2165

for summary model. All the samples were in Polish language and were down-sampled to 22050 Hz. The samples were parameterized using MFCCs. The best recognition accuracy was achieved for summary models with free silence equal to 70%. In their next paper (Wis´niewski, Kuniszyk-Józ´kowiak, Smołka, & Suszyn´ski, 2007a), they proposed an automatic detection system fully concentrated on recognition of prolonged fricative phonemes with HMM as classification method. Same parameterization method and same sampling frequency was applied as previous paper. An approximately 80% was achieved and it was better than their previous paper. Tian-Swee et al. (2007)) presented automatic stuttering recognition system which utilizes HMM technique to evaluate speech problem for children. Twenty samples of normal speech data and 15 samples of artificial stutter speech data in Malay language were used to train and test the HMM model. Before training and testing the HMM model, MFCC features were extracted from the speech data. Average percentage of correct recognition rates were 96% and 90% achieved by normal speech data and artificial stutter speech data respectively. In 2008, Ravikumar et al. (2008) proposed an automatic detection method of syllable repetition in reading speech for objective assessment of stuttered dysfluencies which has four stages, namely, segmentation, feature extraction, score matching and decision logic. Ten speakers were employed to read 150 Standard English words. Speech samples were segmented manually into syllables unit. 12MFCC features were extracted from the segmented speech samples. Perceptron is used to decide whether a syllable is repeated or not. 83% of accuracy was achieved by them with 10 speech samples. Eight samples were used for Perceptron classifier training while the remaining two samples are used for testing. In 2009, Ravikumar et al. (2009) proposed same technique for the automatic detection method with different decision logic as an improvement to their previous work. The decision logic was implemented using SVM to classify between fluent and dysfluent speech. The Fifteen speech samples were collected from 15 adults who stutter, 12 samples were used for training. The remaining three samples were used for testing. The system yielded 94.35% accuracy which is higher than their previous work. In 2009, S´wietlicka, Kuniszyk-Józ´kowiak, and Smołka, May 12, (2009) presented a research concerning automatic detection of dysfluency in stuttered speech. Eight stuttering speakers were employed to produce 59 non-fluent speech samples and four normal speakers were employed to produce fluent speech samples. All the speech samples were in Polish language. Twenty one digital 1/3 octave filters of centre frequencies between 100 Hz and 10 kHz were used to analyze the speech samples. These parameters of the speech samples were used as an input for the Networks. Multilayer Perceptron (MLP) and Radial Basis Function (RBF) networks were applied to recognize and classify fluent and non-fluent in speech samples. Classification accuracy for all networks was achieved between 88.1% and 94.9%. Based on the previous works, it has been observed that feature extraction algorithms and classification method performs important role in the area of stuttered events recognition. In this work, two feature extraction algorithms namely MFCC and LPCC were used and its efficacy was tested using two simple classifiers viz. kNN and LDA. The synopsis of paper is organized as follows. The methodology of the system is presented in Section 2. Furthermore, this section covers database that used in the experiment, feature extraction algorithm and classification technique. Experimental results and discussion are presented in Section 3. Finally, conclusions and future works are discussed in Section 4.

Stuttered Speech

Segmentation Feature Extraction

Classification

Output Fig. 1. General block diagram of stuttering recognition.

2. Methodology In this study, LPCC is used as a feature extraction algorithm; kNN and LDA are employed to evaluate the effectiveness of the LPCC features in the recognition of repetition and prolongation in stuttered speech. Fig. 1 illustrates general block diagram of stuttering recognition. The effectiveness of LPCC features are compared with MFCC features since most of the research groups used MFCC as the feature extraction algorithm. However, it is hard to compare results when the experiments are done with different database. Thus, in this work, MFCC and LPCC are compared using the same database which is the only stuttered speech database available and can be accessed easily on the University College London Archive of Stuttered Speech (UCLASS) website (Howell, Davis, & Bartrip, 2009).

2.1. Speech data The database was obtained from UCLASS (Howell et al., 2009). The database consists of recordings for monologs, readings and conversation. There are 43 different speakers contributing 107 reading recording. Table 2 shows the distribution of the reading recording database. In this work, only a subset of the available sample that is 39 samples of speech were taken from the UCLASS archive (Howell & Huckvale, 2004) for analysis. It includes one sample each from two female speakers and 37 male speakers ranged between 11 years 2 months and 20 years 1 month. The samples were chosen to cover a broad range of both age and stuttering rate. Content of the speech samples are ‘‘One more week to Easter’’ and ‘‘Arthur the rat’’ (90% of these are monosyllable words) (Howell et al., 1997b) were considered in this work and each of the two passages consists of more than 300 words. The filenames of speech samples employed in this paper are F_0142_12y4m_1, F_0818_12y4m_1, M_0034_12y6m_1, M_0034_12y8m_1, M_0034_12y11m_1, M_0034_13y2m_1, M_0034_13y6m_1, M_0048_11y1m_1, M_0048_11y1m_2, M_0048_11y5m_1, M_0048_11y6m_1, M_0053_10y2m_1, M_0053_10y2m_2, M_0053_10y4m_1, M_0053_10y6m_1, M_0053_10y8m_1, M_0053_11y1m_1, M_0053_11y4m_1, M_0064_11y5m_1, M_0064_11y11m_1, M_0064_12y2m_1, M_0065_20y1m_1, M_0077_11y2m_1, M_0077_13y1m_1, M_0098_8y4m_1, M_0138_12y2m_1, M_0138_13y3m_1, M_0216_11y6m_1, M_0216_12y1m_1, M_0253_14y6m_1, M_0398_12y5m_1, M_0541_12y3m_1, M_0545_15y3m_1, M_0760_12y3m_1, M_0760_15y3m_1, M_0874_13y1m_1, M_0874_15y5m_1, M_0880_13y4m_1, M_0880_15y2m_1. The two types of dysfluencies in stuttered speech such as repetitions and prolongations were identified and segmented manually

2160

O. Chia Ai et al. / Expert Systems with Applications 39 (2012) 2157–2165

Table 2 Age and gender distribution of the reading recordings in the chosen subset of UCLASS databases. Age range

Recording Speaker

Gender

7y10 m–20y7 m

Male

Female

92 38

15 5

by listening to the recorded speech signal. The segmented speech samples are subjected to speech signal pre-processing. 2.2. Speech signal pre-processing The original sampling frequency of the speech samples is 44.1 kHz. For speech processing purpose, each of the speech samples is down sampled to 16 kHz. It is reasonable for speech processing task in present work, because most of the salient speech features are within 8 kHz bandwidth (Huang, Acero, & Hon, 2001). Fig. 2 displays the computation of speech signal processing. The recognition system was implemented in MATLAB. Firstly, the sampled speech signals are pre-emphasized with filter. A pre-emphasis filter is typically a simple first order high pass filter. The purpose of pre-emphasis is to even the spectral energy envelope by amplifying the importance of high-frequency components and removing the DC component in the signal. The z-transform of the filter is shown in Eq. (1).

HðzÞ ¼ 1  a  z1 ;

0:9 6 a 6 1:0

ð1Þ

In the time-domain, the relationship between the output s0n and the input sn of the pre-emphasis block is shown in (2). For fixed-point implementations a value of a = 15/16 = 0.9375 is often used (Rabiner & Juang, 1993).

s0n ¼ sn  asn1

ð2Þ

The pre-emphasized signal was divided into short frame segment using Hamming window.

speech technology field especially speech recognition and speaker identification. Furthermore, some studies have been made in speech stuttering using MFCC. The essence of this method is that the MFCC acoustic features claimed to be robust in recognition tasks related to human voice. MFCC analysis is similar to cepstral analysis and yet the frequency is warped in accordance with Mel-scale. Due to the Mel-frequency Cepstrum approximates human auditory system‘s response closely than the linearly-spaced frequency bands using normal Cepstrum. In MFCC, logarithm is used to separate the excitation spectrum from vocal system spectrum. The results of the Cepstrum represent slow variations of the spectrum of signal and characterize the vocal tract shape of the uttered words. The computation of MFCC features are illustrated in Fig. 3(a). Firstly, Fast Fourier Transform (FFT) is applied to windowed signal for converting each frame of N samples from the time domain into the frequency domain. After the FFT block, the power coefficients are filtered by a triangular bandpass filter bank also known as Mel-scale filter (Dhanalakshmi, Palanivel, & Ramalingam, 2009). The Mel-frequency scale is linear frequency spacing below 1 kHz and logarithmic spacing above 1 kHz. Mel scale filter bank is illustrated in Table 3. The mapping of linear frequency to Mel-frequency is shown in Eq. (3).

  f Melðf Þ ¼ 2595log 10 1 þ 700

ð3Þ

Finally, the log Mel spectrum is converted into time using Discrete Cosine Transform (DCT). The output is called as Mel Frequency Cepstrum Coefficients (MFCC) (Jothilakshmi, Ramalingam, & Palanivel, 2009). 2.3.2. Linear Prediction Cepstral Coefficients LPCC is Linear Prediction Coefficients (LPC) represented in the Cepstrum domain (Antoniol, Rollo, & Venturi, 2005; Rabiner & Juang, 1993). The fundamental idea of LPCC can be derived directly from the LPC using recursion technique rather than applying inverse Fourier transform of the logarithms of the spectrum of the original signal. LPCC is less computationally expensive as it can

2.3. Speech parameterization Speech parameterization is an important step in speech recognition systems. It is used to extract relevant information such as voices (phoneme) from audio signal (Gajsek & Mihelic, 2008). Two speech parameterization techniques were investigated in this study, namely, MFCC and LPCC to identify stuttering characteristics in speech signal. 2.3.1. Mel-frequency Cepstral Coefficients In the past few decades, MFCC is a popular parameterization method that has been developed and has been widely used in

Speech signal sampled at 16 kHz

Pre-emphasis

sn' Frame-blocking x(n)

Windowing w(n) Fig. 2. Block diagram of speech signal preprocessing.

Table 3 Mel-scale filter bank (Picone, 2002). Index

Lower frequency (Hz)

Center frequency (Hz)

Upper frequency (Hz)

Bandwidth (Hz)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

0 100 200 300 400 500 600 700 800 900 1000 1147 1317 1511 1733 1989 2282 2618 3004 3447 3955 4538 5207 5975 6855

50 150 250 350 450 550 650 750 850 950 1071 1229 1410 1618 1857 2130 2444 2805 3218 3692 4237 4861 5578 6400 7343

100 200 300 400 500 600 700 800 900 1000 1147 1317 1511 1733 1989 2282 2618 3004 3447 3955 4538 5207 5975 6855 7866

100 100 100 100 100 100 100 100 100 100 147 169 194 223 255 293 336 386 443 508 583 669 768 881 1011

2161

O. Chia Ai et al. / Expert Systems with Applications 39 (2012) 2157–2165

0 Autocorrelation Analysis

FFT

B B B R¼B B B @

Spectrum Mel scale Filterbank

LPC Parameters

Cepstral Coefficients

(b)

Fig. 3. Block diagram of speech parameterization: MFCC (a) and LPCC (b).

be computed without calculating the Fourier transform at the beginning stages to modify signal from time domain to frequency domain(Watts, 2006). Furthermore, LPCC was inheriting the advantages from LPC. The main idea of LPC is based on the speech production model. An ideal approximation to the vocal tract spectral envelope was provided by the all-pole model of the LPC while it is applied to the analysis of speech signals which leads to a moderate source-vocal tract separation. Therefore, LPCC was employed in this work and the block diagram of LPCC feature extraction is depicted in the Fig. 3 (b). 2.3.2.1. Linear Prediction Analysis. LPC analysis is to represent each sample of the signal in the time domain by a linear combination of p (p is the order of the LPC analysis) preceding values s (n  p  1) through s(n  1). In this paper, LPC analysis uses the autocorrelation method of order p = 14 (Antoniol et al., 2005). The frame x(n) is assumed to be zero for n < 0 and n P N by multiplying it with hamming window, the error minimization of pth-order linear predictor results in the well-known normal equations (Alexander & Rhee, 1987; Alexander & Zong, 1987; Makhoul, 1975). Eqs. (4) and (5) are shown as below: p X

ak Rðji  kjÞ ¼ RðiÞ;

16i6p

ð4Þ

k¼1

where

RðiÞ ¼

X 1 N1i y y L  i n¼0 n nþi

ð5Þ

The coefficients R(i  k) form as an autocorrelation matrix which is a symmetric Toeplitz matrix (Toeplitz matrix is a matrix in which all the elements along each descending diagonal from left to right are equal(Makhoul, 1975)). Using matrix form considerably simplifies the combination (4) and (5) into

Ra ¼ r

ð6Þ

where

r ¼ ½rð1Þrð2Þ . . . rðpÞT

a is a p  1 predictor coefficients vector and

ð9Þ

ð10Þ

c0 ¼ Inr2 cm ¼ amþ

ð11Þ m 1   X k k¼1

cm ¼

m1 X k¼1

m

ck amk

 k ck amk m

16m6p

m>p

ð12Þ ð13Þ

r2- gain term in the LPC model cm- Cepstral Coefficients am- predictor coefficients k- 1 < k < N  1 p- pth order The Cepstral Coefficients, which are the coefficients of the Fourier transform representation of the log magnitude of the spectrum, have been shown to be more robust for speech recognition than the LPC coefficients (Dhanalakshmi et al., 2009). Generally, it is used as a cepstral representation with Q > p coefficients, where Q  (3/2)p (Rabiner & Juang, 1993). In this study, 25 MFCCs and 21 LPCCs were extracted to discriminate the two types of dysfluencies, namely repetitions and prolongations in stuttered speech. For further analysis, three parameters were varied, to investigate the effectiveness of LPCC and MFCC in stuttered events recognition, that is, frame length changed from 10 to 50 ms with overlapping window adjusted between no overlap, 33.33%, 50% and 75%. Furthermore, a value of the first-order pre-emphasized is varied from 0.91 to 0.99. The results of the analysis were discussed in results and discussion section. 2.4. Classification This section explains the elementary theory of k-Nearest Neighbor (kNN) and Linear Discriminant Analysis (LDA). The input for both classifiers is the set of feature vectors discussed above. Conventional validation was performed with 70% of data used for training and 30% of data used for testing. Data in both testing and training sets are normalized in the range [0, 1] and shuffled randomly.

ð7Þ

r is a p  1 autocorrelation vector,

a ¼ ½a1 a2 . . . ap T

. . . rðp  1Þ C . . . rðp  2Þ C C rð2Þ rð1Þ rð0Þ . . . rðp  3Þ C C C ... ... ... ... ... A rðp  1Þ rðp  2Þ rðp  3Þ . . . rð0Þ

2.3.2.2. Cepstral Coefficients (CC). CCs are the coefficients of the Fourier transform representation of the logarithm magnitude spectrum (Antoniol et al., 2005). Once LPC vector is obtained, it is possible to compute Cepstral Coefficients. LPC vector is defined by [a0a1a2. . .ap]. CC vector is defined by [c0c1c2...cp]. LPC vectors are converted to CCs using a recursion technique (Dhanalakshmi et al., 2009). The recursion is defined in (11)–(13):

LPCC feature vector

(a)

rð2Þ rð1Þ

a ¼ R1 r

Discrete Cosine Transform (DCT)

MFCC feature vector

rð1Þ rð0Þ

R is the p  p Toeplitz autocorrelation matrix, which is nonsingular. In order to find a predictor coefficient vector, we need to solve linear system by a matrix inversion.

Logarithmation Cepstrum

rð0Þ rð1Þ

1

ð8Þ

2.4.1. k-Nearest Neighbor (kNN) kNN is an uncomplicated classification model that employs lazy learning(Pallabi & Bhavani, 2006). It is a supervised learning algorithm by classifying the new instances query based on majority of kNN category. Minimum distance between query instance and each of the training set is calculated to determine the kNN

2162

O. Chia Ai et al. / Expert Systems with Applications 39 (2012) 2157–2165

category. Each query instance (test speech signal) will be compared against each of training instance (training speech signal). The kNN prediction of the query instance is determined based on majority voting of the nearest neighbour category. Since query instance (test speech signal) will compare against all training speech signal, kNN encounters high response time (Pallabi & Bhavani, 2006). In this work, for each test speech signal (to be predicted), minimum distance from the test speech signal to each of the training speech signal in the training set is calculated to locate the kNN category of the training data set. A Euclidean Distance measure dE(x, y) is used to compute the distance between training and testing, where x, y are both training and testing speech signals composed of N features, namely x = {x1, x2, ..., xN}, y = {y1, y2, ..., yN}. Euclidean distance is defined in Eq. (14). N qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X x2i  y2i dE ðx; yÞ ¼

ð14Þ

i¼1

From this kNN category, class label of the test speech signal is determined by applying majority voting among the k nearest training speech samples. Thus, the k values represent an important role in kNN classification. Generally, k-values of kNN should be determined in advance and the best choice of k-values depends on the data. Normally, larger values of k will reduce the effect of noise on the classification but cause boundaries between classes less distinct (Liu, Lee, & Lin, 2010). Therefore in this study, for each of the experiments, different k values ranging between 1 and 10 were applied. For each value of k, the experiment was repeated for 10 times, each time different training and testing sets were prepared. 2.4.2. Linear Discriminant Analysis (LDA) LDA is used to for data classification and feature selection. It has been widely used in many applications such as speech recognition, face recognition, image retrieval, etc. In this work, discriminant function is used to assign the speech features into two feature vector classes. The aim of LDA is to find a linear transformation that maximizes class separability in reduced dimensional space (Park & Park, 2008). Linear transformation is also known as discriminant function. It is generated by transforming the speech features vector to a projection features vector. Discriminant function is T fi ¼ li S1 w xk 

1 l S1 lT þ InðPi Þ 2 i w i

ð15Þ

li- mean feature in group i, where i =1 and 2. xk- x is the feature of all data. k represents one speech feature. Pi- total sample of each group divided by the total samples. Initially, the mean (l1- mean of class 1, l2- mean of class 2) of each feature set and mean l3 of the total data set are computed, is shown in (16). P1 and P2 are prior probabilities of the class 1 and class 2 respectively.

l3 ¼ P1 l1 þ P2 l2

ð16Þ

In LDA, class discrimination is measured by applying within-class scatter Sw and between-class Sb scatter(Park & Park, 2008). The within-class scatter is computed using (17).

Sw ¼ P1 cov1 þ P2 cov2

ð17Þ

It is the expected covariance of each class. Thus, cov1 and cov2 are in symmetry. Eq. (18) is used to calculate covariance matrix. The between-class scatter is calculated using (19). Sw can be considered as the covariance of data set by replace mean vectors of each class with mean vectors of total data set.

covi ¼ ðxi  li Þðxi  li ÞT X Sb ¼ ð li  l3 Þ  ð li  l3 Þ T

ð18Þ ð19Þ

i

All feature data are transformed into discriminant function, training data and the prediction data are drawn into new coordinate. Class label of prediction data is determined by calculating the difference between fi. 3. Results and discussion In this section, the results of kNN and LDA were discussed under various conditions such as different frame size, different a values and different percentage of overlapping. The average recognition accuracy versus frame length is depicted in Fig. 4. Experiment was conducted with 50% window overlapping, fixed a value equal to 0.9375 and different frame lengths. It was experimentally determined that 30 ms frames generated better recognition accuracy compared with other frame length for LPCC features. In a consistent manner, 20 ms frame length produced better recognition accuracy compared to other frame length for MFCC features. The highest average accuracy of LPCC and MFCC

Fig. 4. Average recognition accuracy versus frame size for MFCC features and LPCC features.

2163

O. Chia Ai et al. / Expert Systems with Applications 39 (2012) 2157–2165

Fig. 5. Average recognition accuracy versus a values for MFCC features and LPCC features.

Fig. 6. Average recognition accuracy versus percentage of window overlapping for MFCC features and LPCC features.

features for kNN classifier are 92.75% and 92.55% respectively. The highest average accuracy of LDA was 90.20% for LPCC features with 20 ms frame length and 88.24% were obtained for MFCC features with 40 ms frame length. It can be concluded that generally LPCC is slightly outperformed than MFCC for both kNN and LDA for all frame lengths except frame length equal to 20 ms. The average recognition accuracy versus a values in the first order high pass filter is presented in Fig. 5. For LPCC features, it was experimentally showed that a = 0.98 produced the highest recognition accuracy of 92.35% compared with other a values. The average recognition accuracy of 91.76% was achieved by MFCC when a = 0.96. The highest average accuracy of LPCC and MFCC features was obtained when a = 0.93 by LDA classifier are which 91.57% and 88.82%. The results elucidate that LPCC was somewhat outperformed than MFCC for kNN when the a values varied from 0.91 to 0.99 except 0.92, 0.93, 0.94, 0.95 and 0.99. For LDA, LPCC features are giving better result than MFCC when the values varies between 0.91 and 0.99 except a = 0.96.

Based on Fig. 6, the average recognition accuracy versus percentage of window overlapping was discussed. LPCC features were able to produce higher average recognition accuracy when compared to MFCC features by varying the percentage of window overlapping from no overlap to 75% overlap. The highest average accuracy of MFCC features obtained with 50% window overlapping is 92.16%. The highest average accuracy of LPCC features was 92.16% with 33.33% and 75% window overlapping. From the result

Table 4 Best parameters for MFCC and LPCC features after experiments. Features

Frame length (ms)

Window overlapping (%)

a

Classifier

Accuracy (%)

MFCC

20

50

0.9375

LPCC

30

75

0.9800

kNN LDA kNN LDA

92.55 88.82 94.51 90.00

2164

O. Chia Ai et al. / Expert Systems with Applications 39 (2012) 2157–2165

Fig. 7. Average recognition accuracy versus k-values for MFCC features and LPCC features.

Table 5 Some of the previous works pertaining to the recognition of prolongation and repetition. First author

Database

Languages

Features

Classifiers

Best results (%)

Howell (Howell & Sackin, 1995) Howell (Howell et al., 1997a,1997b) Czyzewski (Czyzewski et al., 2003) Wis´niewski (Wis´niewski et al., 2007) Ravikumar (Ravikumar et al., 2008) Ravikumar (Ravikumar et al., 2009) Current study



English

ANNs

80%

12 Speakers (UCLASS)

English

ANNs

78.01%

6 normal speech samples + 6 stop-gaps speech samples –

Polish Polish

Autocorrelation function and envelope parameters Duration, energy peaks, spectral of word based and part word based Frequency, 1st to 3rd formant’s frequencies and its amplitude MFCC

ANNs & rough set HMMs

10 speakers

English

MFCC

Perceptron

73.25% & P90.0% Approximately 80% 83%

15 speakers

English

MFCC

SVM

94.35%

39 speech samples (2 female + 36 male) obtained from UCLASS

English

MFCC

kNN LDA kNN LDA

92.55% 88.82% 94.51% 90.00%

LPCC

it shows that the highest accuracy of both MFCC and LPCC features are equal for kNN classifier. The highest average accuracy of LPCC and MFCC features obtained by LDA classifier are 91.18% and 88.04%. Generally, the results present that the LPCC features achieved a little better recognition accuracy when compared to MFCC features by varying the different percentage of window overlapping for both kNN and LDA classifier except when percentage of window overlapping is equal to 50% and kNN is the classifier. It can be seen from Figs. 4–6, that LPCC features were able to provide a slightly higher overall recognition accuracy when compared to MFCC features by varying the three parameters like frame length, avalues and percentage of window overlapping for both kNN and LDA classifiers. The best parameters for both features were found, 30 ms frame with 75% overlap and a = 0.98 for LPCC features and 20 ms frame with 50% overlap and a = 0.9375 for MFCC features. The results of this analysis are summarized in Table 4. Based on the best parameters of MFCC and LPCC, The average overall accuracy of both features versus k-values were compared and reported in Fig. 7. The impact of k-values on accuracy for both features is studied. From the Fig. 7, k = 4 was found to be best k value for LPCC features to classify the stuttered event with 94.51%. On the other hand, for k = 2 is the best choice for MFCC features to classify stuttered events by achieving 92.55%. Finally, it can be seen from the figure that LPCC features were able to provide a slightly higher overall accuracy compared to MFCC features for all k-values except k = 1 and k = 2.

The comparison between this study and previous works pertaining to the recognition of two types of stuttered events namely prolongation and repetition were tabulated in Table 5. Based on Table 5, it can be observed that LPCC and MFCC features give comparable results with previous works but cannot be directly compared with previous research works due to different languages, the way of handling the database, size of database and distribution of database such as gender and age groups.

4. Conclusion In this study, two speech parameterizations were compared in repetitions and prolongation recognition for the available stuttered event data. Results displays that LPCC slightly outperforms MFCC in all situations such as frame length selection, window overlapping percentage and a values in the first order high pass filter. The best configuration of 21 LPCC features has shown the best accuracy of 94.51%. The optimal configuration of 25 MFCC features has presented the best accuracy of 92.55%. This may be due to the fact that LPCC is capable of capturing salient information from stuttered events and slightly increases the ability to classify both stuttered events, namely, repetition and prolongation. This study also reports that the kNN and LDA can be used as classifier to classify the repetition and prolongation. Finally, the conventional validation was performed to compare the accuracy of the kNN and LDA. In the future works, more samples, other feature extraction

O. Chia Ai et al. / Expert Systems with Applications 39 (2012) 2157–2165

algorithms and classification techniques may used to improve recognition accuracy of stuttered events.

References Alexander, S., & Rhee, Z. (1987). An analysis of finite precision effects for the autocorrelation method and Burg’s method of linear prediction. In acoustics, speech, and signal processing, IEEE international conference on ICASSP ‘87. Alexander, S., & Zong, R. (1987). Analytical finite precision results for Burg’s algorithm and the autocorrelation method for linear prediction. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(5), 626–635. Antoniol, G., Rollo, V. F., & Venturi, G. (2005). Linear predictive coding and cepstrum coefficients for mining time variant information from software repositories. In Proceedings of the 2005 international workshop on mining software repositories. Awad, S. S. (1997). The application of digital speech processing to stuttering therapy. In instrumentation and measurement technology conference, 1997. IMTC/ 97, Proceedings of IEEE ‘Sensing, Processing, Networking’. Prakash, B. (2003). Acoustic Measures in the Speech of Children with Stuttering and Normal Non Fluency – A Key to Differential Diagnosis. In workshop on spoken language processing. Chia Ai, O., & Yunus, J. (2006). Overview of a computer-based stuttering therapy. In regional postgraduate conference on engineering and science (RPCES 2006), Johore. Czyzewski, A., Kaczmarek, A., & Kostek, B. (2003). Intelligent processing of stuttered speech. Journal of Intelligent Information Systems, 21(2), 143–171. Dhanalakshmi, P., Palanivel, S., & Ramalingam, V. (2009). Classification of audio signals using SVM and RBFNN. Expert Systems with Applications, 36(3, Part 2), 6069–6075. Gajsek, R., & Mihelic, F. (2008). Comparison of speech parameterization techniques for Slovenian language. In 9th International PhD workshop on systems and control: young generation viewpoint. Geetha, Y. V., Pratibha, K., Ashok, R., & Ravindra, S. K. (2000). Classification of childhood disfluencies using neural networks. Journal of Fluency Disorders, 25(2), 99–117. Howell, P., Davis, S., & Bartrip, J. (2009). The UCLASS archive of stuttered speech. Journal of speech, language, and hearing research: JSLHR. Howell, P., & Huckvale, M. (2004). Facilities to assist people to research into stammered speech. Stammering research: An on-line journal published by the British Stammering Association, 1(2), 130. Howell, P., & Sackin, S. (1995). Automatic recognition of repetitions and prolongations in stuttered speech. In Proceedings of the first World Congress on fluency disorders. Howell, P., Sackin, S., & Glenn, K. (1997a). Development of a two-stage procedure for the automatic recognition of dysfluencies in the speech of children who stutter: I. Psychometric procedures appropriate for selection of training material for lexical dysfluency classifiers. Journal of Speech, Language, and Hearing Research, 40(5), 1073. Howell, P., Sackin, S., & Glenn, K. (1997b). Development of a two-stage procedure for the automatic recognition of dysfluencies in the speech of children who stutter: II. ANN recognition of repetitions and prolongations with supplied word

2165

segment markers.. Journal of Speech, Language, and Hearing Research, 40(5), 1085. Huang, X., Acero, A., & Hon, H. (2001). Spoken language processing: A guide to theory, algorithm, and system development. NJ, USA: Prentice Hall PTR Upper Saddle River. Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2009). Unsupervised speaker segmentation with residual phase and MFCC features. Expert Systems with Applications, 36(6), 9799–9804. Liu, C.-L., Lee, C.-H., & Lin, P.-M. (2010). A fall detection system using k-nearest neighbor classifier. Expert Systems with Applications, 37(10), 7174–7181. Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 561–580. Nöth, E., Niemann, H., Haderlein, T., Decher, M., Eysholdt, U., Rosanowski, F., et al. (2000). Automatic stuttering recognition using hidden Markov models. Pallabi, P., & Bhavani, T. (2006). Face Recognition Using Multiple Classifiers. In International conference on 18th IEEE tools with artificial intelligence, 2006. ICTAI ‘06. Park, C. H., & Park, H. (2008). A comparison of generalized linear discriminant analysis algorithms. Pattern Recognition, 41(3), 1083–1097. Picone, J. W. (2002). Signal modeling techniques in speech recognition. Processings of IEEE, 81(9), 1215–1247. Rabiner, L., & Juang, B. (1993). Fundamentals of speech recognition. Prentice hall. Ravikumar, K., Reddy, B., Rajagopal, R., & Nagaraj, H. (2008). Automatic Detection of Syllable Repetition in Read Speech for Objective Assessment of Stuttered Disfluencies. In Proceedings of world academy science, engineering and technology. Ravikumar, K. M., Rajagopal, R., & Nagaraj, H. C. (2009). An approach for objective assessment of stuttered speech using MFCC features. ICGST International Journal on Digital Signal Processing, DSP, 9(1), 19–24. Sin Chee, L., Chia Ai, O., Hariharan, M., & Yaacob, S. (2009, 14-15 Dec. 2009). Automatic detection of prolongations and repetitions using LPCC. In 2009 international conference for technical postgraduates (TECHPOS). S´wietlicka, I., Kuniszyk-Józ´kowiak, W., & Smołka, E. (May 12, 2009). Artificial Neural Networks in the Disabled Speech Analysis Computer Recognition System 3 (Vol. 57/ 2009, pp. 347–354): Springer Berlin/Heidelberg. Szczurowska, I., Kuniszyk-Jozkowiak, W., & Smolka, E. (2006). The application of Kohonen and multilayer perceptron networks in the speech nonfluency analysis. Archives of Acoustics, 31(4), 205. Tian-Swee, T., Helbin, L., Ariff, A. K., Chee-Ming, T., & Salleh, S. H. (2007). Application of Malay speech technology in Malay Speech Therapy Assistance Tools. In: international conference on intelligent and advanced systems, 2007. ICIAS 2007. Van Borsel, J., Achten, E., Santens, P., Lahorte, P., & Voet, T. (2003). fMRI of developmental stuttering: a pilot study. Brain and Language, 85(3), 369–376. Watts, D. M. G. (2006). Speaker identification-prototype development and performance. University of Southern Queensland. Wis´niewski, M., Kuniszyk-Józ´kowiak, W., Smołka, E., & Suszyn´ski, W. (2007). Automatic detection of prolonged fricative phonemes with the hidden Markov models approach. Journal of Medical Informatics & Technologies, 11, 293–298. Wis´niewski, M., Kuniszyk-Józ´kowiak, W., Smołka, E., & Suszyn´ski, W. (October 18, 2007). Automatic Detection of Disorders in a Continuous Speech with the Hidden Markov Models Approach Computer Recognition Systems 2 (Vol. 45/2008, pp. 445– 453): Springer Berlin/Heidelberg.