Phonation biomechanic analysis of Alzheimer׳s Disease cases

Phonation biomechanic analysis of Alzheimer׳s Disease cases

Neurocomputing 167 (2015) 83–93 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Phonation...

3MB Sizes 0 Downloads 37 Views

Neurocomputing 167 (2015) 83–93

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Phonation biomechanic analysis of Alzheimer's Disease cases Pedro Gómez-Vilda a,n, Victoria Rodellar-Biarge a, Víctor Nieto-Lluis a, Karmele López de Ipiña b, Agustín Álvarez-Marquina a, Rafael Martínez-Olalla a, Miriam Ecay-Torres c, Pablo Martínez-Lage c a

Biomedical Technology Center, Universidad Politécnica de Madrid, Campus de Montegancedo, s/n, 28223 Pozuelo de Alarcón, Madrid, Spain Department of Systems Engineering and Automation, University of the Basque Country, Donostia, Gipuzkoa, Spain c Neurology Department/CITA-Alzheimer Foundation, Donostia, Gipuzkoa, Spain b

art ic l e i nf o

a b s t r a c t

Article history: Received 2 December 2014 Received in revised form 4 March 2015 Accepted 6 March 2015 Communicated by: Carlos Manuel TraviesoGonzález Available online 22 May 2015

Speech production in patients suffering of dementias of Alzheimer's type is known to experience noticeable changes with respect to normative speakers. Classically this kind of speech has been described as presenting altered prosody, rhythmic pace, anomy, or impaired semantics. Phonation, conceived as the production of voice in voiced speech fragments remains as an unexplored field. The aim of the present paper is to open a preliminary study presenting biomechanical estimates from phonation produced by two patients (male and female) suffering Alzheimer's Disease (AD), contrasted on two controls of both genders (CS: control speakers). A vocal fold biomechanical model is inverted to facilitate estimates of the vocal fold stiffness to analyze significant segments of phonated speech as long vowels and fillers. The estimates of both the AD patients and CS subjects are contrasted on a database of phonation features from a normative speaker population of both genders, as well as in paired tests contrasting AD and CS subjects. Results show the possibility of establishing significant discrimination between AD and CS when using f0, as well as vocal fold body stiffness, although this last feature seems to be more relevant and shows larger statistical significance. & 2015 Elsevier B.V. All rights reserved.

Keywords: Speech processing Cognitive disease monitoring Neurodegenerative speech modeling Aging-care e-Health systems

1. Introduction Since the pioneering study of Auguste Deter's case by Alois Alzheimer, where a first description of what we nowadays recognize to be the main dramatic cause of dementia severely impairing quality of third-age life in developed societies, one hundred and ten years have passed, and a definitive cure of the disease is still waiting. A first difficulty comes from the fact that AD seems to be the result of severe biochemical degradation of neural tissues in the brain cortex, eventually extending to other deeper structures, whose nature is still the subject of intensive studies, which are far from concluding [1]. Thus, the difficulty of finding pharmacological treatments is still high, and strong research efforts have to be devoted to this task. A related problem is that establishing a clear differential diagnose of AD type of dementias is very complicate, according to neurologist experts. The prevalence of AD is around 30 million people worldwide [2], with an incidence of 4.6 million of new cases per year (one case every 7 s), so as to expect the number of AD cases to climb up to 81.1 million cases by 2040 (nearly doubling every 20 y). With this panorama in mind it is of most importance that applied technology contributes to medicine,

n

Corresponding author.

http://dx.doi.org/10.1016/j.neucom.2015.03.087 0925-2312/& 2015 Elsevier B.V. All rights reserved.

pharmacology and biochemistry in helping to diagnose and monitor AD facing patient treatment. It is in this scenario where speech sciences can contribute with inexpensive methodologies and procedures in establishing an early diagnose, and a continued longitudinal monitoring of patients. The anticipated diagnose is especially relevant, as it can contribute to a better treatment slowing the progression of the disease. Speech technologies may offer a simple way of assessing early symptoms by the analysis of AD speech characteristics based on the “dry-lab” concept, in which on-line objective analysis may be available to the neurologist in real time, not depending in other more expensive tests with long turn-over time lags. This fact is already considered by neurologists, as tests where the patient is asked to produce speech are currently among the diagnose protocol. Nevertheless the evaluation of AD speech using that methodology may become rather subjective and depending on the specific rating judge opinion. Therefore, under this point of view a good description of AD-affected speech is of most interest. It is well documented that AD speech is characterized by a decline in semantic abilities, as well as by deficits in discourse and prosody, known with the global name of non-fluent aphasia. Anomia (the difficulty of assigning names to things or persons) and single word comprehension deficits are also other related symptoms. Some studies also mention deficits in production, repetition and comprehension. Other evidence suggests a

84

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

degree of vulnerability in articulatory and phonological levels as well. Some related descriptions can be found in [3–6]. The main observable behavioral trends in AD speech can be summarized under the terms of dysfluency (aphasic, or anomic), dysprosody (emotional decay, production, repetition and comprehension) and agrammatism (deficient syntax). The following is a brief description of the behavior observables associated to these descriptions:

 Dysfluency may result in broken speech rhythm, therefore, the

 

control of speech presence will be crucial. Intervals with no speech and with speech will have to be carefully detected and measured. Dysprosody may be traced from the fundamental frequency f0 and the intensity of speech production. Repetition and comprehension are left for other kinds of studies. Agrammatism does not leave important acoustical marks and is to be studied using other techniques, as automatic linguistic speech recognition [7].

The most important referents under the speech processing point of view are the acoustic correlates which can be associated to each behavioral observable. These are some of the most relevant ones

 Sub-segmental intervals: syllables defined as associations of

 







consonant–vowel structures (V, CV, VC, CVC or VCV, where V stands for vowel and C for consonant), and inter-syllabic pauses. Verbal rate as the number of syllables produced per unit time is an important feature. Fillers, as prolonged vowels in syllable endings or independent insertions (such/uh/,/ah/,/eh/and/hmmm/-like phonations). Supra-segmental intervals: groups of continuous voiced speech with no separation (phonation groups), and pause intervals separating these groups. Mean duration of pauses is another important feature. f0 or fundamental frequency profile, estimated for phonation groups, defining the prosodic contour of the sub- and suprasegmental intervals. Cepstral peak [8], SWIPE [9] and DYPSA [10] are popular methods for f0 estimation. Intensity profile, estimated for both the voiced and unvoiced segments of speech, defining the energy envelope of sub- and supra-segmental intervals. Low-pass energy envelope filtering and Teager–Kaiser algorithms are popular estimation methods currently used [11]. Emotional arousal and temperature are related concepts which are being currently defined, based on complexity studies as well as in non-linear methods to establish an average estimation of the patient's emotional state [12].

Researchers have resorted to study the degradation of speech at different linguistic levels, ranging from the phonological and articulatory[4,6]. The syntactic and lexical-semantic levels have been also studied in [13–17], although these are out of the scope of the present work. The f0 contour and intensity profile, as well as temporal interval estimations are usual features at the acoustic level, which can be found in the same studies. More recently, language and speech researchers have concentrated in finding markers in order to early diagnose and monitor the progression of the disease ([16–19]). In [20] five temporal measures are explored, finding that verbal rate, mean duration of pauses, and standardized phonation time are significant markers in the differentiation between AD and healthy subjects, concluding that the length of pauses is more relevant than their number. Following this study, some researchers have used these indices, combining them with other measures, to develop algorithms for automating feature extraction and analysis processes. Several

other researchers have investigated the prosodic production and recognition of emotions in AD ([5], and [21,22]), finding that these markers were significantly altered in AD patients. The inclusion of voice quality defined as a vehicular index for emotional expression [23,24], instead of taking into account only f0 and intensity profiles, can be considered a relevant contribution. The present paper is intended to carry on an exploratory study about the possible influence of AD in basic phonation. It is well known that diseases of neuro-motor origin affect vocal fold biomechanics [25]. As the nature of AD is mainly cognitive, it cannot be inferred that this type of diseases could influence biomechanics. Nevertheless, it is well documented in the literature that AD speech presents distortions in f0 and energy, as commented above. These correlates can be related closely to vocal fold biomechanics. Therefore, it would be possible to conclude that AD affects also vocal fold biomechanics. It is well known that f0 is sustained by the biomechanical parameters of the vocal folds, mainly by the mechanical stiffness of the vocal fold body. This factor is controlled by the laryngeal nerves, activated by the neuromotor speech planner through the bulbar system. Although secondary neurons do not seem to be affected by AD progress (at least at early stages), it is true that in late stages of AD many patients succumb to pneumonic infections when the laryngeal reflex is impaired. These two facts, i.e., progressive dystonic phonation and loss of laryngeal reflex in AD may indicate that some kind of laryngeal biomechanical deterioration may be expected from AD as well [26]. The intention of this research is to give some hints on the study of biomechanical analysis of phonation in AD patients (Section 2) to determine possible markers showing these effects in their statistical distributions, as an extension to [27]. The ultimate objective will be to launch a massive study on a large database, which is to be built based on these premises. Section 3 discusses the materials and methods used in this exploratory study, necessarily concentrated in a few study cases. Preliminary results are discussed in Section 4. Concluding observations and remarks are given in Section 5.

2. Biomechanical analysis of phonation Studies in AD speech production have concentrated their scope in phonological and articulatory, syntactic, lexical-semantic, and acoustic levels, including both temporal measurements and f0 production, as discussed in Section 4. Not much interest has been devoted to phonation in itself (glottal source production in voiced speech). Apparently this line has not been taken into account, allegedly because AD being a disease of cognitive origin, phonation is considered mainly a feature controlled by neuromotor activity (peripheral to the central nervous system) [28]. However, this scope cannot be supported without confirmation or refutation. The key technique used for the analysis of voice quality in the present work is adaptive vocal tract inversion to produce an estimate of the glottal source. From this signal a possible approach to vocal fold biomechanics is granted. Specific behavior of vocal folds, as muscular tension, is directly related to neuro-motor activity. Whether the irregular behavior of this activity be considered peripheral or central, is a matter of investigation. This work proposes the use of acurate spectral domain techniques [29] allow the estimation of a set of biomechanical parameters associated to a 2-mass model of the vocal folds [30] as the one depicted in Fig. 1. Template (a) shows the classical body-cover structure of the vocal folds. Template (b) on its turn gives a biomechanical model of the structures depicted in template (a). The average dynamic mass of the body contributing to vibration of the body is represented by masses Mbr and Mbl, for the right and

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

85

Fig. 2. Simplified electromechanical model of the body dynamic mass, losses and stiffness when stimulated by intra-glottal forces. A similar model stands for cover dynamics.

the glottal source sr(t) Z π    ‖Sr ðωÞ‖ ¼  sr ðtÞe  jωt dt  π

ð2Þ

can be associated with the trans-admittance functional in (1) to estimate the biomechanical parameters μb, σb and ξb minimizing the cost function I  2 ‖Sr ðωÞ‖  ‖T b ðω; μb ; ξb ; σ b Þ‖ dω ð3Þ Lðω; μb ; ξb ; σ b Þ ¼ 2π

Fig. 1. Vocal fold 2-mass biomechanical model used in the study. (a) Structural description of vocal folds seen in transversal section. The body and cover structures (in orange) behave mainly as dynamic masses. The visco-elastic ligaments (in green) behave mainly as damped springs. (b) Equivalent model in masses and visco-elasticities. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

left vocal folds. These masses are attached to the rigid walls of the larynx cartilages by springs with stiffness given as Kbr and Kbl, respectively. The complex structure of the cover and visco-elastic ligaments in Reinke's space are represented by masses Mcr and Mcl, and the linking stiffness springs Kcr and Kcl. Transversal forces fc resulting from the pressure difference between the sub-glottal and supra-glottal chambers as well as from the flow of air (intra-glottal forces) act on both systems. As a result the masses move with transversal velocities vcl, vcr, vbl and vbr along the horizontal axis (there is also a vertical component, assumed null in the present model). It may be shown that in this biomechanical model the relation between acting forces (fc) and the transversal velocities, for the case of the body (musculus vocalis) can be expressed by the trans-admittance functional given as T b ðω; μb ; ξb ; σ b Þ ¼

h

2

ω μb  ω  1 ξb þ σ 2b

i1

ð1Þ

where μb, σb and ξb stand for the estimates of the massive, viscous and elastic parameters of the vocal fold body biomechanical model, corresponding to Rbl,r, Mbl,r and Kbl,r respectively. It may be shown [29] that under certain assumptions the modulus of this functional can be associated with the power spectral density of the glottal source sr(t), a sound wave corresponding with the supraglottal pressure just at the point where the glottal flow is injected. The reconstruction of the glottal source sr(t) from a voiced segment of speech (preferably an open vowel) requires the inversion of the vocal tract [29]. The power spectral density of

The numerical estimation of the biomechanical parameters and particularly, the stiffness induced by the neuro-motor activity on the transversal and oblique laryngeal muscles controlling phonation (considered proportional to ξb) can be carried out using different approaches. The electromechanical equivalent, represented by the system shown in Fig. 2, is used to determine the parameters of the body and cover dynamics. The estimation of the body parameters from the glottal source power spectral density is not complicate, because this magnitude is well conditioned in frequency. The process of estimation starts with the determination of the dynamic masses from the power spectral profile of the glottal source, given as   ω2 T r  T 2 1=2 M bl;r ¼ 2 ð4Þ ω2  ω2r T r T 2 T2 and ω2 being the amplitude and angular frequency corresponding to the second harmonic in the glottal source power spectral density, Tr and ωr being the respective ones at the resonance frequency given by

ω2r ¼

K bl;r M bl;r

ð5Þ

where the square modulus of the admittance in (1) is rewritten as 1 TðωÞ ¼ h i2 2 Rbl;r þ ϖ 2 M 2bl;r

ð6Þ

whereas the frequency relative to the resonance point is defined as

ϖ¼

ω2  ω2r ω

ð7Þ

and the two selected points in the power spectral density of the glottal source, corresponding to the peak and the second harmonic are given as T r ¼ Tðω ¼ ωr Þ ¼

1

R2bl;r T 2 ¼ Tðω ¼ 2ωr Þ

ð8Þ

The evaluation methodology must produce first a very accurate estimation for f0, which is used to evaluate ωr. This leads to the determination of the losses from (8) and to the mass and stiffness from (4) and (5), respectively. Fig. 3 shows a sample of how the reconstruction of the glottal source works. The stiffness of the vocal fold body is the most semantic biomechanical parameter due to the fact that this parameter is directly and strongly related with the neuron firing rate acting on the laryngeal muscles [28], and retains neurologic disease

86

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

Fig. 3. Example of glottal source reconstruction: (a) original segment of phonated speech; (b) residual after vocal tract removal; (c) glottal source; (d) glottal flow. The red stars in the glottal source negative spikes mark the minimum flow declination rate (MFDR), and the negative spikes are considered the source of excitation of the vocal tract. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

behavior in marks of hypo- and hyper-tension, as well as in tremor [25]. Important correlates quantifying neurodegenerative behavior in speech are thus vocal fold stress, as well as its statistical dispersion. These other parameters beside stiffness, from a list of 68 estimated from the glottal source are of interest, (the parameter name is relative to the set evaluated by [31])

 Parameter 1: fundamental frequency f0.  Parameter 2: jitter, relative variation of fundamental frequency

Estimates from each parameter on a balanced database (HUGMDB) of 50 male and 50 female normative speakers collected and evaluated by endoscopy and GRBAS at Hospital Universitario Gregorio Marañón in Madrid (Spain) are used as a normative database. See Fig. 4 for the normative gender distribution of f0 and the body stiffness ξb. These distributions will be used in the study described in Section 4 to fix the normative baseline reference.

between each two near phonation cycles.

 Parameter 3: shimmer, relative variation of the glottal source        

area between each two near phonation cycles. Parameter 35: dynamic mass of the body μb, average of Mbl and Mbr. Parameter 37: stiffness parameter of the body ξb, average of Kbl and Kbr. Parameter 38: unbalance of dynamic body mass, relative variation between each two neighbor cycles k and k  1: 2 (μbk  μbk  1)/(μbk þ μbk  1). Parameter 40: Unbalance of body stiffness, relative variation between each two neighbor cycles k and k  1: 2(ξbk  ξbk  1)/ (ξbk þ ξbk  1). Parameter 41: dynamic mass of the cover μc, average of Mcl and Mcr. Parameter 43: stiffness parameter of the cover ξc, average of Kcl and Kcr. Parameter 44: unbalance of dynamic cover masses, relative variation between each two neighbor cycles k and k  1: 2 (μck  μck  1)/(μck þ μck  1). Parameter 46: unbalance of cover stiffness, relative variation between each two neighbor cycles k and k  1: 2(ξck  ξck  1)/ (ξck þ ξck  1).

3. Materials and methods The materials used in the present study were supplied by Foundation CITA Alzheimer in Donostia (Basque Country). These corresponded to running speech from interviews to Alzheimer Disease patients (AD) and healthy control subjects (CS). The speech of AD and CS was recorded when the speaker described a picture with a specific scene including a landscape, people and common objects of daily life. Recordings were taken at 44.100 kHz on a digital tape recorder. As initially the recordings were not intended for automatic speech analysis, audio was band-limited to 15 kHz prior to be stored. Due to the exploratory nature of the present work, the decision to use speech recordings “as they were” was consciously taken, among other things to face real problems to serve the base for designing proper recording protocols. Therefore the results are conditioned by this decision, and it is expected that they can be greatly improved if higher quality recordings are used. Two types of analysis were conducted. The first one consisted in estimating the fundamental frequency f0 and the signal intensity along the phonated segments of speech. This will be referred to as Frequency and Intensity Analysis (FIA). This type of analysis is already a classical approach to the problem. The second type of analysis is focused on

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

87

Table 1 Study case description. CSM: control subject (male). CSF: control subject (female). ADM: Alzheimer Disease patient (male). ADF: Alzheimer Disease patient (female).

Fig. 4. Reference distributions for f0 (top) and body stiffness (bottom) derived from the HUGM data base of 50 male and 50 female normative speakers. The male distribution for f0 is more slender than the female distribution, and shows a peak around 110 Hz, whereas the female distribution seems to be multimodal around 210 Hz. The distributions for the vocal fold body stiffness (bottom) are also different, the male one being more slender with a mode around 11.560 N/m whereas the female distribution is more widespread with no clear mode, its median being around 20.800 N/m.

estimating the biomechanical parameters associated to phonated segments within the vocalic nuclei, and especially the stiffness of the vocal fold using a specific tool developed initially for voice quality analysis in organic laryngeal pathology characterization by acoustic analysis [30]. This will be referred to as Bio-Mechanical Analysis (BMA). This type of study is prospective as not much has been done before in the field of voice production in AD patients. Some restrictions had to be taken into account when estimating phonation features in AD patients or in CS speakers in running speech

 Speech segments acceptable for phonation analysis must be of

 

 

clear vocalic nature, therefore long tonic syllables lasting at least 50 ms should be used to provide enough glottal pulses to give some statistical stability to estimates. Nasal vowels and consonants should be initially disregarded due to constraints imposed by inverse filtering. Analysis should not be extended to segments exceeding 150 ms because the vocal tract conditions may change during running speech phonation, and vowels may change in their phonation tension due to message, and not to phonation tension (this was especially notorious for case CSF000008, see next section). Non-modal phonation styles, as creaky or falsetto should be initially removed from the study. Diphthongs and glides are excluded.

Around 100 ms from vocalic nuclei in stable vowels were used in the study.

4. Results and discussion The results presented in this section correspond to the two main types of analysis on a group of AD patients and CS speakers

Code

Diagnose

Gender

Age

CSF000008 ADF000003 CSM000006 ADM000011

Healthy control AD patient Healthy control AD patient

F F M M

53 65 53 68

as listed in Table 1. The first type of analysis (FIA) evaluated and compared the fundamental frequency f0 and the signal intensity, estimated each 5 ms on phonation groups from speech intervals around 25–30 s long, using a sliding window. Cepstral f0 detection was used. The results can be seen in Figs. 5 and 6. The first four templates in Fig. 5 give a perspective of the different behavior of long time analysis (LTA) f0 histograms for the four subjects studied. The first difference appears in the kurtosis of the four distributions, which is more slender for controls. The second difference appears in the second mode of control histograms, which takes place about a fifth over the main mode (100 Hz, and 150 Hz for CSM06, and 160 Hz and 230 Hz for CSF08). In general, second modes are smaller and less neat in AD cases, and are associated to the end of phonation groups except they are in a sentence end. The third difference is the presence of multiple small modes in AD patients well above the main mode, due to whinny phonation at the ends of phonation groups. The fourth difference is the presence of small modes below the main one in AD patients, indicating the appearing of creaky or ‘broken’ speech typical in aging. Some of these results will be better explained later using biomechanical parameter analysis. The LTA intensity histograms in Fig. 6 reveal also a slight different behavior in AD patients relative to controls. These last ones show a larger gentle dispersion in the shape of a foothill from the main peak, showing a mark for a more vivid intensity modulation than for AD patients, as these, on their turn show a main peak with a much less significant activity in larger intensity bins, possibly due to a less vivid intensity modulation. Assessing if these findings are due to general behavior of AD patients relative to controls is something out of the scope of this study, but it is interesting as an objective for wider population studies. It has to be said that the differentiation capability of the behavior observed for f0 and intensity on the cases studied, cannot grant enough statistical significance to clearly dissociate AD from controls on the sole basis of f0 and intensity. A test considering the identity between AD and control distributions as the null hypothesis fails to reject it with a p-value below 10  6 for a confidence level of 0.05. This means that significant distinctions between the AD and the CS speaker conditions cannot be established based solely on f0 and intensity histograms of counts for the four speakers compared. This relative inability of LTA histograms to mark significant differences between AD cases and controls led to consider a more precise analysis of f0, detecting it on the glottal source from vowel nuclei after vocal tract inversion, this being a more elaborate and sensitive methodology based on short time analysis (STA), as f0 is estimated from the period histograms obtained after detecting the negative spikes associated to the MFDR (see Fig. 3c) in each vowel nucleus [10]. The plots in Fig. 7 show the results of applying this methodology to the longest vowel nuclei present in the speech of the four cases studied. The first impression is that both male and female controls show a span of vowel nuclei below their normative distributions. The dispersion of each estimate is not large. Besides, the tonal line given by the average red balls is stable (a bit below 100 Hz for the

88

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

Fig. 5. From top to bottom: results from the comparison of f0 profiles (fundamental frequency) from controls CSM06 (male, 53y) and CSF08 (female, 53y), and patients ADM11 (male, 68y) and ADF03 (female, 65y). Vertical axes are normalized to total number of counts. Horizontal axes are given in Hz.

Fig. 6. From top to bottom: results from the comparison of intensity profiles from controls CSM06 (male, 53y) and CSF08 (female, 53y), and ADM11 (male, 68y) and ADF03 (female, 65y). Vertical axes are normalized to total number of counts. Horizontal axes given in relative intensity level.

male control CSM06, and a bit below 160 Hz for the female control CSF08). Observe that these results are well in agreement with their relative LTA histograms in Fig. 5. In the case of the AD patients, the results for the vowel nuclei studied are also below the respective normative distributions (around 90 Hz for ADM11 and 140 Hz for ADF03), also well in agreement with their LTA histograms shown in Fig. 5, and the dispersion of each estimate is even smaller than the corresponding ones from the controls, but the tonal line described by the average red balls corresponding to separate phonations is much less stable, showing notorious jumps. This result is really striking, as it was considered that AD patients are

not prone to show prosodic capabilities, mainly based in the modulation of f0. But considering it to the light LTA it seems that the instability in the average f0 derived from vowel nuclei does not come from a conscious prosodic planning, but to the inability to sustain a uniform and stable phonation tension. Assessing whether this finding be due to impaired neuro-motor activity or not is an objective for further study. The final objective of this study is to shed light on the possible role of phonation biomechanics in AD speech. For such, estimates of the vocal fold body stiffness were derived from the same vowel nuclei used for STA of f0. The results are shown in Fig. 8.

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

89

Fig. 7. Statistical distributions of f0 for the longest vowel nuclei for the four cases studied. Each vowel segment included in the study is labeled on the horizontal axis. The label gives the subject code, and the start and end of the segment analyzed in hundredths of seconds. The vertical axis gives frequency in Hz. Each distribution is marked by a central red ball (mean) and two horizontal lines ( þ and  standard deviation). The blue ball at the left of each plot gives the mean of the male or female population of reference, estimated from the database of the balanced database (HUGMDB, see Section 2 and Fig. 4). The standard deviation around the average is given also by two horizontal lines. The numbers of vowel nuclei analyzed are respectively 10 and 11 for the controls, and 10 and 6 for the AD patients. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

The results are to be analyzed under the same perspective than for f0: comparison of averages against their normative distributions; relative dispersion of estimates, and stability of average stiffness line. Comparing the alignment of averages relative to their normative distribution it may be seen that the male control CSM06 is near or below the lower distribution mark given by average-std. dev., whereas the female control is overlapping her normative distribution. The AD patients are below their normative distribution except for one vowel nucleus each. The male control CSM06 and the two AD subjects show low intra nuclei dispersion compared to their normative distributions, whereas the female control CSF08 shows very strong intra nuclei dispersion. Comparing this result with the performance of the same speaker in f0 (relatively small dispersion and stable tonal line) one can conclude that the ability of sustain stable phonation with highly changing vocal fold stiffness may be due to a very healthy and flexible vocal use, as changes in stiffness are to be compensated by changes in dynamic mass, which implies the presence of a very healthy and complex mucosal wave in CSF08.

It seems that a speaker with intact production–perception structures is able of tuning f0 in quasi-real time within a wider span of vocal fold stiffness, perhaps due to fine control by hearing phonation-production in a feed-back loop. The confirmation of this observation is also the subject of further study. Finally, the stability of the phonation tension is large for CSM06, relatively low for ADM11 and ADF03, and very low for CSF08. Analyzing the semantics of this finding is also the subject of further study. As the consequences derived from these simple observations are not conclusive enough, hypothesis testing was carried out to better assess the results. Due to the specificity of the data available, the following tests were proposed: BMA. Features from each subject are compared against the a) normative population from HUGMDB according to their respective gender. BMA. Features from AD subjects are compared against their CS b) counterparts.

90

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

Fig. 8. Statistical distributions of ξb for the longest vowel nuclei for the four cases studied. Each vowel segment included in the study is labeled on the horizontal axis. The label gives the subject code, and the start and end of the segment analyzed in hundredths of seconds. The vertical axis gives stiffness in 10  3 N/m. Averages for each vowel nuclei are given by red balls. Reference distributions are given by blue balls. Standard deviations are marked by short horizontal lines above and below. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

The tests will be based on a null hypothesis assuming no difference in the feature sample distributions between each subject and normative population, and between AD patients and CS controls. Gaussian conditions are assumed although actual distributions in Fig. 4 are not strictly Gaussian. It may be seen that although f0 distributions are relatively narrow, their counterparts for ξb are more widespread. In general, this seems to be a typical behavior of healthy controls. The question now is how to evaluate the estimates from each of the four cases considered regarding their phonation condition. In other words, having the normative distributions shown in Fig. 4 approximated by Gaussians, what would be the phonation conditions for each of the speakers CSM000008, CSF00006, ADM000011 and ADF000003 relative to the normative ones? The parameter hypotheses tests used were based on two-mean comparisons [32]. Regarding the first test (BMA.b) the means of all the estimations from each speaker were tested against the means of the normative distributions of the reference database. The features used in the test were the fundamental frequency f0 and the vocal fold body stiffness ξb. The null

Table 2 Hypothesis test results based on Gaussianity with reference to the normative data base HUGMDB. Code

Mean f0 (Hz)

CSM06 97.54 ADM11 86.02 CSF08 155.77 ADF03 149.63

Std. Dev. f0 (Hz)

Mean ξb (10  3 N/m)

Std. Dev. ξb (10  3 N/m)

pValue f0

pValue ξb

3.35 8.90 8.70 22.73

9750 8748 22434 15354

370 702 5118 3488

0.135 0.044 0.018 0.010

0.153 0.059 0.237 0.027

hypothesis assumed that both means were produced by the same distribution. The significance level was established in 0.05. The results are listed in Table 2. The mean f0 and its standard deviation for each speaker are given in the second and third columns (from the left). The corresponding parameters for the vocal fold body stiffness ξb are given in the fourth and fifth columns. The two right-most columns give the corresponding p-values resulting from the test.

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

Table 2 allows a preliminary evaluation of both controls and patients. The case of control CSF000008 tells us that according to the p-value for f0 (0.018) we should not include this subject under the normative population, as it is below 0.05, indicating that we should reject the null hypothesis (rejecting H0, which states that this subject's estimated distribution cannot be differentiated from the normative population one, is equivalent to admit that his distribution cannot be considered normative). But accordingly to the p-value for the vocal fold body stiffness (0.237) we could not reject H0, therefore the control had to be classified within the normal population. This is an important result, as it seems that as far as f0 is concerned, the voice of this person (53 y old female) had a well-below the average pitch, but according to the vocal fold tension, she is completely normative. Female voice variability due to hormonal conditions, as well as others related (voice health, life habits, etc) should have to be taken carefully into account at the time of selecting control subjects. The case of patient ADF000003 is clearly out of the normative standards either under f0 as under vocal fold tension, as the respective p-values (0.010 and 0.027) force the rejection of H0. Case CSM000006 is well within normative standards, as H0 cannot be rejected according to the respective p-values (0.135 and 0.153) well over 0.05. The case of patient ADM000011 is in the boundary of being excluded of the normative group, as H0 must be rejected according to f0 (p-value equal to 0.044), but must not be rejected according to vocal fold tension (pvalue equal to 0.059). These results will influence the patient vs control tests given later in Table 3. Regarding the second test (BMA.b), having in mind that CSM000006 and CSF000008 were labeled as healthy controls free from AD, the estimates from these speakers could be used to check the estimates from ADF000003 and ADM000011 in hypothesis tests based on Student's distributions, given the size of the samples. The results of these tests are given in Table 3. In this case the means of each phonation estimates from a given AD speaker were contrasted to phonation estimate means from CS speakers in two-mean comparisons. The t-score mean differences are given in the second column (f0) and the fourth column (ξb) from the left. Test results are given in the third column and the fifth column (ξb). Again, a significance level of 0.05 was used as the null acceptance-rejection threshold. It may be seen now that when comparing ADF000003 against CSF000008 the null hypothesis cannot be rejected on the basis of f0, but it has to be rejected on view of vocal fold tension, as the p-value for f0 is over 0.05, but for ξb it is below 0.05. In other words, these AD and CS speakers are not different enough on the basis of f0, but they are completely different on the basis of vocal fold tension (see that the phonation of ADF000003 is clearly hypotonic on the average when compared with the female distribution in Fig. 8, except for one vowel nucleus). On the other hand, comparing ADM000011 against CSM000006 produces p-values for ξb well below the significance level, allowing the rejection of the null hypothesis, otherwise, helping to classify ADM000011 as “hypotonic” and CSM000006 as a good control subject. The reference threshold considered for the p-values is the standard confidence level fixed customarily at 0.05 in the literature to accept or reject the null hypothesis. Of course this threshold is arbitrary, and some others may be used. The main Table 3 Hypothesis test results based on t-Student distributions. Test

ADM000011/ CSM000006 ADF000003/ CSF000008

Norm. t-score (f0)

p-Value (f0)

Norm. t-score (ξb)

p-Value (ξb)

3.829

0.0006

3.991

0.0004

 0.639

0.2654

3.521

0.0013

91

reason for its use is that it is well fitted with the two standard deviation interval in Gaussian statistics, and therefore is generally accepted by researchers as a good practice level. Once a large database of patients is available (task with which Foundation CITA is strongly compromised) there will be enough material to fix adequate criteria based on sensitivity and specificity curves to fix quasi-optimal threshold levels for the parameters proposed and others eventually added to the study. These results induce the following reflections:

 ADM000011 and CSM000006 differ in age (see Table 1). Look-



ing at the considered features they differ substantially in f0 as well as in vocal fold tension ξb, and this allows a good discrimination between them based on both features. Whether this difference is only due to AD neuro-degeneration or to voice aging is a matter of further study. Although ADF000003 and CSF000008 differ also in age, phonations of both speakers tend to be similar in f0, but they are much more different in vocal fold tension ξb. This indicates that this parameter may be a better detection feature than f0 accordingly to strict biomechanical criteria (considering that f0 is a pure biomechanical estimate in itself, but very much affected by biological factors other than neuro-degeneration).

These observations arise again the old discussion on the selection of a normative set. If normative subjects are closer in age to pathologic ones, differentiation based on age-sensitive parameters will be difficult. If normative subjects are far apart in age from pathologic ones, differentiation will be easier, but it will be difficult to distinguish if differentiation is due to parameters sensitive to age or to disease effects. This consideration takes us to conclude that a well-grounded normative database of control speakers is a fundamental step before any other study based on parameter hypothesis tests can be conducted. Subjects of both genders in the same age interval than patients, known to be free of other organic or neurological disease, should be selected. Besides, subjects showing presbyphonic voice should also be included. In this way differentiation among AD and other age-induced dysphonias should be based on parameter differential sensitivity and discrimination capability based on tests as the ones shown in the present work.

5. Conclusions The research presented is of exploratory nature due to the low number of subjects examined, the different features considered, and the lack of a normative database in the same age range. LTA of f0 or intensity profiles do not seem to grant enough discrimination between AD and CS phonations, although some differences are observable. STA biomechanical parameters used in the study (f0 and ξb) help in differentiating patients within certain level of significance when evaluated against a normative database of midage speakers. The differentiation is not as clear as it should be desirable when confronting AD speakers against their genderrespective controls relative to f0, but it is stronger regarding vocal fold stiffness. It seems also that age is a crucial factor in carrying out this kind of comparisons. Whether these results may be confirmed from other patient pairs is still a matter of study. Regarding phonation, the use of a general-purpose database for pathology detection studies may seem a bit risky, as the subjects collected in this kind of repositories are in a range of mid ages. It has been shown that a strong age difference can establish the capability of discriminating AD and CS subjects from normative, but when compared among themselves the differentiation improves. Therefore a well-designed database of subjects with

92

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93

normative purposes in the proper age range and normophonic conditions is a priority. Controls have to be selected knowing that they are representative of an age range disregarding AD or other neurological diseases which may confound results. As not much attention has been devoted to phonation studies in AD, it will be necessary to define a protocol to produce this database to further advance in this field. As this resource is not available yet, the adequate discrimination capability of the different parameters which may be used in the study is still a question open to research. The work is being continued by Foundation CITA to recruit a large database oriented to prosody, articulation and phonation studies to evaluate the differentiation capability of other biomechanical parameters besides vocal fold body stiffness.

Acknowledgments This work is being funded by grants TEC2012-38630-C04-01 and TEC2012-38630-C04-04 from Plan Nacional de Iþ Dþi, Ministry of Economy and Competitivity of Spain. The authors acknowledge the contribution of the Foundation CITA in Donostia (Basque Country) for providing part of the data used in the study as well as for their advisory role.

References [1] E. Younesi, A Knowledge-based, Integrative Modeling Approach for In-Silico Identification of Mechanistic Targets in Neurodegeneration with Focus on Alzheimer's Disease (Ph.D dissertation), Rheinischen Friedrich-WilhelmsUniversität Bonn, Germany, 2014. [2] C.P. Ferri, et al., Global prevalence of dementia: a Delphi consensus study, Lancet 336 (2005) 2112–2117. [3] V. Taler, N.A. Phillipos, Language performance in Alzheimer's disease and mild cognitive impairment: a comparative review, J. Clin. Exp. Neurophsychol. 30 (5) (2008) 501–556. [4] K. Croot, et al., Phonological and articulatory impairment in Alzheimer's disease: a case series, Brain Lang. 75 (2000) 277–309. [5] G. Tosto, M. Gasparini, Prosodic impairment in Alzheimer's disease: assessment and clinical relevance, J. Neuropsychiatry Clin. Neurosci. 23 (2) (2011) 21–23. [6] P. Östberg, N. Bogdanović, L.O. Wahlund, Articulatory agility in cognitive decline, Folia Phoniatr. Logop. 61 (2009) 269–274. [7] N. Chater, C.D. Manning, Probabilistic models of language processing and acquisition, Trends Cogn. Sci. 10 (7) (2006) 335–344. [8] J.R. Deller, J.G. Proakis, J.H.L. Hansen, Discrete-Time Processing of Speech Signals, Macmillan, NewYork, 1993. [9] A. Camacho, J.G. Harris, A sawtooth waveform inspired pitch estimator for speech and music, J. Acoust. Soc. Am. 124 (2008) 1638–1652. [10] A., Kounoudes, P.A., Naylor, and M., Brookes, The DYPSA algorithm for estimation of glottal closure instants in voiced speech, in: Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1, 2002, pp. 349–52. [11] D. Dimitriadis, A. Potamianos, P. Maragos, A comparison of the squared energy and Teager–Kaiser operators for short-term energy estimation in additive noise, IEEE Trans. Signal Process. 57 (7) (2009) 2569–2581. [12] K. López-de-Ipiña, et al., On automatic diagnosis of Alzheimer's disease based on spontaneous speech analysis and emotional temperatura, Cogn. Comput. 7 (1) (2015) 44–55. [13] J. Illes, Neurolinguistic features of spontaneous language production dissociate three forms of neurodegenerative disease: Alzheimer's, Huntington's, and Parkinson's, Brain Lang. 37 (4) (1989) 628–642. [14] K.E. Forbes-McKay, A. Venneri, Detecting subtle spontaneous language decline in early Alzheimer's disease with a picture description task, Neurol. Sci. 26 (2005) 243–254. [15] A.J. Astell, R.S. Bucks, Strategy prompts increase verbal fluency in people with Alzheimer's disease, Brain Lang. 99 (1-2) (2006) 141–142. [16] A., Habash, C., Guinn, D., Kline, and Patterson, L.C., Language analysis of speakers with dementia of the Alzheimer's type, Annals of the Master of Science in Computer Science and Information Systems at UNC Wilmington, 6, 1, 2012, pp. 8–13. [17] B. Roark, M. Mitchell, J.P. Hosom, K. Hollingshead, J. Kaye, Spoken language derived measures for detecting mild cognitive impairment, IEEE Trans. Audio Speech Lang. Process. 19 (7) (2011) 2081–2090. [18] K. López-de-Ipiña, et al., On the selection of non-invasive methods based on speech analysis oriented to automatic Alzheimer disease diagnosis, Sensors 13 (2013) 6730–6745.

[19] C. Laske, et al., Innovative diagnostic tools for early detection of Alzheimer's disease in press, Alzheimer Dement. (2015), http://dx.doi.org/10.1016/j. jalz.2014.06.004. [20] S. Singh, R.S. Bucks, J.M. Cuerden, B.H. Hospital, M. Road, An evaluation of an objective technique for analysing temporal variables in DAT spontaneous speech, Aphasiology 15 (6) (2001) 571–583. [21] R.S. Bucks, S.A. Radford, Emotion processing in Alzheimer's disease, Aging Ment. Health 8 (3) (2004) 222–232. [22] K. Horley, A. Reid, D. Burnham, Emotional prosody perception and production in dementia of the Alzheimer's type, J. Speech Lang. Hear. Res. 53 (2010) 1132–1146. [23] T., Johnstone and K., Scherer, The effects of emotions on voice quality, in: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, USA, 1999, pp. 2029–2032. [24] C. Gobl, A.N. Chasaide, The role of voice quality in communicating emotion, mood and attitude, Speech Commun. 40 (2003) 189–212. [25] P. Gómez, et al., Characterizing neurological disease from voice quality biomechanical analysis, Cogn. Comput. 5 (2013) 399–425. [26] M. Kutová, J. Mrzílková, D. Kirdajová, D. Řípová, P. Zach, Simple method for evaluation of planum temporale pyramidal neurons shrinkage in postmortem tissue of Alzheimer disease patients, BioMed Res. Int. (2014), http://dx.doi. org/10.1155/2014/607171. [27] P., Gómez et al., Biomechanical Characterization of Phonation in Alzheimer's Disease, in: Proceedings of the IEEE 3rd International Work Conference on Bioinspired Intelligence (IWOBI2014), IEEE Press, Liberia, Costa Rica, 2014, pp. 14–20. [28] U. Jürgens, Neural pathways underlying vocal control, Neurosci. Behav. Rev. 26 (2002) 235–258. [29] P. Gómez, et al., Glottal source biometrical signature for voice pathology detection, Speech Commun. 51 (2009) 759–781. [30] I.R. Titze, B.H. Story, Rules for controlling low-dimensional vocal fold models with muscle activation, J. Acoust. Soc. Am. 112 (3) (2002) 1064–1076. [31] P., Gómez et al., BioMetroPhon: a system to monitor phonation quality in the clinics, in: Proceedings. of the 5th International Conference on e-Health, Telemedicione and Social Medicine, Nice, France, February, 2013 pp. 253–258. [32] J.P. Marques de Sá, Applied Statistics using SPSS, STATISTICA and MATLAB, Springer, Berlin, 2003.

Pedro Gómez-Vilda was born in Burgo de Osma, Spain. He received the M.Sc. degree in Communications Engineering in 1978 and the Ph.D degree in Computer Science from the Universidad Politécnica de Madrid, Madrid, Spain, in 1983. He is Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid since 1988 and Director of the Neuromorphic Speech Processing Laboratory at the Center for Biomedical Technology since 2010. His current research interests are biomedical signal processing, functional, neuromotor and cognitive disease monitoring by voice, speaker identification, and cognitive speech production and perception. Dr. Gómez Vilda is member of the IEEE, ISCA and EURASIP.

Dr Victoria Rodellar-Biarge was born in Huesca, Spain. She received the M.Sc. and the Ph.D degree in Computer Science from the Universidad Politécnica de Madrid, Madrid, Spain. She is Associate Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid. Her current research interests are biomedical signal processing and reconfigurable logic designs for DSP. Dr Rodellar-Biarge is a member of the IEEE.

Víctor Nieto Lluis was born in Moscow, Rusia. He was graduated in Mechanical Engineering in the Polytechnic University of Havana, Cuba. He received the Ph.D degree in Computer Science from the Universidad Politécnica de Madrid, Spain, in 1991. He is Associate Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid since 2000 and Director of the Vocal Communication Laboratory “R. W. Newcomb” at the Instituto Técnico Superior de Ingenieros Informáticos. His research interests are in speaker identification, neuromotor and cognitive disease monitoring by voice, and cognitive speech production and perception.

P. Gómez-Vilda et al. / Neurocomputing 167 (2015) 83–93 Karmele López-de-Ipiña received the Ph.D degree in Computer Science in 2003, and a Master Degree in Electronics and Automation and the B.Sc. degree in Physics in 1990, at the Universidad del País Vasco/ Euskal Herriko Unibertsitatea (UPV/EHU). She worked for enterprises until 1995 when joined the Department of Systems Engineering and Automation of the University of the Basque Country. He was Director at the UPV/EHU (2004–2009). She is currently Head of the Engineering and Society research group. Her research interests are in Bioengineering and Biomedical Engineering, Pattern Recognition, Signal Processing, Ambient Intelligence and Robotics.

Agustín Álvarez-Marquina was born in Madrid, Spain in 1969. He received the M.Sc. degree in Computer Science in 1994 and the Ph.D degree in Computer Science from the Universidad Politécnica de Madrid, Madrid, Spain, in 1999. He is Associate Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid since 2000. His current research interests are speech recognition, speaker identification and architectures for digital signal processing.

Rafael Martínez-Olalla was born in Madrid, Spain. He received the M.Sc. degree in Communications Engineering in 1995 and the Ph.D degree in Computer Science from the Universidad Politécnica de Madrid, Madrid, Spain, in 2002. He is Associate Professor in the Computer Science and Engineering Department, at Universidad Politécnica de Madrid since 2009. His current research interests are biomedical signal processing, speaker identification, and cognitive speech production.

93 Miriam Ecay-Torres obtained his psychology degree in 2007 at the University of Deusto, a master degree in Neuropsychology in 2010 and a master degree in Psychology Research in 2013. She started the Ph.D Programme in 2013. Her project is focused on the study of potential neuropsychology profiles in the preclinical stages of Alzheimer's disease.

Pablo Martinez-Lage obtained his medical degree in 1988 at the Medical School of the University of Navarra and then became a clinical neurologist. After his Ph.D in 1994 at the University of Navarra he attended a Clinical program in Cerebrovascular Diseases and Dementia at the Department of Clinical Neurological Sciences of the University of Western Ontario (London Ontario, Canada) until 1996. From 2010 he is director of the Neurology area at the Center for Research and Advanced Therapies of the Fundacion CITA-alzheimer Fundazioa in San Sebastian. He is a member of the Spanish Society of Neurology and a founding member of the International Society on Vascular behavioral and cognitive disorders (Vas-Cog). He has been the Coordinator of the Dementia Study Group of the Spanish Society of Neurology (2009–2012) and is a member of the Dementia Panel of the European Federation of Neurological Societies since November 2008.