Journalof Voice
Vol. 8, No. 1, pp. 1_-7 © 1994 Raven Press, Ltd., New York
The G. Paul Moore Lecture
Toward Standards in Acoustic Analysis of Voice Ingo R. Titze Department of Speech Pathology and Audiology, and National Center for Voice and Speech, The University of Iowa, Iowa City, Iowa; and The Recording and Research Center, The Denver Center for the Performing Arts Denver, Colorado, U.S.A.
Summary: A brief summary of the purpose of standardization is presented. This is followed by a few suggestions about methods and procedures in acoustic voice analysis that could ~benefit from consensus, if not fully developed standards. In particular, better agreement on test utterances, techniques for recording voices, and methods of extracting fundamental frequency is called for. Some new databases and calibration materials are in need of development, with better portability and shareability in mind. Key Words: Standards--Voice analysis--Consensus--Recording.
This article serves two purposes: to review the rationale for standards in any field, recalling the benefits and possible liabilities that are part of standardization, and to propose a few procedures that could (in time) be standardized in a specific area of acoustic analysis of voice, namely assessment of laryngeal control in phonation. There is no claim to originality of any of the material presented here. Many articles and books have been written on every aspect of standardization, but every field of science somewhere has to rediscover the steps that lead to better communication, better product control, and greater economy in having several people at different locations working on similar things. This is what standardization is really all about: fostering a team approach, even though the players don't always practice or play together on the same field. GENERAL DISCUSSION OF STANDARDS In what way are standards helpful? The rationale for standardization is outlined clearly by Sullivan (1). First, standards educate. In the process of writing a standard, unusual attention Accepted February 2, 1993. Address correspondence and reprint requests to Dr. Ingo R. Titze at Department of Speech Pathology and Audiology, The University of Iowa, 330-WJSHC, Iowa City, IA 52242, U.S.A.
is given to detail and to proper definition of terms. This allows newcomers to the field to acquaint themselves with the level of the science and some of its history. Often ideals are set forth for future achievement. Second, standards simplify. In industry, the number of stock sizes, shapes, and material components are limited to make the selection process easier. This, then, also simplifies accounting, advertising, storing, and other handling procedures. By reducing the variety of the processes, more attention can be given to improving the quality of a few. This leads to the third point. Standards conserve. They save time, money, and effort. Better tools can be designed when multiple copies of the same item are produced. Henry Ford is credited with mass production of the automobile. Although consumers are often intrigued by handcrafted, custom-made products, they also appreciate the value of assembly-line merchandise. Finally, standards help to certify. Institutions that give professional degrees and accreditation usually have standardized some criteria for giving a stamp of approval. Also, when patents or trademarks are sought by individuals or companies, a description is needed that provides some disclosure of the idea, design, or procedure. This disclosure allows the consumer the option to put his trust into an accred-
2
I . R . TITZE
itation or protection agency or to test the product himself. In what way are standards not helpful? There is also a downside to standardization; in particular, premature standardization or overstandardization can be a hindrance to progress. Just as an economy can be stifled by too much regulatory action, so can a scientific field be slowed down by too much zeal for order. Premature standardization may squelch personal initiative and entrepreneurship. There are always individuals who prefer to go against the grain. By putting the steering wheel on the right side of the car or by turning an electroglottogram (EGG) upside down clever dissenters discover something that the rest of the world did not see. Standards should therefore not be a threat to those who do not wish to conform; they should at most be a slight inconvenience. The inconvenience may actually become an unnecessary burden, however. Those who have the flexibility to deal with large varieties of processes and products may find it limiting (or even frustrating) not to be able to have it their way. An example is the carpenter who can get only certain sizes of lumber to build an odd-shaped structure. He either has to cut pieces that are too long or make undesirable seams by combining two shorter pieces. In addition, standards may be confusing or erroneous. Ambiguities may have been created by the writers of a standard that may make it difficult to interpret even the intent, much less the detailed procedure. Everyone then uses their own interpretation. The result is an ineffective standard and a cynical attitude about standardization. In a worstcase scenario, an entire field may be led astray by a hidden error. For this reason, standards require frequent reviewing and updating.
The process of standardization The process of standardization usually begins with an individual or a small group perceiving a need to unify and simplify. In some cases, it is a need to protect the public from certain hazards. In either case, the decision is made early whether the standard should be voluntary or mandatory. Mandatory standards require a regulatory agency for enforcement, which makes the process much more complicated and expensive. F o r t u n a t e l y , the greater percentage of standards is voluntary. Basically, the field then regulates itself and requires no outside arbitration. Journal of Voice, Vol. 8, No. 1, 1994
The key word in voluntary standardization is
consensus. Majority rule does not get the job done. All minority opinions must be ~heard exhaustively until there is no reasonable argument against the standard. Otherwise, disgruntled individuals or groups will crusade against the standard and voluntary adherence becomes improbable. Ultimately, of course, a large majority will sway the dissenters simply on the basis of convenience and economics. An appropriate professional society is sought for establishing a working group. This society typically has a liaison with the American National Standards Organizations (ANSI) or the International Standards Organizations (ISO) who provides guidance in policy and procedures. The working group reads all the pertinent literature and begins to formulate some proposals. Representation is then sought from industry, education, government, consumers, and any other group that may have an interest in the standard. Lengthy discussions ensue, the aim being consensus at every step of the way. AREAS WHERE SOME CONSENSUS MAY BE ACHIEVABLE Acoustic analysis of voice is performed for many reasons. Some of the more obvious ones are telecommunications (speech transmission), basic studies in speech science and linguistics, and assessment of disorders in human communication. The technical needs for these applications are quite varied and cannot easily be described in a single article. The focus here will be the use of acoustic voice analysis to determine the phonatory capabilities of the larynx. Phonatory capability includes, as a minimum, control of pitch, loudness, phonation mode, and register. We consider these four variables to be fundamental in the control of the larynx. If phonatory capability is to be assessed acoustically, each of the control variables must have an TABLE 1. P h o n a t o r y control variables a n d a c o u s t i c correlates
Control variables Loudness (soft to loud) Pitch (low to high) Mode (breathy to pressed) Register (falsetto/modal/pulse)
Acoustic variables Intensity Fundamental frequency Fundamental-to-harmonics ratio Temporal gap or spectral slope
S T A N D A R D S I N ACO USTIC A N A L YSIS OF VOICE
acoustic counterpart, a quantity that is measurable from a microphone signal. Table 1 shows a pairing of the control variables with primary acoustic variables, the first two being rather obvious. As far as the phonation mode is concerned, it has been difficult to find a single parameter that quantifies changes in spectral envelope when the voice changes from breathy to pressed. The strength of the fundamental component, relative to the harmonics, is one useful indicator. Register is even more difficult to define in terms of one or two spectral or temporal parameters. A temporal gap is a primary feature of pulse (fry) register, whereas an abrupt spectral slope transition is a primary feature of a model-falsetto register change. It is clear that none of the control variables have a single acoustic dimension. Loudness, for example, is governed not only by intensity but also by spectral balance, and pitch is influenced by both intensity and spectrum. The nonorthogonality between control variables and acoustic variables is an inconvenience but not a major theoretical stumbling block. Following the concepts developed in motor control of limb and body movement, each of the control variables can presumably be assessed in terms of (a) strength, as measured by the range of a control variable, (b) accuracy in executing a planned movement pattern, (c) stability in maintaining a posture, and (d) speed of executing a task. For these assessments, it is useful to develop a battery of test utterances from which the relevant acoustic measures can be extracted. Design of a vocal treadmill (test utterances) The traditional clinical goals of constructing test utterances are to determine (a) how voice impacts on speech intelligibility and communication effectiveness, and (b) what insight can be gained about laryngeal health or general body condition. An additional goal, a pedagogical one, would be to determine (c) how the effectiveness of vocal training can be quantified. Historically, clinicians have used a battery of test utterances that progress from primitive vocalizations to isolated syllables or words to complete sentences or paragraphs. Almost everyone is in agreement that the tasks must reveal control of pitch, loudness, and some aspect of vocal quality. In addition, the interaction between respiratory, phonatory, and articulatory components of speech are important to most clinicians. The following list in-
3
cludes nearly all of the test utterances that have been described in the literature: 1. Sustained expiration with no major vocal tract constriction to assess airflow management (tidal volume, vital capacity, inspiratory and expiratory reserves) 2. Prolonged [s] and [z] to compare airflow management with one or two vocal tract constrictions 3. Sustained vowels, typically some combination of [a], [e], [i], [o], or [u], at (a) comfortable pitch and loudness (b) prescribed pitch and loudness (matched tones) (c) gliding pitch and loudness 4. Prolonged vowels to assess phonatory endurance at (a) comfortable (habitual) pitch and loudness (b) prescribed pitch and loudness (high, medium, low) 5. Singing of scales on vowels (including falsetto register) to determine phonation range and stability 6. Repeated syllables, such as [pae] or [pa] 7. Counting or using some emotionally neutral utterance like " a h - h u m " to assess normal speaking pitch and intensity. A vowel in one of the isolated words can be prolonged to assess fundamental frequency (Fo) more accurately. Variations include fast, slow, soft, loud, high, low counting 8. Chant talk, an alternation and mixture of speaking and singing to obtain optimal pitch 9. Speaking of prescribed sentences to simulate situations such as (a) soft conversation--near whisper (b) normal conversation (c) loud conversation or shouting (d) exaggerated articulatory movement (clear speech) (e) accelerated rate (rapid speech) (f) simulated psychological stress or emotion (g) during or after physical exertion (running, lifting, pushing, etc.) or relaxation (meditation, etc.) (h) assuming different roles (lecturer, boss, subordinate, etc.) 10. Oral reading of a passage (with variations as in 9) 11. Conversational speech (with variations as in 9) Journal of Voice, Vol. 8, No. 1, 1994
4
I . R . TITZE
Assessment of vocal control across clinics, studios, and laboratories could benefit from some consensus reached in the use of diagnostic utterances, particularly if they also were to serve as a vocal treadmill. By "treadmill" we mean that some of the utterances would be taxing enough to expose both the strengths and the weaknesses of the larynx in control of pitch, loudness, mode of phonation, and registration. The use of vowels and conversational speech alone does not seem to do that. Table 2 presents a proposed set of utterances of this type. The top half of the table lists a set of nonspeech utterances and the bottom half lists some speech utterances. The primary aim in constructing this table was completeness, not brevity, at this point. The battery is probably too hefty to be used for clinical diagnostics but trimming is a second task. As it stands, the table includes most of the utterances used historically, but expands the list significantly in the direction of dynamic testing, that
is, using phonatory glides to allow assessment of coordinated muscle activity in the larynx and respiratory system. All utterances are customized to the individual's voice range profile (VRP). This must be obtained first to establish the bounds for further testing. Low, medium, and high pitch can then be defined as some percentage of the Fo range, say 10%, 50%, and 90%. The same can be done for defining soft, medium, and loud intensity. With these definitions, sustained vowels are elicited at strategic locations within the VRP to determine phonatory stability. This is followed by [s] and [z] consonants and, finally, by a series of pitch, loudness, adduction, and register glides. In the second half of the table, speech material is used with increasing phonetic, emotional, and artistic complexity. After traditional counting, an allvoiced sentence is first used to test Fo control independent of adductory control. Three loudness
T A B L E 2. P r o p o s e d t e s t u t t e r a n c e s Nonspeech Voice range profile
Sustained [a], [i, [u] vowels
Sustained [s] consonant Sustained [z] consonant Pitch glides Loudness glides Adductory glides [a] and [ha] Register glides
Defines test frequencies and intensities [low = 10% of (fundamental frequency) F o range, medium = 50% of Fo range, high = 90% of Fo range; soft = 10% of intensity range, medium = 50% of intensity range, loud = 90% of intensity range] 1. Low, soft, 2 s 2. Low, loud, 2 s 3. High, soft, 2 s 4. High, loud, 2 s 5. Medium high, medium loud, 2 s 6. Comfortable pitch and loudness, 2 s 7. Comfortable pitch and loudness, maximum duration Comfortable pitch and loudness, maximum duration Comfortable pitch and loudness, maximum duration 1. Low-high-low, one octave, 0.1 Hz 2. Low-high-low, one octave, 2.0 Hz 3. Low-high-low, one octave, maximum rate 1. Soft-loud-soft, 0.1 Hz 2. Soft-loud-soft, 2.0 Hz 3. Soft-loud-soft, maximum rate 1. Onset-pressed-offset, 0.1 Hz 2. Onset-pressed-offset, 2.0 Hz 3. Onset-pressed-offset, maximum rate I. Modal-pulse-modal, 0.1 Hz 2. Modal-falsetto-modal, 0. I Hz 3. Modal-falsetto-modal, maximum rate, as in yodeling Speech
Counting from 1 to 100, comfortable pitch and loudness All voiced sentence, "Where are you going?" soft, medium, loud Sentence with frequent voice onset and offset "The blue spot is on the key again," soft, medium, loud Oral reading of "Rainbow Passage" Descriptive speech, "Cookie Theft" picture Parent--child speech, "Goldilocks andThe Three Little Bears" Dramatic speech involving deep emotions (fear, anger, sadness, happiness, disgust) Singing part of "Happy Birthday to you," modal and falsetto register
Journal of Voice, Vol. 8, No. 1, 1994
STANDARDS IN ACOUSTIC ANALYSIS OF VOICE conditions can be used: (a) as speaking into someone's ear, (b) as speaking across the dinner table, and (c) as shouting across a busy street. This is followed by a sentence with frequent voicing onset and offset, again at different loudness. The "Rainbow Passage" and the "Cookie Theft" description are then administered as de facto standards. At this point, the treadmill advances to some parent-child speech. It is expected that exaggerated F o, intensity, and register patterns will emerge in this test as subjects mimic typical "parentese." Further testing of extreme Fo and intensity patterns (with highly expressive vocalizations) comes with a dramatic recitation, such as one of Shakespeare's soliloquies, or recounting of a highly emotional experience. Finally, a portion of a familiar song ("Happy Birthday") is sung in both modal and falsetto register to examine " h e a v y " and "light" production in a singing mode. A major unanswered question is whether a person's ability to speak or sing can, in any way, be assessed with nonspeech tasks. One would hope that a wide range of pitch and loudness in the VRP, for example, would predict highly expressive intonation, stress, and loudness patterns in speech, but there is no guarantee for that. For assessment of voice disorders, large inaccuracies in pitch and intensity glides should be a predictor of abnormal prosodic contours in speech, but again, this remains an open research question. Table 3 shows how range, stability, speed, and accuracy of the primary control variables of the larynx might be assessed with selected portions of the nonspeech utterances. Range of pitch and loudness are assessed with the VRP. Range of adduction and register are assessed with the slow glides. Stability of posturing is assessed with the sustained vowels, and speed and accuracy are assessed with the rapid and slow glides. A major problem with administration of any task is to determine the cognitive ability of the subject or patient. Do they understand what they are supposed to do? Ample dialogue between the clinician
5
and the patient can, of course, minimize this problem. But what if administration of the tasks is to be automated? Is it sufficient to give examples by videotape, audiotape, or by some other type of preprogrammed message? Some consensus along these lines would be helpful. Another problem may be a perceptual deficit of the subject. Are auditory or visual stimuli that may be used to elicit the vocal tasks adequately perceived by the listeners? If not, it may be necessary to conduct some auditory or visual training to acquaint subjects with octave glides, crescendos, diminuendos, register changes, or other vocal exotics. Once the task is clear to the subject, there is still the question of practice to obtain the level of performance that is most useful for diagnostic purposes. The objective here is not necessarily to get top level performance, but there should be sufficient success to deem the task completed. If a treadmill is to be useful, a person must be able to walk on it at some speed and for some length of time. But the speed of the mill or the length of the walk may not have to be the ultimate (or the only) measures of physical fitness. Other tests can be made, such as a heart rate, blood pressure, and general laboratory analysis. Likewise, subtle vocal irregularities (tremor, involuntary register shifts, intermittent phonation, subharmonics) may give the complete story of vocal fitness. The vocal treadmill acts as a catalyst in exposing the system to induce "vocal stumbles." Measurement of fundamental frequency (Fo) Extraction of Fo is necessary for most of the acoustic measures obtained from the vocal utterances. There is still much uncertainty among investigators with regard to the extraction of Fo and its variability. The field could benefit from some consensus with regard to the following: (a) clarification of the meaning of Fo when signals show evidence of bifurcations, chaos, or highly stochastic behavior; (b) determination of an upper limit of the extent of
TABLE 3. Nonspeech tasks for assessment o f laryngeal motor control Control variable
Range
Stability
Speed
Accuracy
Pitch Loudness Adduction Register
Voice range profile Voice range profile Slow adductory glides Slow register glides
Sustained vowels Sustained vowels Sustained vowels Sustained vowels
Rapid pitch glides Rapid intensity glides Rapid adductory glides Rapid register glides
Slow pitch glides Slow intensity glides Slow adductory glides Slow register glides
Journal of Voice, Vol. 8, No. 1, 1994
6
I . R . TITZE
perturbation for which jitter and shimmer measures have validity; (c) selection of appropriate microphone type and placement (with respect to the mouth) for the highest fidelity F o extractions; and (d) the effect of noise, room acoustics, and sourcereceiver stationarity on jitter and shimmer measures. Some work is underway in these areas. For example, methods of nonlinear dynamics are being applied to voice signals to classify them in terms of well-known attractors (2). It appears that the signals fall into three distinct categories: periodic with small random perturbations, periodic with subharmonic structure and modulation, and nonperiodic (chaotic). Traditional jitter and shimmer analysis can be applied only to the first category, but it is not yet clear how to determine the boundary between the categories. Some work is also being done to select the most appropriate microphones (3). Preliminary results suggest that microphone sensitivity, shielding from groundloops, and distance from the mouth can all affect measured jitter and shimmer of normal voices. This is significant in light of the earlier study by Doherty and Shipp (4) on the effect of tape recorders on jitter and shimmer measurements. A complete set of recommendations is now needed for the entire recording system, including the room environment.
Database of high-fidelity recordings Once consensus has been reached on test utterances and recording techniques, it would seem appropriate to build a new database of voice recordings. This database would include representative numbers of normal and abnormal populations "stepping onto the vocal treadmill" and following the protocol agreed upon. This database should then be made available to a large number of researchers and clinicians for comparative studies. Synthetic calibration materials In addition to the database of human voice recordings, the field would benefit from a randomly accessible library of synthetic voice samples. This library could be distributed on digital audiotape (DAT), c o m p a c t disk read-only m e m o r y (CD ROM), or over computer networks. Table 4 shows an outline of how such a library might be structured. Agreement would be reached on some constants, such as stimulus duration, magnitude, and perhaps onset and offset duration. Parameters to be Journal of Voice, Vol. 8, No. 1, 1994
TABLE 4. Synthetic voice samples Constants: Duration: 5 s Onset: 0.1 s Offset: 1.0 s Magnitude: ---20,000 peak to peak (computer units) Parameters Signal type: Sinusoids; glottal flow, mouth flow, mouth pressure, and contact area analogs Fundamental frequency: -100 Hz, 200 Hz, 400 Hz, but noninteger Sampling frequency: 20 kHz, odd-ball (44.1) Vowels: [a], [i], [u] Modulation type: Sinusoidal, Gaussian, ramp; AM and FM for each Modulation frequency: 1 Hz, 5 Hz, 50 Hz, Fo/2, 17o/3 Modulation extent: 0.05%, 0.5%, 5%; 10%, 50% Additive noise: 40 dB, 80 dB signal-to-noise ratio
varied would include waveform type, fundamental frequency, sampling frequency, vowel, modulation (type, frequency, and extent), and additive noise. A triangular (or ramp) modulation would provide the synthetic equivalent of glides produced by human subjects. This would allow ample comparisons between natural and synthetic voice stimuli.
File structures For databases to be less site- and machinespecific, it would be advisable to agree on a highly flexible file structure. One suggestion, attributed to Professor Paul Milenkovic at the University of Wisconsin, has been to use only single channel binary data strings (16 bits, with perhaps a later expansion). No delimiters would be used for data records, and the file header would contain only character strings. The header would have an agreed-upon fixed length that is known to everyone. The text strings in the header would contain typical file specifications, such as file length, sampling frequency, patient or subject information, date of recording or generation, or any other useful information. Users of the shared databases would need to program their computers to read the unformatted text strings rather than field-specific n u m b e r character combinations (the typical header organization). Nomenclature Some agreement on nomenclature would be helpful, but this is not as important as other foregoing considerations. It was argued before (5) that an attempt should be made to use accepted engineering and statistic terminology, such as root-mean-
STANDARDS I N ACO USTIC A N A L YSIS OF VOICE squared, mean rectified, mean squared, variance, coefficient of variation, modulation index, etc., in making actual calculations on the voice Signal. Terms like fitter, shimmer, tremor, vibrato, flutter, etc., are best retained as generic terms that give a qualitative feeling for the variability. Assigning numbers to them may be taking a step backward. Terminology of nonlinear dynamics has been helpful in sorting out some of the ill-defined terms for vocal quality. For example, the term diplophonia is ascribed to a voice for which two independent pitches are perceived. The attractor in phase space for such a signal is a torus, assuming that two incommensurate frequencies indeed exist. On the other hand, creaky voice often refers to phonations for which subharmonic frequencies occur. The attractor for this signal is a bifurcated limit cycle. If two or more pitches are perceived for these phonations, they should be in octave ratios. Creaky voice should therefore not be confused with diplophonia. Creaky voice becomes vocal fry (or pulse or register) when a subharmonic frequency dominates (perceptually) and is below about 70 Hz. The term voice range profile has recently been adopted by the International Association of Logopedics and Phoniatrics as a display of intensity range versus fundamental frequency. The display has also been referred to as Stimmfeld (German for voice area) or phonetogram in the literature. It would be helpful to settle on the name of this display.
7
CONCLUSION Although it is difficult to identify procedures and measurement techniques in voice analysis that are clear candidates for formal standardization at this point, much progress can be made by reaching some preliminary consensus. In particular, the field is bogged down by lack of specificity in test utterances that have universal appeal and proven diagnostic value. Once such test utterances are defined, work can proceed more rapidly in establishing the technical criteria for extraction of fundamental frequency, intensity, and spectral measurements. Shared calibration materials and databases would then be extremely helpful. Acknowledgment: This work was supported by grant RO1 DC00387 from the National Institutes on Deafness and Other Communication Disorders. The manuscript was prepared by Julie Lemke.
REFERENCES 1. Sullivan CD. Standards and standardization: basic principles and applications. New York: Marcel Dekker, 1983. 2. Titze IR, Baken R, Herzel H. Evidence of chaos in vocal fold vibration. In: Titze IR, ed. Vocal fold physiology: frontiers in basic science. San Diego: Singular Publishing, 1993. 3. Titze IR, Winholtz W. The effect of microphone type and placement on voice perturbation measures. J Speech Hear Res 1993 (in press). 4. Doherty E, Shipp T. Tape recorder effects on jitter and shimmer extraction. J Speech Hear Res 1988;31:485-90. 5. Pinto N, Titze I. Unification of perturbation measures in speech analysis. J Acoust Soc A m 1990;87(3):1278--89.
Journal of Voice, Vol. 8, No. 1, 1994