Journal of Voice Vol. 6, No. 2, pp. 155-158 © 1992 Raven Press,. Ltd., New York
Special Article
Perceptual Evaluation Srren Fex University Hospital, Lund, Sweden
Summary: Perceptual evaluation of the voice, commonly and erroneously termed psychoacoustic evaluation, is subjective and is based on comparisons with another voice or with the listener's previous impressions of the same voice. Although it is applied universally, it is terminologically confusing. To increase reliability, continuous training in listening for voice parameters is essential, and frequent tape recordings are needed to facilitate comparisons. Key Words: Diagnosis--Perception--Perceptual rating.
Perceptual evaluation of a person's voice signifies that a listener is making a comparison between a (perhaps unspecified) number of qualities that the listener can hear in the speaker's voice and the listener's own opinion about how these different qualities should sound in the normal voice. In the past, perceptual j u d g m e n t s have been called psychoacoustic evaluation. The reference levels of the listener are more or less unstable (with one possible exception, if the listener has absolute pitch). There are good reasons to believe that the reference levels vary from listener to listener, either in quality or quantity. Normal voice quality is a conception based on subjective opinion, may vary with different cultures, and certainly is difficult to define; a vast number of people are supposed to have normal but nevertheless individually differentiated voices. Despite the fact that no good definition exists, the term normal voice is widely used, even in otherwise objective studies. Indeed, it is difficult to find articles on the human voice in which perceptual analysis has been omitted. This is one reason why a complete bibliography on perceptual analysis is difficult, if not impossible, to compile. Another reason is that before instruments became available, the listener's judgment was the only available assessment
modality, and thus a complete bibliography would have to include almost everything written on the human voice before the recent research era. One of the greatest difficulties in perceptual analysis of voice is achieving unambiguous descriptions of different voice qualities. Few standard terms exist; the listener is often forced to use words that are not originally meant to describe sound. Sonninen (1) gives the following 59 examples "all of which are used in describing the singing voice: (visual) bright, clear, limpid, pale, cloudy, muddy, dark; (caloric sense) sultry, warm, cold, icy; (kinesthetic sense) strained, forced, breathy, firm, tight, hushed, lilting, relaxed, scooping, sliding, wobbly, biting, strident, supported, tense, vibrant, resonant; (anatomical) back, front, head-, mouth-, throat-, chest-, nasal, naked, strangled; (instrumental) flutelike, piping; (material) metallic, brassy, wooden, woolly, gravelly, airy; (spatial) broad, narrow, fiat, rounded; (aesthetic) beautiful, ugly, pleasing, unpleasant; (taste) sweet, sugary, dulcet, mellow; and also the adjectives 'rich' and 'luxuriant.' " Defining such words (and numerous others) when used for describing voice quality has not been successful so far. There is no reason to believe, for example, that the term hoarse has a definition that has been generally excepted by all who use the term. Terminology is thus a serious problem in serious need of a solution. The inadequacy of current
Address correspondence and reprint requests to Dr. Srren Fex, Phoniatric Dept., University Hospital, S-221 85 Lund, Sweden.
155
156
S. F E X
Function Tests of the Japan Society of Logopedics and Phoniatrics has made a standard tape which has typical voice samples represented by the GRBAS scale. The committee feels that the psychoacoustic evaluation using the GRBAS scale is not an absolute method but needs to be improved upon (6).
terminology has been pointed out in many ways, and "it is apparent that labels traditionally used for the description of voice deviations should be viewed somewhat skeptically" (2). For many years the International Association of Logopedics and Phoniatrics has had a committee charged with establishing a terminology acceptable to at least the majority of its members; thus far it has not succeeded. The necessity of translating among languages complicates the matter considerably, because the final outcome must be a set of terms that have the same semantic value to all who use it, irrespective of their language. A few interesting attempts have been made to avoid, or at least minimize, these terminological difficulties. Using the semantic differential method of Osgood (3), several investigators have studied the auditory impression of human voice and speech [e.g., Isshiki et al. (4) and Takahashi and Koike (5)]. The semantic differential technique is a psychometric procedure originated for measuring the meaning of various abstractions. A listener is given pairs of polar-opposite adjectives assumed to be most related to the judgment of the voice quality. The listener is asked to use a seven-point, equal-appearing interval scale to rate the scaler concept represented by the polar pair. Isshiki et al. (4) used 17 pairs, three of which were also found in the 12 pairs used by Takahashi and Koike (5). Although the two studies used the same procedure, their results did not correspond very well. Isshiki et al. (4) found that the analysis of 17 pairs was too laborious and time consuming to be practiced in out-patient clinics. Based on these data, several Japanese clinical investigators proposed a simplified version of the rating scale consisting of four factors, R, B, A, and D, signifying rough, breathy, asthenic, and degree, but they emphasized that no single adjective is adequate to precisely express any one of the factors. For each factor a four-point, equal-appearing interval scale (0 = normal, 1 = slight, 2 = fair, 3 = extreme) is used. This technique was adopted by the Committee for Phonatory Function Tests of the Japan Society of Logopedics and Phoniatrics in its proposal of the GRBAS scale for evaluating hoarseness. It is composed of five subscales: grade (G), rough (R), breathy (B), asthenic (A), and strained (S). Each scale has four grading levels.
Reliable perceptual analysis implies a standardized ability and/or quality in the listener. Yet Jensen (2) found that "neither inter-judge nor intra-judge reliability was sufficiently high to place confidence in [the listener's] judgements." Similarly, the results of a study by Blaustein and Bar (7) "demonstrated poor inter-judge agreement between listeners . . . . " This study lends support to an issue that may be recognized by speech pathologists but has not been sufficiently documented: the questionable reliability of perceptual judgment by any one examiner. Few investigators, however, have used a multidimensional approach that requires listeners to rate several perceptual dimensions on a multiple-point scale. Darley et al. (8) used a seven-point scale to rate 38 dimensions, the majority of which did not refer to voice. Two investigations by Hammarberg (9,10) used a five-point scale to rate 28 responses on 26 dimensions. Both studies reported high reliability coefficients for their respective dimension. Bassich and Ludlow (11) were interested in determining whether inexperienced judges could be trained to use a multidimensional perceptual rating system to assess vocal function. Four judges were given what was deemed appropriate training after which they rated voices on 13 dimensions. The judges' ratings did not remain stable and interjudge agreement as well as intrajudge reliability was unsatisfactory. The authors concluded that their "results suggest that the task of perceptually rating voice quality is difficult, requiring professional experience and sophistication" (11). On the other hand, Laver (12), in testing their Vocal Profile Analysis Protocol, had encouraging results in training speech pathologists in perceptual voice analysis. Similarly, very satisfactory results reported by Hammarberg (10)
Since the evaluation with the use of GRBAS scale is subjective, the examiner must possess a trained ear. For this purpose, the Committee for Phonatory
"indicate that perceptual evaluation by clinically well trained listeners is reliable and reproducible and can be used for systematic evaluation purposes, if
Journal of Voice, Vol. 6, No. 2, 1992
Such a standard tape represents a considerable improvement: comparison can be based on an audible reference rather than on a vague description. RELIABILITY
P E R C E P T U A L E V A L UA TION
handled with precaution. The jury group [was composed of] the staff members of a clinic in which systematic training and discussions in matters of perceptual voice evaluation had been going on for many years. Among other things, the voice quality parameters have been discussed and re-defined within the group by listening to tapes of voice disorders" (9,10). This seems to indicate that long exposure to, and active listening for, various voice qualities are necessary to make reliable perceptual judgments. It might even be supposed that the longer the experience, the better the judgment reliability. QUALITIES IN THE LISTENERS Most high-quality music performers, at least instrumentalists and orchestra conductors, maintain or increase their quality into old age. There is no special reason to suppose that such people are less affected by presbyacusias than others. This must mean that they adjust to a perceptual change. It is quite possible that elderly experienced voice judges have similarly adjusted their "internal reference levels," and thus do not have the same need for normal hearing as the judge in training where normal hearing should be a prerequisite. The talent for selective listening is more difficult to test, but the possibility of its existence should not be neglected. Judging a painting requires not only that the contours be clearly perceived but also that color vision be accurate. Some people have defective color vision. To what extent this also holds for "colors of voice" is largely unknown. Much more knowledge is needed. In any test situation in which instruments are being used it is important to know their capabilities, and to be certain of their calibration. Such accuracy is, for obvious reasons, not possible when it comes to listeners' judgment of voice qualities; people cannot be calibrated. In almost all comparisons of listeners' judgments the listeners have come from the same language area and have had the same cultural background. The only exception was described by Wendler and Anders (13), who sent taped voice samples to listeners in Finland, The German Democratic Republic, Sweden, and the United States. Listeners were asked to judge the degree of hoarseness using a four-point grading scale. A total of 11,440 judgments were made, mostly by professionals but also by some laypeople. "The auditory tests, on the whole, showed a high degree of accordance of judgments passed by experts as well as by naive
157
listeners in respect to the classification of hoarse voices" (13). It has not been possible, however, to find an investigation that used a multidimensional approach with listeners from different language areas and different cultures. Two strong reasons for this might be the difficulties connected with terminology and translation of terms and definitions, as exemplified in Report on Vocal Registers (14). In summary, it seems clear that neither the concept "normal" nor deviations from it have a single definition; opinions may differ from area to area and from listener to listener. These differences seriously limit the use of perceptual evaluation of voice. Nevertheless it is the most common technique used by voice clinicians regardless of their particular discipline, and for an unknown but probably vast number of such workers it is their only means to judge voice production. Thus, in spite of the limitations inherent in the procedure, perceptual evaluation constitutes an important component of routine clinical evaluations. NEEDS FOR DATABASE Acoustic recordings of different vocal qualities are needed to facilitate development of standard terminology to describe voice production, and to train clinicians to listen for the same vocal qualities. If high-quality recordings of different vocal qualities were available, they could be used to train beginning clinicians, as well as for continual tuning for mature clinicians. Ideally, standardized tape samples would include sustained vowels and running speech of speaker samples representing various diagnoses and vocal qualities, as well as include synthesized samples with well-defined noise components. RECOMMENDATIONS On balance the following appear to be important in assuring the reliability and validity of perceptual judgments: 1. Voice workers should strive for unambiguous terminology and definitions. 2. Listeners should have demonstrably adequate hearing and should have had long, and preferably ongoing, training for the task. 3. Tape recordings used for assessment should be of high quality and must be played back on Journal of Voice, Vol. 6, No. 2, 1992
158
S. F E X
excellent equipment. Failure to do so could introduce artifacts and invalidate interpretation of results. 4. Judgments should be made from running speech samples. 5. A standard tape should be made available to be used as a reference, including samples representing typical diagnostic categories and synthesized noises. REFERENCES 1. Sonninen A. Phoniatric viewpoint on hoarseness. Acta Otolaryngol 1970;263:68-81. 2. Jensen PJ. Adequacy of terminology for clinical judgment of voice quality deviation. Eye Ear Nose Throat Monthly 1965; 44:77-82. 3. Osgood CE, Suci GJ, Tannenbaum PH. The measurement o f meaning. Urbana: University of Illinois Press, 1957. 4. Isshiki N, Okamura H, Tanabe M, Morimoto M. Differential diagnosis of hoarseness. Folia Phoniatr 1969;21:919, 5. Takahaski H, Koike Y. Some perceptual dimensions and acoustical correlates of pathologic voices. Acta Otolaryngol 1976 ;338(suppl): 3-24.
Journal of Voice, Vol. 6, No. 2, 1992
6. Hirano M. Clinical examination o f voice. Vienna: SpringerVerlag, 1981. 7. Blaustein S, Bar A. Reliability of perceptual voice assessment. J Commun Disord 1983;16:157--61. 8. Dartey FL, Aronson AE, Brown JR. Differential diagnostic patterns of dysarthria. J Speech Hear Res 1969;12:246--69. 9. Hammarberg B, Fritzell B, Gauffin J, Sundberg J, Wedin L. Perceptual and acoustic correlates of abnormal voice qualities. Acta Otolaryngol 1980;90:441-51. 10. Hammarberg B, Fritzelt B, Gauffin J, Sundberg J. Acoustic and perceptual analysis of vocal dysfunction. J Phonet 1986; 14:533-47. ll. Bassich CJ, Ludlow CL. The use of perceptual methods by new clinic and for assessing voice quality. J Speech Hear Dis 1986;51:125-33. 12. Laver J, Wirz S, Mackenzie J, Hiller S. A perceptual protocol for the analysis of voice profiles. Work in Progress, Department of Linguistics, University of Edinburgh 1981; 14:139-55. 13. Wendler J, Anders LC. Hoarse voices---on the reliability of acoustic and auditory classifications. Proceedings o f the 20th congress o f the International Association o f Logopedics and Phoniatrics. Tokyo, Japan: 1986:438-39. 14. Hollien H. Report on vocal registers. Acta Phoniatrica Latina 1984;6(suppl): 11-22.