Effects of Training on Time-Varying Spectral Energy and Sound Pressure Level in Nine Male Classical Singers *Sam Ferguson, †Dianna T. Kenny, and *Densil Cabrera, *yNew South Wales, Australia Summary: The male classical singing voice is a musical instrument that is very important in western culture. It has many acoustic features which should change and improve over the period in which the singer trains. In this study we compare nine singers in different stages of training, from university level students through to international soloists. Typically, Energy Ratio (ER; a measure of mean spectral slope) and mean sound pressure level (SPL) may be calculated to summarize an entire singing sample. We investigate an alternative approach, by calculating the time-varying ER and SPL. The inspection of the distribution of these descriptors over an aria’s time period yields a more detailed picture of the strategies for high-frequency energy production used by singers with different levels of training. Key Words: Singing quality–Singer’s formant–Long-term average spectrum.
INTRODUCTION The professionally trained singing voice is an acoustic source of great beauty, complexity, and communicative power. Since Bartholomew1 described the prevalence of high-frequency energy in recordings of female operatic singers, the acoustic correlates of vocal perceptual quality have been investigated by both acousticians and music researchers.2–4 Spectral energy distribution in vocal sources can be explained using the traditional source-filter model of speech as defined by Fant.5 Air pressure causes the vocal folds to vibrate in a periodic manner to form an acoustic oscillator, or if they are retracted causes air to escape through the glottis which resembles acoustic white noise. Regardless of phonation type, the periodic (voiced) or aperiodic (unvoiced) signal is passed through the throat which acts like an acoustic filter, enhancing some acoustic frequencies to create formants or retarding other frequencies. The acoustic filter characteristics of the human vocal tract are traditionally described in terms of their peaks, which are known as formants. Current terminology assigns three dimensions to a formant: center frequency, bandwidth, and amplitude, which Fant et al showed can be approximated from the conjugate complex poles of the vocal tract transfer function.6 The observed level of a formant is dependent on the proximity and bandwidth of other formants. Closely clustered formants, such as the clustering of formants 3, 4, and 5 observed by Sundberg,7,8 will have a resultant effect of increasing the spectral energy in the higher formant structure of the spectrum, as shown by Fant.6 Sundberg argued that this ‘‘singer’s formant cluster’’ may have resulted from the combined effect of a lowered larynx and a widened pharynx in the higher pitch range.9 Specific vowels also have an effect on spectral energy levels. Front Accepted for publication May 6, 2008. From the *Acoustics Research Laboratory, Faculty of Architecture, Design and Planning, University of Sydney, New South Wales, Australia; and the yAustralian Centre for Applied Research in Music Performance (ACARMP), Sydney Conservatorium of Music, The University of Sydney, New South Wales, Australia. Address correspondence and reprint requests to Sam Ferguson, Faculty of Architecture, Design and Planning, University of Sydney, Wilkinson Building, 148 City Road, NSW 2006, Australia. E-mail:
[email protected] Journal of Voice, Vol. 24, No. 1, pp. 39-46 0892-1997/$36.00 Ó 2010 The Voice Foundation doi:10.1016/j.jvoice.2008.05.002
vowels containing a high second formant will contribute toward increasing the spectral energy in the higher formant regions above 2 kHz, whereas back vowels with typically lower second formants will appear decreased in comparison. Sundberg10 has pointed out that the levels of the higher formants vary less between vowels in classically trained singer voices than in nonsinger voices. Formant bandwidth narrowing has also been shown to increase the level of formants: Kang and Coulter11 showed this effect for spoken voice, and it was again shown for the sung voice by Millhouse et al.12 ER has been used to describe the singer’s formant cluster peak, in relationship to the other peak that is commonly seen around the first formant area. It is calculated as the difference between the sums of the energy content in the bands 0–2 kHz and 2–4 kHz expressed in decibels.13 Alpha (a), introduced by Frøkjaer-Jensen,14 is a measure that predates ER and refers to the ratio of spectral energy contained in the regions above and below 1 kHz. ‘‘Spectral Balance’’ (SB), described by Ternstro¨m et al,15 is defined as the difference between the energy in the band 0.1–1 kHz and the band 2–6 kHz. ‘‘Singing Power Ratio’’ (SPR) is related to these measures; Omori et al define SPR as the ratio between the maximum peak energy within these two bands.16 These four measures of spectral energy, ER, a, SB, and SPR are similar in their definition and purpose. They differ in their use of the bandwidth between 1 and 2 kHz: a assigns it to the upper band, ER assigns it to the lower band, and SB does not use it at all. However, the calculation of SPR differs from the other three measures. Calculation is based on the peak energy within the band, rather than the sum of energy within each band. Omori et al found evidence that professional singers possess higher values of SPR than nonprofessional singers, who in turn possessed higher values than untrained singers.16 Watts et al have shown a similar SPR relationship between untrained singers who were assessed as talented or untalented by an audition panel.17 Sundberg compared an orchestra’s spectrum to a solo singer’s spectrum and found that the two are similar in contour, except for an increase in energy in the spectral region around the singer’s formant cluster.18 Rossing et al showed a similar result, by comparing a group of eight bass/baritones singing in solo and as part of a choir.19 This result was
40 replicated with a group of five soprano singers.20 Dmitriev and Kiselev showed that the centre frequency of the singer’s formant peak depended on singer type.21 However, although these spectral energy distributions may be associated with high singing quality, Kenny et al showed a singer who is perceptually rated as of a low quality by expert judges may still exhibit high values of these parameters.22 Further research is needed into the relationship between perceived vocal quality and the modulation of these parameters by the singer for expressive or performative purposes. ER, a, SB, and SPR are usually calculated from the longterm average spectrum (LTAS) of a singing sample. The spectral structure that is due to a particular voice’s phonetic, pitch, and timbral qualities is averaged into a spectral contour that is representative of the formant structure being used by the singer. This process removes the time dimension entirely, and ignores the effect of the song’s pitch and dynamic range on the highfrequency energy. Jansson and Sundberg23 describe how the overall range of a musical sample may affect a resulting LTAS, but the particular notes played within that range will generally not affect it. They also show that it converges rapidly, after approximately 20 s. However, this process produces problematic effects that are investigated further in this study. The purpose of ER is to provide a summary of the spectrum, in terms of the relative distribution of energy in the bands 0–2 kHz and 2–4 kHz. The LTAS as the basis for this calculation is problematic, as it provides information regarding both the spectral distribution over time and the sound pressure level (SPL) at which the spectral distribution is presented. Both the amount of time a particular spectral distribution is maintained and the SPL with which it is maintained are conflated into a single variable, ER. This is due to the decibel transformation that occurs in the calculation of the spectrum. In terms of physical power or energy, the underlying power unit of SPL is the squared sound pressure, and thus time-varying levels that are 30 dB different (such as those described in this study) are three orders of magnitude different when they are transformed to squared sound pressure, which is where the averaging process is undertaken. Therefore sound segments with a low SPL in a singing sample are unlikely to significantly affect the result of an LTAS across time. Sjo¨lander and Sundberg24 showed that subglottal pressure affected the level of the singer’s formant cluster (when compared with the level of the first formant area) for professional baritone singing. Nordenberg and Sundberg25 demonstrated that this effect exists for running speech as well. Sundberg and Nordenberg26 showed the correlation between high-frequency energy usage and Leq (mean squared sound pressure for a sample expressed in decibels) is strong enough to develop equations that may be used to predict the amount of high-frequency energy. Ternstro¨m et al15 have described a similar correlation between SPL and high-frequency energy usage in very loud speech over noise. Their study also included an upper SPL saturation point at which there was no further increase in highfrequency energy for a corresponding increase in SPL. Knowledge of average high-frequency energy is not likely to be the only pertinent aspect of a voice’s quality. More complex
Journal of Voice, Vol. 24, No. 1, 2010
analyses of the way in which high-frequency energy is used may elucidate the other factors involved. For example, a voice of quality may maintain a similar amount of high-frequency energy across its dynamic range, whereas a lesser voice shows level dependence. In this article we seek to investigate these effects through the calculation and statistical analysis of the timevarying spectral energy distribution. In the following sections we use the term ‘‘high-frequency energy.’’ In various contexts ‘‘high-frequency’’ can refer to different parts of the spectrum, but for our purpose we use it to describe the range 2–4 kHz. This range surrounds the singer’s formant region (between 2.4 and 3.6 kHz depending on singer type21) and therefore any increase in high-frequency energy implies the likely use of the singer’s formant.
METHOD Participants Four professional and five university level student baritones participated in the study. Of the four professional singers, two had sung as international soloists, and two had sung as national soloists, according to the Bunch and Chapman taxonomy.27 Age ranged from 44 to 63 years with a mean of 52.75 years (SD ¼ 8.1). These baritones had received singing lessons for an average of 27 years (range 20–43; SD ¼ 10.9 years). Three of the professional singers sang primarily opera, either as soloists, operatic chorus artists, or both. The other sang primarily in classical recitals. Of the five tertiary level student baritones, four were undertaking a Bachelor of Music degree at a University Conservatorium of Music and one was receiving private singing lessons while engaged in another occupation. All students were studying the western classical style of singing. The average age was 22.6 years (range 20–27 years; SD ¼ 2.96). They had studied singing on average for 6.4 years (range 4–9 years). Two students nominated opera as their predominant singing style. The other three students sang a combination of western classical, choral, music theater, and contemporary styles. Two advanced students had professional singing engagements at local and state opera companies. The other three students performed in local ‘‘gigs’’ in various venues such as nightclubs and pubs, or in bands at functions and in choirs. Table 1 presents the divisions we have used for the level of training of the singers. TABLE 1. Singer’s self-reported levels of training Singer
Professional/Student
2St 3St 4St 5ASt 9ASt 1NSo 6NSo 7ISo 8ISo
Student Student Student Student Student Professional Professional Professional Professional
Fine Classification Student Student Student Advanced student Advanced student National soloist National soloist International soloist International soloist
41
Effects of Voice Training in Nine Male Classical Singers
Recording Method Recordings were undertaken in an acoustically treated room. As the room was lined only with thick felt curtains, the acoustic conditions were not anechoic, but were as nonreverberant as possible without being unnatural for singing purposes. The microphone layout method described by Cabrera et al28 was used in this recording process. The primary signal was taken from a Bruel & Kjaer Sound and Vibration Measurement (Naerum, Denmark), which was positioned 70 mm from the left corner of the singer’s mouth by means of a headset mounting frame. The benefit of this particular microphone position was that it allowed the accurate measurement of the source without interference due to movements of the head, the air stream of the voice, or external noise sources as the signal-to-noise ratio is so great at this distance as to largely overwhelm any acoustic reflections or background noise. At this position, the SPLs were quite high (approximately 120 dB at their maximum), and a small diaphragm microphone such as the Type 4135 was necessary to cope with the high sound output of the singers. However, as this position is not a typical listening position, these recordings were scaled to a 2-m distance— somewhat representative of a singing teacher’s proximity during a lesson. The microphone was calibrated with a reference tone of 93.98 dB(SPL) produced by a Bruel & Kjaer coupled calibrator (Type 4231). The microphone signals were preamplified by a Millennia Preamp Model HV-3D (Millenia Music Media Systems, Placerville, CA), analog to digital conversion was performed by an Apogee AD-16 (Apogee Electronics Corp, Santa Monica, CA) analog to digital convertor, the digital signal was sent to an RME Audio Hammerfall DSP36 audio interface (Haimhausen, Germany) and then recorded to hard disc using Adobe Audition software (Version 1.5, Adobe Systems Inc. San Jose, CA). Singing Tasks For the purposes of this study, we recorded an aria, Curtis’s ‘‘Torna A Surriento,’’ which was chosen due to its range of over an octave despite its relatively low level of difficulty. We analyzed only the last 14 bars of the aria, although the singers recorded the entire aria. The musical score for the song was provided to all singers before the recording session, and a starting note was given to make sure each singer was singing within the same range. No attempt was made to specify tempo, except from the musical instructions in the score. Digital Analysis Parameters Fundamental Frequency (F0), SPL, and LTAS were calculated using Praat.29 Methods for calculating spectral indicators such as ER or SPR from the LTAS have been described elsewhere,16,23 but there are some considerations about such spectral indicators that need to be discussed in this context. The calculation of an LTAS of a sample of singing conflates two variables (the overall SPL, and the spectral distribution of energy) into one representation of the spectral energy distribution in the sample. Often this is not necessarily desired, and it seems that much information is lost in the averaging processes that have occurred to provide this single representation. Alternatively, by using a short-term rather than long-term Fourier transform, it is possible to calculate the
variation within the spectral energy distribution over the time period of the sample. An identical process to ER calculation is used, but for many short-term frames instead of the entire sample. We have termed this measure Short-Term Energy Ratio (STER) to distinguish it from ER within this article. It results in time series data, rather than a single number representing the whole sample. STER was calculated by use of a short-term Fourier transform employing a 1024-point Hanning window with a time step of 10 ms. At a 48 kHz sampling rate, a 1024-point window is approximately 21-ms long, and thus there is an overlap between frames. From these time series data, we examined the distribution of the spectral variation over time and compared these with the LTAS-based ER calculations. RESULTS Unvoiced Frames The data were investigated to determine the voiced proportion. The data are thresholded based on the strength of the first nonzero peak in the normalized autocorrelation function. A correlation coefficient of at least 0.45 was required for an F0 candidate to be considered voiced. According to Sundberg and Bauer-Huppmann,30 vowels are the primary musical elements, as they are usually synchronized with the start of notes played by the piano accompaniment, and are also much longer in duration than consonants. However, consonants cannot be discounted, because they often contain large amounts of high-frequency energy. For this analysis, we focused on the voiced portion of the sound as a general approximation of the vowel content. We then split unvoiced sound on the basis of its SPL. Unvoiced sound with a SPL greater than 60 dB was classified as consonants and unvoiced sound with a level less than 60 dB was classified as silence. The choice of this threshold is supported by the data in Figure 3. Figure 1 gives the proportions of the total time considered voiced, unvoiced consonant, and silence for each of these samples. Unvoiced silent proportions were more frequently greater in the students’ performances, whereas the advanced students and soloists exhibited similar voiced proportions. The proportion of consonants was variable but did not follow a pattern
Proportion (%)
Sam Ferguson, et al
Subject
FIGURE 1. Nonvoiced frames seem to form a smaller proportion of the sample with singers who are not students. However, advanced students, and all soloists seem to exhibit similar levels of unvoiced sound.
42
Journal of Voice, Vol. 24, No. 1, 2010
a
0 -10
ER (dB)
related to the training level of the singers. Singer 2St gave a very tentative performance, and hence his performance includes a much larger proportion of silence. The spectrograms for singer 2St (and for singers 3St and 4St as well) show that the extra silence is due mainly to the lengthening of breathing pauses between phrases—when compared with singer 7ISo (an international soloist), the pauses were often lengthened by a factor of approximately 2. By contrast, singers 7ISo and 8ISo also lengthened many of the long notes before the pauses, further increasing their voiced proportion. We omitted all unvoiced frames from all subsequent analyses.
-20 -30 Stud Adv. Stud Nat. Solo Int. Solo
-40 -50 2St
Energy Ratio (dB)
Student Advanced Student National Solo International Solo
-5 -10 -15
2St
4St
3St 5ASt 9ASt 1NSo 6NSo 7ISo 8ISo
Subject
FIGURE 2. ER calculated in the traditional way, using the LTAS.
3St
5ASt 9Ast 1NSo 6NSo 7ISo 8ISo
Subject
b Sound Pressure Level (dB)
100
90
80
70 Stud Adv. Stud Nat. Solo Int. Solo
60
50 2St
4St
3St
5ASt 9Ast 1NSo 6NSo 7ISo 8ISo
Subject
c Interquartile Range (db)
12 10
Student
Adv. Student
4St
5ASt
Nat. Solo
Int. Solo
8 6 4 2 0
5th - 95th Percentile Range (db)
Spectral Energy Professional singers exhibited greater spectral energy in the 2– 4 kHz range than student singers. Figure 2 shows the ER results for this set of singers, calculated from the LTAS of the Aria sample. There is a trend showing higher average values of ER, indicating greater energy in the high-frequency range, for singers with greater levels of training and experience—each of the training categories yielded systematically greater maximum values of ER. However, there is some overlap between the four categories; one of the advanced students shows a greater ER value than one of the international soloists. The distribution of STER variation over the course of each sample demonstrates interesting effects. From Figure 3A it can be seen that professional singers had a more consistent usage of high-frequency energy than those with less training. Figure 3B gives the SPL distribution for each singer showing no corresponding systematic variation. In Figure 3C, comparison of the range between the lower and upper whiskers and the 25th and 75th percentiles in both groups show that singers in this study with higher levels of training use a narrower range of high-frequency energy values across the time period of the sample. This effect is systematic with training level for both the range shown by each of the ‘‘whiskers’’ and by the box ‘‘hinges.’’ Although the upper whiskers for each singer do not vary systematically with the training level, the lower whiskers vary quite systematically between professionals and students. Indeed, in this regard, steps between each of the training levels rise systematically except for one subject from each of the
4St
2St
25
3St
Student
20
9Ast
Adv. Student
1NSo 6NSo
Nat. Solo
7ISo
8ISo
Int. Solo
15 10 5 0
2St
4St
3St
5ASt
9Ast
1NSo 6NSo
7ISo
8ISo
FIGURE 3. Distributions of STER and SPL. In boxplots (a) and (b) the extents of the box denote the 25th and 75th percentile of the data and therefore the box represents the interquartile range. The whiskers extend to the most extreme data point that is no more than the interquartile range multiplied by 1.5 from the box extents. Three systematic effects can be seen. In (a) the increase in the lower whisker is marked, with each group separated excepting singers 3 and 5, whose results are almost identical. In (c) the Interquartile Range of the STER data shows no overlap between the students (Student and Adv. Student classifications) and professionals (Nat. Soloist and Int. Soloist classification) groups, however, there is an overlap between the National Soloists and International Soloists. The 95th to 5th Percentile ranges show advanced student singer 9 overlaps with professionals, and a similar overlap between National Soloist and International Soloist is seen again.
Sam Ferguson, et al
Effects of Voice Training in Nine Male Classical Singers
student and advanced student classifications, who have almost identical values. This is a strong effect, especially considering the small sample size. This pattern indicates that training increases the lower limit of the distribution of STER more than the upper limit, thus also decreasing the overall range of STER. The scatter plot in Figure 4 is better able to present some of the patterns within the STER and SPL data for each singer. A linear regression model has been fitted to the data, shown in red, as well as a local regression model, shown in dotted black. A local regression is a method that applies a polynomial regression to a subset of the data in a piece-wise manner, providing a demonstration of trend within smaller ‘‘neighborhoods’’ of the data, rather than being bound by the entire data distribution. The linear regression shows that nearly all the student singers (except singer 3St) have a steeper upwards slope than their professional counterparts. Singer 3St attained a similar pattern to the professional singers, albeit with spectral energy slightly lower than the professionals, while singing at a higher level. The local regression, however, describes a slightly different effect—for some singers there is an SPL value at which the increasing STER starts to decrease. This occurs more often at higher spectral energy levels, and is similar to the ‘‘spectrum saturation’’ effect described by Ternstro¨m15 for loud speech. However, for most of the student singers the local regression is relatively close to the linear regression—they do not seem to reach this saturation point. This effect may be due to the singers with greater training singing more loudly overall. But as we see in Figure 3B, the distributions of singers’ SPLs over the time period show no clear pattern. Professional singers use a similar dynamic range to many of the students, but have greater spectral energy. Another method that may be appropriate for finding the average ER, rather than using the LTAS, is to use the mean along the time dimension of STER results. This takes the mean ER based on equally weighted time segments, ignoring the nonlinear weighting effect of each segment’s SPL. Of course, an increase in the SPL in some segments may increase the STER somewhat, but Figure 4 shows that some singers can control STER somewhat independently of SPL. The high STERs at low SPLs that professionals are able to produce more consistently are taken into
43
account by using this method, whereas the LTAS completely obscures low SPL data. Figure 5 shows these results. Four of the five student’s values decrease when this is implemented, and all of the professional singer’s values increase. Again singer 3St is an outlier in this analysis, as his ER value increases when recalculated using mean STER in contrast with the rest of the student singers. The overall effect of this method is to increase the range of the results. It can also be noted that although the LTAS-based ER results (Figure 2) exhibited some overlap between advanced student S9 and professionals 1NSo and S7ISo, the use of this STER-based calculation method results in the disappearance of that overlap. Although there is a small difference between the results obtained using the LTAS ERbased method and those obtained using the mean STER method, the primary benefit of using STER is that the distribution of STER over time may be investigated in addition to the average ER. DISCUSSION This study investigated singers’ use of spectral energy over the course of an aria. Although these effects are demonstrated adequately in this study, our sample size is small, and this study is exploratory. We have shown that over the course of an aria singers with more training tend to use a narrower range of spectral energy, without using a narrower range of SPL. Singers with greater training also demonstrated higher minimum levels of spectral energy, but did not show similarly higher upper levels of spectral energy. At the same time, singers with more training demonstrated an upper limit of SPL at which their STER ceases to increase, and in some cases it decreased as SPL increased. These three effects demonstrated that the more highly trained singers increased the lower limits of their time-variant spectral energy (mainly evident at lower SPLs) rather than increasing the entire distribution. As the high-frequency energy at lower SPLs increased, the lack of a concomitant increase in the high-frequency energy at high SPLs meant that the overall distribution narrowed. This produced more consistent highfrequency energy over the time window under investigation for the more highly trained singers.
FIGURE 4. Scatter plot of STER compared with SPL. Local regression is overlaid in dashed black, whereas linear regression is overlaid in solid red. SPL is scaled to a distance of 2 m in this analysis.
44
Journal of Voice, Vol. 24, No. 1, 2010 LTAS-based ER STFT-based Mean STER Adv. Student
Student
Nat.Solo
Int.Solo
dB
-5
-10
-15
2St
4St
3St
5ASt 9ASt 1NSo 6NSo 7ISo 8ISo
Subject
FIGURE 5. ER is recalculated by using the mean of STER, thus avoiding the biasing effect of SPL. Removing this bias increases the values of all the professionals and decreases the ratings of all the students except singer 3St.
An outlying case was singer 3St (see Figure 3A and B). This student singer possessed much greater SPLs and highfrequency energy than his peer students. However, when listening to his sound sample, although the ‘‘ring’’ of the voice was clearly evident, the quality of singing was marred by other musical factors such as less accurate intonation and phrasing. When we compare singer 3St with singer 5ASt, we can see it is possible to achieve similar STER values (Figure 3A) without using as much SPL (Figure 3B). Although there is clearly an association between singers of quality and high-frequency energy usage, the two are not related causally. It may be possible to increase ER simply by increasing SPL, without necessarily increasing the quality of singing. It is also likely that factors such as intonation inaccuracy could negate any improvements made to tonal quality through high-frequency energy usage. Another interesting comparison is between each of the two National Soloists (singers 1NSo and 6NSo) and International Soloists (singers 7ISo and 8ISo). Whereas singer 8ISo and singer 6NSo seem to have the greatest Mean STER (Figure 5) in their respective classifications, singers 1NSo and 7ISo had the smallest STER ranges in their respective classifications for both their Interquartile Range and their 95th to 5th Percentile Range. This effect may highlight the existence of a tradeoff between the consistency of STER and the use of high values of STER. It may be that it is difficult to maintain the use of high STER values for the entire time period, thus leading to a necessarily larger range. Singers like 8ISo and 6NSo may employ higher values for short periods within the aria before returning to values commensurate with their narrower range peers (singers 1NSo and 7ISo). This effect needs further research. Mechanisms that trained singers use for achieving high-frequency energy were described by Sundberg9—he pointed out that, ‘‘the pharynx widening has the effect of isolating the larynx tube from the rest of the vocal tract so that its resonance frequency is not altered by articulatory movements outside it.’’ As an example, this disconnection between the larynx tube and the rest of the vocal tract is evident in a comparison between spectrograms of singer 4ST and 8ISo (Figure 6). For singer
FIGURE 6. Spectrograms of singer 4St (top) and 8ISo (bottom) singing the second phrase of the aria show different approaches to the usage of the upper formants.
8ISo the singer’s formant cluster is not only very densely clustered, but also very stable compared with the vowel changes. Another observation of Sundberg was that the larynx lowering generally decreases the frequency of the fifth formant, which is possibly supported by the spectrograms in this case as well.9 Although it is beyond the scope of this article, a similar time series approach that correlates formant frequencies to high-frequency energy usage has the potential to show these effects more robustly. Perceived Effort The ratio of high-frequency energy to low-frequency energy in a voice can be related to the amount of subglottal pressure being used.24 This conversely means that a listener may perceive a change in the ratio of high- to low-frequency energy as a change in the effort being made by the singer. Performance ability by singers may be related to a narrowing of the range of perceived effort in the listener—it may sound ‘‘easier’’ to the listener if it sounds like the singer is doing the same amount of work at all times. The primary cue to the ‘‘perceived effort’’ of the singer may perhaps be variance in the spectral energy distribution, rather than purely variance in SPL. Subjective testing that separates these two acoustic parameters may elucidate their contribution. Measurement Implications There are significant implications for the measurement, and especially for the summary, of high-frequency energy in the singing voice. The term ‘‘Long-Term Average Spectrum’’ implies a spectral summary across time, but the averaging process is actually dominated by the levels within the sample. An LTAS ignores lower SPLs, as the analysis method is completely overwhelmed when portions of the sample with 20–30 dB greater SPLs are averaged with lower level portions. A spectral indicator that attempts to conflate more than one variable into a single representative variable must do so
Sam Ferguson, et al
Effects of Voice Training in Nine Male Classical Singers
circumspectly, by ensuring that the loss of information does not result in undue biases. The LTAS of a sample is biased toward high-frequency energy distribution when the maximum levels were present in the sample. Perhaps the use of spectral summaries like ER and SPR, due to their reliance on the LTAS process, make implicit assumptions about the relative importance of SPL and time in reducing the sample to a single number. It is unlikely these assumptions can be easily supported, and they may be unintended. Using a short-term Fourier transform seems to be a better way to separate ER and SPL and investigate each independently. The spectral distribution, and thus the spectral slope measure should be found for many short (0.01 s) windows in time, and before averaging them over the time period, thereby avoiding the biasing effect of SPL. Medians and percentiles are other methods that may be appropriate ways to summarize a spectral distribution. Although it is tempting to point to the more systematic effects of training that were evident when using this method, our results would need to be replicated before generalizing these findings. However, this method is certainly promising, and a larger replication study with a statistically valid sample size seems worthwhile from the results observed. In this study, however, we sought to highlight the large range of STERs obtained in each sample, and the necessity for a careful assessment of any averaging process.
Limitations This study was cross-sectional, using a sample of singers at different stages of training and professional experience. It is therefore inadvisable to invoke these effects as a model of how singers learn their art; this would require a longitudinal study, although it may need to be lengthy considering that we investigated singers whose age range was 43 years. Our primary focus was to compare singers with different levels of training and to look for robust effects of training on the distribution of highfrequency energy over time, and to construct a peer-reviewed method for measuring spectral energy modulation and consistency over a singing sample. A follow up study would need a larger sample of singers to confirm the effects that have been explored in this study. The purpose of a future study may determine the number of singers necessary—if it is primarily to distinguish between students and professionals (ie, two groups), then other studies have used samples sizes of approximately 20–25 (eg, Ternstro¨m et al15). However, if the study were to collect more data at each of the Bunch and Chapman27 taxonomy levels (instead of a selection) it may be able to describe at which level spectral energy usage changes significantly, although significance would need to be achieved between neighboring levels to show this, increasing the necessary sample size. It may also examine the question as to which is more perceptually important: the consistency of the high-frequency energy over time, the mean high-frequency energy, or the high-frequency energy compared with the SPL. These three effects have not been adequately compared in terms of their perceptual significance in this study.
45
Acknowledgments This research was carried out with the technical assistance of Robert Sazdov and Peter Thomas. A conversation with Thomas Millhouse and comments from two anonymous reviewers are gratefully acknowledged. This research was supported under the Australian Research Council’s Discovery Projects funding scheme (project number DP0558186). The first author was supported by an Australian Postgraduate Award. We are most grateful to the nine singers who gave their time to participate in this study.
REFERENCES 1. Bartholomew WT. A physical definition of ‘good voice-quality’ in the male voice. J Acoust Soc Am. 1934;6:25-33. 2. Ekholm E, Papagiannis GC, Chagnon FP. Relating objective measurements to expert evaluation of voice quality in western classical singing: Critical perceptual parameters. J Voice. 1998;12(2):182-196. 3. Robison C, Bounous B, Bailey R. Vocal Beauty: A study proposing its acoustical definition and relevant causes in classical baritones and female belt singers. J Singing. 1994;51:19-30. 4. Mendoza E, Valencia N, Munoz J, Trujillo H. Difference in voice quality between men and women: use of the long-term average spectrum (LTAS). J Voice. 1996;10(1):59-66. 5. Fant G. Acoustic Theory of Speech Production. The Hague: Mouton; 1960. 6. Fant G, Fintoff K, Liljencrants J, Lindblom B, Martony J. Formant-Amplitude Measurements. J Acoust Soc Am. 1963;35(11):1753-1761. 7. Sundberg J. Formant structure and articulation of spoken and sung vowels. Folia Phoniatr. 1970;22(1):28-48. 8. Sundberg J. The source spectrum in professional singing. Folia Phoniatr. 1973;25:71-90. 9. Sundberg J. Articulatory interpretation of the singing-formant. J Acoust Soc Am. 1974;55(4):838-844. 10. Sundberg J. Level and center frequency of the singer’s formant. J Voice. 2001;15(2):176-186. 11. Kang GS, Coulter DC. 600 BPS voice digitizer. Proc. of the 1st Int. Conf. on Acoustics, Speech and Signal Proc., Philadelphia, Pennsylvania 1976;91-94. 12. Millhouse T, Clermont F, Davis P. Exploring the importance of formant bandwidths in the production of the singer’s formant. Proc. 9th Australian Int. Conf. Speech Sc. and Tech 2002;373-378. 13. Thorpe CW, Cala SJ, Chapman J, Davis PJ. Patterns of breath support in projection of the singing voice. J Voice. 2001;15(1):86-104. 14. Frøkjaer-Jensen B, Prytz S. Registration of voice quality. Bru¨el & Kjaer Tech Rev. 1976;3:3-17. 15. Ternstro¨m S, Bohman M, Sodersten M. Loud speech over noise: some spectral attributes, with gender differences. J Acoust Soc Am. 2006;119(3): 1648-1665. 16. Omori K, Kacker A, Carroll L, Riley W, Blaugrund S. Singing power ratio: quantitative evaluation of singing voice quality. J Voice. 1996;10(3):228-235. 17. Watts C, Barnes-Burroughs K, Estis J, Blanton D. The singing power ratio as an objective measure of singing voice quality in untrained talented and nontalented singers. J Voice. 2006;20(1):82-88. 18. Sundberg J. Singing and timbre. In: Music, Room and Acoustics, Vol. 17. Stockholm: Royal Swedish Academy of Music; 1977. 57–81. 19. Rossing TD, Sundberg J, Ternstro¨m S. Acoustic comparison of voice use in solo and choir singing. J Acoust Soc Am. 1986;79(6):1975-1981. 20. Rossing TD, Sundberg J, Ternstro¨m S. Acoustic comparison of soprano solo and choir singing. J Acoust Soc Am. 1987;82(3):830-836. 21. Dmitriev L, Kiselev A. Relationship between the formant structure of different types of singing voice and the dimension of supra-glottal cavities. Folia Phoniatr. 1979;31:231-241. 22. Kenny D, Mitchell H. Acoustic and perceptual appraisal of vocal gestures in the female classical voice. J Voice. 2006;20(1):55-70. 23. Jansson EV, Sundberg J. Long-time-average-spectra applied to analysis of music. Part 1: method and general applications. Acustica. 1975;34:15-19.
46 24. Sjo¨lander P, Sundberg J. Spectrum effects of subglottal pressure variation in professional baritone singers. J Acoust Soc Am. 2004;115(3): 1270-1273. 25. Nordenberg M, Sundberg J. Effect on LTAS of vocal loudness variation. Logoped Phoniatr Vocol. 2004;29(4):183-191. 26. Sundberg J, Nordenberg M. Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average spectra of speech. J Acoust Soc Am. 2006;120(1):453-457.
Journal of Voice, Vol. 24, No. 1, 2010 27. Bunch M, Chapman J. Taxonomy of singers used as subjects in scientific research. J Voice. 2000;14(3):363-369. 28. Cabrera D, Davis P, Barnes J, Jacobs M, Bell D. Recording the operatic voice for acoustic analysis. Acoustics Aust. 2002;30(3):103-108. 29. Boersma P, Weenink D. Praat: doing phonetics by computer (Version 4.4.04) [Computer Program]; 2006. 30. Sundberg J, Bauer-Huppmann J. When does a sung tone start? J Voice. 2007;21(3):285-293.