Equalization in ambisonics

Equalization in ambisonics

Applied Acoustics 139 (2018) 129–139 Contents lists available at ScienceDirect Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust ...

1MB Sizes 1 Downloads 47 Views

Applied Acoustics 139 (2018) 129–139

Contents lists available at ScienceDirect

Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust

Equalization in ambisonics

T

Shu-Nung Yao Department of Electrical Engineering, National Taipei University, No. 151, University Rd., San Shia District, New Taipei City 23741, Taiwan

A R T I C LE I N FO

A B S T R A C T

Keywords: Ambisonic Equalization Timbral Spatial

In ambisonics, the number of loudspeakers must be greater than or equal to the requirement of the ambisonic order. On the one hand, if the number of loudspeakers in an ambisonic system satisfies the minimum requirement exactly, localization in the lateral regions can be poor. On the other hand, the use of a large number of loudspeakers in an ambisonic array induces spectral impairments. In this paper, we present a binaural ambisonic decoder equipped with a 1/3-octave equalizer that demonstrates how to improve sound localization without perceptual spectral impairment. The equalization decoding estimates the level of spectrum impairment in a virtual high-density loudspeaker array and uses a 1/3-octave filterbank to equalize the frequency components, which are low-pass filtered or comb filtered. Therefore, the magnitude of the treated signal is nearly uniform from low to high frequencies. Both objective and subjective listening tests were conducted. The experimental results show that the proposed method reinforces sound localization, especially at low frequencies. Additionally, the use of a 1/3-octave filterbank facilitates the higher-order extension of binaural ambisonic decoding.

1. Introduction Ambisonics [1] can reproduce a sound field configured in the two dimensions in terms of cylindrical harmonics by using a horizontal loudspeaker layout or a three-dimensional sound field in terms of spherical harmonics by using a three-dimensional loudspeaker array. Because the encoding procedure is independent of the loudspeaker configuration, the same encoded signals can be played in many types of loudspeaker arrays. In addition to the encoding compatibility of ambisonics, the loudspeakers do not require a rigid decoding arrangement owing to the flexibility of the ambisonic loudspeaker configuration. Several decoding methods have been proposed to enhance the listening experience. The basic ambisonic decoder can technically reconstruct the sound field at a single “sweet spot”. However, not all of the loudspeaker feeds are in phase for listeners distributed over a listening area. Malham [2] first reported the out-of-phase loudspeaker signals in first-order threedimensional ambisonics. Based on Malham’s discovery, Monro [3] extended the concept to higher-order ambisonics by reducing the ratio of directional signals to the omnidirectional signal, a method known as inphase decoding. Daniel [4] proposed the max-rE decoding to pass ambisonic components through filters with different gains below and above the transition frequency. This technique is advantageous because human beings use interaural time differences (ITDs) to localize low-frequency sound and interaural intensity differences (IIDs) to localize high-frequency

sound. The transition frequency depends on two factors. The first factor is the distance between the ears. Someone with a small head detects higher transition frequency. The second factor is the position of the sound source. Placement of the sound source directly to the left or right of the listener creates the maximum ITD, which results in the lowest transition frequency. Hence, Gerzon suggested that the transition frequency is between 100 Hz and 1000 Hz [5]. In [6], Yao et al. presented the relationship between spectral impairment, ambisonic order, and the number of loudspeakers. Although a dense loudspeaker array helps localization, it causes spectrum impairment at high frequencies. Therefore, the number of loudspeakers that must be used in ambisonics is a trade-off: whilst a large number is excellent for low-frequency reconstruction, a small number is beneficial for high-frequency reconstruction. A split-band method was proposed in [6] to reinforce audio quality. The method refined the frequency components and the near perfectly reconstructed components from both a large loudspeaker array and a small loudspeaker array were subsequently combined to enhance the audio quality. According to [7], localization in the lateral regions can be reinforced by adding more loudspeakers. Yao et al. [6], on the other hand, observed that a large number of loudspeakers had a negative impact on audio timbral fidelity. Therefore, we proposed a binaural ambisonic decoder equipped with an equalizer for a virtual dense loudspeaker array. For a large number of virtual loudspeakers, the ambisonic decoder helps localization accuracy and, when combined with a onethird-octave filterbank, enhances timbral fidelity. The equalizer

E-mail address: [email protected]. https://doi.org/10.1016/j.apacoust.2018.04.027 Received 21 December 2017; Received in revised form 21 March 2018; Accepted 20 April 2018 Available online 27 April 2018 0003-682X/ © 2018 Elsevier Ltd. All rights reserved.

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

estimates the spectral impairment at the center of the loudspeaker array by a root-mean-square (RMS) level calculation and compensates the low-pass-filtered or comb-filtered frequency components.

field may occur at an off-center listening position or at high frequencies. The reconstruction errors at the listeners’ ear positions are discussed in this section.

2. Ambisonics by cylindrical harmonic signals

3.1. Relative sound intensity

Ambisonics is used to represent a two- or three-dimensional acoustic pressure field as a function of cylindrical or spherical harmonic components, respectively. The expression of the sound field as a combination of spherical harmonic signals was discussed in [6]. In this paper, we consider the horizontal-only space as an example and show that a two-dimensional spectrum can be expressed as a sum of cylindrical harmonic components. Consider a monophonic sound ψ coming from θψ . The sound field is [8]

In [10], Solvang used the relative sound intensity to compare the squared pressure of p (kr ,θ) and p (kr ,θ)S ; the relative sound intensity is

p (kr ,θ) = ψeikr cos(θ − θψ),

Irel (kr ,θ,θψ) = =

imqm eimθJm (kr ),

where Jm (kr ) is the Bessel function of the first kind. The coefficients qm are expressed as [9]: (3)

Assume that Eq. (2) is truncated at the Mth-order, M

imqm eimθJm (kr ).

N



eikr cos (

2nπ − θ N

M

) ∑

n=1

e−im (

2nπ − θ ψ N

p (kr ,θ)M = ψ

(5)

(

)

−i ⎡m 2nπ − θψ − l 2sπ − θψ ⎤ N N ⎣ ⎦,

(

) (

)

(6)



eim (θψ − θ)



i aN + mJaN + m (kr ) e−iaNθ,

a =−∞



eim (θψ − θ) imJm (kr ).

(7)

(8)

m =−M

Thus, if a is equal to 0 in Eq. (7), p (kr ,θ)S and p (kr ,θ)M are equivalent. In other words, the term i aN + mJaN + m (kr ) e−iaNθ leads to a reconstruction error, unless a = 0. According to the Bessel function, the location of the maximum value is controlled by the order of the Bessel function. The higher-order Bessel function has its maximum at the higher kr region. Therefore, for a large number of loudspeakers, JaN + m (kr ) becomes a high-order function, which shifts the reconstruction error to the high- kr region. This is why Solvang [10] noted that utilizing a large number of loudspeakers produces higher fidelity at low

Since the ambisonic loudspeaker configuration is flexible, we do not need to mount the loudspeakers in a rigid decoding arrangement as long as they are equally distributed around the listening area and are arranged in diametrically opposed pairs. However, basic ambisonics was designed for the perfect reconstruction at the center point of the loudspeaker array, not at the ear positions. Impairment of the sound

(b)

50

20 0

relative level, dB

0

relative level, dB

)

M

).

m =−M

-50 -100 -150 3 loudspeakers 4 loudspeakers 8 loudspeakers 50 loudspeakers

-200 -250 0

(

where a is an arbitrary integer. According to Eq. (3) and Eq. (4), in Appendix C, we obtain

3. Reconstruction errors in different loudspeaker arrays

(a)

e

m =−M

As shown in Appendix A, using an N-loudspeaker array to model the sound field, the superposition of the N loudspeakers is

ψ N



p (kr ,θ)S = ψ

(4)

m =−M

p (kr ,θ)S =



ikr ⎡cos 2nπ − θ − cos 2sπ − θ ⎤ N N ⎣ ⎦

M



p (kr ,θ)M =

e

where p is the conjugate matrix of p . The relative intensities at the leftear position, 0.1 m away from the center, are presented in Fig. 1 using the first-order and the second-order ambisonics and varying the number of loudspeakers when the incident angle is zero. If the values are close to 0 dB, the reconstructed sound intensities are similar to the original one. Because the relative intensities of lower-order ambisonic systems tend to converge to 0 dB slowly, Solvang [10] indicated that the number of loudspeakers needs to match the minimum requirement of ambisonic order. Despite the occurrence of spectral impairment, a dense loudspeaker array has a major advantage. By using Eq. (5) and considering that ∞ eiz cosθ = ∑t =−∞ it Jt (z ) eitθ , Appendix B shows:

(2)

qm = ψe−imθψ.

N

m =−M l =−M

(1)

m =−∞

N

∑ ∑







1 N2

n=1 s=1 M M

where θ is the anti-clockwise angle from the center front, r is the distance from the origin, k is the wave number, and i is defined as −1 . If this expression is written using cylindrical harmonics [9],

p (kr ,θ) =

|p (kr ,θ)S |2 p (kr ,θ)S p (kr ,θ)S = |p (kr ,θ)|2 p (kr ,θ) p (kr ,θ)

2000

4000

6000

8000

-20 -40 -60 -80

5 loudspeakers 8 loudspeakers 16 loudspeakers 360 loudspeakers

-100 -120 0

10000 12000

2000

4000

6000

8000

10000 12000

frequency

frequency

Fig. 1. Relative level of the (a) first-order and (b) second-order ambisonic reproduction of the original sound field at the listener’s ear position. The direction of the incident sound is at 0°. 130

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

(a) 50

the spectrum of the 360-loudspeaker array tended to be low-pass-filtered, as demonstrated in Fig. 2a. However, at low-frequencies, the curves of the 360-loudspeaker array approached 0 dB, as displayed in Fig. 2b, while the curves of the pentagon array were widely distributed. The most obvious low-frequency spectrum impairment of the regular pentagon array occurred in the lateral region. We generated two ambisonics-generated sources in the regular pentagon array. One was in the direction of a loudspeaker at 0° and the other at 44.5° between two loudspeakers. Fig. 3a shows the severe low-frequency spectrum impairment, which occurred when the source was positioned in the lateral region. If kr equals N, where the frequency is approximately 1.1 kHz, then the relative level is decreased to −19.06 dB. On the other hand, when we examine the same source angles in the 360-loudspeaker array, the location of the ambisonics-generated source has a smaller impact on the spectrum in the kr < N frequency bandwidth, as shown in Fig. 3b. If kr equals N, the relative level is decreased to −6.867 dB in the lateral region and the distortion is much lower than that of the regular pentagon array.

relative level, dB

0

-50 -100 -150 -200 -250 -300 0

5-speaker 360-speaker

0.5

1

Hz

1.5

2

2.5 4

x 10

(b) 50 relative level, dB

0 4. Equalization decoding

-50

As discussed in the previous section, we determined that a dense loudspeaker array produces more accurate low-frequency cues. However, a dense loudspeaker array also suffers from severe spectrum impairment. In this section, we describe the reproduction of a sound field in a virtual dense loudspeaker array equipped with an equalizer for spectrum compensation. The binaural architecture of the proposed equalization ambisonics contains a controller, which was used to assess the spectrum impairment and then to adjust the equalizer for spectrum compensation. The equalizer divided all frequency components into 15 frequency banks. The bandwidths of the band-pass filters were one third of an octave. The cut-off frequency of the low-pass filter depended on the leftcorner (3 dB) frequency of the first one-third-octave filter and that of the high-pass filter on the right-corner (3 dB) frequency of the last onethird-octave filter. The gain of each filter band was adjusted by the controller, which could predict the level of spectrum impairment at the listener’s ear position. We considered a dense circular loudspeaker array, and placed an ambisonics-generated impulse at the desired source position in the virtual acoustic space. The superposition of the loudspeaker signals at the listener’s ear position was calculated by the controller. Because the listener’s ear was at an off-center position, the spectrum of the receiving signal was not flat. Then, the RMS power of the receiving signal in each frequency band was computed for the equalization. The receiving signal that needed equalization was fed into the filterbank. The RMS level of the jth frequency band is Gj which is calculated by

-100 -150

5-speaker 360-speaker

-200 -250

0

500

1000

1500

2000

2500

Fig. 2. Relative levels of five-loudspeaker array and 360-loudspeaker array (a) in the entire frequency range and (b) at low frequencies. Each loudspeaker array has 360 curves corresponding to 360 angles of the ambisonics-generated source.

frequencies. Additionally, a large number of loudspeakers have been reported to enhance localization accuracy when kr < N [6]. We considered second-order ambisonics as an example. We arranged five loudspeakers in the first loudspeaker array and 360 loudspeakers in the second array. Because the loudspeakers were positioned with an angularly uniform geometry, Eq. (6) can be applied to compute the spectrum distortion. The ambisonics-generated sound source moved from 1° to 360° and the measurement point was always at the same position, 0.1 m (approximately the average radius of a human head) away from the origin. Fig. 2 shows the relative levels for each source angle. Because the measurement position was fixed, r = 0.1, the x-axis can be modified to represent the frequency in Hz. We determined that

(a)

(b)

kr=N

kr=N

Fig. 3. Spectrum impairment in (a) the regular pentagon array and (b) the 360-loudspeaker array. 131

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

(a)

1

HRIR 50-speaker 4-speaker

0

0.35

50-speaker 4-speaker

0.3

ITD error, ms

ITD, ms

0.5

(b)

0.25 0.2 0.15 0.1

-0.5

0.05 -1 -100

0

100

200

0 -100

300

0

100

200

300

angle, degrees (in CIPIC coordinate)

angle, degrees (in CIPIC coordinate)

Fig. 4. (a) Calculated ITDs of virtual loudspeaker array obtained using HRIR 156 from the CIPIC database and (b) their absolute errors. K −1

∑ Gj =

Oj2 (n)

n=0

, j = 0,1,…,14,

K

(9)

where Oj is the magnitude of the frequency component in the jth frequency band when the input is the impaired signal. Similarly, an impulse was then sent into the filterbank. The RMS level of the jth frequency band is Aj which is calculated by K −1



P j2 (n)

n=0

Aj =

K

, j = 0,1,…,14,

(10)

where K is the number of frequency points and Pj the magnitude of the frequency component in the jth frequency band. Finally, the gain for equalizing each frequency band was computed as

Ej =

Aj Gj

, j = 0,1,…,14.

Fig. 6. Relationship between ITD and azimuthal angle.

decoder [3], the max-rE decoder [4], the 50-loudspeaker decoder, splitband decoder [6], and the equalization decoder. In the subjective listening test, in order to avoid listener exhaustion, only three binaural ambisonic decoders, the four-loudspeaker decoder, the 50-loudspeaker decoder, and the equalization decoder were involved.

(11)

The impaired signal was sent into the one-third-octave filterbank with the cascaded gains. The output signal was the equalized signal whose RMS level in each one-third-octave frequency band was expected to equal that of the impulse signal.

5.1. Localization errors and spectral distortion in objective listening test

5. Methods and experimental results

The duplex theory introduced by Rayleigh [12] has provided a model describing how listeners distinguish the positions of sound sources by discriminating ITDs and IIDs. To evaluate ITDs and IIDs, we virtually built ambisonic loudspeaker arrays with HRIRs in the Center for Image Processing and Integrated Computing (CIPIC) HRIR database [11]. The CIPIC database was built by high-spatial-resolution HRIR measurements for 45 different subjects. The measurements were recorded at 25 different interaural-polar azimuths and 50 different interaural-polar elevations, therefore including 2500 HRIR pairs for each

To assess the audio quality of the proposed ambisonic decoder, objective measurements and subjective evaluations were both conducted. While the objective listening tests were conducted using 45 head-related impulse response (HRIR) datasets from a public database [11], the subjective listening tests were performed by eight male and four female listeners. In the objective listening tests, six ambisonic decoders were tested: the basic quadraphonic decoder [1], the in-phase

(a)

20

(b)

0 -10 -20 -30 -100

50-speaker 4-speaker

15

IID error, dB

IID, dB

10

20

0

100

10

5

HRIR 50-speaker 4-speaker 200 300

0 -100

0

100

200

300

angle, degrees (in CIPIC coordinate)

angle, degrees (in CIPIC coordinate)

Fig. 5. (a) Calculated IIDs of virtual loudspeaker arrays using HRIR 156 from the CIPIC database and (b) their absolute errors. 132

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

Table 2 Post-hoc tests for multiple IID accuracy comparisons.

*** significant at (p < 0.001)

4-speaker In-phase Max-rE 50-speaker Split-band Equalization

4-speaker

In-phase

Max-rE

50-speaker

Splitband

Equalization



***

***

***

***

***

***



***

***

***

***

***

***



***

***

***

***

***

***



***

*

***

***

***

***



***

***

***

***

*

***



Significance at different levels: ** p < 0.01. * p < 0.05. *** p < 0.001.

The ITD estimation used in this study was based on the interaural cross-correlation function [14,15] between a listener’s ears. Suppose that hl (t ) is the left HRIR and hr (t ) is the right HRIR. We must determine a value τ that maximises the function

Fig. 7. The ITD error of each decoder. The circles denote the means and the vertical lines symbolize 95% confidence intervals. Level of significance: p < 0.001.

t

*** significant at (p < 0.001)

Φ(τ ) =

∫t12 hl (t ) hr (t + τ ) dt t

t

∫t12 hl2 (t ) dt ∫t12 hr2 (t ) dt

, (12)

where t1 and t2 are the time limits of the integration, depending on the length of the HRIRs. The desired τ is the ITD between hl (t ) and hr (t ) . On the other hand, the IID cues are calculated by [16,17]

IID = 10log10

T ∫0 hl2 (t ) dt T ∫0 hr2 (t ) dt

, (13)

where T is the length of the HRIRs. Since the ITD is especially relevant below around 700 Hz, for a fair comparison, the impulse response was low-pass filtered before calculating the ITD. The resulting ITDs, produced by ambisonics and pure HRIRs, are presented in Fig. 4a; the sampling angles are the same as those of the HRIRs in the CIPIC database. The figure shows that a dense loudspeaker array can achieve finer ITD resolutions. For greater precision, the absolute errors between HRIR-generated ITDs and ambisonics-generated ITD were computed according to [18]

Fig. 8. The IID error of each decoder. The circles denote the means and the vertical lines symbolize 95% confidence intervals. Level of significance: p < 0.001.

subject. Two types of circular loudspeaker arrangements were virtually constructed: a quadraphonic (four-loudspeaker) array and a 50-loudspeaker array. The four loudspeakers were uniformly distributed in the shape of a square. Although the distribution of the loudspeakers should be uniform around the listening position [13], the 50 loudspeakers were placed at the following azimuthal angles: −80°, −65°, −55°; from −45° to 45° in steps of 5°; at 55°, 65°, and 80° in a head-centered interaural-polar coordinate system. This configuration was adopted to match the locations of the HRIR measurement described in [11]. Both loudspeaker arrangements were in the first-order ambisonic format.

AEITD (θ) = |ITDHRIR (θ)−ITDAmbisonics (θ)|,

(14)

where θ is the sampling angle, ITDHRIR is the ITD from the HRIRs, and ITDAmbisonics is the ITD from one of the six ambisonic decoders. By using an example HRIR dataset, the ITD errors of the four-loudspeaker array and the 50-loudspeaker array are plotted in Fig. 4b as a function of azimuthal angle. The mean absolute errors of the 50-loudspeaker and the four-loudspeaker arrays are 0.0694 ms and 0.0880 ms, respectively. Additionally, the impulse response was high-pass filtered at 700 Hz before the IID calculation. The IID results obtained using Eq. (13) and the absolute IID errors (in dB) between HRIR-generated IIDs and ambisonics-generated IID were computed by

Table 1 Post-hoc tests for multiple ITD accuracy comparisons.

4-speaker In-phase Max-rE 50-speaker Split-band Equalization

4-speaker

In-phase

Max-rE

50-speaker

Split-band

Equalization



***

Not significant

***

***

***

***



***

***

***

***

Not significant

***



***

***

***

***

***

***



***

***

***

***

***

***



***

***

***

***

***

***



Significance at different levels: * p < 0.05. ** p < 0.01. *** p < 0.001. 133

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

20

b) 16

4-speaker 50-speaker EQ

Spectral Distortion (dB)

Spectral Distortion (dB)

a)

15

10

5 -100

0

100

200

14 12 10 8 6

300

*** significant at (p < 0.001)

4-speaker 50-speaker

EQ

Angle (degrees in CIPIC coordinate) Fig. 9. (a) Spectral distortions in different decoders with varying positions. (b) The mean spectral distortions of all measuring positions. The circles are the means and the vertical lines are 95% confidence intervals. Level of significance: p < 0.001.

midline, increasing to approximately 10° on one side of the head. Considering that psychophysical measures have shown that JND in the ITD in human beings is 6 μs [20,21]. For the JND in IIDs, Mills [22] measured the just noticeable interaural difference in the intensity of a dichotic tone pulse as a function of its frequency. The subject listened to a dichotic tone pulse with equal loudness at his left and right ears, and reported if a second variable dichotic tone pulse appeared to the right or left of the fixed tone pulse. There were several single tone pulses involved in the experiment, including: 250, 750, 1000, 1500, 2000, 3000, 4000, 6000, 8000, and 10,000 Hz. The experiment shows that the JND in intensity reaches its peak (about 1 dB) at 1000 Hz. The minimum of the JND in IID thresholds is about 0.5 dB at 2000 and 10,000 Hz. We used 45 HRIR datasets from the CIPIC database and the mean ITD errors were calculated with Eq. (14). Fig. 7 shows the ITD analysis for each decoder. The ITD accuracy in the basic 50-loudspeaker array was similar to that observed in the same array equipped with an equalizer. The split-band decoder ranked third and the fourth smallest ITD error was exhibited by the max-rE decoder. The ITD accuracy of the four-loudspeaker array with the in-phase technique was the poorest. When we compare the exact mean values, the split-band decoder slightly improves the ITD performance of the four-loudspeaker array, as stated [6]. Because the JND is 6 μs , both of the 50-loudspeaker array and our proposed systems produced a noticeable improvement to the ITD performances. Fig. 8 shows the IID analysis of each decoder using Eq. (15). The four-loudspeaker array with the max-rE technique produced the most accurate IID cues, followed by the basic four-speaker array and the split-band decoder. Because the dense loudspeaker arrays tended to exhibit the poorest IID performances, we observed degraded IID accuracies in both the basic 50-loudspeaker array and the equalization decoder. According to [22], the JND threshold of each high frequency interval (above 700 Hz) is rarely lower than 0.5 dB, which means although the equalization decoder degraded the IID accuracy in the 50loudspeaker array, the degradation is unlikely to be noticeable. The smallest IID error is consistent with the statement included in [4] that the max-rE decoder aims to correct energy vectors at high frequencies. The similar IID errors of the split-band decoder and the four-loudspeaker array also conform to the description provided in [6]. After finishing all objective measurement of 45 subjects in the CIPIC database, each ambisonic decoder got a group of absolute ITD errors and a group of absolute IID errors. Because we got the same subjects to hear different types of decoders and to rate each one for audio quality, the repeated measures analysis of variance (ANOVA) was used to determine whether there are any statistically significant differences between the mean absolute errors of different ambisonic decoders. The repeated measures ANOVA tests for whether there are any differences

Table 3 Post-hoc tests for spectral distortion comparisons.

4-speaker 50-speaker Equalization

4-speaker

50-speaker

Equalization



***

***

***



***

***

***



Significance at different levels: * p < 0.05. ** p < 0.01. *** p < 0.001.

Magnitude (dB) Frequency (Hz) Fig. 10. Tolerance mask from [27] for the diffuse-field frequency response of studio monitoring headphones.

AEIID (θ) = |IIDHRIR (θ)−IIDAmbisonics (θ)|,

(15)

where θ is the sampling angle, IIDHRIR is the IID from the HRIRs, and IIDAmbisonics is the IID for one of the ambisonic decoders. By using an example HRIR dataset, the IID cues with the band-limited impulse are shown in Fig. 5a. As shown in the figure, the dense loudspeaker array produced less accurate IID cues. Fig. 5b shows the comparison between the 50-loudspeaker array and the four-loudspeaker array. The mean absolute errors were 6.0446 dB and 4.5747 dB for the 50-loudspeaker and the four-loudspeaker array, respectively. It is interesting to investigate whether the differences between the ITDs and IIDs were noticeable. For the just noticeable difference (JND) in ITDs, we virtually placed a sound source at a distance of 1 m from the listener and assumed that the radius of a subject’s head was about 10 cm. Fig. 6 shows the non-linear ITD variation with the azimuthal angle of the sound source, from the midline to the one side of the listener’s head (0° to 90°, correspondingly). The variation rate of the ITD is greatest at 0° (the midline), which indicates that human beings localize sounds optimally from this position. On the other hand, this rate is smallest when the source is located on one side of the head, at 90°. In [19], the minimal audible angle was reported as approximately 1° at the 134

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

Magnitude (dB) Frequency (Hz) Fig. 11. Screen grab of response curve view in SonarWorks. The blue curve is the frequency response of SENNHEISER PX 100-II and the red curve is the target response curve. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

decoder. A small p value indicates that the experimental result is statistically significant. The p values in Figs. 7 and 8 are both smaller than 0.001. Post-hoc tests were conducted to determine the significance levels between the decoders, as shown in Tables 1 and 2. Table 1 shows that the ITD accuracy of the equalization decoding outperformed the other decoders significantly. Table 2 shows that the equalizer did not considerably degrade the IID accuracy of the basic 50-loudspeaker. In the objective listening test, we also evaluated the spectral distortion (SD) by the mean square errors [24,25] in the frequency domain:

Fig. 12. The continuous grade scale used in the first listening test: “I can discriminate the source at the high and low elevation” and “I can discriminate the source in front from the source in the back”.

35°

-45°

-175°

SD =

Fig. 13. Sound source coming from, −45°, −175°, 140° and 35°.

Imperceptible

4

Perceptible but not annoying

3

Slightly annoying

2

Annoying

1

Very annoying

2

∑k =1 ⎛⎜20log10 K



|H (fk )| ⎞ , |H ̂ (fk )| ⎟⎠

(16)

where |H ̂ (fk )| is the ambisonics-generated frequency magnitude response and |H (fk )| is the head-related transfer function (HRTF) at the same position. fk is the frequency in Hz and K is 100, so f1 = 0 Hz and fK = 22.05 kHz. The reference signal is the HRTF, and the signal under test is the ambisonics, so the levels of timbre distortion can be compared. Although a dense loudspeaker effectively enhanced localization, it caused spectrum impairments as mentioned in the previous section. The mean of spectral distortions at the listener’s left and right ears was calculated in Fig. 9a with different measuring positions and the average spectral distortion of all measuring positions was computed in Fig. 9b. Through looking into the averages of the 50-loudspeaker decoder and the equalization decoder in Fig. 9b, the equalizer decreased the spectral distortion in the 50-loudspeaker array by more than 4 dB on average. The decreased value is significantly closer to the spectral distortion in the square array. Post-hoc test was also conducted to determine the significance levels between the decoders as shown in Table 3.

140°

5

1 K

5.2. Timbral and spatial fidelity in subjective listening test

Fig. 14. Assessment grades used in listening tests questionnaires to rate the audio quality in terms of timbral fidelity assessment and localization assessment.

Because an HRIR dataset varies from person to person, the approach to selecting the best fit is an important issue. If the HRTF dataset does not match the real HRTF dataset of the user, poor localization occurs [26]. The first stage of the subjective listening test aimed at choosing the best-fitting HRIR dataset for a user from a database composed of 18 randomly selected HRIR datasets from the CIPIC Laboratory [11]. Listeners were asked to wear SENNHEISER PX 100-II headphones during

between related, not independent population means [23]. We made the null hypothesis states that the mean absolute errors of different binaural decoders are equal. The alternative hypothesis states that mean absolute error is significantly different in one or more ambisonic 135

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

Subjective Difference Grade

** significant at (p < 0.01)

Subjective Difference Grade

0 -1 -2 -3 -4 -5 -6

(c)

(b)

1

4-speaker 50-speaker

not significant (p > 0.05)

-1 -2 -3 -4 -5 -6

0

*** significant at (p < 0.001)

-1 -2 -3 -4 -5 4-speaker 50-speaker

EQ

(d) 1

1 0

1

-6

EQ

Subjective Difference Grade

Subjective Difference Grade

(a)

4-speaker 50-speaker

0 -1 -2 -3 -4 -5 -6

EQ

*** significant at (p < 0.001)

4-speaker 50-speaker

EQ

Fig. 15. Timbral fidelity subjective difference grade of (a) wide-frequency piano, (b) low-frequency cello (c) mid-frequency vocal, and (d) high-frequency flute sound in different ambisonic decoders. The circles denote the means and the vertical lines symbolize 95% confidence intervals.

contained two sections. The first section aimed to find if there is any timbre difference between decoders, so we fixed the position of the sound source. Each sound source was virtually placed in front of the subject, because listeners' best localization performance was for the loudspeakers located directly in front of them [28]. The sounds included wide-frequency piano, low-frequency cello, mid-frequency vocals, and high-frequency flute. In the second section, to reduce distraction to the user, the stimulus was changed to broadband white noise which was widely used in the literature for sound localization [29] [30]. The white noise originating from −45°, −175°, 140°, and 35° as shown in Fig. 13 was produced by different binaural decoders. We asked the listeners to compare if there is any localization difference between decoders. Although the listener was equipped with a head tracker applied to the rotation of virtual auditory space during the subjective test, in order to fix the test angles of the sound source, head movements were discouraged. The “double-blind triple-stimulus with hidden reference” method described in [31] was applied for the assessment of the subjective listening test. Specifically, we consider three stimuli, S1, S2, and S3. While S1 is always the known reference, the hidden reference and the stimulus under test are randomly assigned to S2 and S3. Subjects are asked to rate the impairments on S2 compared to S1 and on S3 with reference to S1. Finally, the subjective difference grade (SDG) is defined as:

the experiments. According to ITU-R BS.708 [27], the tolerance mask for the frequency response of studio monitor headphones should be flat within the limits specified in Fig. 10. The magnitude level at 1 kHz is normally placed on the 0 dB line. The magnitude level from 500 Hz to 4 kHz should be within ± 1.5 dB. The maximum tolerances for 125 Hz and 16 kHz are ± 2 dB and ± 4 dB, respectively. We used SonarWorks software to correct the headphone frequency response from 2 Hz to 22.05 kHz. Fig. 11 shows the original frequency response of SENNHEISER PX 100-II, and the flat response that SonarWorks intended to create. Each listener listened to 36 audio sequences generated by the 18 sets of HRIRs. Each dataset produced two audio sequences corresponding to the following criteria: up-down discrimination and frontback discrimination. All HRIR datasets were convolved with the same piano music. In the first question, we fixed the azimuthal angle and changed the elevation to test up-down discrimination. Listeners were asked to state whether or not they could discriminate the sources at the high and low elevation. In the second question, the elevation was fixed and different azimuth angles were employed to test front-back confusion. We asked the listeners to assess front-back confusion. The categorization was estimated using the continuous grade scale, shown in Fig. 12. We calculated the average score for each HRIR dataset and the dataset with the best score was used to generate the sound sources in the listening test for ambisonic decoders. After calibration, the selected HRIR dataset was used to develop binaural ambisonic decoders. To avoid listener fatigue, only three types of decoders were involved in the subjective listening test: a four-loudspeaker decoder, a 50-loudspeaker decoder, and a 50-loudspeaker decoder equipped with an equalizer. The subjective listening tests

SDG = Gs−Gr ,

(17)

where Gs is the grade of the stimulus under test and Gr is the grade of the hidden reference. Both grades are quasi-continuous and determined according to the five-grade impairment scales in terms of timbral fidelity and localization accuracy, as shown in Fig. 14. Participants were 136

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

(b)

1

-1 -2 -3 -4 -5 -6

Subjective Difference Grade

(c)

Subjective Difference Grade

*** significant at (p < 0.001)

0

4-speaker 50-speaker

(d)

1

not significant (p > 0.05)

0 -1 -2 -3 -4 -5 -6

4-speaker 50-speaker

1 0

-2 -3 -4 -5 4-speaker 50-speaker

EQ

1 0

*** significant at (p < 0.001)

-1 -2 -3 -4 -5 -6

EQ

* significant at (p < 0.05)

-1

-6

EQ

Subjective Difference Grade

Subjective Difference Grade

(a)

4-speaker 50-speaker

EQ

Fig. 16. Localization subjective difference grade of white noise originating from (a) −45°, (b) −175°, (c) 140°, and (d) 35° in different ambisonic decoders. The circles denote the means and the vertical lines symbolize 95% confidence intervals.

intervals for timbral fidelity and localization accuracy were summarized in Figs. 15 and 16, respectively. The more negative the subjective difference grade is, the worse the audio quality is. Repeated measures ANOVA was also applied to investigate the significance of the different settings to the subjective difference grades by p values. The experimental results in Fig. 15 show that the square array is more likely to present the best audio quality at high frequencies, with performance declining as the decoder presenting the sound was further to the low frequencies. The hypothesis that the equalization technique compensates for the high-frequency distortion in the 50-loudspeaker array was validated according to the subjective difference grades displayed in Fig. 15d. However, the timbral fidelity produced by the equalization decoder cannot be as good as that generated by the square array, which shows the limitation of the equalizer. The large p value in Fig. 15c indicates that there is little difference between the three decoders in terms of mid-frequency timbre. The p values in Fig. 15a, b, and d are all smaller than 0.05, indicating statistical significance. By all means, the equalization decoding was ranked in the middle, presenting robust quality from low- to high- frequencies. The experimental result in Fig. 16 shows that the square array is more likely to exhibit good localization performance when the sound source is placed in the direction of the loudspeaker in the square array, as shown in Fig. 16a. If the sound source is in the lateral regions of the square array, the 50-loudspeaker array tends to achieve better localization accuracy than the square array, as shown in Fig. 16b and d. Compared with the result where the sound sources in front of the listener, the decoders exhibit more similar localization performance when the sound sources are behind. The p values in Fig. 16a, b, and d are all smaller than 0.05, indicating statistical significance. The localization performance of the equalization decoder is similar to that of the 50-

Table 4 Post-hoc tests for timbral fidelity in subjective listening.

Piano Cello Vocal Flute

50-speaker versus 4-speaker

4-speaker versus equalization

Equalization versus 50speaker

***

Non-significant

*

***

**

*

*

Non-significant

Non-significant

***

*

*

Significance at different levels: * p < 0.05. ** p < 0.01. *** p < 0.001. Table 5 Post-hoc tests for localization in subjective listening. 50-speaker versus 4speaker −45° −175° 140° 35°

4-speaker versus equalization

Equalization versus 50speaker

***

*

**

*

*

Non-significant

Non-significant

***

**

Non-significant Non-significant Non-significant

Significance at different levels: * p < 0.05. ** p < 0.01. *** p < 0.001.

asked to assess the difference between an HRIR-generated sound as the reference and the ambisonics-generated sound as the signal under test. The mean subjective difference grades and the 95% confidence 137

Applied Acoustics 139 (2018) 129–139

S.-N. Yao

loudspeaker decoder. To compare the results of specific decoders, post hoc tests were performed. The comparisons of timbral fidelity and localization accuracy are shown in Table 4 and Table 5, respectively. The square array outperforms the 50-loudspeaker array in the aspect of timbral fidelity, when using wide-frequency piano music and high-frequency flute music. The 50-loudspeaker array, on the other hand, got better timbral fidelity when playing low-frequency cello music. The improved timbral fidelity by using equalization technique is supported by the small pvalues obtained in the last column of Table 4 when using piano sound and flute sound. However, for a mid-frequency sound source, like the human voice, the timbral fidelity difference becomes insignificant. The resulting p-values by post hoc tests in the localization assessment are shown in Table 5. The hypothesis that the localization of the square array is especially accurate in the direction of one loudspeaker is validated by the p-values when the source is at −45°. On the other hand, the p-values in −175° and 35° indicate that the 50-loudspeaker array and equalization decoder can both locate the lateral image more precisely than the square array. No significant difference is observed if the sound source is at 140°, an angle that is close to, but not in the direction of a loudspeaker. In most cases, the equalization decoder performed similar accuracy to a dense circular array, because the localization difference between the 50-loudspeaker array and the equalization decoder is not significant.

and compensation of the low-pass-filtered components in a high-density loudspeaker array. In much of the literature concerning ambisonic decoders, optimization is based on the magnitude of the velocity vectors and the energy vectors that indicate the direction of localization at low and at high frequencies, respectively. However, these vectors are calculated only at the center point of the loudspeaker array, not at the ear positions. In this study, the ITD and IID cues were applied for localization evaluation. The objective listening tests showed that the proposed decoding method achieved comparable ITD accuracy to a highdensity loudspeaker array but did not suffer severely spectrum impairments at high frequencies. Additionally, the results of the subjective listening tests were very encouraging. To the best of our knowledge, the split-band decoder [6] should be preferred when the loudspeaker feed is high-pitched music because most of the frequency components originate from a small loudspeaker array. On the other hand, the equalization decoder is the modification of a dense loudspeaker array; therefore, people who are interested in bass performance should select the equalization solution. Higher-order ambisonic systems and those with a multi-channel microphone array [32] will be further investigated. When using equalization decoding, we need only adjust the gain of each one-third-octave filter in the filterbank. Therefore, higher-order extensions are easy to compute and can be combined into the mobile playback devices to provide a spatial auditory and visual space for virtual reality applications [33].

6. Conclusion and future work

Acknowledgement

In this paper, we propose an equalization decoding method based on equalization of the RMS power in each one-third-octave frequency bank

This work was partially supported by the Ministry of Science and Technology, Taiwan (Grant No. MOST 106-2221-E-305-009).

Appendix A. Derivation of Eq. (5) All of the qm terms in Eq. (4) can be expressed in a matrix as

q ⎡ −M ⎤ q = ⎢ ⋮ ⎥. ⎢ ⎣ qM ⎥ ⎦

(A.1)

When we use an N-loudspeaker array to model the sound field,

q = D × p,

(A.2)

where D represents the cylindrical harmonics, iMθ iMθ ⎡e 1 ⋯ e N ⎤ D=⎢ ⋮ ⋱ ⋮ ⎥, ⎢ e−iMθ1 ⋯ e−iMθN ⎥ ⎣ ⎦

(A.3)

and p is the matrix for the loudspeaker feeds

⎡ A1 ⎤ p = ⎢ ⋮ ⎥. ⎢ AN ⎥ ⎣ ⎦

(A.4)

By moving D to the other side of the equation, we can design the loudspeaker feeds

p = pinv(D) × q.

(A.5)

where pinv(D) is the pseudo-inverse of D. If all loudspeakers are placed uniformly on a circle, 2π

pinv(D) =



−iM N −iMθ1 ⋯ e iMθ1 ⋯ eiM N ⎤ ⎤ 1 H 1 ⎡e 1 ⎡e ⋱ ⋮ ⎥, D = ⎢ ⋮ ⋱ ⋮ ⎥= ⎢ ⋮ N N ⎢ −iMθN N⎢ 2Nπ 2Nπ ⎥ iMθN ⎥ − iM iM ⋯ e e ⎣ ⎦ N ⋯ e N ⎦ ⎣e

(A.6)

DH

denotes the conjugate transpose of D. where The nth-loudspeaker feeding is

An =

1 N

m=M

∑ m =−M

e−im

2nπ N q−m

=

1 N

m=M

∑ m =−M

e−im

2nπ N ψe imθψ

=

ψ N

m=M



e−im

2nπ N e imθψ

m =−M

Therefore, the superposition of the N loudspeakers is

138

=

ψ N

m=M

∑ m =−M

e−im (

2nπ − θ ψ N

). (A.7)

Applied Acoustics 139 (2018) 129–139

S.-N. Yao N

p (kr ,θ)S =

eikr cos (



2nπ − θ N

) An =

n=1

N

ψ N



eikr cos (

2nπ − θ N

M

) ∑

n=1

e−im (

2nπ − θ ψ N

). (A.8)

m =−M

Appendix B. Derivation of Eq. (7)

p (kr ,θ)S = =

ψ N ψ N

N



eikr cos (

2nπ − θ N

n=1 M





eimθψ

m =−M

2nπ − θ ψ N N



it Jt (kr ) e−itθ

)=

ψ N

e−(m − t ) (



N

eimθψ

2nπ − θ N

M

) ∑

e−im (

2nπ − θ ψ N

)

m =−M

2nπ i N .

)

(B.1)

( )

2nπ N ∑n = 1 e−(m − t ) N i



it Jt (kr ) eit (

n = 1 t =−∞

is N. Otherwise,



m =−M



∑ ∑

n=1

t =−∞

M



e−im (

m =−M

If −(m−t ) is a multiple of N,

p (kr ,θ)S = ψ

M

) ∑

( )

2nπ N ∑n = 1 e−(m − t ) N i

i aN + mJaN + m (kr ) e−i (aN + m) θ = ψ

a =−∞

is 0. Therefore,



M



eim (θψ − θ)

m =−M



i aN + mJaN + m (kr ) e−iaNθ.

a =−∞

(B.2)

Appendix C. Derivation of Eq. (8)



imqm eimθJm (kr ) =

m =−M



M

M

M

M

p (kr ,θ)M =

imψe−imθψ eimθJm (kr ) =

m =−M



imψe−im (θψ − θ) Jm (kr ) =

m =−M



i−mψeim (θψ − θ) J−m (kr ).

m =−M

(C.1)

Because m is an integer, the Bessel function of the first kind has the following characteristic:

J−m (kr ) = (−1)mJm (kr ).

(C.2)

Then, M

M

p (kr ,θ )M =



i−mψeim (θψ − θ) (−1)mJm (kr ) =

m =−M

∑ m =−M

M

−1 m im (θψ − θ) ⎛ ⎞ ψe Jm (kr ) = ψ ∑ eim (θψ − θ) imJm (kr ). ⎝ i ⎠ m =−M

(C.3)

Appendix D. Supplementary material Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.apacoust.2018.04.027.

[18] Estrella J. On the Extraction of interaural time differences from binaural room impulse responses. Ph.D. Thesis, Tech Univ Berlin, Germany; 2010. p. 14. [19] Smith RC, Price SR. Modelling of human low frequency sound localization acuity demonstrates dominance of spatial variation of interaural time difference and suggests uniform just-noticeable differences in interaural time difference. PLoS ONE 2014;9. [20] Carlile S. The physical basis and psychophysical basis of sound localization. Virtual auditory space: generation and applications; 1996. p. 55. [Chapter 2]. [21] Conn PM. Neuroscience in medicine. Humana Press; 2003. p. 553. [Chapter 5.4]. [22] Mills AW. Lateralization of high-frequency tones. J Acoust Soc Am 1960;32(132). [23] Gueorguieva R, Krystal JH. Move over ANOVA: progress in analyzing repeatedmeasures data and its reflection in papers published in the Archives of General Psychiatry. Arch Gen Psychiatry 2004;61(3):310–7. [24] Hu H, Zhou L, Ma H, Wu Z. HRTF personalization based on artificial neural network in individual virtual auditory space. Appl Acoust 2008;69(2):163–72. [25] Nishino T, Nakai Y, Takeda K, Itakura F. Estimating head related transfer function using multiple regression analysis. IEICE Trans A 2001:260–8. J84-A. [26] Yao SN, Chen LJ. HRTF adjustments with audio quality assessments. Arch Acoust 2013;38(1):55–62. [27] Rec. ITU-R BS. 708. Determination of the electro-acoustical properties of studio monitor headphones. International Telecommunications Union; 1990. [28] Yost WA, Loiselle L, Dorman M, Burns J, Brown CA. Sound source localization of filtered noises by listeners with normal hearing: a statistical analysis. J Acoust Soc Am 2013;133(5):2876–82. [29] Hebrank J, Wright D. Spectral cues used in the localisation of sound sources on the median plane. J Acoust Soc Am 1974;56(6):1829–34. [30] Rakerd B, Hartmann WM. Localization of sound in rooms. V. Binaural coherence and human sensitivity to interaural time differences in noise. J Acoust Soc Am 2010;128(5):3052–63. [31] Rec. ITU-R BS.1116-1. Methods for the subjective assessment of small impairment in audio systems including multichannel sound systems. International Telecommunications Union, Geneva; 1997. [32] Martellotta F. On the use of microphone arrays to visualize spatial sound field information. Appl Acoust 2013;74(8):987–1000. [33] Yao SN. Headphone-based immersive audio for virtual reality headsets. IEEE Trans Consum Electron 2017;63(3):300–8.

References [1] Gerzon MA. Periphony: with-height sound reproduction. J Audio Eng Soc 1973;21:2–10. [2] Malham DG. Experience with large area 3D Ambisonics sound systems. In: Proc institute of acoustics; 1992. p. 212–5. [3] Monro G. In-phase corrections for Ambisonics. In: Proc international computer music conference; 2000. [4] Daniel J, Rault JB, Polack JD. Ambisonics encoding of other audio formats for multiple listening conditions. In: 105th AES convention, San Francisco, California; September 1998. [5] Gerzon MA. Multidirectional sound reproduction systems. U.S. Patent 3997725; December 1976. [6] Yao SN, Collins T, Jančovič P. Timbral and spatial fidelity improvement in Ambisonics. Appl Acoust 2015;93:1–8. [7] Collins T. Binaural ambisonic decoding with enhanced lateral localization. In: 134th AES convention, Rome, Italy; May 2013. [8] Vanderkooy J, Lipshitz S. Anomalies of wavefront reconstruction in stereo and surround sound reproduction. In: 83rd AES convention; October 1987. [9] Poletti MA. A unified theory of horizontal holographic sound systems. J Audio Eng Soc 2000;48(12):1155–82. [10] Solvang A. Spectral impairment for two-dimensional higher order ambisonics. J Audio Eng Soc 2008;56(4):267–79. [11] Algazi VR, Duda RO, Thompson DM, Avendano C. The CIPIC HRTF database. In: Proc IEEE workshop on Applicat Signal Process Audio Acoust; 2001. p. 99–102. [12] Rayleigh L. On our perception of sound direction. Philosoph Mag 1907;13. [13] Okamoto T, Enomoto S, Nishimura R. Least squares approach in wavenumber domain for sound field recording and reproduction using multiple parallel linear arrays. Appl Acoust 2014;86:95–103. [14] D’Orazio D, Guidorzi P, Garai M. A Matlab Toolbox for the analysis of Ando’s factors. In: 126th AES convention, Munich, Germany; May 2009. [15] Sato S. MATLAB program for calculating the parameters of the autocorrelation and interaural cross-correlation functions based on Ando’s auditory-brain model. In: 137th AES convention, Los Angeles, California; October 2014. [16] Satongar D, Dunn C, Lam Y, Li F. Localisation performance of higher-order Ambisonics for off-centre listening. White paper WHP254; October 2013. [17] Gaik W. Combined evaluation of interaural time and intensity differences: psychoacoustic results and computer modelling. J Acoust Soc Am 1993;94(1):98–110.

139