On the use of microphone arrays to visualize spatial sound field information

On the use of microphone arrays to visualize spatial sound field information

Applied Acoustics 74 (2013) 987–1000 Contents lists available at SciVerse ScienceDirect Applied Acoustics journal homepage: www.elsevier.com/locate/...

2MB Sizes 1 Downloads 64 Views

Applied Acoustics 74 (2013) 987–1000

Contents lists available at SciVerse ScienceDirect

Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust

On the use of microphone arrays to visualize spatial sound field information Francesco Martellotta ⇑ Dipartimento di Scienze dell’Ingegneria Civile e dell’Architettura, Politecnico di Bari, via Orabona 4, I-70125 Bari, Italy

a r t i c l e

i n f o

Article history: Received 3 September 2012 Received in revised form 12 November 2012 Accepted 15 February 2013

Keywords: Microphone arrays Sound field visualization Room acoustics

a b s t r a c t Microphone arrays represent today a state of the art solution to many acoustic problems. In architectural acoustics, for example, one of the most interesting applications is the possibility to analyse the directional information associated to a given reflection. Ambisonics microphones could provide similar information based on zeroth and first order spherical harmonic decomposition, but larger microphone arrays allow the determination of higher order components providing even better accuracy. In this case, directional information may be obtained through beamforming techniques that, although potentially more accurate and capable of resolving simultaneous reflections, are computationally heavier and provide a ‘‘discrete’’ sampling of the sound field. The paper compares the localization accuracy of a 32 channel microphone array by processing its output using a simple Ambisonics decomposition and a spatial sampling carried out using 32 ‘‘virtual’’ third-order hyper cardioid microphones. In addition, a comparison with conventional Ambisonics microphones is provided in order to point out possible differences. Results show that, when single reflections are involved and the sound field is highly polarized, the Ambsionics decomposition given by the microphone array gives good accuracy over the whole spectrum, while conventional Ambisonic microphones shows less stable results and greater variations as a function of frequency. Spatial sampling is intrinsically less accurate but allows a clearer resolution of simultaneous reflections. Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction Sound source identification and localization has been one of the most interesting outcomes of early sound intensity measurements [1,2]. The most widespread applications are based on 1D transducers (requiring two spaced pressure microphones) so that spatial information about sound sources can be grasped only by a proper scanning of the space. However, there have been applications involving 3D transducers (using six pressure microphones). In this way the three Cartesian components of the particle velocity could be determined simultaneously and, hence, the corresponding direction of arrival of a given sound. The most recent developments in this field are represented by the pressure–velocity sensors, Ref. [3] showing quite interesting performance. A viable alternative, although originally not well understood in its potential, was proposed by Gerzon [4,5], who first introduced the idea of Ambisonics decomposition of the sound field. As it will be described in detail in the next section, Ambisonics is based on the measurement of the sound field by means of four nearly coinciding microphones arranged on a tetrahedral volume. Such set of

⇑ Tel.: +39 080 5963631. E-mail address: [email protected] 0003-682X/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.apacoust.2013.02.004

signals (named ‘‘A-format’’), after a proper processing provides an omni-directional (pressure information) and three figure-of-eight signals oriented along the Cartesian axes (particle velocity information), known together as ‘‘B-format’’. This technique ideally corresponds to decomposing the sound field according to the zeroth and first order spherical harmonics, from which directional information can be grasped, and reproduction over several spatial loudspeaker arrays is made possible. This method, originally patented in 1975 by Gerzon and Craven [5], is now implemented on many commercially available Ambisonics microphones, offering a wide spectrum of choices. One of the most interesting aspects of this technique is the possibility to determine the directional characteristics of the sound field by means of just four channels. However, the intrinsic limits of the hardware (microphones and sound processing devices) may sometimes affect the accuracy of the localization [6]. In addition, when used to playback the recorded sound field this system has several disadvantages [7] due to the low order of the spherical decomposition. In fact, the reproduced signals tend to be highly coherent, resulting in colouring and distortion of the spatial image, so preventing a satisfactory subjective listening experience. In order to get improved realism in sound reproduction, higher-order Ambisonics components need to be determined [8], but this requires the use of more microphones than just the four used for first-order Ambisonics.

988

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

However, after many years during which large microphone arrays were only developed by isolated research groups [9–13], mostly focussing on the more appropriate combination of hardware setup and mathematical processing to optimize a reliable retrieval of spatial information [14–18], a few commercial solutions are now available. This is opening new perspectives also for endusers not interested in (or not able to understand) the complex mathematical details that lay behind the microphone array theory. In fact, most commercial solutions include a basic set of tools to process the signals and interpret the results according to the desired purpose of the measurement. Different fields of the acoustics, from industrial and machinery noise identification, to forensic applications, architectural acoustics, and virtual sound field reproduction are now interested in exploring the extremely wide potential offered by such microphones. The present paper aims at discussing the performance of such type of microphones in the specific field of sound field analysis for architectural acoustic purposes, comparing different ‘‘low level’’ ways to use the microphone outputs and, at the same time, comparing the accuracy of the spatial information resulting from microphone arrays with that resulting from traditional 4-channel microphones.

2.1. Ambisonics decomposition Microphone arrays based on Ambisonic decomposition of the sound field may provide two different outputs. The raw signal, corresponding to the four capsules is the so called A-format, while Bformat corresponds to zeroth and first order components of a spherical decomposition of the sound field. Traditionally such components are identified as W, X, Y, and Z. The first one (0th order component) represents the omni-directional response of the microphone at the centre of the array and its value is traditionally normalized by dividing the amplitude by 2 in order to have four signals with comparable amplitude. In the following discussion only the ‘‘ideal’’ W signal will be considered, assuming therefore that any ‘‘normalization’’ is immediately compensated before any subsequent analysis. The latter three (1st order components) correspond to figure-of-eight (or better 1st order dipoles) responses of microphones oriented along the three Cartesian axes, so that they provide the sound pressure multiplied by the vector of the sound direction along that axis. In other words, considering that particle velocity (u) is a vector quantity oriented along the direction of sound propagation, and that (at least for plane waves) u is proportional to sound pressure (p) through characteristic impedence Z0 (u = p/Z0), X, Y, and Z may also be assumed as the Cartesian components of the particle velocity [19] and used to determine sound intensity properties [20]. All this stated, considering the relationship between sound intensity, sound pressure and particle velocity, the following equations may be written for the instantaneous intensity components (time and position dependencies are omitted for brevity):

Ix ¼ p  ux ¼ w  x=Z 0 ð1Þ

Iz ¼ p  uz ¼ w  z=Z 0 where w, x, y, and z are the output signals of the B-format microphone. And hence the norm of the vector is:

jIj ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 I2x þ I2y þ I2z ¼ ðw  xÞ2 þ ðw  yÞ2 þ ðw  zÞ2 Z0

ð2Þ

In addition, considering that energy density in a reverberant field [2] is given by:

ð3Þ

it is now possible to calculate the ‘‘degree of diffusion’’ (w) of the sound field as the ratio between the norm of the time-averaged intensity vector (given by the vector sum of the instantaneous values), divided by the speed of sound c, and the time-averaged density, so that:

w¼1

jhI=cij hEi

ð4Þ

In this way w may vary between 0 and 1, where 0 corresponds to a fully polarized sound field (resulting from direct sound or strong reflections), and 1 corresponds to a fully diffuse sound field (resulting from an intensity vector that, being the sum of contributions from uniformly distributed directions, is very small compared to fluctuating pressure). Other approaches have been proposed to calculate this parameter [21], but a discussion on this topic is beyond the scope of this paper. With reference to the B-format components and replacing the average over a given time window with a discrete summation, Eq. (4) yields:

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P P 2 ð w  xÞ2 þ ð w  yÞ2 þ ð w  zÞ2 P 2 w¼1 ðw þ x2 þ y2 þ z2 Þ

2. Overview of theoretical background

Iy ¼ p  uy ¼ w  y=Z 0

" #  1 p2 1 q0  2 2 E ¼ q0 2 þ u ¼ w þ x2 þ y 2 þ z 2 2 2 2 Z0 Z0

ð5Þ

A further development of this way of interpreting the data is that, given the Ix, Iy, and Iz components of the sound intensity, the azimuth (h) and elevation (/) of the direction of arrival of the sound at a given time may be easily calculated from the following equations:

  Iy ; h ¼ arctan Ix

0

1

Iz B C / ¼ arctan @qffiffiffiffiffiffiffiffiffiffiffiffiffiffiA 2 2 Ix þ Iy

ð6Þ

So, combining together the direction of sound with its intensity and degree of diffusion, sound field properties may be reproduced using innovative multichannel techniques [22] or, as in the present case, used to obtain a spatial map of the sound intensity as a function of time. Different approaches may be followed in order to graphically render this information, depending on the purpose to be obtained. The most simple and, at the same time, detailed approach may be that of using an equirectangular projection which transposes angular coordinates over a Cartesian plane, using colours (or grayscale) to represent the intensity of sound in that point. This way of mapping the information may also be particularly useful in combination with a panoramic view of the room (such as those used in virtual tours) to allow easy identification of actual origin of the sound reflections. 2.2. Beamforming The basic output of any microphone array is a set of raw signals recorded by each single microphone capsule (conceptually equivalent to Ambisonic A-format). However, in order to make such information of any practical use, a complex mathematical processing (named ‘‘beamforming’’) is required. Beamforming takes into account the delays and phase changes of each of the signals arriving to each of the microphones in order to accurately determine the direction of the impinging sound wave (which, to keep calculations simpler, is assumed to be a plane wave). The general idea behind beamforming is that, given a single monopole source at an assumed position, and a set of M microphones in an array, a proper processing of the output signals of the microphones via a set of linear filters may allow an estimate

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

of the monopole output. Following this principle a proper focussing of the beamformer onto a spatial grid of points may allow a mapping of the sound field. The general approach may be somewhat simplified when dealing with arrays mounted on a rigid sphere. In fact, not only this makes possible to take advantage of the symmetry and decompose the sound field into orthonormal spherical harmonics, but the scattering effects due to the rigid sphere are calculable, the diffraction on the sphere improves the signal to noise ratio at low frequencies, and a wider frequency range can be analyzed. Assuming that only plane waves are involved, the sound field may be decomposed according to the theoretical framework outlined in Refs. [12,13], according to which, given a generic square-integrable function on the unit sphere p(X), X being the position in angular coordinates, it may be decomposed using its spherical Fourier transform pnm and the corresponding spherical harmonics Y m n , n being the order and m the degree:

pðXÞ ¼

1 X n X

pnm Y m n ðXÞ

ð7Þ

n¼0 m¼n

When a discrete number of M microphones is used to sample the sound pressure p on a sphere at position Xj, the spherical Fourier integral is approximated by a summation to give:

pnm ¼

M X  aj pðXj Þ½Y m n ðXj Þ

ð8Þ

j¼1

where aj are coefficients that need to be appropriately chosen in order to get the exact approximation. The asterisk denotes the complex-conjugate operator. Now, by replacing pnm in Eq. (7) the finite form of the inverse transform is obtained. The final step is represented by the plane-wave decomposition. In fact, the pressure due to a single unit amplitude plane wave arriving from direction Xl can be represented using spherical harmonics as:

pðX; Xl Þ ¼

1 X n X

 m bn ½Y m n ðXl Þ Y n ðXÞ

ð9Þ

n¼0 m¼n

where bn is a complex operator, known as ‘‘mode amplitude’’, that depends on the Bessel and Hankel functions, and shows (Fig. 1 in Ref. [13]) a clear dependence on the frequency multiplied by the sphere radius. This means that, for a given array radius, higherorder spherical harmonics are weaker in the low frequency range. As the accuracy of the sound field reconstruction depends on the order of the harmonics included in the summation, it can be concluded that, unless larger spheres are used, low frequencies will be less accurately rendered than the others. In a sound field composed of infinite plane waves the measure of the intensity of the sound field in a given direction Xl is finally given by the so called ‘‘directional amplitude density’’, given by:

wðXl Þ ¼

N X n X pnm m Y ðXl Þ bn n n¼0 m¼n

ð10Þ

It can be easily understood that from the computational point of view this approach is quite demanding. A slightly different approach, that starts from the same assumptions, but uses a two steps processing, was proposed by Meyer and Elko [10,23]. This method illustrates the theoretical background on which one of the microphones used in this study is based. According to this method the impinging sound field is first decomposed in orthonormal components (named ‘‘Eigenbeams’’), a process equivalent to picking the sound with a set of ideal microphones having a polar pattern corresponding to the different spherical harmonic functions. The order of the Eigenbeams (and hence the resulting spatial resolution of the analysis) depends on the number of microphones in the array, as to determine a Nth order spherical

989

harmonic a minimum of (N + 1)2 microphones are required. As stated above, the frequency dependence of the mode amplitude coefficients (that tend to drop dramatically in the low frequency range when the order N increases) prevent from fully benefitting of the higher order components in this part of the spectrum, so that the spatial resolution will be intrinsically lower unless, as stated above, larger spheres are used. The final step of the proposed process is the so called modal-beamforming. In fact, in this case the beamformer processes only the orthonormal components instead of the set of microphone outputs, with the advantage of considerably simplifying the calculations needed to shape the beam and to steer it. The shaping consists in a linear combination of a subset of the Eigenbeams (just those with degree equal to zero), aimed at obtaining the desired directivity. In this way different polar patterns may be obtained, with a maximum directivity index depending on the maximum order of the harmonics, according to the relation 20log(N + 1). Once the beam has been shaped it can be aimed at the desired direction by means of a further processing that involves the spherical harmonics of degree different from zero (see Ref. [23] for details). Apart from the rather complex mathematical aspects of the beamforming process, there are a number of issues that still need to be addressed in order to have a reliable output. In fact, partly described before, and as shown in several papers [13–18] arranging proper beamforming procedures is certainly not easy, as the bandwidth over which the process can correctly work is often a fraction of the whole band of interest in acoustics. The already mentioned weakness of the higher order mode amplitudes related to sphere dimension, as well as the small pressure gradient, limit the reliability of the process at the low frequencies. Conversely, spatial aliasing due to discrete number of sensors used in the array limits the high frequency range. A careful balance of the different needs is therefore necessary to get the best results. A possible alternative to traditional beamforming techniques has been proposed by Farina et al. [24]. They suggest a purely numerical approach that consists in measuring the response of the array microphones to an ideal impulse (measured in anechoic conditions) emitted from a given direction. By processing the set of the responses by means of robust inversion techniques [25,26] a set of inverse filters is obtained for the given directions. So, convoluting the recorded signals by the corresponding inverse filters yields exactly the (single channel) response of the array in that direction. In this way the response should be perfectly flat in frequency and free of any artifacts resulting from beamforming processing and from harmonic decomposition. The main limit of the method is the discretization of the aiming directions, but this may be overcome by proper interpolation and increasing the sampled directions. Once this spatial sampling has been carried out, polar patterns of any shape (and order) can be easily obtained by means of the mathematical equations that describe the microphone sensitivity, allowing for virtually hyper-directive patterns to be used.

3. Methods 3.1. Hardware configuration The test was carried out using three different microphones. The first was an Eigenmike32™ (EM32), a spherical array with 32 electret microphones embedded in a rigid sphere with a diameter of 84 mm. The spherical baffle also includes all the preamplifiers and the A/D conversion hardware which is finally connected via a CAT6 cable to an external box that communicates with the PC via firewire interface. The 32 channel output can be handled by any multichannel audio tool, but it usually ships with Zynewave

990

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

Podium which also hosts a VST plugin to perform beamforming using one of the available ‘‘virtual’’ microphones, including omnidirectional and first-order dipole (so allowing Ambisonic B-format decomposition), torus and first to third order cardioid, hypercardioid, and super-cardioid patterns. With the exception of omni-directional, dipole, and first-order super- and hyper-cardioid all the patterns have frequency-dependent directivity figure, with a more or less dramatic increase around 1 kHz. The second microphone was a Soundfiled MkV, bought in 2002. This is a studio microphone, with a dedicated control unit that outputs directly a four channel B-format signal. Finally, the last microphone under test was a Soundfield ST-350, bought in 2009. This is a portable microphone with a much smaller control unit that is nonetheless capable of providing B-format output.

3.2. Processing of measured signals For the purpose of the present paper a simplified approach was followed to process EM32 signals. In fact, only standard polar responses provided by the proprietary VST plugin were used, assuming that they have been optimized for the microphone and allow the end-user (without any specific beam-forming background or without the laboratory facilities required by Farina’s approach) to take advantage of the huge potential offered by the microphone over the largest frequency band. In order to retrieve spatial information from the measured responses a series of MATLAB scripts were developed and later organized under a unique graphical user interface (GUI) capable of handling also Soundfield signals. In fact, two different approaches where used to analyse the directional accuracy of the microphone. The first was a sort of ‘‘low level’’ use of the microphone potential, as it considers only the lower order Ambisonic components of the sound field, i.e. omni-directional response and 1st order dipoles oriented along the three Cartesian axes. The resulting signals were then processed according to the procedure described above, obtaining for each sample the intensity vector and its direction of arrival. So, at a given time may correspond a ‘‘single’’ direction of arrival. This fact, as will be shown later, represents the most significant limit to the application of the method in reverberant conditions where simultaneous reflections are highly probable. However, if the reflections are sufficiently spaced (compared to the wavelength), the direction of arrival of each sound can be safely estimated (Fig. 1, early part), the transition from one to the other being very steep. On the contrary, when reflections from different directions are closer (or the wavelength gets longer), they tend to be superimposed, and the vector sum of the particle velocity components determines a much slower transition between the actual directions of arrival (Fig. 1, late part), appearing as a sort of ‘‘moving trace’’. Anyway, for the sake of clarity of representation it may be preferable to average the direction of arrival over longer time intervals (which may be chosen by the user depending on the desired result). In this case, if sound is coming from different directions, the corresponding energy is assigned to a point on a 2D equirectangular grid having Azimuth and Elevation angles as coordinates. A sort of spatial discretization is also possible in order to have a clearer map. Energy contributions may be ‘‘binned’’ over grids of different resolution, depending on the purpose of the representation and on the speed of calculation. In this way, shorter intervals and higher resolution grids may be conveniently used to detect single reflections, while longer intervals and less refined grids may be useful to show diffuse reflections. Finally, independent of the chosen visualization grid, for the selected time interval the time-averaged intensity (based on a vector sum of the instantaneous components) is calculated, as well as the corresponding

Fig. 1. Effect of the superposition of two wavefronts with different delays and different directions of arrival. (a) Velocity components of the wave arriving from +45°; (b) velocity components of the wave arriving from 45°; (c) velocity components of the resulting wave; (d) resulting Azimuthal angle calculated using Eq. (6).

angular coordinates (according to Eq. (6)), and the degree of diffusion (according to Eq. (5)). The second approach was based on the ‘‘spatial sampling’’ idea, and involved the use of the third-order super-cardioid pattern (cardioid and super-cardioid provided the smoother curves) aiming at fixed directions (that for ease of calculation were assumed coinciding with the layout of the 32 capsules of the microphone). In this way a discretized spherical map of the sound field could be obtained. It should be noted that the angular distance between the microphones is quite large (being about 30°), suggesting that the resulting angular resolution of the map could not be better. According to Ref. [13] the spatial resolution of a 32 microphones array (assumed as the half of the zero-to-zero width of the array response to a plane wave) is 60°, confirming that a more refined analysis could only be obtained by increasing the number of microphones. However, considering also the polar response of the third-order super-cardioid (for which a 3 dB drop in highfrequency sensitivity appears at about 30° from the aiming axis, and a 10 dB drop at about 60°, being even larger at frequencies below 1 kHz), increasing the number of ‘‘virtual’’ microphones would have been a useless effort. Spatial sampling representation was based on the simultaneous acquisition of the 32 signals of the virtual microphones. This means that they are substantially immune from simultaneous reflections arrivals from different directions, the only limit being the mutual excitation of neighbouring microphones. This improvement is, however, likely to be negatively compensated by a substantial lack of spatial accuracy due to the strong discretization of the directional information. In order to make results comparable with the previous analysis, ‘‘spatially sampled’’ signals were processed by

991

F. Martellotta / Applied Acoustics 74 (2013) 987–1000 L1

L1

L2

L2

R

R

30° L3 1.5 m

averaging the acoustic energy arriving at each of the 32 virtual microphones during the selected time interval. The integrated energy was then attributed to the direction at which the microphone was aiming and the resulting values were finally interpolated to obtain a full equirectangular projection similar to that obtained from Ambisonic decomposition. At a given time sample the direction of arrival of the sound was simply assessed as the point from which the loudest sound was coming. However, in order to prevent possible random fluctuations it was preferred to assume a finite time interval over which the sound energy and its directions of arrival were averaged. So, if sound is coming from different directions during the selected time interval it can be easily shown on the map. Similarly, when different reflections are involved, a map of the different directions of arrival may be obtained by simply superimposing plots referred to time intervals centred on individual reflections. In order to quantify the peakiness of the wavefront direction detection, two measures were used. First, following Park and Rafaely [13], the directional gain (DG) was introduced, defined as the logarithmic ratio between the maximum and the average energy arriving in the selected time interval. As stated by Gover et al. [11] this may be assumed as an ‘‘anisotropy index’’ that mostly depends on the characteristics of the sound field. However, if used to compare the same time portion of the signal it can provide an interesting measure of the variations as a function of frequency. In addition, the average angular distance between the peak and the points at which the intensity of the peak decreases to half the maximum value was determined. Such parameter, named here angular scatter (AS), is also largely depending on the time interval used, as a larger interval will likely involve sound coming from several directions, so in order to compare different values it is mandatory that they should be referred to equal time intervals. Finally, the accuracy of the microphones in detecting the correct direction of arrival of the sound was estimated in terms of the angular difference between the actual origin of the sound and the peak position.

L3

(a)

(b)

Fig. 2. Schematic representation of the different microphone-loudspeaker configuration tested in the dry room.

S

R

3.0 m

5.8 m

3.2 m

3.3. Experimental set-up In order to investigate microphone performance under different conditions and with different excitation signals, several experimental set-ups were used. Two rooms were used for testing. First a small listening room with very dry acoustics, then a larger and more reverberant empty room. Similarly, two different types of signals were used. First, impulse responses were obtained from deconvolution of logarithmic sweeps in order to obtain a better signal-to-noise ratio and replicate typical measurement conditions in room acoustics. Then a continuous white noise sound was used in bursts of variable length depending on the effect to be investigated. In the first room the microphone was located in one of the corners of a 1.5 m wide square rig of loudspeakers used for Ambisonics playback. In this way the three loudspeakers were on the horizontal plane (El = 0°) with an angular spacing of 45°. The loudspeakers were three 2 way, bi-amplified Yamaha MSP5, with a frequency response from 40 Hz to 40 kHz. The 100 tweeter was 12 cm above the 500 woofer, resulting in a 4.5° distance in angular coordinates as seen from the receiver. In order to normalize the measurements, the microphone was aligned horizontally to the centre of the woofer, this means that an upward shift is expected when dealing with high frequency signals. During the test the microphones were also rotated and one of the loudspeakers moved to the floor in order to test the localization accuracy along the vertical direction (Fig. 2). The second room (Fig. 3) was a more reverberant empty office of about 60 m3, with reverberation time varying from 1.0 s at low

Fig. 3. Schematic representation of the reverberant office used during the second set of measurements.

Table 1 Reverberation time as a function of frequency measured in the second test room. Octave-band centre frequency (Hz)

T30 (s)

125

250

500

1k

2k

4k

1.09

1.27

1.80

2.20

2.29

1.83

frequencies, up to about 2 s at medium and high frequencies (Table 1). A single source was used this time as the scope of the measurement was to understand how reverberation affects the microphone accuracy. Microphone was placed along the longitudinal axis of the room and the source (of the same type used in the dry room) was placed on a stand, at the same height of the microphone, at angular coordinates (30°; 90°). As usual, first the impulse response was obtained from deconvolution of a logarithmic sine sweep and it was processed using both Ambisonics and 32 channel decomposition, then a continuous white noise was used to excite the room. As stated above, Ambisonic decomposition may be more or less refined, being possible to visualize the spatial information over a tight or loose grid. For the purpose of the following analysis, unless otherwise specified, a 1° resolution was used, in order to point out even the faintest variation. However, this level of accuracy also im-

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

250 Hz

60 90 120 150 180 180 150 120 90

60

30

0

-30 -60 -90 -120 -150 -180

Azimuth [°] 0 1000 Hz

Relative intensity level [dB]

4. Results

0 30

Elevation [°]

plies that small parallax errors in source and receiver placement may appear in the results despite any reasonable effort to avoid it. Finally, in order to make results comparable over different situations, a time window of analysis needs to be defined in order to ‘‘pick’’ individual reflections. As a general rule, an analysis carried out over a larger time interval may be very useful to show a spatial map of the direction of arrival of the loudest sounds and may be very informative. This method works fine as far as the direct sound (or any other reflection we want to locate) is sufficiently loud, however, as soon as the pattern of diffuse reflection increases, this may made the interpretation of the results more complex. In addition, reflections coming from close surfaces may easily arrive a few ms after the direct sound, thus preventing correct localization. So, in order to normalize the approach a 1 ms window was used to analyse the results unless otherwise specified.

30

Elevation [°]

992

60 90 120 150

4.1. Dry room, single impulsive sound 180 180 150 120 90

60

30

0

-5 -10 -15 -20

-30 -60 -90 -120 -150 -180

Azimuth [°] 0

30

Elevation [°]

The first configuration under investigation was aimed at assessing the localization accuracy on the horizontal plane, as a function of frequency. The microphone was aimed at loudspeaker L3 (Fig. 2a), and the signal was fed to each of the three loudspeakers with a delay of 100 ms, in order to keep the direct sound coming from each one well separated. According to this configuration the sound should arrive from the horizontal plan at Azimuth angles of 0°, 45° and 90°. For convenience of representation the directional information pertaining to the three loudspeakers were plotted on the same diagram but they refer to different 1 ms time windows. Fig. 4 shows that the accuracy that can be obtained by means of Ambisonics representation was quite impressive, in agreement with the use of the highest spatial and temporal resolution to obtain the most accurate representation of the directional information. The analysis as a function of frequency also showed similar results. In fact, at the low frequencies the traces had an oblong shape, with an angular scatter of about 5°, while at higher frequencies they clearly appeared as points with angular scatter of no more than 1°. The observed low frequency behaviour does not appear when measurements are performed in larger rooms, at greater distance from the source, and the same time window is used. This supports the idea that the observed phenomenon is not specific to the microphone but rather to the room. In fact, its small dimensions (and its lower sound absorption in the low frequency range) may easily cause superimposition of different wavefronts which, as shown in Fig. 1, always result in a gradual shift of the sound source position (later on referred as ‘‘moving trace’’), compared to the neat transition that takes places when wavefronts are clearly distinct. However, despite this problem, the direction of the wavefront was sufficiently correct although shifted by about 4° towards the left and by about 3° below the horizontal axis. However, considering also the wavelength, such lack of accuracy seemed perfectly acceptable. At 1 kHz the localization of the sound sources was nearly perfect, all the peaks appeared on the horizontal axis and at the correct azimuth (only the central loudspeaker was 1° off the expected position but it might be well a positioning error). Finally, at 4 kHz the sound direction moved, as expected, 4° above the horizon, while a 2° shift towards the left also appeared for all the loudspeakers. In all the cases the directional gain was very high, varying between 44 dB at 4 kHz and 41 dB at 125 Hz, while the diffusivity index was nearly zero, confirming a highly polarized sound field. In order to understand whether such behaviour might be influenced by a rotation of the microphone (i.e. to check if the micro-

0

4000 Hz

60 90 120 150 180 180 150 120 90

60

30

0

-30 -60 -90 -120 -150 -180

Azimuth [°] Fig. 4. Directional intensity map as a function of frequency for the three sound sources located along the horizontal plane in the dry room. Angular resolution was set to 5° to ease visibility, 1 ms time window. Background image of the room was taken with the camera just above the EM32 microphone, so the position of the loudspeaker results slightly below their actual angular position.

phone behaviour was consistent at all directions) the experiment was repeated with a sequence of 45° rotations on the microphone axis. Results were substantially identical, with just the corresponding shift appearing in the azimuth angle. Finally, in order to test also the accuracy to correctly render the elevation angle (even though the good behaviour shown with the tweeters gave a first interesting essay), a second loudspeaker configuration was investigated (Fig. 2b). In this case the microphone was aimed at the central loudspeaker (along the square diagonal), while loudspeaker L3 was moved to the floor so to determine an angular difference of 30° along the vertical direction (resulting in an elevation of 120°). In this case the analysis was carried out with all the three microphones under investigation (EM32 and the two native Ambisonics) in order to compare their accuracy. Results given in Table 2 show that Ambisonics decomposition resulting from EM32 provided again an impressive performance with a deviation from actual position generally within 2°. The less accurate results were observed, as already mentioned, at low frequencies where ‘‘moving’’ traces appeared again. DG varied between 48 dB at 4 kHz and 41 dB at 125 Hz. Results from Soundfield MkV (Table 3) showed a substantially uniform distribution of the angular scatter which was about 5° with the exclusion of the 4 kHz band where larger scatter appeared. However, it is interesting to notice that the phenomenon was less evident for source L3. At low frequencies the ‘‘moving

993

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

Table 2 Localization accuracy as a function of frequency for Ambisonics decomposition obtained from EM32 measurements of distinct impulses arriving from (45; 90), (0; 90), and (45; 120). Angular scatter (AS) is assumed as the average angular distance from the maximum intensity to half its value. Loudspeaker 1

Loudspeaker 2

Loudspeaker 3

Frequency

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Actual pos. 125 250 500 1000 2000 4000 WB

45 46 42 44 44 44 42 44

90 85 90 90 90 89 84 88

– 8 6 3 1 1 1 1

0 2 2 0 0 0 2 0

90 93 88 90 90 88 86 89

– 10 5 3 1 1 1 1

45 54 44 46 46 44 46 47

120 121 125 122 121 118 118 121

– 8 6 5 1 1 1 1

Max. err.

3

6



2

4



9

5



Table 3 Localization accuracy as a function of frequency for Ambisonics decomposition obtained from Soundfield Mk-V measurements of distinct impulses arriving from (45; 90), (0; 90), and (45; 120). Angular scatter (AS) is assumed as the average angular distance from the maximum intensity to half its value. Loudspeaker 1

Loudspeaker 2

Loudspeaker 3

Frequency

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Actual pos. 125 250 500 1000 2000 4000 WB

45 34 36 36 36 36 40 36

90 96 92 93 94 96 85 93

– 3 4 5 8 7 17 25

0 0 0 0 0 2 2 1

90 94 90 90 90 90 81 89

– 2 1 1 1 4 6 5

45 44 42 42 42 38 38 41

120 116 111 115 115 120 121 116

– 2 3 6 4 6 5 6

Max. err.

11

6



2

4



7

5



Table 4 Localization accuracy as a function of frequency for Ambisonics decomposition obtained from Soundfield ST-350 measurements of distinct impulses arriving from (45; 90), (0; 90), and (45; 120). Angular scatter (AS) is assumed as the average angular distance from the maximum intensity to half its value. Loudspeaker 1

Loudspeaker 2

Loudspeaker 3

Frequency

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Actual pos. 125 250 500 1000 2000 4000 WB

45 39 43 43 41 43 45 42

90 89 92 92 89 91 85 90

– 4 2 4 5 10 17 18

0 3 1 1 1 3 7 1

90 88 91 95 92 88 95 92

– 2 1 1 1 2 10 13

45 47 45 43 45 43 45 45

120 124 120 111 117 116 124 119

– 3 2 2 1 4 5 15

Max. err.

6

5



7

5



2

4



traces’’ were not as evident as in the previous cases. However, in terms of accuracy of the localization this microphone performance was below expectation. In fact, the right source (L1) was shifted about 9° towards to the central source, while the left source showed a less dramatic shift of 3° towards the centre, thus reducing the angular span between the extreme sources to 78° instead of 90°. In terms of elevation angles the upward shifting of the tweeter signal could be seen with sufficient accuracy for the centre source, while it was moved downward for the L3 signal. At the other frequencies, on average, the localization of L3 source was shifted upwards by 5°. DG varied between 38 dB at 4 kHz and 36 dB at 125 Hz. Results from the more recent ST350 microphone (Table 4) showed a substantially better performance. The 4 kHz scatter appeared again, with large fluctuations for source L1. At the other frequencies the direction of arrival of the sounds was not as stable as observed for EM32 signals, but on average the angular scatter was only 3° and the directional accuracy was good, with source L2 and

L3 that were correctly located within a 1° error (for both azimuth and elevation), and source L1 that was shifted to the centre by 3°. In this case the DG varied between 39 dB at 4 kHz and 44 dB at 1 kHz and 125 Hz. At this point, in order to better understand some of the behaviours observed for native B-format microphones, the measurements were repeated by simply rotating the microphone around the vertical axis by ±90°. In this way a clearer description of the angular accuracy was obtained. Results were given in Fig. 5 for all the frequencies. Again, the angular scatter at 4 kHz appeared and this happened with both microphones. A detailed analysis of B-format components showed that the origin of the scatter was a small misalignment between the Cartesian components of the velocity. A similar effect may well cause (as explained in Fig. 1) a more or less evident shift in the source placement. Such sort of behaviour has been shown elsewhere [27], appearing as an increased scattering (or diffuseness) in the sound field localization, and can be explained as a consequence of the non-perfect

994

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

(a)

(b) Fig. 5. Schematic spatial representation of localization accuracy as a function of frequency for Ambisonics decomposition of signals obtained from (a) Soundfield MkV; (b) Soundfield ST-350. Plots are referred to dry room measurements of distinct impulses arriving from (45; 90), (0; 90), and (45; 120) and microphones rotated by ±90° around their axes. Circles are centred at the maximum position and their radius corresponds to the average angular scatter.

Table 5 Localization accuracy as a function of frequency for 32 channel spatial sampling obtained from EM32 measurements of distinct impulses arriving from (45; 90), (0; 90), and (45; 120). Angular scatter (AS) is assumed as the angular distance from the maximum at which the value is halved. Loudspeaker 1

Loudspeaker 2

Loudspeaker 3

Frequency

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Actual pos. 125 250 500 1000 2000 4000 WB

45 35 35 44 45 38 38 39

90 85 92 92 92 88 88 88

– 60 55 45 40 40 43 42

0 11 11 2 2 2 2 5

90 92 95 92 85 82 79 82

– 58 53 45 43 40 42 42

45 41 47 44 47 47 44 45

120 113 122 113 119 122 122 122

– 60 55 55 45 42 40 38

Max. err.

10

5



11

11



4

7



coincidence of the four microphone capsules, that may cause spatial aliasing [22], or phase/time misalignments of the signals [27] in the high frequency range (from 3.15 kHz on). When dealing with angular accuracy the two microphones showed very different behaviours. In fact, the ST350 rendered source positions with good accuracy, with an average error of ±3°in both directions, never exceeding 5°. In addition sound sources that after rotation shared the same azimuth angle appeared clearly aligned along the vertical axis. On the contrary,

Mk-V showed considerably less accurate results, with variations appearing both as a function of frequency (with differences up to 10°) and as a function of the arrival direction. In particular, it can be observed that the azimuth of sources that after microphone rotation had to coincide, differed on average by 9°, with peaks of 13° at given frequencies. Such variations are coherent with the slight rotation of the polar pattern (also depending on frequency) found by Farina [6] when using similar microphones.

995

F. Martellotta / Applied Acoustics 74 (2013) 987–1000 0 Relative intensity level [dB]

0

Elevation [°]

30 60 90 120 150

-5 -10 -15 -20 -25 -30

180 180 150 120

90

60

30

0

-30

-60

-90 -120 -150 -180

Azimuth [°]

(a)

(b)

(c)

(d) Fig. 6. Directional intensity map obtained in presence of simultaneous reflections. (a) Ambisonics decomposition (5° angular resolution) at 1 kHz for reflections arriving from (45; 90) and (45; 90); (b) Spatial sampling at 1 kHz for reflections arriving from (45; 90) and (45; 90); (c) Same as (b) but at 4 kHz; (d) Spatial sampling at 4 kHz for reflections arriving from (45; 90) and (0; 90).

To complete this analysis, the directional accuracy resulting from the discrete spatial sampling of the sound field was deter-

Fig. 7. Directional intensity map obtained from Ambisonics decomposition of time spaced continuous white noise bursts arriving from (45; 90), (0; 90), and (45; 120). Image referred to 1 kHz octave band, 1° of spatial resolution and 100 ms time window. It can be observed the ‘‘spotted’’ distribution of the levels around the actual source position.

mined. The same signal recorded during the second set-up with EM32 and processed to obtain the B-format components, was simply reprocessed to get the response of 32 virtual microphones aimed at the same directions in which the real capsules are located. The spatial map showed larger areas around each peak, resulting in larger angular scatter as shown in Table 5. The origin of each sound was identified with sufficient accuracy with angular errors generally within 5°, with a maximum of 10° in a few cases. As already observed, the accuracy tended to be better as the frequency increases. However, what emerged dramatically was the much larger angular scatter that resulted from the interpolation process over the 32 fixed positions. Its values varied between 40° at highest frequencies to 60° at the lowest, in good agreement with the polar plot of the ‘‘virtual microphone’’ sensitivity. In other words, considering the ‘‘boundary’’ conditions, the results could not be any better. However, this result was perfectly expected and suggests that in order to allow a clear separation between two sound sources according to this approach they should be spaced by an angular distance at least twice the angular scatter. Nonetheless, in presence of single reflections, despite the larger angular scatter the identification of the arrival direction seems satisfactory. In conclusion of this comparison between conventional Bformat microphones and the Ambsionic decomposition of the sound field given by EM32, it can be stated that the latter offered outstanding performance particularly at medium and high frequencies, while in the low frequency the performance was slightly poorer. This problem is generally related to the lower pressure gradient that typically affects low frequencies. The traditional microphones showed a critical behaviour at 4 kHz, while at lower frequencies they showed substantially different performances, with the ST350 performing better than the Mk-V. In practice this meant that using a very detailed grid with a 1° resolution may prove unnecessary in most cases and particularly towards the extremes of the frequency range. A 5° spacing could be more than adequate in most circumstances, possibly masking unavoidable aiming errors. Spatial sampling was characterized by a larger angular scatter (varying between about 40° at high frequencies and 60° at low frequencies), but the localization accuracy was quite good considering the large angular span between the sampled areas. 4.2. Dry room, simultaneous impulsive sound The next step of the validation procedure involved the use of simultaneously arriving impulses, in order to analyse their effect on the localization accuracy. As stated above, the performance was expected to worsen significantly for Ambisonics decomposition. The loudspeaker configuration used for the test was the same shown in Fig. 2a, i.e. the same used to compare Ambisonics

996

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

Table 6 Localization accuracy as a function of frequency for 32 channel spatial decomposition obtained from EM32 measurements of time spaced continuous white noise bursts arriving from (45; 90), (0; 90), and (45; 120). Angular scatter (AS) is assumed as the average angular distance from the maximum intensity to half its value. Loudspeaker 1

Loudspeaker 2

Loudspeaker 3

Frequency

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

Actual pos. 125 250 500 1000 2000 4000 WB

45 n.d 26 59 47 38 38 38

90 n.d. 88 95 92 92 88 88

– n.d. 70 55 42 40 43 44

0 n.d. 2 17 2 2 2 2

90 n.d. 76 92 88 88 80 82

– n.d. 55 65 43 40 45 43

45 47 53 53 47 44 44 44

120 122 122 110 119 122 122 122

– 65 60 60 45 40 38 38

Max. err.

19

5



17

14



8

10



0 Relative intensity level [dB]

0

Elevation [°]

30 60 90

X

X

120 150

-5 -10 -15 -20 -25

sition with reference to azimuth angle and within 1° with reference to elevation, confirming that they were correctly identified. At low frequency the sound sources tended to merge despite the 90° span. Reducing the angular separation to 45° (i.e. using sources L3 and L2), even at 4 kHz the DG remained equal to 7 dB but no clear identification was possible (Fig. 6d), confirming that below the estimated angular resolution of 60° the identification of simultaneous sources became critical.

-30

180 180 150 120

90

60

30

0

-30

-60

-90 -120 -150 -180

4.3. Dry room, continuous sound from single source

Azimuth [°] Fig. 8. Directional intensity map obtained from Ambisonics decomposition (at 1 kHz) of dry room recordings of simultaneous emission of white noise bursts arriving from (45; 90) and (45; 90).

microphones performance, with the only difference that this time the sweep signals were sent to the lateral loudspeakers (L1 and L3) with exactly the same delay and the same intensity. Fig. 6a shows that the Ambisonics decomposition was unable to correctly identify the two sources and with reference to the 1 ms time window showed a ‘‘moving trace’’ that seemed to connect the actual positions of the sound sources placing the energy maximum (that normally identifies the direction of arrival) nearly half way between the two. For clarity of representation the plot refers to 1 kHz on a 5° grid and to a 1 ms time window. At lower frequencies an even more focused single source located in between the actual sources appeared, while at high frequencies the peaks were distributed along the whole interval. In real-world situation such ‘‘ghost’’ sources resulting from combinations of wave fronts arriving from different directions need to be clearly identified in order to avoid interpreting them as actual sources (or reflections). An interesting instrument to discriminate a real source from a ghost source might be the degree of diffusion w. In fact, it has been seen before that a sound arriving from a real source is usually characterized by a nearly zero w, independent of frequency, while in the case of the ghost source w varied between 0.1 at 125 Hz and 0.7 at 4 kHz. The same analysis was carried out using the 32 channel spatial sampling, showing that the two simultaneously arriving reflections determined in this case (despite the already observed angular scatter) a sufficiently clear image resembling a ‘‘dipole’’ with a two marked peaks at L1 and L3 positions that could be identified with sufficient accuracy from 1 kHz on. At 1 kHz (DG = 7 dB) the localization of the peaks appeared shifted towards the centre at an azimuth of ±35°, while the elevation was within 5° of the horizontal plane, confirming a certain reciprocal influence between the two simultaneous reflections. At 4 kHz the peaks appeared clearly distinct (DG = 7 dB) and their location was within 4° of the actual po-

At this point it was interesting to understand how the microphones behaved when they were excited by continuous noise-like signals rather than impulsive ones. Therefore, taking advantage of the same experimental set-up previously described, the loudspeakers were fed by white noise. With this type of signal an ‘‘instantaneous’’ picture of the directions of arrival (i.e. a very narrow time window for the analysis) proved quite useless due to the random nature of the process. So, the time window over which calculating the directional information was kept larger, so that contributions arriving from the actual source could sum up each other, while random reflections distributed all over the space. The more or less reverberant nature of the space was expected to further emphasize such behaviour so that in a dry room a 100 ms interval was assumed to be sufficient to allow a reliable direction identification, even though a larger interval of at least 500 ms provided better results. As will be discussed later, in more reverberant rooms even longer intervals are needed in order to ensure a sufficient collection of energy contributions from the actual source origin. As shown in Fig. 7 the resulting directional map obtained from Ambisonics decomposition was characterized by areas that, although well centred around the actual source positions, showed an angular scatter that was significantly larger than it was observed for impulsive signals. On average it was 16° at 250 Hz, 12° at 1 kHz, and reduced to 8° at 4 kHz, confirming that low frequencies were more difficult to identify correctly. DG was also considerably reduced (compared to impulsive sound) and varied between 17 dB at 125 Hz and 23 dB at 4 kHz, confirming that also the few and weak reflections of the dry room were enough to worsen the performance. Given the amount of the angular scatter observed in dry conditions it could be concluded that for the Ambisonics decomposition of continuous signals a grid spacing of 5° might be more than adequate, providing an even clearer representation of the directional properties of the sound field. The same analysis was carried out using 32 channel decomposition (Table 6). At 125 Hz the direction identification proved to be very difficult (DG = 3 dB) and, in some cases, affected by larger errors (certainly because the room was slightly more reverberant at low frequencies). At other frequencies the average angular error

997

F. Martellotta / Applied Acoustics 74 (2013) 987–1000 0

Relative intensity level [dB]

Elevation [°]

30

60

90

120

150

0 -5 -10 -15 -20 -25 -30

180 180

150

120

90

60

30

0

-30

-60

-90

-120

-150

-180

Azimuth [°]

(a)

Relative intensity level [dB]

0

30

Elevation [°]

X 60

90

X

X

X

120

X

0

-5

-10

-15

-20

150

180 180

150

120

90

60

30

0

-30

-60

-90

-120

-150

-180

Azimuth [°]

(b) 1 Direct+ 0.8

Front wall

Amplitude [normalized]

Floor

0.6

Ceiling Back wall

Side wall

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1

0

2

4

6

8

10

12

14

16

18

20

Time [ms]

(c) Fig. 9. Directional map obtained in a reverberant rectangular room (shown in background) superimposing on the same plot the direct sound localized at (30; 90) and the first order reflections. Actual origin of the first order reflections (shown by cross) was estimated using image source method. (a) Ambisonics decomposition (5° angular resolution); (b) 32 channel spatial sampling; (c) Omni-directional impulse response.

between actual positions and the observed peaks never exceeded 5°, with the only exception of the 500 Hz band for which the centre source was strangely shifted by 20° to the left. DG was also similar to impulsive sound arriving at 9 dB at 4 kHz. In terms of angular scatter no substantial variation appeared compared to the impulsive sound, with values varying from 40° at 4 kHz to 60° (or slightly more) at 125 Hz. This suggested that, although slightly less accurate, this approach could be considered fairly immune from the type of signal (at least in dry rooms). In this case the use of smaller time windows of 50 ms gave equally good results.

4.4. Dry room, continuous sound from simultaneous sources To conclude the analysis in the dry room, white noise was used to feed the extreme loudspeakers (configuration given in Fig. 2b) simultaneously, in order to investigate the microphone performance in this case. The Ambisonics decomposition showed a distributed peak that was located between the two actual sources (Fig. 8). The area appeared larger and more ‘‘blurred’’ than it was for impulsive sounds, but no ‘‘dipole’’ appeared suggesting the presence of two distinct sources. At higher frequencies the area

998

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

appeared more distributed between the actual locations covering the whole range from L1 to L3. This confirmed that continuous sounds worsened the localization performance using Ambisonic decomposition in presence of multiple emitting sources, resulting in a ‘‘ghost’’ source located in between the actual sources. When this configuration was analyzed using the spatial sampling no substantial difference appeared compared to results in Fig. 6, confirming that, at least in a dry room, the microphone performance was clearly independent of signal characteristics, allowing a better discrimination of simultaneous reflections provided that they were separated by an angular distance larger than the minimum angular resolution.

Table 7 Localization accuracy as a function of frequency for Ambisonics and 32 channel spatial sampling obtained from EM32 measurements with reference to first order reflection arriving from the floor (geometric configuration given in Fig. 2). Ambisonics

Image method Wide band 4 kHz 2 kHz 1 kHz 500 Hz 250 Hz 125 Hz

Spatial sampling

Azimuth (°)

Elevation (°)

Azimuth (°)

Elevation (°)

30 30 30 32 28 28 30 30

126 125 127 114 113 112 110 110

30 41 41 38 29 38 35 38

126 122 125 122 128 125 110 107

4.5. Reverberant room, impulsive sound The analysis of the sound propagation in the reverberant room provided additional information on the performance of the different spatial decomposition approaches to identify and localize not only direct sound, but also the reflections due to the room surfaces. In fact, Ambisonics decomposition allowed the identification of the direct sound with the usual accuracy, with slight variations as a function of frequency in terms of DG (generally between 43 dB and 38 dB) and angular accuracy (2° at 4 kHz and 10° at 125 Hz). The first reflections were also identified with similar accuracy. As shown in Fig. 9a the first order reflections (mapped on the same plane for ease of representation) fell within a maximum of 8° from the actual position, theoretically calculated using a simplified image source method. The largest errors appeared for the back wall reflection for which, as confirmed by the image source model, a bunch of other reflections (up to fourth order) arrive very closely spaced in time (Fig. 9c). In this case, a 1 ms time window might not be sufficiently selective, but considering that the spatial map is made up by adding the intensity contribution pertaining to each sample to its direction of arrival, a further reduction of the time window would have simply removed the contribution of the scattered reflections, but could not help to improve the localization accuracy. Anyway, considering that the direct sound was located at an elevation of 86° and the sound reflected from the back was located at 81°, the relative localization error remains within acceptable limits, well compatible with small inaccuracies in source or receiver positioning. In particular, it was interesting to observe that due to the relatively close time spacing of the reflections (the first four arrive within 12 ms after the direct sound), moving from high to low frequencies caused increased difficulty to identify (apart from the high frequencies) the individual reflections in the filtered impulse responses, making harder also the correct estimation of the direction of arrival. In such cases Park and Rafaely [13] suggested to first apply a time windowing of 1 ms around the selected reflection and then process the signal. Application of this method to Ambisonics decomposition proved to be very effective. In fact, for the first reflection arriving from the floor the direction of arrival as a function of frequency was given in Table 7 and proved to be estimated quite accurately, although the elevation angle was slightly underestimated (with a maximum error of 16°) at low frequencies. The strong reflection determined a significant polarization, being w = 0.03, independent of frequency. For the other two strong reflections w increased to about 0.1. Conversely, when the intensity of the reflections reduced and became comparable with the nearby diffuse reflections, or the early reflections were too closely spaced to allow a sufficient separation after time windowing, the degree of diffusion increased considerably (e.g. for the back wall reflection it was, on average, 0.3), and the low frequency accuracy worsened consequently.

When the same impulse response was processed using spatial sampling the results were again coherent with those observed in the dry room. Direct sound was identified with good angular accuracy independent of frequency (the peaks corresponding to each reflection fell within 5° of the expected position, with a DG varying between 7 dB at 125 Hz and 10 dB at 4 kHz). As usual the angular scatter was large (varying between 30° at 4 kHz and 60° at 125 Hz). When dealing with reflections the first analysis was carried out using unfiltered signals and the same 1 ms time window, to allow comparison with Ambisonics results. As shown in Fig. 9b, apart from the larger angular scatter, the results were equally satisfactory and, although by a small extent, even more accurate than Ambsionics, possibly because this method is immune from the creation of ghost reflections in presence of simultaneous arrivals. The analysis of band-filtered signals (not shown in figure for brevity) showed that the above conclusions applied with sufficient accuracy only in the high frequency range. At low frequencies the identification of direction of arrival became critical mostly as a consequence of the difficulty to ‘‘pick’’ the single reflection in the impulse response (as close reflections tend to blend at low frequencies). The preliminary time-windowing was hence applied also in this case with satisfactory results. Table 7 shows that for the floor reflection azimuthal angle was estimated correctly with a maximum error of 11° at 4 kHz. Conversely, for the elevation angle the maximum error was 19° and appeared (as for Ambisonics) moving towards low frequencies. In this case the almost systematic shift towards the left could be explained by the proximity with one of the aiming directions (45°; 125°), while the low frequency movement towards 110° in elevation might result from the combined effect of close 2nd order reflection from the side wall. When the other reflections were taken into account the method proved to be effective at least until the energy content of the reflections was high compared to nearby diffuse reflections. This condition was particularly difficult to satisfy when dealing with low frequencies which were more easily identified in the wrong place. This confirmed that, as shown in previous sections, where the dry room conditions contributed to make the task easier, the correct localization of low frequency sounds remains difficult. As shown in Ref. [21], small microphone arrays always show poorer performance in the low frequency range (independent of the type of processing). In fact, the so called ‘‘noise amplification effect’’ takes place because the small spacing of the microphone capsules compared to the wave-length determines a lower pressure gradient. Hence, even assuming the same level of noise over the whole spectrum, a lower pressure gradient results in a lower signal-to-noise ratio and, consequently, in larger errors in the estimation of direction related parameters.

999

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

Table 8 Localization accuracy and angular scatter (AS) as a function of frequency for Ambisonics and 32 channel spatial sampling obtained from EM32 measurements obtained in a reverberant rectangular room excited by white noise source located at (30; 90) using Ambisonics decomposition and 32 channel spatial sampling. Ambisonics

Actual pos. 125 Hz 250 Hz 500 Hz 1 kHz 2 kHz 4 kHz Wide band

Spatial sampling

Azimuth (°)

Elevation (°)

30 65 150 35 15 30 20 20

90 130 65 95 85 90 90 80

Table 9 Average angular coordinates and relative standard deviations resulting from ten random samples of different duration analyzed with 32 channel spatial sampling. Measurements were carried out using EM32 microphone in a reverberant rectangular room excited by white noise source located at (30; 90). Wide band signal only.

Actual pos. 5s 1s 0.5 s 0.1 s 50 ms 10 ms

Azimuth (°)

Elevation (°)

Std. dev (°)

30 35 35 35 36 37 24

90 82 79 80 78 73 95

– 0 1 2 4 8 120

4.6. Reverberant room, continuous emission Given the above conclusions, the source identification in a reverberant room under continuous sound emission was expected to be quite critical. This is not an issue when dealing with architectural acoustics (which is based on IR analysis) but use of continuous signals (such as white or pink noise) could be interesting in building acoustics to detect elements with poor sound insulation. When white noise was used to excite the room and Ambisonics decomposition used to process the spatial information, even using a large time window (up to 5 s) the identification of the source direction became considerably more difficult. In fact, the reverberant field that contributed with simultaneous reflections coming from everywhere raised the noise threshold, as confirmed by the directional gain dropped to 6 dB in this case. Large angular scatter and inaccurate identification of the sound source location appeared below 500 Hz (Table 8). From 500 Hz on results were a bit more reassuring (with a maximum error of 15°), but the angular scatter remained large. In all the cases a large time interval of 5 s had to be taken into account in order to keep the arrival direction of the maxima sufficiently stable, otherwise continuous fluctuations of the directional map were observed. The spatial sampling actually showed results very similar to those given by Ambsionics, with the wide-band (unfiltered) signal and high frequencies localized with good accuracy even though directional gain was no bigger than 3 dB. Unfiltered signal showed a peak at (35°; 82°), with an error of 8°. At medium frequencies the direction of arrival was shifted upwards to the left, possibly as a result of the combined effect of the reflections from the closest surfaces. At 125 Hz again the localization accuracy worsened considerably as a result of the noise amplification effect. An interesting difference compared to Ambisonics decomposition, was that shorter time windows (up to 50 ms) could be safely used in this case, without affecting the stability of sound localization. In fact, Table 9 shows that picking random samples of different length causes fluctuations in the localization that tend to increase as the time window decreases, becoming strongly variable (angular standard deviation of more than 100°) only below 50 ms.

AS (°)

Azimuth (°)

Elevation (°)

AS (°)

60 60 50 50 50 40 55

30 86 26 41 40 29 32 35

90 137 61 70 70 85 92 82

– 120 110 80 35 40 45 60

5. Conclusions The paper presented the results of a series of measurements aimed at assessing the accuracy of microphone arrays to localize sound sources and reflections in enclosed rooms. A 32 channel spherical array, and two traditional Ambisonics microphones were used in the test. The latter were used only during the first stage of the tests to provide a sort of ‘‘reference’’ conditions. Measurements were carried out in a small dry room and in a larger reverberant office. Impulsive and continuous signals were used. Results of the measurements were interpreted using a simple Ambisonics decomposition (for all the microphones) and, only for the 32 channel array, using a spatial sampling obtained aiming 32 virtual microphones (with a third-order super-cardioid response) at the same direction in which the capsules are located. Measurements in the dry room showed that when impulsive sounds from single sources are used, the Ambisonics decomposition based on 32 channel array recordings provided the highest accuracy (with good localization and angular scatter limited to few degrees), while traditional Ambisonics microphones showed fluctuating performance, with better results observed when the newest microphone was used, and less reliable localization when the oldest instrument was used. Spatial sampling proved to be sufficiently accurate but characterized by substantially larger angular scatter that tended to increase from 30° at 4 kHz to 60° at 125 Hz. After this step the traditional Ambisonics microphones were not considered anymore, due to the tougher tasks to be performed. In presence of simultaneous signals arriving from multiple sources (always in the dry room) the Ambisonics decomposition showed poor accuracy, creating ‘‘ghost’’ sources in between the actual sources. The degree of diffusion w was proposed as a criterion to discriminate between a real and a ghost source (or reflection). The spatial sampling proved to be able to discriminate two distinct sources provided that high frequencies (above 1 kHz) were considered, and that the angular separation between the sources was larger than the minimum theoretical resolution of the array (in this case about 60°). The use of continuous signals in the dry room increased the angular scatter when the Ambisonics decomposition was used, as a consequence of the superposition of simultaneous (although weak) reflections. In the same conditions spatial sampling was (as expected) scarcely affected from the simultaneous reflections, identifying the direction of arrival with the same accuracy, apart from a slight reduction in the directional gain. Simultaneous sources in the same conditions showed again the creation of a ghost source when Ambisonics decomposition was used, while spatial sampling showed results comparable with those obtained with impulsive sources. Finally, in reverberant conditions the 32 channel array ability to detect the direction of arrival of the early reflections was demonstrated with both the approaches. The best results were obtained

1000

F. Martellotta / Applied Acoustics 74 (2013) 987–1000

using unfiltered signals, so that early reflection spikes could be easily picked. However, in presence of closely spaced reflections, a preliminary 1 ms time windowing of the signal applied before octave-band filtering, allowed obtaining fairly good results also as a function of frequency. Only the lowest bands showed critical performance in presence of weak reflections, following well known limitations of microphone arrays in this range of frequencies. Once again, Ambisonics decomposition proved to be affected by some inaccuracies in presence of simultaneous arrivals, while spatial sampling, despite the larger angular scatter, was clearly better at detecting directions of arrival in such critical conditions. In conclusion, the work presented here proved that microphone arrays, although used according to a low-level approach, may offer new and interesting opportunities to research in architectural acoustics, through a better knowledge and understanding of the sound field properties, improving at the same time its virtual reproduction by means of loudspeaker arrays. Several architectural acoustics applications would benefit from additional information on the direction of arrival of the reflections to complement the picture made by intensity and time of arrival of sound. This could allow improved knowledge of sound propagation, or a detailed study of the effects that reflecting surfaces may induce in given listening positions. For such applications the amount of spatial information and the level of accuracy found in this paper, may be satisfactory. Nonetheless, availability of more refined beamforming techniques, capable of taking advantage of higher order harmonics, and hence creating even more directive ‘‘beams’’ to sample the space, will likely contribute to increase the spatial resolution of such microphones in the future. In the meanwhile, provided that simultaneous reflections are not involved, Ambisonics decomposition may still provide very detailed spatial information. References [1] Fahy F. Sound intensity. E&FN Spon; 1995. [2] Fahy F. Engineering acoustics. Elsevier Academic Press; 2001. [3] Wind J, Basten T, de Bree HE, Xu B. 3D sound source localization and sound mapping using a p-u sensor array, Baltimore; Noise-Con 2010. [4] Gerzon MA. Periphony: with-height sound reproduction. J Audio Eng Soc 1973;21:2–10. [5] Gerzon MA. The design of precisely coincident microphone arrays for stereo and surround acoustics, presented at the 50th convention of the audio engineering society. J Audio Eng Soc 1975;23:402–4. [6] Farina A. Advancements in impulse response measurement by sine sweep. In. 122nd AES convention, Vienna (Austria), 5-8 (May): 2007.

[7] Pulkki V. Spatial sound reproduction with directional audio coding. J Audio Eng Soc 2007;55(6):503–16. [8] Daniel J, Nicol R, Moreau S. Further investigations of high-order ambisonics and wavefield synthesis for holophonic sound imaging, presented at the 114th convention of the audio engineering society. J Audio Eng Soc (Abstracts) 51(May):2003;425 [convention paper 5788]. [9] Meyer J. Beamforming for a circular microphone array mounted on spherically shaped objects. J Acoust Soc Am 2001;109(1):185–93. [10] Meyer J, Elko G. A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP 2002). p. 1781–4. [11] Gover JBN, Ryan JG, Stinson MR. Measurements of directional properties of reverberant sound fields in rooms using a spherical microphone array. J Acoust Soc Am 2004;116:2138–48. [12] Rafaely B. Plane-wave decomposition of the sound field on a sphere by spherical convolution. J Acoust Soc Am 2004;116(4):2149–57. [13] Park M, Rafaely B. Sound-field analysis by plane wave decomposition using spherical microphone array. J Acoust Soc Am 2005;118:3094–103. [14] Capon J. High resolution frequency-wavenumber spectrum analysis. In: Proc IEEE; 1969. p. 1408–18. [15] Stoica P, Wang Z, Li J. Robust capon beamforming. IEEE Sign Process Lett 2003. [16] Rafaely B, Balmages I, Eger L. High-resolution plane wave decomposition in an auditorium using a dual radius scanning spherical microphone array. J Acoust Soc Am 2007;122:2661–8. [17] Khaykin D, Rafaely B. Acoustic analysis by spherical microphone array processing of room impulse responses. J Acoust Soc Am 2012;132(1):261–70. [18] Holland KR, Nelson PA. An experimental comparison of the focused beamformer and the inverse method for the characterisation of acoustic sources in ideal and non-ideal acoustic environments. J Sound Vib 2012;331(20):4425–37. [19] Farina A, Ugolotti E. Subjective comparison between Stereo Dipole and 3D Ambisonics surround systems for automotive applications. In: 16th AES conference, Rovaniemi (Finland), 10–12 April 1999. [20] Schiffrer G, Stanzial D. Energetic properties of acoustic fields. J Acoust Soc Am 1995;96:3645–53. [21] Del Galdo G, Taseska M, Thiergart O, Ahonen J, Pulkki V. The diffuse sound field in energetic analysis. J Acoust Soc Am 2012;131(3):2141–51. [22] Merimaa, Pulkki V. Spatial impulse response rendering I: analysis and synthesis. J Audio Eng Soc 2005;53:1115–27. [23] Meyer J, Elko GW. Spherical microphone array for spatial sound recording. In: 115th AES convention, New York, 10–12 October 2003. [24] Farina A, Amendola A, Capra A, Varani C. Spatial analysis of room impulse responses captured with a 32-capsules microphone array. In: 130th AES conference, London, 13–16 May 2011. [25] Kirkeby O, Nelson PA. Digital filter design for inversion problems in sound reproduction. J Audio Eng Soc 1999;47(7/8):583–95. [26] Kirkeby O, Rubak P, Farina A. Analysis of ill-conditioning of multi-channel deconvolution problems. In: 106th AES convention, Munich, Germany – May 8-11; 1999. [27] Vilkamo J. Spatial sound reproduction with frequency band processing of Bformat audio signals, master’s thesis. Helsinki University of Technology; 2008. [date last viewed 06.12.12].