Sensitivity-based region selection in the steered response power algorithm

Sensitivity-based region selection in the steered response power algorithm

Signal Processing 153 (2018) 1–10 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Sens...

1MB Sizes 0 Downloads 14 Views

Signal Processing 153 (2018) 1–10

Contents lists available at ScienceDirect

Signal Processing journal homepage: www.elsevier.com/locate/sigpro

Sensitivity-based region selection in the steered response power algorithm Daniele Salvati∗, Carlo Drioli, Gian Luca Foresti Department of Mathematics, Computer Science and Physics, University of Udine, Italy

a r t i c l e

i n f o

Article history: Received 28 August 2017 Revised 18 May 2018 Accepted 2 July 2018 Available online 3 July 2018 Keywords: Acoustic source localization Microphone array SRP-PHAT Sensitivity map Region selection Geometrically sampled grid

a b s t r a c t The steered response power (SRP) algorithm is a well-studied method for acoustic source localization using a microphone array. Recently, different improvements based on the accumulation of all time difference of arrival (TDOA) information have been proposed in order to achieve spatial resolution scalability of the grid search map and reduce the computational cost. However, the TDOA information distribution is not uniform with respect to the search grid, as it depends on the geometry of the array, the sampling frequency, and the spatial resolution. In this paper, we propose a sensitivity-based region selection SRP (R-SRP) algorithm that exploits the nonuniform TDOA information accumulation on the search grid. First, high and low sensitivity regions of the search space are identified using an array sensitivity estimation procedure; then, through the formulation of a peak-to-peak ratio (PPR) measuring the peak energy distribution in the two regions, the source is classified to belong to a high or to a low sensitivity region, and this information is used to design an ad hoc weighting function of the acoustic power map on which the grid search is performed. Simulated and real experiments show that the proposed method improves the localization performance in comparison to the state-of-the-art. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Acoustic source localization using microphone arrays has received significant attention by the scientific community due to its importance in sound scene analysis, signal enhancement, and speaker recognition and tracking [1–10]. In general, the localization can be computed with indirect and direct methods. The former are based on the computation of a set of time differences of arrival (TDOAs), obtained by measurements across various combinations of microphones [11,12], and on the estimation of the source position using geometric reasoning [13–15]. Direct methods are based on the steered response power (SRP) beamformers [16–18], on subspace algorithms [19–21], or on maximum-likelihood estimators [22–24]. They are very attractive for acoustic applications due to their robustness in noisy and reverberant conditions. The conventional SRP algorithm is based on the delay-and-sum beamforming technique [25]. Broadband SRP is typically implemented with the phase transform (PHAT) pre-whitening [11,26], which provides a normalization of narrowband SRPs and increases the spatial resolution [27]. This allows a better identification of di∗

Corresponding author. E-mail addresses: [email protected] (D. Salvati), [email protected] (C. Drioli), [email protected] (G.L. Foresti). https://doi.org/10.1016/j.sigpro.2018.07.002 0165-1684/© 2018 Elsevier B.V. All rights reserved.

rect path and early reflections in a reverberant environment. SRPPHAT has the advantage that it can be computed by considering the generalized cross-correlation (GCC) [11] between each microphone pair, and by summing TDOA values related to the search space [17]. This implementation is computationally more efficient if compared to methods that require a computation of narrowband SRP maps and their fusion [27]. However, the search procedure can be very expensive. Thus, iterative volume-search-based procedures have been recently proposed [28–30], which aim at reducing the computational complexity of this step. These methods take into account the accumulation of TDOA information [29–31] to achieve the reduction of the spatial grid resolution without loss of information, and uses sequentially volumetric refinement steps for increasing the localization accuracy. It has been demonstrated, using the geometrically sampled grid (GSG) algorithm [32], that the accumulation of all TDOA values from GCC functions is not uniform within the search space, and as a consequence the acoustic map is characterized by high and low sensitivity regions. The advantage of using all TDOA information is to obtain a robust localization in the high sensitivity region with adverse noisy and reverberant conditions. If the sound source is located in a low sensitivity region, however, its localization is more prone to be unstable and affected by errors. This is due to the fact that the acoustic map energy peak corresponding to the actual source position might be lower than the peaks correspond-

2

D. Salvati et al. / Signal Processing 153 (2018) 1–10

ing to noise and reverberation in the high sensitivity region, emphasized by the prominent TDOA accumulation. SRP-based methods that use all TDOA information were proposed in [29,31,32]. In [28], it was also proposed a SRP method that uses all TDOA information, providing however a power normalization in each volume with respect to the number of TDOA values. This approach mitigates the problem due to the nonuniform TDOA accumulation, but also reduces the robustness in the high sensitivity region. In [30], a SRP method based on the use of two grids (a coarser one, and a finer one) was proposed. This method uses a uniform TDOA accumulation in each volume, mitigating the problem of nonuniform distribution but meanwhile discarding part of the information available, thus reducing the TDOA accumulation that can be positively used in the high sensitivity region. In fact, the final resolution is given by the finer grid, implemented with a conventional SRP approach: for each microphone pair and for each point on the grid, a unique integer TDOA value is selected to be the acoustic delay information linked to that point. This uniform regular grid procedure does not guarantee that all TDOA samples are associated to points on the grid and does not exploit the accumulation of TDOA values that can be positively used in the high sensitivity region also with a finer grid. Note that it was demonstrated in [32] that using all TDOA samples can improve the localization performance in the high sensitivity region with a coarser grid and with a finer grid up to a resolution of 0.01 m. In this paper, we consider the localization of a single source in noisy and reverberant conditions. This scenario can be of interest in different practical applications such as videoconferencing systems or in human-computer interaction systems. We propose a sensitivity-based region selection SRP algorithm, named R-SRP, which has the following characteristics: 1. it uses all the TDOA information provided by the GCC functions; 2. it exploits the localization robustness in the high sensitivity region; 3. with respect to other methods, it allows to use coarser search grids in a more effective way, thus reducing the computational cost. The algorithm is organized in two steps. First, it establishes if the source is positioned in a high or low sensitivity region, through the formulation of a peak-to-peak ratio (PPR) measuring the peak energy distribution in the high and low sensitivity regions of the array, determined through the GSG algorithm. Then, it proceeds with the search of the acoustic source in the selected region using, when opportune, the sensitivity map to weight the power acoustic map and reduce the impact of noise. It will be shown that this array sensitivity-informed method effectively reduces the localization errors due to the nonuniform distribution of the TDOA accumulation in the power acoustic map. 2. Steered response power Let us consider a reverberant room G, M microphones positioned at coordinates rm = [xm , ym , zm ]T (m = 1, 2, . . . , M ), where ( · )T denotes the transpose operator, and a single source rs (k ) = [xs (k ), ys (k ), zs (k )]T active at frame index k. The SRP-PHAT based on all the TDOA information can then be expressed in terms of GCC functions as [17,28,29,31,32]

φ ( r, k ) =

M−1 

M 

τmmax (r ) 1 m2



Rm1 m2 ( τ , k ),

(1)

1 2

where r = [x, y, z]T ∈ G is a generic grid position with spatial resmin (r ) and τ max (r ) denote the bounds of the accuolution , τm m1 m2 1 m2 mulated TDOAs between microphones m1 and m2 for the position r, and the GCC-PHAT [11] function is

1 2π

 π Xm1 (w, k )Xm∗ 2 (w, k ) −π

 rs (k ) = argmax[φ (r, k )].

|Xm1 (w, k )Xm∗ 2 (w, k )|

e jwτ dw,

(2)

(3)

r

3. Geometrically sampled grid The proposed R-SRP algorithm extends the G-SRP [32] algorithm by including a region selection procedure. The G-SRP is based on the GSG method, in which the search space is obtained by discretizing, with a given spatial resolution, the hyperboloids representing the surface on which the TDOAs are constant, and by finally computing a grid related to the intersections between these discrete curves. It thus allows the accumulation of the whole TDOA information provided by the GCC functions into the search space, the design of an acoustically-coherent space grid, and the design of a sensitivity map for the array in use. The use of all the TDOA information available from the GCC-PHAT functions solves the problem of arbitrarily selecting the spatial grid resolution without loss of information. The acoustically-coherent space grid guarantees that every point of the grid is consistent with the condition of being the locus where at least three half-hyperboloids intersect. Note that the coherent grid may discard the points of the uniform regular grid (used in the conventional approach) which are not covered by sufficient acoustic information, especially when a finer grid is used. The sensitivity map refers to a quantified measure of the change of the response power with respect to the change of the spatial position, predicting where the search space will be characterized by higher and lower localization accuracy. Note that another approach to identify the spatial localization accuracy was proposed in [33], in which a discriminability measure is defined to distinguish a given point in space from its neighbors. This method does not consider the spatial resolution, and it does not give useful information when a larger resolution is used [32]. Let us now consider the dicretization of the search space G with a spatial resolution . A discrete hyperboloid related to a microphone pair (m1 , m2 ) and a TDOA τm1 m2 can be represented as a 3 finite set  τm m of points in R , describing the hyperboloid when 1 2

m1 =1 m2 =m1 +1 τ =τmminm (r )

Rm1 m2 ( τ , k ) =

where τ is the time lag, w is the angular frequency, Xm (w, k) is the transform of the signal observed at microphone m, ( · )∗ denotes the complex conjugate, j denotes the imaginary unit, and | · | denotes absolute value. The GCC-PHAT is computed in the frequency domain using the discrete Fourier transform (DFT), and hence the min (r ) = τ max (r ), SRP is computed on a block-by-block basis. If τm m1 m2 1 m2 Eq. (1) represents the conventional SRP-PHAT algorithm [17]. The accumulation limits can be determined with different strategies which can rely on the gradient of the inter-microphone time delay function corresponding to each microphone pair in the M-SRP [31], on the gradient of the inter-microphone time delay function exploiting the mean of the accumulated GCC-PHAT values for each volume in the I-SRP [28], on the surrounding cube taking into account vertices of the volume in the H-SRP [29] or selecting only some points related to a finer grid in the RV-SRP [30], or on discrete representations of the hyperboloids related to all possible TDOA values in the GSG-based method (G-SRP) [32]. Once the array steered response power function φ (r, k) is available, the source position can be estimated by searching its maximum in the search region

the x, y, and z-axis are discretized with spatial resolution  (for a detailed discussion on the hyperboloid discretization procedure, see [32]). In the implementation of the G-SRP, the discrete hyperboloids and the TDOA information are stored in four look-up tables. The tables are computed off-line, and then used on-line to estimate the acoustic energy and computing the accumulation of the GCC-PHAT function information due to all the sensor pairs involved. To each discrete hyperboloid point we assign an index q, so that we have a table γ r (q) that stores the position to which each hyperboloid

D. Salvati et al. / Signal Processing 153 (2018) 1–10

point is related, a table γ p (q) that stores the microphones pair index, and a table γ τ (q) that stores the TDOA. The last look-up table, δ (r), is the GSG sensitivity map, which contains the number of all the discrete surfaces intersecting at position r. The sensitivity map provides information on the distribution of TDOAs into the search space, and thus it defines a measure of the localization accuracy of the array and a mean to identify those areas for which it is more accurate.  ||r −r || f  m1 m2 s If we call Tm1 m2 = the maximum TDOA in samc ples for the sensor pair (m1 , m2 ), where  ·  denotes the floor function that maps a real number to the largest previous integer, fs is the sampling frequency, c is the speed of sound, and || · || denotes Euclidean norm, we have (2Tm1 m2 + 1 )M (M − 1 )/2 discrete hyperboloids. The procedure to build the GSG and the sensitivity map δ (r) is given by the following steps: 1. Initialize δ (r ) = 0 for all r ∈ G and of index q = 0. 2. For each sensor pair (m1 , m2 ) and for all TDOA values τm1 m2 in the range [-Tm1 m2 , Tm1 m2 ], calculate the discrete hyperboloid   τm m , and for each grid position r ∈ τm m , fill the look-up 1 2

1 2

tables γ r (q), γ p (q), and γ τ (q), increment by one the value of the look-up table δ (r), and increment q by one. 3. After the geometric discrete analysis of the hyperboloids has terminated, apply the constraint δ (r ) < μ ⇒ δ (r ) = 0, ∀r ∈ G, where μ = 3 and μ = 2 in case of 3D and 2D localization, respectively. The constraint has the goal of discarding those space grid points that are useless for the localization, i.e., those space grid points that do not guarantee the condition of being the locus where at least three half-hyperboloids intersect in 3D or two half-hyperbolas intersect in 2D. Finally, update the lookup tables γ r (q), γ p (q), and γ τ (q), and define the acousticallycoherent grid as r = {r : δ (r ) = 0}.

Algorithm 1 GSG: grid and look-up tables computation.

for all r ∈  τm

1 m2

1 m2

do

γr (q ) = r, γ p (q ) = [m1 , m2 ]T , γτ (q ) = τm1 m2 δ (r ) = δ (r ) + 1

q=q+1 end for end for end for end for Apply the constraint δ (r ) < μ ⇒ δ (r ) = 0, ∀r ∈ G Update γr (q ), γ p (q ), and γτ (q ) Define the acoustically-coherent grid as r = {r : δ (r ) = 0} We can then write the G-SRP as

φ ( r, k ) =

M−1 

M 



Rm1 m2 (γτ (z ), k ),

(4)

m1 =1 m2 =m1 +1 z∈Zr,m1 m2

where

Zr,m1 m2 = {q : [γr (q ) = r] ∧ [γ p (q ) = [m1 , m2 ]T ]}

is the set of look-up table indices corresponding to the TDOAs of the sensor pair (m1 , m2 ) for the position r ∈ r (i.e., the set of all hyperboloid points intersecting in r). For each r, Zr,m1 m2 may contain three or more elements (or two or more, in the 2D case) if the acoustic information is sufficient to provide localization cues, or it can be empty when the position r is not covered by sufficient acoustic information, i.e. it does not correspond to an intersection point of a bare minimum of three hyperboloids (or two hyperbolas, in 2D). Note that the conventional approach (Eq. (1), τmmin (r ) = τmmax (r )) leads to exactly M (M − 1 )/2 TDOA values as1 m2 1 m2 sociated with each point on the grid, whereas the GSG procedure and the G-SRP approach (Eq. (4)) might associate less than M (M − 1 )/2, M (M − 1 )/2 or more than M (M − 1 )/2 TDOAs to a point on the grid. A larger amount of TDOA information is the principal reason for the increased localization performance in the high sensitivity region. In [29,31], we have M (M − 1 )/2 or more than M (M − 1 )/2 TDOA values associated with each point on the grid. Both methods have hence a nonuniform accumulation in the space. In the I-SRP method [28], the same volume accumulation criterion of [31] is used; however, the accumulation is normalized for each microphone pair. In practice, we can consider that there are M (M − 1 )/2 TDOA values in the I-SRP as in the conventional SRP, allowing however the use of a coarser grid without loss of information. In the RV-SRP method [30], the same volume accumulation criterion of [29] is used; however, the TDOA values are selected on the basis of some points of a finer grid that is calculated with the conventional approach, i.e. by linking each point of the search grid with a TDOA for each microphone pair. This fact allows a complexity reduction discarding some information and a uniform accumulation of TDOA values in the space, reducing however the TDOA useful information that can be used in the high sensitivity region. 4. Sensitivity-based region selection

The GSC algorithm is summarized in Algorithm 1.

M: number of microphones : spatial resolution Initialization: δ (r ) = 0, ∀r ∈ G, q = 0 for m1 = 1 to M − 1 do for m2 = m1 + 1 to M do for τm1 m2 = −Tm1 m2 to Tm1 m2 do Calculate the discrete hyperboloid  τm

3

(5)

We consider a scenario in which a single acoustic source is located in a noisy and reverberant environment. We make the assumption that the uncorrelated spatially white noise and the signal reflections due to reverberation can both be modeled in the GCC-PHAT functions as additive white Gaussian terms. Hence, we neglect in the model the early reflection components, which are in general sparse in the TDOA function, since the main problem herein is the accumulation of TDOA values in the high sensitivity region mainly due to the non-sparse noise components. The approximation in the model is reasonable also taken into account the fact that the prominent peaks due to early reflections in the GCC-PHAT are theoretically accumulated outside the room, i.e., in the phantom source positions, and that such components become hardly distinguishable from the late reverberation components as the reverberation time increases. For simplicity, we drop the frame index k from now on. We can model the GCC-PHAT function as

R = Rs + Rr + σv2 ,

(6)

where Rs is the generalized cross-correlation of the source, which can be modeled as a sinc function [4] with a theoretical value of 1 at the actual TDOA for anechoic and noiseless conditions, Rr is the reverberation component that is a linear combination of sinc functions, and σv2 is the noise component due to the uncorrelated spatially white noise at microphones. If we model the generalized cross-correlation of early reflections as Re and the late reverberation as a noise component σl2 (note that late reverberation is characterized by low intensity and high density of reflections, and no longer depends on source position [34]), we have that Rr = Re + σl2 . By neglecting the term Re , we obtain an approximate model defined as

R ≈ Rs + σl2 + σv2 = Rs + σ 2 .

(7)

4

D. Salvati et al. / Signal Processing 153 (2018) 1–10

Fig. 1. Example of GCC-PHAT functions and G-SRP maps in different reverberant conditions. The microphone array setup and the sensitivity map is shown in Fig. 6. The GCCPHAT is related to microphones r1 and r2 , when the active acoustic source is positioned in the low sensitivity region. The spatial resolution is 0.05 m. The room conditions are: (a) and (e) anechoic; (b) and (f) RT60 =0.1 s; (c) and (g) RT60 =0.6 s; (d) and (h) RT60 =1.0 s.

We thus assume that the noise plus reverberation component of each GCC function R in (2) has normal distribution N (0, σ 2 ), and we can write the acoustic map as

φ ( r ) = φs ( r ) + φn ( r ) = φs ( r ) + σ 2 δ ( r ) .

(8)

We can see from (4) that the accumulation of TDOA values in each grid position is given by the number of sample values from all sensor pairs, i.e. the information contained in the sensitivity map δ (r), resulting in

φs ( r ) =

M−1 

M 



1 m2 Rm , s

(9)

m1 =1 m2 =m1 +1 z∈Zr,m1 m2

m m

where Rs 1 2 is the cross-correlation of the source for the microphone pair (m1 , m2 ), and

φn ( r ) =

M−1 

M 



σ 2 = σ 2 δ ( r ).

(10)

m1 =1 m2 =m1 +1 z∈Zr,m1 m2

We empirically verified that neglecting the term Re in (7) and (8) is reasonable by a set of numerical simulations in which the source, noise and reflections components were controlled on purpose. In Fig. 1, we show the behavior of the model in different reverberant (noiseless) situations: 1. when an acoustic source located in the low sensitivity region is active in anechoic conditions (a) and (e), the GCC-PHAT function related to a single generic microphone pair shows a peak corresponding to the actual TDOA (a) and the source position is correctly estimated with the G-SRP (see Fig. (e)); 2. when the environment has a reverberation time (RT60 ) of 0.1 s, we can note the peaks due to early reflections (see Fig. (b)) and the correct estimation of the source position (f); 3. when the level of reverberation increases (RT60 =0.6 s), both signal and noise (late reverberation) components are now discernible in the GCC function (c), the early reflection components become indistinguishable, and the G-SRP fails to localize of the source (g) since the noise component is accumulated in the high sensitivity region providing a greater response power (the central area in (g)) if compared to the source position; 4. the source component becomes indistinguishable in case of very high reverberation level (RT60 =1.0 s)(d), and

the source localization fails (h). In conclusion, we can say that the accuracy of the model in (7) and (8) is proportional to the reverberation level: for moderately and high reverberation conditions, early reflections and late reflections become indistinguishable and assume a highly noise behavior. For low reverberation levels or anechoic conditions, the model accuracy decreases, however the nonuniform TDOA information accumulation in this case is in general less problematic, since the acoustic map energy corresponding to noise and reverberation in the high sensitivity region, emphasized by the prominent TDOA accumulation, is low. According to [32], we can divide the search space sensed by the array into two regions with different sensitivity:

H = {r ∈ G : δ (r ) ≥ η}, L = {r ∈ G : δ (r ) < η},

(11)

where H and L denote the high and low sensitivity regions respectively, and η is a threshold computed as

η = δ¯ (r ),

(12)

with δ¯ (r ) denoting the mean value of the sensitivity map in the search space. Based on the available data, i.e. the power function φ (r) and the function δ (r), with r ∈ G, a rough region classification criterion will check if the maximum of φ (r) has been found in L or H, and assign the source to that region. Figs. 2 and 3 represent two qualitative examples of SRP functions for the source in H and L, respectively. Due to the additive noise component, this criterion would misclassify the region in those cases in which, even though the source is located in L (i.e., φ s (r)’s maximum is in L), the maximum of φ (r) is found in H due to the additive noise component, amplified in H by the function δ (r) (see Fig. 3). The opposite situation, i.e. when the source is in H but the maximum of φ (r) is found in L, is very unlikely since the function δ (r) is low-valued in this region and would hardly be responsible for a high-energy noise peak able to affect the global maximum. We thus aim at improving the baseline criterion by finding a more effective, data-dependent threshold for the region selection. We define the following peak-

D. Salvati et al. / Signal Processing 153 (2018) 1–10

PPR =

5

max[φs (r ) + σ 2 δ (r )] r∈H

max[σ

2

r∈L

δ ( r )]

=

φs (r˜ ) + σ 2 δ (r˜ ) . σ 2 max[δ (r )]

(15)

r∈L

Since max[δ (r )] = η (note that the maximum sensitivity map value r∈L

for the region L corresponds to the value η determines the boundary between the low ity regions), Eq. (15) leads to the condition threshold in (15) can be found by assuming

defined in (12), which and the high sensitivPPR ≥ δ (ηr˜ ) . In fact, the φs (r˜ ) = 0, which leads

to σ 2 σmaxδ ([δr˜ )(r )] = σσ δ2(ηr˜ ) = δ (ηr˜ ) . Note that σ 2 is cancelled out. When 2

2

r∈L

the source is active, i.e. φs (r˜ ) > 0, we have hence PPR ≥ δ (ηr˜ ) . We can now show that this threshold also correctly classifies the sensitivity region when the source is located in L but the maximum of φ (r) is found in H due to the effect of noise. The contribution of the sources in H can be assimilated to the noise component σ 2 . In this case, we can write Fig. 2. A schematic representation of the SRP profile along x axis when the source is positioned in the high sensitivity region.

max[φs (r ) + σ 2 δ (r )] ≈ max[σ 2 δ (r )] = σ 2 δ (r˜ ), r∈H

r∈H

(16)

and the peak-to-peak ratio becomes

PPR =

σ 2 δ (r˜ ) δ (r˜ ) < . η max[φs (r ) + σ 2 δ (r )]

(17)

r∈L

The inequality in (17) can be found by assuming φs (r ) = 0, resulting in PPR =

σ 2 δ (r˜ ) max[σ 2 δ (r )] r∈L

= σ 2 σmaxδ ([δr˜ )(r )] = δ (ηr˜ ) . When the source is ac2

r∈L

tive in the L region, i.e. s (r) > 0, we obtain hence the inequality in (17). We can thus adopt the following L − H classification criterion:

 rs ∈

⎧ ⎨L if PPR < 1, L

if 1 ≤ PPR < δ (ηr˜ ) , δ (r˜ ) η .

⎩H if PPR ≥

(18)

We can now note that

1≤ Fig. 3. A schematic representation of the SRP profile along the x axis when the source is positioned in the low sensitivity region and the maximum of the overall region is positioned in the H region.

to-peak ratio

PPR =

max[φ (r )] r∈H

max[φ (r )]

,

(13)

r∈L

which is a measure of the difference between the maximum energy peak in the high sensitivity region and the one in the low sensitivity region. The baseline criterion would classify the source as belonging to L if PPR < 1, and to H otherwise. Since this criterion can be assumed robust for PPR < 1 (and thus maximum of φ (r) in L), we will focus on the PPR ≥ 1 case in what follows. Let us call r˜ the position of φ (r)’s maximum, and let us suppose now that the source is actually positioned in the high sensitivity region H. From what was said so far, we can assume that r˜ will fall in H and thus restrict the maximum search to the high sensitivity region, i.e. r˜ = argmax[φ (r )]. We can say in this case that r∈H

max[φs (r ) + σ 2 δ (r )] ≈ max[σ 2 δ (r )], r∈L

r∈L

(14)

i.e. the contribution of the source will be negligible in the computation of the SRP maximum in L. We can thus write the PPR as

δ (r˜ ) max[δ (r )] ≤ . η η

(19)

The threshold for the PPR region selection will be equal to 1 when δ (r˜ ) = η, i.e. when the maximum of the power response is positioned on the boundary between the two regions. In this case, the amplification of the noise in the high sensitivity region is uninfluential. On the other hand, we have a larger noise amplification when δ (r˜ ) > η, which is uninfluential on the classification if the source is in H, but might affect it if the source is in L. Therefore, a threshold value larger then 1 has the effect of compensating the amplification of noise due to the sensitivity of the array and to improve the decision on which is the region where the source should be searched. When the PPR criterion selects the high sensitivity region as the searching region, the source position is estimated as

 rs = argmax[φ (r )].

(20)

r∈H

On the other hand, when the PPR criterion indicates to search in the low sensitivity region, the source is localized by searching the maximum of the steered response power, uniformed through the array sensitivity map:

 rs = argmax r∈L

φ (r ) δ (r )

.

(21)

This equation provides a more robust sound localization in the region L, since it permits to reduce the nonuniform accumulation and the ambiguity that may arise when the maximum value for

6

D. Salvati et al. / Signal Processing 153 (2018) 1–10

4. The source position is finally estimated using (20) or (21), depending on whether it was estimated to lie in the high or in the low sensitivity region.

6. Computational complexity analysis

Fig. 4. A schematic representation of the SRP profile along x axis when the source is positioned in the low sensitivity region and the L region maximum is not positioned on the source position.

the L region is positioned close to the boundary of the two regions. Fig. 4 illustrates the situation in which the L region maximum is positioned close to the boundary and it is larger than the source maximum (continuous line). Eq. (21) provides a uniform TDOA accumulation (dotted line) that allows the correct estimation of the source position in this case. The proposed R-SRP increases the localization accuracy in the low sensitivity region by keeping a high accuracy in the high sensitivity region due to the accumulation of all TDOA information. Note that by using a uniform steered response power in the overall region, the localization performance in the high sensitivity region considerably degrades, since the mean operation attenuates the TDOA accumulation in the grid points corresponding to the highest number of hyperboloid intersections. An example of uniform steered response power in the overall region was proposed in [28] (I-SRP), in which the normalization allows the reduction of the problem due to nonuniform accumulation. However, it also discards part of the information in the high sensitivity region that can be positively used to improve the localization performance in that region [32].

5. Region selection steered response power The implementation of the R-SRP method can be divided in two steps. In the off-line step, the sampled space grid is computed with the GSG method (Algorithm 1) providing the look-up tables (γ r (q), γ p (q), γ τ (q)), linking all TDOA values of the microphone pairs with the grid positions in space, and the sensitivity map δ (r). From Eq. (11), the high and low sensitivity regions can be identified, providing two sets of discrete grid positions, H and L, one for each region. In the on-line step, the G-SRP is computed on a frameby-frame basis to estimate the source position. For each analysis frame, the R-SRP is computed through the following steps: 1. The values from the estimated GCC-PHAT functions are accumulated in the grid map (4). 2. The maximum values of the SRP for the low and high sensitivity regions are identified, and the PPR is estimated through Eq. (13). 3. By using the classification criterion in (18), the region selection is computed to estimate the area in which the source is positioned.

The computational cost for the SRP-based algorithm is the sum of two components, the calculation of the GCC-PHAT functions and the summation of TDOA values in the search space. The relationship between TDOAs and the positions in space is pre-calculated off-line using the look-up tables. Let N denote the frame size for the DFT and Q = M (M − 1 )/2 denote the number of microphone pairs; we can express the complexity of GCC-PHAT computation in terms of the approximated number of arithmetic operations. We obtain (1.5Q + Q )5Nlog2 N operations for the DFTs and inverse DFTs computation, and 20QN for the cross-power spectrums computation [35]. The summation of TDOA values is related to the computation of Eq. (1) for all grid positions. With the same resolution, the complexity of all SRP-based algorithms is approximately the same since the differences in the number of summations is negligible. The iterative step of I-SRP, H-SRP and RV-SRP adds in general few summations in the refinement step, while the accumulation normalization in the I-SRP can be computationally demanding when the number of grid points is large. Let denote the number of grid points. The proposed R-SRP adds few operations for the normalization in the low sensitivity region, and few operations for the PPR and threshold estimation. We will show in the experimental section that the computational load increment is negligible if compared to G-SRP. The approximate number of arithmetic operations of the R-SRP can be summarized as follows:

R-SRPcost ≈ (1.5Q + Q )5N log2 N + 20QN + r + L + 2,

(22)

where  r is the number of additions in (1) for position r and L is the number of grid points in the low sensitivity region. The last 2 operations are due to the calculation of the PPR and of the threshold value. Note that if the source is positioned in the high sensitivity regions, then L is null. 7. Experimental results Experiments were conducted using a distributed sensor array for the acoustic source localization on both simulated data and real-world data. Distributed microphone arrays [36–38] and wireless acoustic sensor networks [39–41] have been recently investigated by the research community. In ad hoc microphone arrays, the sensors can be distributed randomly over the environment covering a much larger area with a potential in increasing the localization capabilities. We compare the performance of the proposed R-SRP algorithm with the following ones: SRP [17], M-SRP [31], ISRP [28], H-SRP [29], RV-SRP [30], and G-SRP [32]. The same grid resolution was used for all SRP methods. Specifically, the spatial resolution  was set to 0.25 m and 0.1 m for the 3D localization and 0.05 m for the 2D localization. We consider the volumetric refinement steps of I-SRP, H-SRP, and RV-SRP by imposing the final spatial resolution to 0.01 m. Performance is reported in terms of root mean square error (RMSE) and of accuracy rate (AR) for the estimated source that is inside the area surrounding the grid point given by the spatial resolution AR :

| xs (k ) − xs (k )| ≤ AR ,  |ys (k ) − ys (k )| ≤ AR , | zs (k ) − zs (k )| ≤ AR . The AR value was set to 0.25 m.

(23)

D. Salvati et al. / Signal Processing 153 (2018) 1–10

7

Fig. 5. The simulated room with the position of 10 microphones (a) and the high sensitivity points with spatial resolution  = 0.25 m (b).

Fig. 6. The simulated and real room with the position of 6 microphones (a), the sensitivity map (b), and the high and low sensitivity regions with spatial resolution  = 0.05 m (c). In the 2D simulations, twenty source positions were randomly selected in the two regions. In the real experiment, 4 sources (s1 , s2 , s3 , s4 ) have been considered.

7.1. Simulation The localization performance has been evaluated with 20-trial Monte Carlo simulations, using a distributed sensor array of 10 microphones for 3D localization and of 6 microphones for 2D localization. The microphones of the distributed arrays were positioned using a unique fixed randomly chosen set of locations for each array. The 3D localization performance is analyzed under variation of noise and reverberation conditions with a spatial resolution of 0.25 m and 0.1 m. We also provide a 2D simulation with a finer grid of 0.05 cm with the same setup used for the real-data analysis. Practical examples of acoustic localization in 2D can be found in all those situations in which some activity occurs in general at ground level, e.g. in traffic monitoring systems or public areas surveillance, or when it is reasonable to search the acoustic source on a given plane, e.g. in a meeting room or a conference room in which the active speaker has to be localized for camera steering and audio enhancement through beamforming. The image-source model was used to simulate reverberant audio data in room acoustics [42], implemented using the improved algorithm reported in [43]. Uniform absorption coefficients for all room boundaries were considered. A room of (6.4 × 3 × 3.6) m was used. The tests were conducted with different signal-to-noise ratios (SNRs), which were obtained by adding mutually independent white Gaussian noise samples to each channel. In the first set of simulations the evaluation of the 3D localization performance is analyzed. The room setup and the high sensitivity region with =0.25 m are shown in Fig. 5. In all experiments, a source emitting a 25-s duration male speech signal was randomly located in each trial. Twenty source positions were considered. The sampling frequency was 44.1 kHz and the analysis frame length was set to 8192 samples. Tables 1 and 2 report the AR and the RMSE localization performance with spatial resolution  = 0.25 m and  = 0.1 m, respectively. The RT60 was set to 0.3 s in the room simulator. The tables

report the AR and RMSE localization performance for the whole region Gs , and two regions Hs (high sensitivity) and Ls (low sensitivity). As it can be observed, the R-SRP algorithm delivers in general a better performance than other SRP-based methods since it has the lowest RMSE and the highest AR in the whole region Gs . When the SNR decreases, we can note the localization improvement of the R-SRP in Ls region and a similar AR and RMSE in the Hs region if compared to G-SRP. The H-SRP iterative-volume search provides the best performance in Hs region in low noise conditions (20 dB and 10 dB) due to the use a finer grid (0.01 m) in the refinement step. However, it tends to reduce its performance in Ls region and when the noise level increases due to the nonuniform TDOA information accumulation. In some cases, the iterative step was computed on an incorrect volume. The lower performance of the RVSRP is due to the loss of TDOA information since it uses in the first step only some points of the finer grid that is based on the conventional approach to reduce the computational cost. The lower performance of I-SRP is due to the average operation in the GCC-PAHT accumulation that reduces the robustness in the high sensitivity region. In accordance to [31,32], the conventional SRP degrades the localization accuracy when a coarser grid is used due to the loss of information of GCC functions, which are not linked with any grid position. The advantage of the proposed R-SRP becomes more prominent when reverberation increases. Table 3 shows the 3D localization results with  = 0.25 m and  = 0.1 m for an SNR of 20 dB under different reverberation times. R-SRP outperforms other methods for both spatial resolutions in the overall region Gs . The complexity of all algorithms is reported in Table 4. As we can observe, the approximate number of arithmetic operations calculated with Eq. (22) is very similar for all methods, with a lower computational cost for the proposed R-SRP and the conventional SRP. The average number of arithmetic operations for R-SRP and G-SRP is practically the same.

8

D. Salvati et al. / Signal Processing 153 (2018) 1–10

Table 1 AR (%) and RMSE (m) 3D localization performance using simulated data with RT60 = 0.3 s and spatial resolution  = 0.25 m for R-SRP, G-SRP, SRP, M-SRP and spatial resolution 0.25 m with refinement step of 0.01 m for I-SRP, H-SRP, RV-SRP.

AR

RMSE

AR

RMSE

AR

RMSE

Region

SNR (dB)

R-SRP

G-SRP

SRP

M-SRP

I-SRP

H-SRP

RV-SRP

Ls Hs Gs Ls Hs Gs Ls Hs Gs Ls Hs Gs Ls Hs Gs Ls Hs Gs

20

85.00 91.66 90.00 0.230 0.155 0.174 82.50 91.83 88.50 0.240 0.165 0.199 65.00 90.83 79.50 0.589 0.262 0.424

76.25 94.58 87.250 0.282 0.149 0.212 71.87 95.00 85.75 0.419 0.157 0.292 49.37 87.50 72.25 0.774 0.299 0.542

51.87 51.66 51.75 0.513 0.933 0.792 42.50 46.25 44.75 0.802 0.959 0.899 23.75 23.75 23.75 1.435 1.650 1.567

78.12 98.33 90.00 0.328 0.147 0.237 68.75 98.33 86.50 0.422 0.150 0.291 55.00 89.83 76.50 0.634 0.300 0.464

78.12 89.16 84.75 0.340 0.463 0.418 72.50 81.66 78.00 0.431 0.473 0.457 48.75 71.66 62.50 0.755 0.811 0.789

77.50 98.33 90.00 0.252 0.091 0.198 72.50 98.33 88.00 0.407 0.088 0.266 58.12 89.83 77.50 0.669 0.289 0.492

46.87 58.75 54.00 0.864 1.453 1.251 39.37 51.66 46.75 1.062 1.473 1.324 18.75 28.33 24.50 2.021 1.945 1.975

10

0

Table 2 AR (%) and RMSE (m) 3D localization performance using simulated data with RT60 = 0.3 s and spatial resolution  = 0.1 m for R-SRP, G-SRP, SRP, M-SRP and spatial resolution 0.1 m with refinement step of 0.01 m for I-SRP, H-SRP, RV-SRP.

AR

RMSE

AR

RMSE

AR

RMSE

Region

SNR (dB)

R-SRP

G-SRP

SRP

M-SRP

I-SRP

H-SRP

RV-SRP

Ls Hs Gs Ls Hs Gs Ls Hs Gs Ls Hs Gs Ls Hs Gs Ls Hs Gs

20

10 0.0 0 10 0.0 0 10 0.0 0 0.097 0.055 0.075 96.25 99.16 98.00 0.207 0.069 0.141 88.12 92.08 90.50 0.351 0.307 0.336

99.37 10 0.0 0 99.75 0.165 0.055 0.113 91.87 99.16 96.25 0.320 0.070 0.209 65.62 90.83 80.75 0.690 0.372 0.523

94.37 86.66 89.750 0.214 0.461 0.382 85.62 83.75 84.50 0.361 0.516 0.461 52.50 55.83 54.50 1.114 1.168 1.147

99.37 10 0.0 0 99.75 0.091 0.079 0.094 92.50 99.58 96.75 0.340 0.075 0.223 67.50 90.416 81.25 0.710 0.378 0.536

99.37 99.16 99.25 0.113 0.246 0.204 95.62 95.83 95.75 0.235 0.427 0.363 68.75 76.66 73.50 0.791 0.917 0.869

98.12 10 0.0 0 99.25 0.201 0.029 0.113 90.62 99.16 95.75 0.284 0.064 0.186 66.25 91.25 81.25 0.742 0.327 0.526

71.87 55.41 62.00 0.867 1.344 1.177 45.62 51.66 49.25 1.082 1.361 1.257 19.37 28.75 25.00 1.630 2.017 1.872

10

0

Table 3 AR (%) and RMSE (m) 3D localization performance using simulated data in Gs with SNR=20 dB. The refinement spatial resolution is 0.01 m for I-SRP, H-SRP, RV-SRP.

AR RMSE AR RMSE AR RMSE AR RMSE AR RMSE AR RMSE

 (m)

RT60 (s)

R-SRP

G-SRP

SRP

M-SRP

I-SRP

H-SRP

RV-SRP

0.25

0.4

89.00 0.193 82.25 0.374 73.50 0.612 99.25 0.084 94.75 0.216 91.75 0.254

86.75 0.284 76.25 0.476 65.25 0.767 97.75 0.179 89.25 0.412 83.75 0.579

45.00 0.915 31.00 1.310 22.75 1.6409 86.75 0.421 73.75 0.749 60.75 1.177

86.75 0.299 78.75 0.523 70.00 0.709 97.00 0.223 92.00 0.375 85.00 0.549

81.25 0.459 68.00 0.753 58.50 1.058 95.25 0.342 87.00 0.640 75.25 0.995

85.50 0.289 79.00 0.504 69.50 0.761 97.25 0.217 91.75 0.405 85.00 0.516

37.25 1.586 21.50 2.086 14.75 2.301 46.00 1.446 29.00 1.804 20.00 2.143

0.6 0.8 0.1

0.4 0.6 0.8

Table 4 Approximate number of arithmetic operations per frame ( × 107 ).

 (m)

R-SRP

G-SRP

SRP

M-SRP

I-SRP

H-SRP

RV-SRP

0.25 0.1

7.0602 9.0831

7.0602 9.0831

6.7438 7.0155

7.1850 10.0902

7.6784 13.4552

7.3079 10.7885

9.9497 10.4113

D. Salvati et al. / Signal Processing 153 (2018) 1–10

9

Table 5 AR (%) and RMSE (m) 2D localization performance and approximate number of operations per frame using simulated data in Gs with RT60 = 0.6 s, SNR=20 dB, and spatial resolution  = 0.05 m for R-SRP, G-SRP, SRP, M-SRP and spatial resolution  = 0.05 m with refinement step of 0.01 m for I-SRP, H-SRP, RV-SRP.

AR RMSE Complexity ( × 107 )

R-SRP

G-SRP

SRP

M-SRP

I-SRP

H-SRP

RV-SRP

93.06 0.373 2.2961

87.23 0.571 2.2961

81.85 0.755 2.2538

89.58 0.479 2.3157

88.63 0.774 2.3890

89.54 0.492 2.3246

37.19 1.622 2.2988

Table 6 AR (%) and RMSE (m) with spatial resolution  = 0.05 m for the whole search space Gr using real data with RT60 =0.6 s. The average SNR is 20 dB.

AR RMSE

R-SRP

G-SRP

SRP

M-SRP

I-SRP

H-SRP

RV-SRP

89.31 0.484

73.73 0.774

55.43 1.065

82.42 0.596

65.21 1.142

77.35 0.587

25.00 2.261

Next, an evaluation of 2D localization performance with a finer grid (=0.05 m) in shown in Table 5, which also reports the complexity in terms of approximation number of arithmetic operations per frame. The room is the same used in the 3D simulation. A distributed array of 6 microphones was positioned at the same height with a distance from the floor of 0.88 m. The setup is the same used in the real-data experiment with RT60 = 0.6 s and SNR= 20 dB. The room setup, the sensitivity map, and the high and low sensitivity regions with =0.05 m are shown in Fig. 6. We can see in Table 5 the best performance of the proposed R-SRP with a significant RMSE reduction, and a better AR performance in comparison to other methods. 7.2. Real data Real-world tests have been computed in a room of dimensions (6.4 × 3 × 3.6) m (the same used in simulations) that had an RT60 of 0.6 s. A grid resolution  of 0.05 m was used for all SRP methods (for I-SRP, H-SRP, and RV-SRP the refinement resolution was set to 0.01 m). A distributed array of 6 microphones was positioned with a distance from the floor of 0.88 m. Four source positions have been considered: s1 and s4 located in the low sensitivity region, and s2 and s3 located in the high sensitivity region. A speech signal (the same used in simulations) was reproduced with a loudspeaker at each position. The loudspeaker has a small oval driver with a size of 9.5 cm × 5 cm, a frequency response of 90– 20 0 0 0 Hz, and a RMS power of 1 W. The average SNR measured at microphones is about 20 dB. The microphones used were six omnidirectional iSEMcon EMM-7101-CSTB specifically designed for measurement in array configuration. Fig. 6 depicts the room setup, the sensitivity map, and the sensitivity regions calculated with the GSG algorithm. The result of 2D localization performance is reported in Table 6 for the whole search space Gr . We can observe that the R-SRP algorithm outperforms the other SRP methods, providing an accuracy rate of about 89% whereas the others reach a 82% accuracy rate at best. 8. Conclusions By taking into account a nonuniform distribution of the whole TDOA information in the search space, we presented a sensitivitybased region selection method for the GCC-based SRP-PHAT using GSG accumulated TDOA functions and coarser grids to reduce the computational cost. By relying on a sensitivity measure of the sensors array and on a classification of the search space into high and low sensitivity regions, we have proposed a peak-to-peak ratio (PPR) measuring the peak energy distribution in the two regions

to identify the region in which the source is positioned. The noise component of a GCC function was modeled as a Gaussian distribution, resulting in a space distribution related to the array sensitivity estimation procedure. The proposed R-SRP adds small complexity if compared to G-SRP method, and the arithmetic operations for the region selection are in general negligible as we have seen in the experimental section. Simulation and real-world results show that the error of localization can be reduced especially when the source is positioned in the low sensitivity region without significant additional computational cost. Acknowledgments The authors are grateful to the anonymous reviewers for their constructive comments that greatly contributed to improve the manuscript. References [1] B. Laufer-Goldshtein, R. Talmon, S. Gannot, Semi-supervised sound source localization based on manifold regularization, IEEE/ACM Trans. Audio Speech Lang. Process. 24 (8) (2016) 1393–1407. [2] D. Salvati, C. Drioli, G.L. Foresti, A weighted MVDR beamformer based on SVM learning for sound source localization, Pattern Recognit. Lett. 84 (2016) 15–21. [3] L. Kumar, R.M. Hegde, Near-field acoustic source localization and beamforming in spherical harmonics domain, IEEE Trans. Signal Process. 64 (13) (2016) 3351–3361. [4] J. Velasco, C.J. Martín-Arguedas, J. Macias-Guarasa, D. Pizarro, M. Mazo, Proposal and validation of an analytical generative model of SRP-PHAT power maps in reverberant scenarios, Signal Process. 119 (2016) 209–228. [5] D. Yook, T. Lee, Y. Cho, Fast sound source localization using two-level search space clustering, IEEE Trans. Cybern. 46 (1) (2016) 20–26. [6] D. Salvati, C. Drioli, G.L. Foresti, Sound source and microphone localization from acoustic impulse responses, IEEE Signal Process. Lett. 23 (10) (2016) 1459–1463. [7] L. Petrica, An evaluation of low-power microphone array sound source localization for deforestation detection, Appl. Acoust. 113 (2016) 162–169. [8] M. Cobos, F. Antonacci, A. Alexandridis, A. Mouchtaris, B. Lee, A survey of sound source localization methods in wireless acoustic sensor networks, Wirel. Commun. Mob. Comput. 2017 (2017) 1–24. [9] D. Salvati, C. Drioli, G.L. Foresti, A low-complexity robust beamforming using diagonal unloading for acoustic source localization, IEEE/ACM Trans. Audio Speech Lang. Process. 26 (3) (2018) 609–622. [10] D. Salvati, C. Drioli, G.L. Foresti, Exploiting CNNs for improving acoustic source localization in noisy and reverberant conditions, IEEE Trans. Emerg. Top. Comput. Intell. 2 (2) (2018) 103–116. [11] C. Knapp, G. Carter, The generalized correlation method for estimation of time delay, IEEE Trans. Acoust. 24 (4) (1976) 320–327. [12] J. Benesty, Adaptive eigenvalue decomposition algorithm for passive acoustic source localization, J. Acoust. Soc. Am. 107 (1) (20 0 0) 384–391. [13] J.O. Smith, J.S. Abel, Closed-form least-squares source location estimation from range-difference measurements, IEEE Trans. Acoust. 35 (12) (1987) 1661–1669. [14] Y. Huang, J. Benesty, G.W. Elko, R.M. Mersereau, Real-time passive source localization: a practical linear-correction least-squares approach, IEEE Trans. Speech Audio Process. 9 (8) (2001) 943–956. [15] P. Stoica, J. Li, Source localization from range-difference measurements, IEEE Signal Process. Mag. 23 (3) (2006) 63–66. [16] M. Omologo, P. Svaizer, R. De Mori, Spoken Dialogue with Computers, Academic Press. 1998, Ch. Acoustic Transduction. [17] J.H. DiBiase, H.F. Silverman, M.S. Brandstein, Microphone Arrays: Signal Processing Techniques and Applications, Springer. 2001, Ch. Robust localization in reverberant rooms. [18] P. Pertilä, T. Korhonen, A. Visa, Measurement combination for acoustic source localization in a room environment, EURASIP J. Audio Speech Music Process. 20 08 (20 08) 1–14. [19] R.O. Schmidt, Multiple emitter location and signal parameter estimation, IEEE Trans. Antennas Propag. 34 (3) (1986) 276–280.

10

D. Salvati et al. / Signal Processing 153 (2018) 1–10

[20] R. Roy, T. Kailath, ESPRIT - Estimation of signal parameters via rotational invariance techniques, IEEE Trans. Acoust. 37 (7) (1989) 984–995. [21] B.D. Rao, K.V.S. Hari, Performance analysis of root-music, IEEE Trans. Acoust. 37 (12) (1989) 1939–1949. [22] K. Harmanci, J. Tabrikian, J.L. Krolik, Relationships between adaptive minimum variance beamforming and optimal source localization, IEEE Trans. Signal Process. 48 (1) (20 0 0) 1–12. [23] C. Zhang, D. Florencio, D.E. Ba, Z. Zhang, Maximum likelihood sound source localization and beamforming for directional microphone arrays in distributed meetings, IEEE Trans. Multimed. 10 (3) (2008) 538–548. [24] J. Traa, D. Wingate, N.D. Stein, P. Smaragdis, Robust source localization and enhancement with a probabilistic steered response power model, IEEE/ACM Trans. Audio Speech Lang. Process. 24 (3) (2016) 493–503. [25] M.S. Bartlett, Smoothing periodograms from time-series with continuous spectra, Nature 161 (1948) 686–687. [26] K.D. Donohue, J. Hannemann, H.G. Dietz, Performance of phase transform for detecting sound sources with microphone arrays in reverberant and noisy environments, Signal Process. 87 (7) (2007) 1677–1691. [27] D. Salvati, C. Drioli, G.L. Foresti, Incoherent frequency fusion for broadband steered response power algorithms in noisy environments, IEEE Signal Process. Lett. 21 (5) (2014) 581–585. [28] A. Marti, M. Cobos, J.J. Lopez, J. Escolano, A steered response power iterative method for high-accuracy acoustic source localization, J. Acoust. Soc. Am. 134 (4) (2013) 2627–2630. [29] L.O. Nunes, W.A. Martins, M.V.S. Lima, L.W.P. Biscainho, M.V.M. Costa, F.M. Gonçalves, A. Said, B. Lee, A steered-response power algorithm employing hierarchical search for acoustic source localization using microphone arrays, IEEE Trans. Signal Process. 62 (19) (2014) 5171–5183. [30] M.V.S. Lima, W.A. Martins, L.O. Nunes, L.W.P. Biscainho, T.N. Ferreira, M.V.M. Costa, B. Lee, A volumetric SRP with refinement step for sound source localization, IEEE Signal Process. Lett. 22 (8) (2015) 1098–1102. [31] M. Cobos, A. Marti, J.J. Lopez, A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling, IEEE Signal Process. Lett. 18 (1) (2011) 71–74.

[32] D. Salvati, C. Drioli, G.L. Foresti, Exploiting a geometrically sampled grid in the steered response power algorithm for localization improvement, J. Acoust. Soc. Am. 141 (1) (2017) 586–601. [33] L.O. Nunes, W.A. Martins, M.V.S. Lima, L.W.P. Biscainho, B. Lee, A. Said, R.W. Schafer, Discriminability measure for microphone array source localization, in: Proceedings of the International Workshop on Acoustic Signal Enhancement, 2012, pp. 1–4. [34] H. Kuttruff, Room Acoustics, Spon Press, 2009. [35] H.F. Silverman, Y. Yu, J.M. Sachar, W.R.I. Patterson, Performance of real-time source-location estimators for a large-aperture microphone array, IEEE Trans. Speech Audio Process. 13 (4) (2005) 593–606. [36] O. Thiergart, G.D. Galdo, M. Taseska, E.A.P. Habets, Geometry-based spatial sound acquisition using distributed microphone arrays, IEEE Trans. Audio Speech Lang. Process. 21 (12) (2013) 2583–2594. [37] M. Souden, K. Kinoshita, M. Delcroix, T. Nakatani, Location feature integration for clustering-based speech separation in distributed microphone arrays, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (2) (2014) 354–367. [38] M. Taseska, E.A.P. Habets, Informed spatial filtering for sound extraction using distributed microphone arrays, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (7) (2014) 1195–1207. [39] A. Bertrand, M. Moonen, Distributed adaptive node-specific signal estimation in fully connected sensor networks - part i: sequential node updating, IEEE Trans. Signal Process. 58 (10) (2010) 5277–5291. [40] S. Markovich-Golan, S. Gannot, I. Cohen, Performance of the SDW-MWF with randomly located microphones in a reverberant enclosure, IEEE Trans. Audio Speech Lang. Process. 21 (7) (2013) 1513–1523. [41] A. Griffin, A. Alexandridis, D. Pavlidi, Y. Mastorakis, A. Mouchtaris, Localizing multiple audio sources in a wireless acoustic sensor network, Signal Process. 107 (2015) 54–67. [42] J.B. Allen, D.A. Berkley, Image method for efficiently simulating small-room acoustics, J. Acoust. Soc. Am. 65 (4) (1979) 943–950. [43] E. Lehmann, A. Johansson, Prediction of energy decay in room impulse responses simulated with an image-source model, J. Acoust. Soc. Am. 124 (1) (2008) 269–277.