Applied Acoustics 157 (2020) 106995
Contents lists available at ScienceDirect
Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust
Prediction of head-related transfer function based on tensor completion Jing Wang a,b,⇑, Min Liu a, Xinyao Wang a, Teijun Liu b, Xiang Xie a a b
School of Information and Electronics, Beijing Institute of Technology, 100081 Beijing, China State key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, 110016 Shenyang, China
a r t i c l e
i n f o
Article history: Received 21 November 2018 Received in revised form 11 April 2019 Accepted 1 August 2019
Keywords: Head-related transfer function Tensor completion Prediction
a b s t r a c t A new head-related transfer function (HRTF) prediction method using tensor completion is proposed to estimate the HRTF in unmeasured directions based on sparse measurements and achieve high spatial resolution HRTF for individual subject. A three-order incomplete tensor prediction model is constructed to describe HRTF with a multi-dimensional structure depending on the subspaces of azimuth, elevation and frequency. Then, low rank tensor completion method is applied to complete this tensor by exploiting the hidden spatial structures of HRTF data, aiming to obtain the HRTF in unmeasured directions based on a small set of sparse measurements, in which gradient descent algorithm and line search strategy are used. Experiments have been conducted based on CIPIC HRTF database to evaluate the prediction performance. The objective results suggest that the proposed method has a better prediction performance than other tensor and linear interpolation methods. The subjective tests show that the proposed predicted HRTF is more approximate to the original one in sound localization similarity. Ó 2019 Elsevier Ltd. All rights reserved.
1. Background Head-related transfer function (HRTF) is an acoustic transfer function which capture transformations of a sound wave propagating from the source to human ears. Some of the transformations include diffraction and reflections of head, pinnae, shoulders and torso [1]. Its corresponding time-domain representations are Head-Related Impulse Response (HRIR), which includes the information related to the sound localization such as the interaural time differences (ITD) and interaural level differences (ILD). The convolution of the HRIR with a sound source signal will virtualize sound image with different spatial effects, which enables the listener to perceive the sound from any random direction in virtual space through headphones. Therefore, HRTF is usually applied in binaural hearing, sound localization, and the synthesis for virtual auditory reproduction, etc. [2,3]. HRTF varies with the sound source directions (azimuths and elevations), distance and frequency, and it is also closely related to different human subjects due to different diffraction and reflection properties of different human heads, torsos and ear pinnas. Sound localization for HRTF from one subject to another can be affected critically due to these different external physiological ⇑ Corresponding author at: School of Information and Electronics, Beijing Institute of Technology, 100081 Beijing, China. E-mail addresses:
[email protected] (J. Wang),
[email protected] (T. Liu),
[email protected] (X. Xie). https://doi.org/10.1016/j.apacoust.2019.08.001 0003-682X/Ó 2019 Elsevier Ltd. All rights reserved.
structures. That is, different listeners may have different perceptions of spatial sound localization for the same non-individual HRTF. Experimental measurement is the most direct and accurate way to generate individual HRTFs. HRTFs are measured for persons or dummy heads at different positions in an anechoic room. When the sound source is played at a certain position from the rdistance of the head center of the subject via a loudspeaker in an anechoic room, the resulting signals are recorded using small microphones placed in the left and right ear canals of the respective subject. The procedure needs to be repeated many times to obtain HRTF data for other directions in space. This complex work is costly and time-consuming especially for measuring higher spatial resolution HRTF. Therefore, it is more important and useful to analyze the structure information of HRTF and employ some theoretical techniques to predict HRTF in unmeasured directions based on a small set of measurements. In order to better represent the characteristics of the HRTF, several theoretical calculation models have been proposed, such as spherical head model [4], the snowman model [5], boundary element model [6,7] and the filter-band model [8]. To extract the directional cues from HRTF effectively, Evans et al. expressed HRTF as a weighted sum of surface spherical harmonics called spherical harmonics functions [9], which can generate the HRTF at a random direction in the space, while existing dependency relationship between the number of azimuth and elevation angle for the measured HRTF. In an attempt to extend the harmonics functions, Zhang et al. proposed an functional HRTF expansion modeling to
2
J. Wang et al. / Applied Acoustics 157 (2020) 106995
represent and expand frequency-portion [10]. Furthermore, an HRTF representation model was constructed by Hu et al. based on local analysis functions [11]. Though the sound synthesized by all these models resembles the spatial sound in real space, intensive computation is required to compute their parameters. Since experimental measurement only generates HRTF in a discrete set of directions, HRTF in the unmeasured direction can be obtained by interpolation on a small set of measures to improve spatial resolution. That is, HRTF can be reconstructed by interpolation techniques based on a small set of measurements. Some different interpolation methods have been proposed, such as linear, bilinear and nonlinear solutions, etc. While the most straightforward and simple method to estimate HRTF in an unmeasured direction is linear and bilinear interpolation using nearby measurements [12,13]. To directly apply and store such a large HRTF dataset, Kistler and Wightman [14] proposed a HRTF matrix model based on principal component analysis (PCA), Wang et al. applied PCA in HRTF interpolation and compression [15,16], Xie et al. recovered highdirectional-resolution HRTFs from a small set of measurements by combining the SBFs and the individual weights [17]. PCA can successfully represent the high-dimensional HRTF as vector or matrix in a low-dimension space and reduce the dimensionality of HRTF data. However, it can not capture the multi-way structure and complex characteristic features in HRTF. To address these issues, Grindlay and Vasilescu [18] proposed a multilinear (tensor) framework to analyze and synthesize HRTF. This multilinear model depends on individual, sound source direction and frequency and explores the interaction among these multiple variables to achieve individualized HRTF by using regression and N-mode SVD. Rothbucher et al. applied multiway array analysis in HRTF compression [19] and customization [20], respectively. Huang et al. modeled individual HRTF tensor using high-order partial least squares [21]. These work makes it possible for multilinear model to be applied to model HRTF. Tensors can represent more naturally the structural properties of multidimensional data and to capture useful information in HRTF effectively. Moreover, tensor completion is the high-order generalization of matrix completion, which is widely applied in traffic data [22,23], image data [24] to recover lost data or estimate missing entries by exploring interactions among observations. However, little attention has been paid on the estimation or reconstruction of HRTF in unmeasured directions to obtain high-spatialresolution HRTF for individual subjects using data loss recovery method like low-rank tensor completion. Kulharni et al. analyzed the sensitivity of human ear to the sound phase by approaching the minimum phase of HRTF function and found that the human ear hardly can distinguish the difference between the HRTF reconstructed by the minimum phase and the original HRTF [25]. This implies that the HRTF can be approximated by a minimum phase function followed by a pure time delay relating mainly to the ITDs. Since the minimum phase function of HRTF is only related to the magnitude, some method to process HRTF data can be effective only in the magnitude components.
2. Introduction HRTF data with multiple variables contains natural multidimensional structure with robust correlations among different sound source directions and inter-frequency. Motivated by previous related work, we propose a three-order prediction tensor model based on HRTF sparse measurements of an individual to keep multiple natural structures in HRTF and use a completion method to predict HRTF magnitude functions in unmeasured direc-
tions in high-dimensional space aiming to reconstruct highspatial-resolution HRTF. As for phase information, the minimum phase information of HRTF and the ITD obtained by linear interpolation are used to reconstruct the phase information of HRTF in unmeasured direction. This is different from the work of [18–21] that mostly analyzed and learned the interaction between human anthropometric parameters and corresponding HRTF data to obtain individual HRTF of a new subject. In our proposed method, the HRTF data with the sparse measured entries and unmeasured ones are represented as a threeorder incomplete tensor model in subspaces of azimuth, elevation and frequency. Then estimating the unmeasured HRTF in other directions was treated as reconstructing or completing the incomplete tensor data by low-rank approximation based on tensor completion. The HRTF in all directions can be obtained with gradient descent algorithm and line search strategy by solving the convex optimization problem converted by some relaxing techniques. With the decrease of the ratio of known HRTF entries or the data of azimuth and elevation mode, that is, the HRTF measurements are more sparse, more HRTF data in other directions can be obtained with acceptable distortion. Moreover, except for other tensor methods, we compared linear interpolation method [13], one of the classical mainstream approaches, with our proposed method. In summary, the main contributions of this work are: The new prediction three-order tensor model for HRTF data in all directions along azimuth, elevation and frequency; Development of a method for HRTF prediction in unmeasured directions based on tensor completion; Demonstration of the performance superiority of the proposed prediction method over other tensor methods. The rest of this paper is organized as follows: Section 3 elaborates on the preliminary tensor algebra used in this paper and describes the prediction model and method of HRTF based on tensor in detail. Section 4 shows the experiment designing. Section 5 presents some results and discussion, and we conclude this paper in Section 6. 3. The Proposed prediction method for HRTF 3.1. Preliminary tensor algebra Tensor, a multidimensional array, represents an element of Nway multifactor space. The way of tensor is the number of dimensions or factor spaces, also known as order or mode [26]. In this paper, tensor is denoted by upper case letters e.g., X , matrix is denoted by bold upper case letters e.g., X, vector is denoted by bold lower case letters e.g., x, and scalar is denoted by lower case letters e.g., x. We use the lowercase letters with sub-scripts xi1 ;i2 ;...;iN for the i1 ; i2 ; . . . ; iN entry of an N-way tensor X 2 RI1 I2 ...IN . An N-way tensor X 2 RI1 I2 ...IN can be rearranged as a matrix in the subspace of mode-n by changing the position of elements, which is called mode-n unfolding, also known as matricization or flattening. This Q can be denoted by X ðnÞ 2 RIn J , where J ¼ m–n Im . For the inner product of two N-way tensors X 2 RI1 I2 ...IN ; Y 2 RI1 I2 ...IN , it can be computed as a sum of element-wise products over all the indices and expressed as:
hX ; Yi ¼
I1 X I2 X i1 ¼1 i2 ¼1
IN X
xi1 ;i2 ;...;iN yi1 ;i2 ;...;iN
ð1Þ
iN ¼1
For an N-way tensor X 2 RI1 I2 ...IN , its Frobenius norm is defined as the square root of the sum of the squares of all elements as:
J. Wang et al. / Applied Acoustics 157 (2020) 106995
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u I1 I 2 IN X X pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX kX kF ¼ hX ; X i ¼ t x2i1 ;i2 ;...;iN i1 ¼1 i2 ¼1
ð2Þ
iN ¼1
P Let X ðnÞ 2 Rpq , whose rank is r, X ¼ U V T is the singular value P decomposition, where ¼ diagðrÞ; r ¼ ðr1 ; r2 ; ; rr ÞT . For any s > 0, singular value shrinkage operator Ds ðXÞ is defined as:
Ds ðXÞ ¼ U rs V T
ð3Þ
3.2. The general description of the scheme Fig. 1 shows the outline of HRTF prediction method, which aims to obtain HRTF in all directions for individual subjects by predicting HRTF in unmeasured directions based on a small set of sparse measurements. This procedure includes three main parts: prediction model (details are shown in Section 3.3), tensor completion method (details are shown in Section 3.4), and the ITDs reconstruction. We focus on the prediction of HRTF magnitudes using the proposed method. The minimum-phase function of HRTF and the ITDs are used to reconstruct the phase information of HRTF in unmeasured direction based on [25]. HRTF is usually saved as its time domain form called HRIR in the actual experimental measurements. Firstly, some HRIR measurements and the measured and estimated directions (i.e., azimuth and elevation angles) are used as the input dataflow, which are expressed as hin and win , respectively. As for the ITDs reconstruction, we calculated the ITDs in the measured directions according to [25] and used linear interpolation to obtain the target ITDs of the unmeasured direction based on these known ITDs. On the other hand, these measured HRIR can be transformed to HRTF by time-frequency transformation. On the other hand, all directions are used to construct observed matrix along azimuth and elevation. It is best that the angular resolution between azimuths is uniform observed from the azimuth by fixing the index of elevation, and the same is true for the elevation dimension to maintain the prediction accuracy. Corresponding with locations of HRTF frequencies, the observed matrix is extended to a threeorder index tensor by adding frequency dimensionality, and whether the element at this position is known or not is represented by 1 or 0. Subsequently, with index tensor combined with HRTF in known directions, two three-order incomplete tensor model for
3
left and right ears can be constructed along azimuth, elevation and frequency subspaces. Therefore, prediction for HRTF in unmeasured directions can be converted to a completion of missing data based on observations. Gradient descent algorithm was used for solving this completion problem with line search strategy applied for reducing the time complexity, resulting in the complete tensor. The unmeasured HRTF entries in other directions are predicted by unfolding the complete tensor. At last, the HRIRs data (i.e., hout ) in unmeasured direction as the output can be obtained by the minimum phase functions of HRTFs following the corresponding ITD. In this paper, all the public HRTF databases can be used to build up the incomplete tensor model with parameters of measured and unmeasured directions. Based on these known elements, the tensor completion can capture the structure correlations among azimuths and elevations so as to predict the HRTF in other directions. Then, the subjective listening test and objective evaluations, such as spectral distortion, relative square error and prediction time, have been conducted to evaluate the HRTF data in estimated directions. 3.3. Tensor model for HRTF prediction HRTF can be expressed as a function of azimuth, elevation and frequency. Therefore, a three-dimensional spatial model can be constructed based on some sparse measurements from left or right ear. Compared with the traditional vector or matrix, tensor works well in representing this multi-dimensional spatial structure information for the purpose of predicting the HRTF in unmeasured directions based on a small set of sparse measurements. The steps of model construction are as follows: Given that L1 -point HRIR samples from a specific position in total of PQ directions are represented by hij 2 RL1 ði ¼ 1; 2; . . . ; P; j ¼ 1; 2; . . . ; Q Þ; L0 -point HRTF frequency samples can be obtained through Fast Fourier Transform (FFT), which is described as Hij 2 RL0 . The estimated and measured directions including azimuths and elevations are represented by wij 2 R1 called observed matrix. And the Gand K represent the number of rows and columns of the matrix wij and determine the total number of azimuth and elevation angles. Here, P < G , Q < K and it is worth noting that the estimated angles for azimuth dimension are set in our own way with a uniform angle interval, and the same is true for the elevation dimension to ensure predic-
Fig. 1. An overview of the proposed prediction method. (‘‘hin ” means left and right HRIR data in some measured directions; ‘‘win ” means estimated directions including azimuth and elevation angles; ‘‘hout ” means HRIR data in unmeasured directions by prediction;).
4
J. Wang et al. / Applied Acoustics 157 (2020) 106995
tion accuracy. Moreover, for the smaller angle intervals, the bigger the values Gand Kare, the higher spatial resolution HRTF will be. We extend wij from one point to L0 points inconsistent with Hij . Combining the extended observed matrix wij with HRTF samples in a total of P Q directions, a total of G K HRTFs with a length of L0 points can be described as:
2
H11 6H 6 21 6 6 4 :
H12 H22
HG1
HG2
:
.. .
3 H1K H2K 7 7 7 7 : 5
HGK
ð4Þ
X
s:t: X X ¼ MX :
ð6Þ
Tensor can represent more than two dimensional data spaces to capture the potential structures between subspaces, which have a wider application in the era of big data. Therefore, in this paper we focus on tensor in the low-rank approximation, which can be described as a convex optimization problem as:
minX kX k ; s:t: X X ¼ T X :
GK
where the estimated values are 0. Since the tensor completion method does not perform well for predicting the entire slice matrix, the matrix of 4 needs to satisfy that any row or column is not all 0. Finally, a three-order incomplete tensor model X 2 RGKL0 is constructed in three subspaces: azimuth, elevation and frequency as shown in Fig. 2, and the extended observed matrix including angle information is used to construct a tensor of the same size called index tensor. Fig. 2 shows the incomplete tensor model ðX Þ based on HRTF values in some measured and estimated directions along azimuth and elevation. The known measurements generally in a specified direction is displayed as the darker bars, that is, the higher-order analogues of matrix rows or columns, and the lighter one with slash pattern means the estimated HRTF data in measured directions. The HRTF correlation relationship between azimuth and elevation can be represented by high-order tensor structure. Here, G; K and L0 are the dimension number of the azimuth, elevation and frequency subspaces respectively. The known HRTF directions and the size of model can also be changed by adjusting the parameter G; K and L0 in order to obtain HRTF data of high resolutions, which is particular in spatial audio.
The multi-dimensional data can be approximated by a number of low-dimensional linear subspaces with some unknown entries, which can be described as low-rank approximation which in essence an issue of rank constraint. The optimization problem for low-rank matrix completion can be expressed as:
min rankðXÞ; s:t: X X ¼ M X :
ð7Þ PN
i¼1 ai kX ðiÞ k ; ai ; i ¼ 1; ::; NðN ¼ 3Þ are the weighted P3 coefficients satisfying ai P 0 and i¼1 ai ¼ 1. The three-order Ia Iz If incomplete tensor space is X 2 R . T 2 RIa Iz If is equivalent to the original tensor X o , and the set X contains locations of the observed data, which has identical meanings with tensor W. The operator k k is defined as trace norm of tensor that is a convex combination of the trace norms of all matrices unfolded along each mode (azimuth, elevation and frequency) in this paper. The Eq. (8) below can be rewritten as a general minimization function of tensor trace norm.
where, kX k ¼
min f ðX Þ ¼ X
s:t:
N X
ai kX ðiÞ k ;
i¼1
X X ¼ T X:
ð8Þ
Three dual variables Y 1 ; Y 2 ; Y 3 2 RIa Iz If and three parameters
l1 ; l2 ; l3 are introduced. Then, the objective function will be obtained as follows: N X
minf l ðX Þ ¼ X
i¼1
max ai hX ; Y i i
kY iðiÞ k1
li 2
kY ðiÞ k2F ;
s:t: X X ¼ T X :
3.4. Low rank approximation with tensor completion
X
minkXk ;
ð5Þ
in which the matrix X and M have same size of p q and the entries of set X are given while the remaining ones are unknown. Since this issue is a non-convex optimization, it is difficult to obtain global solution. Since the trace norm has been shown to be the tightest convex approximation for the rank of matrices [27], non-convex optimization was converted to convex by using the following Eq. (6).
ð9Þ
To solve the Eq. (9), we use gradient descent algorithm to minimize the objective function and the line search to improve the convergence speed [28,29]. X kþ1 and Z kþ1 are updated by the Eqs. (10) and (11).
X kþ1 ¼ Gkþ1
Z kþ1 ¼ Z k where
L
5f l ðGkþ1 Þ
hkþ1 Lk is
Lkþ1
;
ð10Þ
5 f l ðGkþ1 Þ: the
ð11Þ
Lipschitz constant of the objective pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ 4Lkþ1 Bk Þ and Bkþ1 ¼ Bk þ hkþ1 =Lk .
f l ðGÞ; hkþ1 ¼ Lk =2Lkþ1 ð1 þ
equals to T For the Eq. (10) and (11), G kþ1 X as follows:
Gkþ1 ¼
X
and can be updated
hkþ1 =Lk
Bk k Z þ Xk Bk þ hkþ1 =Lk Bk þ hkþ1 =Lk
ð12Þ
Its gradient can be computed by:
ð5f l ðGkþ1 ÞÞi
1 ;i2 ;i3
¼
8 ! X ða Þ2 > > < i T li ðGkþ1 Þ > > :
i
li
ðiÞ
ai
; ði1 ; i2 ; i3 Þ R X
i1 ;i2 ;i3
0;
ði1 ; i2 ; i3 Þ 2 X ð13Þ
Fig. 2. The prediction tensor model X 2 RGKL0 . (‘‘azimuth”, ‘‘elevation” and ‘‘frequency” represent three subspaces of the prediction model).
For the HRTF prediction, the values in other directions are initialized as the average of other observed data. For each iteration, G; X ; Z and other parameters can be updated. The procedure will not stop until the maximum iterations or the minimum tolerable
J. Wang et al. / Applied Acoustics 157 (2020) 106995
error. Then the eventual updated tensor X with the HRTF data in all directions will be obtained. The algorithm of tensor completion is shown as Algorithm 1. Algorithm 1 Tensor completion algorithm for HRTF prediction. Require: tensor X 2 RIa Iz If ; X X ¼ T X , sets X; c 2 ð0; 1Þ; L; ai , the maximum number of iterations K, the tolerable error tol Ensure: X 1: Initialize:Z ¼ G ¼ T ; B ¼ 0; L ¼ 1e 5; pffiffiffi c ¼ 0:6; li ¼ ai = Ii ; 2: fork = 1 to K do 3: while true do 4:
Update hkþ1 ; G kþ1 ; X kþ1 ;
5: 6: 7:
iff l ðX kþ1 Þ 6 f l ðG kþ1 Þ k 5 f l ðG kþ1 Þk2F =2Lkþ1 then break; end if
8: 9:
3.5. Compared with other tensor methods Tensor decomposition emphasizes the underlying factor estimation and exploits the model factors in low-rank approximation. Two classical models are included: the CANDECOMP/PARAFAC (CP) model and Tucker model, which have been successfully applied in the processing of signals [30], such as multi-channel audio signal [31], and features extraction in audio [32]. 3.5.1. CP-WOPT The completion problem for a weighted CANDECOMP/PARAFAC (CP) model is named CP-WOPT (CP Weighted OPTimization) [33], which can be described as: X ;U a ;U z ;U f
ð14Þ
where, U a ; U z ; U f are the factor matrices of the incomplete tensor X in azimuth, elevation and frequency subspaces, respectively, U a 2 RIa R ða ¼ a; z; f Þ and X ; T ; W 2 RIa Iz If . CP decomposition can be expressed as a sum of rank-one tensor component as follows:
sU a ; U z ; U f t ¼
R X uar uzr ufr
ð16Þ
where S is the core tensor with a size of R1 ; R2 ; R3 ,representing the main features of the incomplete tensor X. A1 ; A2 ; A3 2 RIi Ri ; ð1 6 i 3Þ are the factor matrices of tensor along each mode (azimuth, elevation, frequency), characterizing correlation relationships between subspaces. Alternating least squares (ALS) algorithm can usually be used to solve this problem by iteratively optimizing X and S; A1 ; A2 ; A3 , respectively. However, the quality of the eventual prediction will be affected by the n-mode ranks of tensor X (R1 ; R2 ; R3 ) and the prediction time will increase with the increase of ALS algorithm iterations. Therefore, selecting proper n-mode ranks for the HRTF prediction is very critical in this paper.
4.1. CIPIC HRTF database
Lkþ1 ¼ Lk =c; end while
1 kW ðX sU a ; U z ; U f tÞk2F s:t: X X ¼ T X : 2
1 kW ðS1 A1 2 A2 3 A3 Þk2F s:t: X X ¼ T X : 2
4. Simulation experiment
10: Update Z kþ1 ; Bkþ1 ; 11: end for
min
min
X ;S;A1 ;A2 ;A3
5
ð15Þ
r¼1
in which the operator denotes the outer product and uar is the ath vector of U a (uar 2 RIa 1 ; r ¼ 1; . . . ; R) for a ¼ a; z; f . It is worth noting that the tensor rank (R) is defined as the smallest number of rank-one tensors for tensor X , which is important for the reconstruction effects. However, there is currently no way to directly define a best rank for an arbitrary tensor, which turns out to be an NP-hard problem [26]. Generally, solving this problem entails many experiments or solid prior background knowledge about data. In this paper, we choose two ranks (R = 20, 50) for comparison. The eventual prediction results for HRTF can be obtained by updating the tensor X with the optimization of the factor matrices. 3.5.2. Tucker-opt Tucker optimization (also called Tucker-opt) [34] is formulated as a kind of completion problem based on the Tucker factorization model, which can be described as the Eq. (16).
The CIPIC HRTF Database [35] is a public-domain database of high-spatial-resolution HRIR measurements for 45 different subjects. The database includes 1250 locations at 25 different azimuths (i.e., 5°along most azimuths) and 50 different elevations (i.e., 5.625°along all elevations) for each subject in the interaural coordinate system. Therefore, we use the interaural coordinate system in this experiment. Whereas it is noting that the proposed prediction method is not limited to the interaual coordinate system of this experiment. HRIR has 200 samples long and its sampling rate is 44.1 kHz. Moreover, the delay of left and right ear and the corresponding ITD are also included in the database. The CIPIC database contains the non-uniform azimuth interval which is also used to explore more interaction in this experiment. Moreover, all experiments were done using MATLAB R2014a with Intel Core E3-1225 3.4 GHz processor and 8 GB RAM. 4.2. Prediction model construction and experiment set For a random subject, we firstly transformed the time domain parameters into frequency domain through a 200-point Fast Fourier Transform (FFT). Then, the parameters for each position were normed. Take the left ear for example, a three-order incomplete tensor model X 2 RIa Iz If (Ia ¼ 25; Ie ¼ 50; If ¼ 200) was constructed in azimuth, elevation and frequency mode by multiplying the index tensor W and the original tensor X o . These two tensors have the same size as tensor X . The tensor X o was constructed based on all HRTF measurements for a random subject in CIPIC database. Whether these parameters are unmeasured or measured in such directions (including azimuth and elevation) in tensor X can be designed by the index tensor W including just 0 and 1. A parameter ratio was introduced to describe the percentage of unmeasured directions and we can call it the missing rate. We use the missing rate to simulate sparse measurements. To explore the performance of different prediction methods with various ratios, the missing rate was set as 0.1, 0.3, 0.5 and 0.8, respectively. The estimated HRTF data as a test set were used for evaluation. Due to the randomness of observed and estimated directions at a fixed ratio, we conducted eight experiments repeatedly and the average results were recorded for each ratio in this paper. Moreover, tensor-related operations are implemented in Tensor_toolbox 2.6 [36]. The paper does not focus on the generation of new individual HRTF but the prediction and estimation of the HRTF in unmeasured directions based on sparse measurements to increase spatial resolution, which is more similar to the objective of interpolation methods. Moreover, among all of interpolation methods, linear
6
J. Wang et al. / Applied Acoustics 157 (2020) 106995
interpolation is more simple and effective to generate more accurate HRTF in unmeasured directions. Therefore, we chose the mainstream linear interpolation method [13] aimed to apply tensor completion in HRTF. Apart from the traditional linear interpolation method, other tensor methods like CP-WOPT and Tucker-opt were also employed for comparison. For the proposed method, the maximum iteration can be set as 500 times. Based on experiences, a was set as ½0:5; 0:5; 5 104 . Since the rank of CP-WOPT is an NP-hard problem, two ranks R = 20, 50 were selected for this method based on our experiences. For Tucker-opt method, rank parameters can be respectively set as 10, 25 and 200 in the azimuth, elevation and frequency mode, and the tolerance and maximum number of iterations were set as 106 and 100, respectively. For the linear interpolation method in [13], we converted the tensor model to matrices at the same ratio to ensure that the locations of the observed and unmeasured data are identical.
Table 1 Sound localization similarity scale. Description of sound localization similarity
Score
No difference Very Similar Similar Different Very Different
5 4 3 2 1
file was rendered using the original and predicted HRTF with the same azimuth in the horizontal plane at the ratio of 0.8. A total of 11 directions were tested, whose azimuths including 65 ; 45 ; 30 ; 15 ; 10 ; 0 ; 25 ; 30 ; 35 ; 65 and 80 .
4.3. Performance evaluations 4.3.1. Objective performance metrics To evaluate the experiment performances of each method, we calculated with three metrics: Relative Standard Error (RSE), Spectral Difference (SD) and the prediction time. The relative standard Error (RSE) between the incomplete tensor and the reconstructed tensor is defined as follows:
RSE ¼
^k kX X F kX kF
ð17Þ
where k kF is the Frobenius norm of tensor. The prediction performance will be better with smaller RSE. The spectral difference (SD) is the average of spectrum differences in all test directions over the whole frequency spectrum. It is defined as follows:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi M 1X 1 X SD ¼ 20log 10 ðjHi j jHoi jÞ2 M i¼1 Nf
ð18Þ
Fig. 3. The SD of different methods to predict HRTF at different ratios. (SD, Spectral Difference; CP-WOPT, CANDECOMP/PARAFAC Weighted OPTimization; Tucker-opt, Tucker optimization; see also in Fig. 4 and Table 2).
where Hi is the magnitude of the ground-truth HRTF of dataset and Hoi is the magnitude of reconstructed HRTF in one direction. N f is the number of frequency samples for each HRTF and M is the number of test directions (unmeasured positions). It is noted that SD (expressed in dB) can be obtained from the frequency-byfrequency squared differences between the original HRTF measurements Hi and the reconstructed HRTF data Hoi in one direction. The smaller the SD is, the closer the predicted parameters will be to the original HRTF measurements. 4.3.2. Design of subjective test It is meaningful and desirable for the predicted HRTF to provide accurate sound localization. Thus, the design of the subjective listening test is to reflect sound localization difference of the original HRTF and the predicted HRTF from different methods. Only the magnitude of HRTF can be obtained by these methods. While the ITD also affects the accuracy of sound localization especially for the horizontal plane, which need to be taken into account. We used the minimum-phase function of HRTF and the ITD to reconstruct the phase of HRTF in unmeasured direction based on [25]. The minimum phase of HRTF can be obtained by hilbert-transforming on the corresponding magnitude. The changes of ITD are very small at different elevations and the same azimuth. Thus we used linear interpolation to unify the ITD of the unmeasured direction HRTF obtained by all methods. In this test, we selected two representative sound files including drum and speech with a length of 8 s. The sampling rate of these sounds were set as 44.1 kHz and resolution was set at 16 bit. Each
Fig. 4. The RSE of different methods to predict HRTF at different ratios. (RSE, Relative Standard Error).
Table 2 Prediction Time for different tensor methods (s). Methods
Tensor completion CP-WOPT (R = 50) CP-WOPT (R = 20) Tucker-opt
Ratio 0:1
0:3
0:5
0:8
4:6 200:9 154:7 30:3
5:9 180:5 144:2 24:9
7:3 167:1 135:4 29:7
11:3 85:4 69:7 37:2
J. Wang et al. / Applied Acoustics 157 (2020) 106995
7
Fig. 5. Amplitude spectrums for different methods at the ratio of 0.6 (elevation angles ranging from 45 to 225 degree with 5.625-degree division and 25 degree azimuth angle. (a) original HRTF; (b) tensor completion; (c) CP-WOPT (R = 50); (d) CP-WOPT (R = 20); (e) Tucker-opt; (f) Linear interpolation[13]. (Different colors means different magnitude values of HRTF, See also Fig. 6).
To reduce the influence of personalization of HRTF on perceptual sound localization, 12 experienced participants (7 female and 5 male, age from 21 to 26) have attended this test, who are sensitive to the variation of sound position. Before the experiment, each subject performed a brief training to be familiar with the task. During the experiment, each rendering test signal (different directions, different methods) is played back through a headphone randomly. The participant needs to compare the sound localization difference between reference and test sample and score the latter using the five point degradation category scale according to Degradation Category Rating (DCR) method of ITU-T Recommendations P:800[37]. The Table 1 shows the grading criteria of the level of sound localization similarity, which resembles with [21]. 5. Results analysis and discussion Fig. 3 shows the value of SD averaged from the results of eight experiments at a 95% confidence level for tensor completion and other methods. It can be seen that the proposed method signifi-
cantly yields a quite smaller SD around 4 dB than other methods except 0.8. This is especially notable that the SD has a slight increase for tensor completion when ratio increases largely up to 0.8, but the values for other tensor methods increase sharply especially for ratio 0.8. This is that overfitting for these two tensor-methods occurred without the best rank based on a fewer observations. For different ranks set in CP-WOPT method, the value of SD for R = 50 is smaller than that for R =20 when the ratio is 0.1 and 0.3, while the result is the opposite when ratio = 0.5, 0.8. This illustrates that the tensor rank may change due to different tensors constructed by the HRTF data of different unmeasured and measured locations. Curvilinear trend chart with averaged RSE from eight experiments for tensor completion and other methods is shown in Fig. 4. We can see from the figure that RSE values for all methods have an upward trend at different ratios, which means that the prediction accuracy was affected as the ratio increases because HRTF data in more unmeasured directions need to be estimated by using more sparse measurements. There is a sharp increase
8
J. Wang et al. / Applied Acoustics 157 (2020) 106995
Fig. 6. Amplitude spectrums for different methods at the ratio of 0.6 (azimuth angles ranging in 80 to 80 degree with most 5-degree division and 28.125 degree elevation angle. (a) original HRTF; (b) tensor completion; (c) CP-WOPT (R = 50); (d) CP-WOPT (R = 20); (e) Tucker-opt; (f) Linear interpolation[13].
for RSE especially at the ratio of 0.8 except for tensor completion and linear interpolation methods, which is consistent with the previous SD results. Evidently, the proposed method generally yields smaller RSE, indicating that the proposed method can achieve higher accuracy when predicting the overall HRTF. In order to compare the time complexity of different tensor approaches, Table 2 shows the prediction time for tensor completion and other tensor methods. The results were recorded by averaging the time for eight experiments for each ratio. Tensor completion can acquire all HRTF data in the shortest time for each ratio, which illustrates that the proposed method can reduce the time complexity and estimate HRTF data in unmeasured directions more effectively than other tensor methods. It is worth noting that the prediction time decreases with the increase of ratio for CPWOPT method. For the CP-WOPT at similar ratio, the smaller the tensor rank is to set, the less factor matrices is to decompose. With the increase of ratio, there will be less observed values and therewith less calculating complexity. This may lead to data overfitting and then affect the prediction performance.
To some extent, the visual representations are useful to illustrate the results, which could reflect the amplitude spectrum information of the predicted HRTF and the differences between the ground-truth HRTF and predicted HRTF. Therefore, Fig. 5 observed at the ratio of 0.6 are as an example to illustrate some results. Besides, the number of points in the FFT is set to 512 for better image visuals. Amplitude spectrums across all elevations for a specific azimuth are shown in Fig. 5, which are referred to as lateral plane spectrums. Elevation angles range from 45 to 225 degree with a 25 degree azimuth angle. Fig. 5 separately represent original measurements of HRTF and reconstructed results obtained from different methods. It can be seen that all reconstructed lateral plane spectrums have more or less difference from the original especially in high frequency, which will affect elevation perception. The recovery of amplitude spectrums for CP-WOPT method with R = 20 among tensors is similar to that for linear interpolation, which implies our tensor prediction model is useful in reconstruction of HRTF. Moreover, Fig. 6b implies that the proposed method can predict the HRTF data with acceptable error.
J. Wang et al. / Applied Acoustics 157 (2020) 106995
9
Fig. 7. Amplitude spectrums comparison for different methods from a specific direction at 15 degree azimuth, 123.75 degree elevation and at the ratio of 0.6. (a) tensor completion; (b) CP-WOPT (R = 50), CP-WOPT (R = 20); (c) Tucker-opt; (d) Linear interpolation[13]. (‘‘ORIGIN HRTF” means the original HRTF at the same position in CIPIC HRTF database).
Fig. 8. The subjective test results for different methods (‘‘M1” means the tensor completion method, ‘‘M2” means the CP-WOPT method with rank of 20, ‘‘M3” means the Tucker-opt method, ‘‘M4” means the linear interpolation method).
Similar to Fig. 5, amplitude spectrums across all azimuths for a specific elevation are shown in Fig. 6, which are referred to as frontal plane spectrums. The azimuth angles range from 80 to 80 degree with an elevation angle of 28.125 degree. The overall estimated effects for frontal plane for all methods outperform the lateral plane obviously, especially tensor completion method can basically obtain the same frequency information as the original frontal plane spectrums by Fig. 6 (a), and (b). The amplitude spectrums for CP-WOPT method with smaller tensor rank and for original ground-truth HRTF are similar as observed from Fig. 6(a)–(d). And Fig. 6(d) and (f) show that linear interpolation approach results in greater differences around 0 degree, where there may be unmeasured HRTF entries in several continuous directions. To directly compare frequency information between the original and the predicted of different methods, the amplitude spectrum for a specific direction at 15 degree azimuth and 123.75
degree elevation are shown in Fig. 7. We can see from the Fig. 7 (a)–(d) that the tensor methods can exactly predict the HRTF parameters at low frequency of roughly less than 12 kHz. However, not only tensor methods but also linear interpolation can cause slight spectrum distortions when the frequency is more than 12 kHz, especially ranging from 12 kHz to 16 kHz. This is because HRTF sharply changes at high frequencies. Moreover, previous studies suggested that the spectral detail of HRTF at high frequency is inaudible [38]. We calculated the average values of the sound localization similarity and the 95% confidential interval according to the listeners’ scores. Fig. 8 shows the listening test results with comparing virtual sounds rendered by the original HRTF and the predicted HRTF from different methods. It is noted that the predicted HRTF for tensor completion is very close to that of linear interpolation. The scores for both of two methods are about 4 points and the former are higher, which means ‘‘very similar” according to Table 1. While the scores for CP-WOPT and Tucker-opt methods are about 3 points, corresponding to ‘‘similar”. Which all imply that the predicted HRTF for tensor completion is more similar to the original HRTF in sound localization similarity. 6. Conclusion HRTF plays an important role in reconstructing 3D spatial audios. However, it is costly and time-consuming to measure HRTF in actual experiments, especially for high spatial resolution HRTF. The paper describes a prediction method to estimate HRTF in unmeasured directions or to reconstruct overall HRTF for an individual only based on a small set of sparse ground-truths. A three-order tensor model with unmeasured entries firstly was constructed in azimuth, elevation and frequency mode, which represents the multi-dimension structure of HRTF magnitude functions by exploiting the potential spectrum information among three modes. Then, low rank tensor completion method was applied to obtain the eventual complete tensor, which is formulated as a convex optimization problem. To tackle this problem
10
J. Wang et al. / Applied Acoustics 157 (2020) 106995
fast, the gradient descent algorithm and line search strategy were used to complete the incomplete HRTF tensor and obtain the HRTF magnitude functions in unmeasured directions. At last, the HRIR entries in unmeasured directions were achieved by the minimum phase functions model of HRTF following the pure time delay. When it comes to the lost data for frequency mode, the proposed method can also be used in the reconstruction of the compressed HRTF, which may be directly applied in Virtual Reality audio. In all simulations performed, experimental results demonstrate that the proposed prediction model based on tensor completion outperforms other methods and achieves high prediction accuracy. The subjective tests show that the predicted HRTF is more approximate to the original ones for sound localization similarity than other methods. While with the increase of the ratio of unmeasured elements, that is more sparse directions, the HRTF prediction performance will decrease. Moreover, the proposed method in this paper only predicts the unmeasured HRTFs based on an individual incomplete measurements to increase the spatial resolution, which has certain limitations such as not applicable to predicting the HRTF values of the entire slice directions. In the future work, we will focus on the prediction of HRTFs based on more sparse measurements. Besides, the anthropometric measurements will also be included to predict the individual HRTFs.
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Acknowledgment [23]
The authors would like to thank the reviewers for their kind suggestions. The work in this paper is supported by the National Natural Science Foundation of China (No. 61571044, No. 11590772, No. 61473041 and No. 61620106002) and the State Key Laboratory of Robotics. References [1] Potisk T. Head-related transfer function. Seminar Ia. Faculty of Mathematics and Physics, University of Ljubljana, Ljubljana, Slovenia; 2015. [2] Usman M, Kamal K, Qayyum R, Akram S, Mathavan S. 3d sound generation using kinect and hrtf. 2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP). p. 307–10. [3] Savioja L, Huopaniemi J, Lokki T, Väänänen R. Creating interactive virtual acoustic environments. Journal of the Audio Engineering Society 1999;47 (9):675–705. [4] Duda RO, Martens WL. Range dependence of the response of a spherical head model. J Acoust Soc Am 1998;104(5):3048–58. https://doi.org/10.1121/ 1.423886. [5] Algazi VR, Duda RO, Duraiswami R, Gumerov NA, Tang Z. Approximating the head-related transfer function using simple geometric models of the head and torso. J Acoust Soc Am 2002;112(5):2053–64. https://doi.org/10.1121/ 1.1508780. [6] Otani M, Ise S. Fast calculation system specialized for head-related transfer function based on boundary element method. J Acoust Soc Am 2006;119 (5):2589–98. https://doi.org/10.1121/1.2191608. [7] Salvador CD, Sakamoto S, Treviño J, Suzuki Y. In: Dataset of near-distance head-related transfer functions calculated using the boundary element method. Science, Audio Engineering Society; 2018. [8] Kulkarni A, Colburn HS. Infinite-impulse-response models of the head-related transfer function. J Acoust Soc Am 2004;115(4):1714–28. https://doi.org/ 10.1121/1.1650332. [9] Evans MJ, Angus JA, Tew AI. Analyzing head-related transfer function measurements using surface spherical harmonics. J Acoust Soc Am 1998;104 (4):2400–11. https://doi.org/10.1121/1.423749. [10] Zhang M, Kennedy RA, Abhayapala TD. Empirical determination of frequency representation in spherical harmonics-based hrtf functional modeling. IEEE/ ACM Trans Audio Speech Language Processing (TASLP) 2015;23(2):351–60. https://doi.org/10.1109/TASLP.2014.2381881. [11] Hu S, Trevino J, Salvador C, Sakamoto S, Li J, Suzuki Y. A local representation of the head-related transfer function. J Acoust Soc Am 2016;140(3):EL285–90. https://doi.org/10.1121/1.4962805. [12] Christensen F, Møller H, Minnaar P, Polgsties J, Olesen SK. Interpolating between head-related transfer functions measured with low-directional
[24]
[25]
[26] [27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37] [38]
resolution. Audio Engineering Society Convention, vol. 107. Audio Engineering Society; 1999. Reddy CS, Hegde RM. Horizontal plane hrtf interpolation using linear phase constraint for rendering spatial audio. 2016 24th European Signal Processing Conference (EUSIPCO). IEEE; 2016. p. 1668–72. Kistler DJ, Wightman FL. A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. J Acoust Soc Am 1992;91(3):1637–47. https://doi.org/10.1121/1.402444. Wang L, Yin F, Chen Z. Head-related transfer function interpolation through multivariate polynomial fitting of principal component weights. Acoust Sci Technol 2009;30(6):395–403. https://doi.org/10.1250/ast.30.395. Wang L, Yin F, Chen Z. Hrtf compression via principal components analysis and vector quantization. IEICE Electron Express 2008;5(9):321–5. https://doi.org/ 10.1587/elex.5.321. Xie B-S. Recovery of individual head-related transfer functions from a small set of measurements. J Acoust Soc Am 2012;132(1):282–94. https://doi.org/ 10.1121/1.4728168. Grindlay G, Vasilescu MAO. A multilinear (tensor) framework for hrtf analysis and synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, vol. 1. IEEE; 2007. p. I–161. Rothbucher M, Shen H, Diepold K. Dimensionality reduction in hrtf by using multiway array analysis. In: Human Centered Robot Systems. Springer; 2009. p. 103–10. Rothbucher M, Durkovic M, Shen H, Diepold K. Hrtf customization using multiway array analysis. 2010 18th European Signal Processing Conference. IEEE; 2010. p. 229–33. Huang Q, Li L. Modeling individual hrtf tensor using high-order partial least squares. EURASIP J Adv Signal Process 2014;2014(1):58. https://doi.org/ 10.1186/1687-6180-2014-58. Xie K, Wang L, Wang X, Xie G, Wen J, Zhang G, Cao J, Zhang D, Xie K, Wang X, et al. Accurate recovery of internet traffic data: A sequential tensor completion approach. IEEE/ACM Trans Networking (TON) 2018;26(2):793–806. https:// doi.org/10.1109/TNET.2018.2797094. Tan H, Wu Y, Shen B, Jin PJ, Ran B. Short-term traffic prediction based on dynamic tensor completion. IEEE Trans Intell Transp Syst 2016;17 (8):2123–33. https://doi.org/10.1109/TITS.2015.2513411. Bengua JA, Phien HN, Tuan HD, Do MN. Efficient tensor completion for color image and video recovery: Low-rank tensor train. IEEE Trans Image Process 2017;26(5):2466–79. https://doi.org/10.1109/TIP.2017.2672439. Kulkarni A, Isabelle S, Colburn H. On the minimum-phase approximation of head-related transfer functions. IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE; 1995. p. 84–7. Kolda TG, Bader BW. Tensor decompositions and applications. SIAM Rev 2009;51(3):455–500. https://doi.org/10.1137/07070111X. Recht B, Fazel M, Parrilo PA. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev 2010;52 (3):471–501. https://doi.org/10.1137/070697835. Devolder O, Glineur F, Nesterov Y. First-order methods of smooth convex optimization with inexact oracle. Math Programming 2014;146(1–2):37–75. https://doi.org/10.1007/s10107-013-0677-5. Liu J, Musialski P, Wonka P, Ye J. Tensor completion for estimating missing values in visual data. IEEE Trans Pattern Anal Mach Intell 2013;35(1):208–20. https://doi.org/10.1109/TPAMI.2012.39. Wu Y, Tan H, Li Y, Li F, He H. Robust tensor decomposition based on cauchy distribution and its applications. Neurocomputing 2017;223:107–17. https:// doi.org/10.1016/j.neucom.2016.10.030. Jing W, Xiang X, Jingming K. A novel multichannel audio signal compression method based on tensor representation and decomposition. China Commun 2014;11(3):80–90. https://doi.org/10.1109/CC.2014.6825261. Yang L, Wang J, Xie X, Kuang J. Application of tucker decomposition in speech signal feature extraction. 2013 International Conference on Asian Language Processing (IALP). IEEE; 2013. p. 155–8. Acar E, Dunlavy DM, Kolda TG, Mørup M. Scalable tensor factorizations with missing data. Proceedings of the 2010 SIAM international conference on data mining. SIAM; 2010. p. 701–12. Filipovic´ M, Jukic´ A. Tucker factorization with missing data with application to low-n-rank tensor completion. Multidimensional Syst Signal Process 2015;26 (3):677–92. https://doi.org/10.1007/s11045-013-0269-9. Algazi VR, Duda RO, Thompson DM, Avendano C. The cipic hrtf database. 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics. IEEE; 2001. p. 99–102. Bader BW, Kolda TG, et al., Matlab tensor toolbox version 2.6, available online, february 2015, http://www.sandia.gov/tgkolda/TensorToolbox/index-2.6. html.. I. Rec, P. 800, Methods for subjective determination of transmission quality (1996) 800–899.. Xie B, Zhang T. The audibility of spectral detail of head-related transfer functions at high frequency. Acta Acustica United Acustica 2010;96 (2):328–39. https://doi.org/10.3813/AAA.918282.