Robust Acoustic Event Classification using Fusion Fisher Vector features

Robust Acoustic Event Classification using Fusion Fisher Vector features

Applied Acoustics 155 (2019) 130–138 Contents lists available at ScienceDirect Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust ...

836KB Sizes 3 Downloads 95 Views

Applied Acoustics 155 (2019) 130–138

Contents lists available at ScienceDirect

Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust

Robust Acoustic Event Classification using Fusion Fisher Vector features Manjunath Mulimani ⇑, Shashidhar G. Koolagudi Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal 575 025, India

a r t i c l e

i n f o

Article history: Received 24 September 2018 Received in revised form 15 April 2019 Accepted 19 May 2019

Keywords: Acoustic Event Classification (AEC) Fusion Fisher Vector (FFV) features Pseudo-color spectrogram Principal Component Analysis (PCA)

a b s t r a c t In this paper, a novel Fusion Fisher Vector (FFV) features are proposed for Acoustic Event Classification (AEC) in the meeting room environments. The monochrome images of a pseudo-color spectrogram of an acoustic event are represented as Fisher vectors. First, irrelevant feature dimensions of each Fisher vector are discarded using Principal Component Analysis (PCA) and then, resulting Fisher vectors are fused to get FFV features. Performance of the FFV features is evaluated on acoustic events of UPC-TALP dataset in clean and different noisy conditions. Results show that proposed FFV features are robust to noise and achieve overall 94.32% recognition accuracy in clean and different noisy conditions. Ó 2019 Elsevier Ltd. All rights reserved.

1. Introduction Acoustic Event Classification (AEC) refers to the task of identifying the semantic label of an audio clip that represents the specific sound in a surrounding environment. Every environment has its own set of sounds. For instance, applause, chair moving, footstep, phone ringing, door slam belong to the meeting room environment. Such sounds are popularly known as acoustic events. An environment in which acoustic events are present is known as the acoustic scene. Every acoustic event in the scene has its own spectral and temporal characteristics which are effectively characterized by the human auditory system through Auditory Scene Analysis (ASA) [1]. Humans are also supported by many types of evidence such as visual, intellectual (experience) and so on while classifying acoustic events. Vast experience, diversified training examples, contexts (such as distinct time and frequency of acoustic events) would help human beings towards more effective and efficient classification. Natural sounds like a temple bell, school bell and bicycle bell ringing sound are made up of distinct frequency components, which are easily identified by the human beings. The machine is able to recognize the acoustic events with the help of computation algorithms. This process is known as Computational Auditory Scene Analysis (CASA) [2]. AEC is a branch of CASA. It has many applications such as intelligent audio-based personal archives [3], audio-based surveillance systems [4–6], machine hearing [7] and so on.

⇑ Corresponding author. E-mail addresses: [email protected] (M. Mulimani), koolagudi@nitk. edu.in (S.G. Koolagudi). https://doi.org/10.1016/j.apacoust.2019.05.020 0003-682X/Ó 2019 Elsevier Ltd. All rights reserved.

The traditional approaches used for AEC are ‘‘frame-based”, which extracts the features from the continuous acoustic event signal frame-by-frame. The most common frame-based features are the Mel-frequency cepstral coefficients (MFCCs); specifically designed and used for speech/speaker recognition tasks [8–12]. However, it is very clearly pointed out that, speech is different from the acoustic events when one considers their phonetic structure [13]. This understanding indicates that, frame-based features may not be suitable for AEC. Due to variations in the length of the different acoustic event signals (audio clips), frame-based feature extraction technique gives a different number of fixed-dimensional feature vectors. If an acoustic event is represented by a sequence of feature vectors, then the sequence of feature vectors of different acoustic events are in different lengths based on the length of the acoustic events. Such variable length sequences are effectively modeled using Gaussian Mixture Models (GMM) based Hidden Markov Models (HMM) [14]. However, HMMs require huge training data to capture distinct variations among acoustic events and not fit for simple and compact applications of the modern digital world. Recently, Support Vector Machine (SVM) classifier proved to be highly effective for AEC [15–17]. However, SVM requires a fixeddimensional sequence of feature vectors. Several encoding methods are used to map the variable length sequences into fixedlength. Most popular methods are the Bag-of-Audio-Words (BoAW) [18] and the Fisher kernel [15]. BoAW approach represents the frame-based features (sequence of feature vectors) into a fixeddimensional histogram (’bag’) known as BoAW. This histogram is used as a feature vector to SVM. Recently, it is reported that BoAW approach even outperforms the popular Deep Neural Networks

M. Mulimani, S.G. Koolagudi / Applied Acoustics 155 (2019) 130–138

(DNN) [19,20]. Alternatively, in this work, Fisher kernel [21] is used to represent frame-based features into a fixed-dimensional feature vector, known as the Fisher vector. A generalized Fisher kernel named as score-space kernel used for speech recognition [22], speaker verification [23] and AEC [15]. Generally, real-time acoustic events are overlapped with highbackground noise. Traditional frame-based features are mostly sensitive to noise [14]. Hence, even Fisher vector/BoAW representations of frame-based features may not be suitable for AEC, especially in noisy conditions. Acoustic events are short in duration and have more distinct Time-Frequency Representations (TFR) [24]. Hence, features from the Time-Frequency image are proved to be effective for robust AEC [17,24–26]. A conventional and widely used TFR is a spectrogram, which is generated from Short-Time Fourier Transform (STFT) [13] of a signal. Recently, a set of grayscale spectrogram images is used as Bag-of-Visual-Words (BoVW) for AEC using non-linear Chi-square SVM [25]. BoVW feature representations were shown to be more robust and performed significantly better than MFCCs and BoAW representations in different noisy conditions. In [24], spectrogram image is divided into blocks; second and third order central moments are computed from each block and considered as Spectrogram Image Features (SIFs) for robust AEC. SIFs were shown to be more robust and performed significantly better than MFCCs in different noisy conditions. Acoustic events have higher intensity values in the lower frequency regions of the spectrogram images. Gammatone spectrograms (also known as cochleagrams) represents such higher intensity values more clearly than a conventional STFT-based spectrograms using narrow bandwidth at the lower frequency regions and wide bandwidth at the higher frequency regions. In [26], the central moments are computed from each block of Gammatone spectrogram instead of conventional spectrogram and referred as Cochleagram Image Features (CIFs) and used for AEC in a surveillance application. CIFs performed better than SIFs. However, computation of two central moments from each image block leads to a loss of important information from the spectrogram image. BoVW features fully captured the significant information by considering the entire spectrogram and performed better than SIFs and CIFs [25]. In [17], the features are selected from Gammatone spectrogram images using Sequential Backward Feature Selection (SBFS) algorithm and used for AEC. However, SBFS is a greedy algorithm, which demands high computational time and impractical on the larger dataset. Recently, there are references in the literature on the use of Convolutional Neural Networks (CNNs) for AEC [27–29]. However, CNNs require larger training data to learn and it is difficult to get significant features from the learned CNNs. A BoVW approach performs well with non-linear kernel classifiers such as Intersection and Chi-square kernels SVMs, which demand higher computational time than simple linear SVM. BoVW approach also suffers from a large quantization error. Advantages of using the Fisher kernel over BoVW are mainly from two aspects: Fisher vectors can be evaluated from much smaller vocabularies with lower computational time and perform well with linear SVM [30]. In computer vision, patch-wise Scale Invariant Feature Transform (SIFT) descriptors from the digital images are used to represent Fisher vectors [30]. SIFT descriptors effectively characterize objects appearing in different scale, location and pose [31]. However, acoustic events in the spectrograms may not have such variations, except variation along time. Hence, Fisher vector representations of SIFT descriptors from the spectrogram images may not be suitable for AEC.

131

In this paper, the intensity values of Red (R), Green (G) and Blue (B) monochrome images of the HSV pseudo-color spectrogram are considered as feature descriptors to represent as high-dimensional Fisher vectors. Further, PCA is applied to reduce the dimension of each Fisher vector. The resulting Fisher vectors from each monochrome image are concatenated (fused) to get Fusion Fisher Vector (FFV). FFV is the combination of prominent features from Fisher vectors of all monochrome images of the pseudo-color spectrogram. Performance of the FFV features is compared with state-ofthe-art systems, and their robustness is verified in different noisy conditions. The main contributions of our proposed work are summarized below.  Novel FFV features are proposed for AEC and they are shown to be more robust in noisy environments.  Traditional AEC systems are based on features used for speech/ speaker recognition tasks. In this work, acoustic event specific Fisher vector features are evaluated from distinct visual information of the acoustic events in the monochrome images of the pseudo-color spectrograms.  Prominent features from Fisher vectors of monochrome images of the pseudo-color spectrogram are selected and fused to get novel FFV features.  The significance of the FFV features for AEC is analyzed and compared with the other state-of-the-art methods. The rest of this paper is organized as follows, Section 2 explains the proposed Fusion Fisher Vector features in brief. Section 3 describes the experiments carried out in this work. Results are discussed in Section 4. The conclusions are given in Section 5. 2. Fusion Fisher Vector features Overview of the proposed approach is shown in Fig. 1. First, pseudo-color spectrogram of an acoustic event is generated. Further, monochrome images of pseudo-color spectrogram are represented as Fisher vectors. Finally, PCA removes the irrelevant feature dimensions of Fisher vectors, which are fused to get FFV features. 2.1. Pseudo-color spectrogram generation In this step, first, acoustic events are represented as Gammatone spectrogram Sðf ; tÞ, where f (ranging from 1 to F) is the center frequency of the Gammatone filter and t is the time frame obtained by windowing the signal into frames using the hamming window of length 20 ms with 50% overlap. The sampling rate is 44100 Hz and F ¼ 64 filters are equally spaced on the Equivalent Rectangular Bandwidth (ERB) scale. The logarithmic Gammatone spectrogram is generated using (1).

Sðf ; tÞ ¼ logðSðf ; tÞÞ

ð1Þ

Further, values of Time–Frequency matrix Sðf ; tÞ are normalized between [0, 1] using (2) to get grayscale intensity spectrogram image.

GIðf ; tÞ ¼

Sðf ; tÞ  minðSÞ maxðSÞ  minðSÞ

ð2Þ

Acoustic events are highly variant to time, which may cause dimensional variations. Hence, grayscale spectrogram image GIðf ; tÞ is transposed as given in (3) to get fixed 64-dimensional row vectors.

132

M. Mulimani, S.G. Koolagudi / Applied Acoustics 155 (2019) 130–138

Fig. 1. Overview of the proposed approach.

T

Gðt; f Þ ¼ GIðf ; tÞ

ð3Þ

Grayscale spectrogram (Gðt; f Þ) is quantized and mapped onto different monochrome images using (4) and this process is called as pseudo-color mapping [24].

X q ðt; f Þ ¼ f q ðGðt; f ÞÞ 8 q 2 ðq1 ; . . . ; qN Þ

ð4Þ

where X q is the Red (R), Green (G) or Blue (B) monochrome image, f is the nonlinear mapping function, q is the quantization region (three regions: R, G, and B). In this work, popular HSV colormap is used to map the intensity values of Gðt; f Þ onto RGB monochrome components, resulting spectrogram image is known as pseudocolor spectrogram image (shown in Fig. 2). Unlike 128dimensional SIFT descriptor from each image patch in the image processing, in this work, each row of monochrome image Xðt; f Þ is considered as a 64-dimensional descriptor of intensity values for Fisher vector representations.

2.2. Fusion Fisher Vector features In this step, each monochrome image of the pseudo-color mapped spectrogram image is represented as a Fisher vector (see Fig. 1). Let X ¼ fxt ; t ¼ 1; 2; . . . ; Tg be the set of 64-dimensional T row vectors (feature descriptors) of a monochrome image. Where T is the number of time frames. Generally, Fisher vector is derived from Fisher kernel. The process of generation of Fisher vector includes two stages, first is to build the generative model of local descriptors and then obtain feature’s coding vector (Fisher vector) by computing gradients of the likelihood of local descriptors concerning the model parameters. In this work, the generative model is considered as Gaussian Mixture Model (GMM), which is trained using the local descriptors of five randomly selected monochrome images per class (small training partition). It is to be noted that, five randomly selected monochrome images per class are sufficient enough to train GMM with less time. The set of parameters of the trained GMM are denoted as k :

Fig. 2. HSV pseudo-color mapped spectrograms of an acoustic event at clean and 0 dB SNR. (a) Acoustic event chair moving; (b) pseudo-color spectrogram at clean condition; (c) pseudo-color spectrogram at 0 dB SNR; (d) HSV colormap.

M. Mulimani, S.G. Koolagudi / Applied Acoustics 155 (2019) 130–138

k ¼ fwj ; lj ;

X j

K

g

ð5Þ

j¼1

P Where wj ; lj ; j are the weight, mean vector and covariance matrix of a Gaussian j, respectively. K is the number of Gaussians. Each Gaussian is also known as a visual word. All the visual words P together constitute the visual vocabulary. We assume that j is a P diagonal matrix and diagonal of j is denoted by rj , a standard deviation vector of Gaussian j. Once the GMM is trained, a monochrome image X is represented as a Fisher vector by assigning its row vectors xt to the Gaussians. Let ct ðjÞ is the soft assignment of xt to Gaussian j.

ct ðjÞ ¼

h i exp  12 ðxt  lj ÞT R1 j ðxt  lj Þ

K X

h i exp  12 ðxt  li ÞT R1 j ðxt  li Þ

ð6Þ

i¼1

ct ðjÞ is also known as posterior probability [30]. V Xl;j and V Xr;j are the gradients concerning lj and rj of Gaussian j. They are computed using derivations (7) and (8).

  T xt  l j 1 X V Xl;j ¼ pffiffiffiffiffiffi ct ðjÞ T wj t¼1 rj

ð7Þ

" #  T xt  l j 2 1 X ct ðjÞ 1 V Xr;j ¼ pffiffiffiffiffiffiffiffi rj T 2wj t¼1

ð8Þ

Final Fisher vector V is the concatenation of 64-dimensional gradient vectors V Xl;j and V Xr;j of all K Gaussians, resulting in 2  D  K dimensional vector. Where D is the dimension of local descriptors, i.e., 64. In this work, the value of K ranging from 8 to 256 is considered and its impact on the final AEC accuracy is analyzed. Further, Principal Component Analysis (PCA) is applied to reduce the dimension of Fisher vectors from 2  64  K to M dimension using ’percentage of cumulative variance’, which is set to 99% [32]. There is no general best practice for selection of the ’percentage of cumulative variance’. The 99% of cumulative variance retains the maximum variation among the discriminative features in the Mdimensional Fisher vector with minimal loss of information; hence it is chosen. To avoid feature biasing, Fisher vectors are normalized pffiffiffiffiffiffi using Signed Square Root (SSR), computed as V ¼ signðVÞ jVj, which is also known as power normalization. Further, Fisher vectors are normalized using ‘2 norm [33]. Normalized Fisher vectors of three monochrome images of an acoustic event are concatenated (fused) to generate Fusion Fisher Vector (FFV) features. 3. Experiments 3.1. Acoustic event dataset Performance of the proposed approach evaluated on UPC-TALP dataset [34]. Twelve different isolated meeting room acoustic events, namely: applause (ap), cup jingle (cl), chair moving (cm), cough (co), door slam (ds), key jingle (kj), knock (kn), keyboard typing (kt), laugh (la), phone ring (pr), paper wrapping (pw) and walking sounds of steps (st) are selected for AEC. Approximately 60 acoustic events per class are recorded using 84 microphones. An array of 64 Mark III microphones, 12 T-shape clusters microphones, 8 table top and omni-directional microphones are used for recording. In this work, only the third channel of Mark III array is considered for evaluation. Acoustic events are trimmed to the length of given annotations and resulting data is divided into five

133

disjoint folds to perform fivefold cross-validation. Each fold has equal number of acoustic event clips per class. To compare the robustness of the proposed approach, ’speech babble’ noise from NOISEX’92 database [35] is added to the acoustic events at 20, 10 and 0 dB SNRs. The ’speech babble’ noise is diffuse and most of its energy is distributed at lower frequencies. All acoustic event clips are available with 44100 Hz sampling rate. 3.2. Evaluation methods Performance of the proposed approach is compared with the following approaches. 1. Baseline systems: (a) Mean and standard deviation of 13 MFCCs and their first and second-order derivatives are taken over each frame, resulting in 39  2 dimensional feature vector. (b) Mean and standard deviation of 13 GTCCs and their first and second-order derivatives are taken over each frame [36], resulting in 39  2 dimensional feature vector. 2. BoAW approaches: (a) 13 GTCCs and their first and second-order derivatives are taken over each frame, resulting into 39 dimensional feature vector represented as BoAW. (b) Pancoast et al. [18] considered MFCCs and their first and second order derivatives with their log energies are represented as BoAW. (c) Grzeszick et al. [20] considered combined 13 GTCCs, MFCCs and a loudness over each frame for BoAW representations. 3. SIFs: Concatenating second and third order moments over 9  9 blocks of monochrome spectrogram images gave Spectrogram Image Features (SIFs) [24]. 4. DNNs: Mel band energies are used as features to train Deep Neural Networks (DNNs), which has three fully connected layers followed by a softmax output. Each layer uses 500 units with ReLU activation function and 10% dropout. Categorical cross-entropy is used as a loss function [37]. 5. CNNs: 60-dimensional log Mel features used as input features to Convolutional Neural Networks (CNNs) [29], which has layer architecture similar to VGG net [38]. 6. BoVW: Grayscale spectrograms of acoustic events are represented as BoVW [25]. The MFCC and GTCC features used in our experiments are extracted using 20 ms hamming window with 50% overlap and normalized to zero mean and unit variance. Performance of the baseline systems, SIFs and proposed FFV is reported using linear kernel SVM, which achieves higher recognition accuracy with lower computation cost; hence, it is chosen as a classifier. Oneversus-rest multi-class SVM is trained with fivefold crossvalidation. The BoAW and BoVW representations are reported to perform better with non-linear kernel SVM than linear kernel SVM [25]. Hence, the performance of the BoAW and BoVW representations is reported using Chi-square kernel SVM. 4. Results and discussion 4.1. Performance comparison Experimental results detailed with different noisy conditions are given in Table 1. These results demonstrate that proposed FFV-SVM outperforms all other known approaches in both clean and noisy conditions with the average recognition accuracy of 94.32%. The proposed FFV features are robust to noise and achieve recognition accuracy of 89.59% at 0 dB SNR, which is only 4.73%

134

M. Mulimani, S.G. Koolagudi / Applied Acoustics 155 (2019) 130–138

Table 1 Comparison of overall Recognition accuracy (%) of proposed FFV features with other methods using SVM at clean and different SNRs. Method

Ref.

Clean

20 dB

10 dB

0 dB

Average

MFCCs GTCCs GTCCs-BoAW Pancoast et. al Grzeszick et. al SIFs DNN CNNs BoVW FFV

[36] [18] [20] [24] [37] [29] [25] -

74.79 76.87 92.09 88.97 89.07 75.42 70.42 86.33 93.54 97.29

66.46 74.83 89.38 86.88 86.97 74.01 69.38 84.05 92.51 96.23

58.55 72.54 86.09 80.42 81.47 72.16 58.55 80.17 88.76 94.18

46.88 59.02 77.13 57.30 59.38 55.31 35.63 65.97 79.54 89.59

61.67 70.81 86.17 78.39 79.22 69.22 58.49 79.13 88.58 94.32

lesser than overall recognition accuracy (94.32%) and 10.05% higher than next best-performing BoVW (79.54%). As we mentioned earlier, the energy of speech babble noise is concentrated at lower frequencies. MFCC features are highly sensitive to lower frequency noise components leading to clear drop of the recognition accuracy of the baseline system significantly at 0 dB SNR. The GTCCs which are obtained from Gammatone filter banks are similar to the proposed pseudo-color spectrogram. Hence, in this work, GTCCs are considered for performance comparison. Gammatone filter bank resolution at lower frequencies is much higher with ERB scale than Mel filter bank with Mel scale [36,39]. Hence, GTCCs discriminate the spectral components at lower frequencies belonging to the acoustic event and noise more accurately than MFCCs. However, the performance of GTCCs is not better when compared to that of the proposed FFV features. At this point, we tried to improve the recognition accuracy of GTCCs by concatenating them with MFCCs to build a competitive baseline. However, it further deteriorates the performance of GTCCs. Hence, it is not reported in Table 1. The values of mean and standard deviation computed over each frame lead to the inevitable loss of acoustic event information. Hence, baseline systems achieve poor performance in all conditions. BoAW representations [18,20] of frame-based features effectively capture the vital information of the acoustic events over frames and outperform the baseline systems. However, BoAW representations are from the low-level frame-based features, which are sensitive to noise. This reduces the significance of BoAW model at noisy conditions. BoVW representations of the grayscale spectrogram images effectively discriminate acoustic events from the noise and perform better than BoAW representations. The dimension of a Fisher vector from each monochrome image is 2  D  K while the dimension of a BoVW representation is only K. Hence, the Fisher vector contains significantly much more information about the acoustic events by including gradients concerning mean and standard deviation. Thus, FFV-SVM outperforms the BoVW representations in clean and different noisy conditions. In this work, each Fisher vector is evaluated from entire monochrome spectrogram image rather than taking second and third order central moments over each block of an image [24]. The spatial information of the intensity values of monochrome spectrogram image was lost in central moment features is fully captured in FFV. Hence, proposed FFV features perform significantly better than SIFs in all noisy conditions. Performance of the FFV-SVM approach is also compared with emerging DNNs and CNNs [37], which learn effectively normally with large training data. It can be seen from Table 1 that DNNs and CNNs do not perform well in the present case with limited training data. The magnitude of the intensity values of an acoustic event in the monochrome image is much higher than that of noise. The

noise is distributed over spectrogram compared to the acoustic event and its maximum energy concentrated at the lower region of the pseudo-color mapped spectrogram image. However, higher intensity values (stronger peaks) of the acoustic events are unaffected by noise (see Fig. 2), which effectively captured and discriminated by the Fisher vector from the noise and achieve higher recognition accuracy in all conditions compared to other methods. The Fisher vectors are evaluated from the intensity distribution (normalized spectral energy distribution) in the monochrome images. For instance, the intensity distribution of acoustic events ’chair moving’ and ’laugh’ at clean and 0 dB SNR are given in Fig. 3. In this context, the intensity distribution of a monochrome image is computed as the mean of intensity values (normalized spectral values) of each frequency bin (a total of 64 frequency bins are considered). It is clear evidence from the Fig. 3 that, red, green, blue monochrome images of pseudo-color spectrogram and grayscale spectrogram image (grayscale spectrogram is generated using Eq. (2)) of acoustic events have distinct distributions, which are effectively clustered by GMM and quantized by posterior probability; resulting Fisher vectors recognize the acoustic events with the highest recognition rate. To verify the robustness of individual distribution, Euclidean distance between the clean and noisy (0 dB SNR) distribution of the same acoustic events (shown in Fig. 3) are calculated and given in Table 2. The similar distributions have smaller distance, indicating that distributions are robust and less affected by the noise. One can observe from Fig. 3 that, the green and grayscale spectrogram images have more low-intensity values, which are affected by ’speech babble’ noise (distributed over the lower regions of spectrograms). Hence, distribution distances of green and grayscale images are high as compared to distribution distances of blue and red monochrome images. The higher intensity values have minimal or no effect of noise on them. It is worth to point out that, the intensity distribution of red, green and blue monochrome images of pseudo-color spectrogram are more robust than grayscale spectrogram images. This observation gives us motivation that, Fisher vector representations of red, green and blue monochrome images generate more discriminative and robust FFV features for AEC. Fisher vectors from red, green, and blue monochrome images are evaluated and named as RFV, GFV and BFV respectively. A summary of recognition performance of RFV, GFV and BFV along with FFV at clean and 0 dB SNR is shown in Fig. 4. As we had mentioned earlier, Green monochrome image represents the acoustic event in TF dimension with more lower intensity values, which are susceptible to noise as compared to red and blue monochrome images. Hence, RFV and BFV perform significantly better than GFV in both clean and 0 dB SNR conditions. The values with maximum variations in the Fisher vector represent the higher intensity values of the acoustic events in the monochrome images. These significant values are chosen from the Fisher vector using PCA and fused to get FFV. FFV is the combination of

135

M. Mulimani, S.G. Koolagudi / Applied Acoustics 155 (2019) 130–138

Fig. 3. Intensity distribution of acoustic events chair moving (a) and laugh (b) at clean and 0 dB SNR. (a1) & (b1) Spectral energy distribution of red monochrome images; (a2) & (b2) spectral energy distribution of green monochrome images; (a3) & (b3) spectral energy distribution of blue monochrome images; (a4) & (b4) spectral energy distribution of grayscale spectrogram images. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 2 Example intensity distribution distances between acoustic events at clean and 0 dB SNR. Acoustic Events

Red

Green

Blue

Grayscale

chair moving laugh

0.487 0.609

0. 648 0.835

0.293 0.432

0.659 0.851

selected prominent features from RFV, GFV and BFV. Hence, FFV features give the highest recognition accuracy, which outperforms the Fisher vectors from individual monochrome images, baseline and state-of-the-art methods. The recognition accuracy of Fisher vector representation of grayscale spectrograms generated using Eq. (2) is less than BFV and RFV, hence, it is not considered for comparison.

that recognition accuracy improves by increasing number of Gaussians in the beginning and slowly decreases after 64 Gaussians. Acoustic events are brief and present in the sparse frequency spectrum. Hence, we set number of Gaussians to 64 (K ¼ 64), which are efficient enough to recognize acoustic events with higher accuracy and lower computational cost.

4.2. Number of Gaussians versus recognition accuracy

4.3. Performance comparison with state-of-the-art method on UPCTALP dataset

Recognition accuracy of FFV-SVM concerning the number of Gaussians (K) at clean condition is shown in Fig. 5. One can observe

A BoAW approach, which is presently treated as the state-ofthe-art; achieves 96.46% of acoustic event recognition accuracy

136

M. Mulimani, S.G. Koolagudi / Applied Acoustics 155 (2019) 130–138

[40], which is less than proposed FFV features (i.e., 97.29%) in clean condition, for UPC-TALP dataset. It is noted that, the noise sensitive frame-based features are used to represent acoustic events as BoAW. This seriously leads to performance degradation in noisy conditions. 4.4. Performance analysis In this subsection, we present recognition performance between acoustic event classes using confusion matrices. The confusion matrices obtained using the proposed FFV features and competitive BoVW features (see Table 1) at clean and 0 dB SNR conditions are given in Table 3–6. One can observe that the proposed FFV features significantly reduce the misclassification between the classes compared to the BoVW features in both clean and 0 dB SNR conditions. One can notice from confusion matrix of the proposed FFV features at clean condition given in Table 3 that, the non-speech acoustic events ‘cough’ (co) and ‘laugh’ (la) generated from human vocal-tract are confused between themselves, compared to other acoustic events. As we had mentioned earlier, FFV features have much more significant information about acoustic events than BoVW features. Hence, FFV features effectively discriminate the acoustical characteristics of ‘cough’ and ‘laugh’ better than BoVW features (see confusion matrix of BoVW features given in Table 5). It is observed from the confusion matrix of proposed FFV features at 0 dB SNR condition given in Table 4 that, once again the ‘cough’ and ‘laugh’ are largely misclassified. It is worth to point out that, all other acoustic events except ‘steps’ (st) are fairly correctly classified at 0 dB SNR with the recognition accuracy of more than or equal to 90%. However, ‘steps’ has more low energy compo-

Fig. 4. Acoustic event recognition accuracy of Red, Green, Blue and Fusion Fisher Vectors at clean and 0 dB SNR. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 5. Recognition accuracy versus number of Gaussians.

Table 3 Confusion matrix for the classification of test samples at clean condition using proposed FFV features (Values are in % and recognition accuracy of a class is highlighted along the diagonal).

ap cl cm co ds kj kn kt la pr pw st

ap

cl

cm

co

ds

kj

kn

kt

la

pr

pw

st

100.00 0 0 0 0 0 0 0 0 0 0 0

0 100.00 0 0 0 0 0 0 0 0 0 0

0 0 97.50 0 0 0 0 2.50 0 0 0 0

0 0 0 92.50 0 0 0 0 5.00 0 0 0

0 0 0 0 100.00 0 0 2.50 0 0 0 0

0 0 0 0 0 97.50 0 0 0 0 0 0

0 0 0 0 0 0 97.50 0 0 0 0 0

0 0 2.50 0 0 0 0 95.00 0 0 0 0

0 0 0 7.50 0 0 0 0 92.50 2.50 0 2.50

0 0 0 0 0 0 0 0 0 97.50 0 0

0 0 0 0 0 0 0 0 0 0 100.00 0

0 0 0 0 0 2.50 2.50 0 2.50 0 0 97.50

ap: applause, cl: cup jingle, cm: chair moving, co: cough, ds: door slam, kj: key jingle, kn: knock, kt: keyboard typing, la: laugh, pr: phone ringing, pw: paper work, st: steps.

Table 4 Confusion matrix for the classification of test samples at 0 dB SNR using proposed FFV features (Values are in % and recognition accuracy of a class is highlighted along the diagonal).

ap cl cm co ds kj kn kt la pr pw st

ap

cl

cm

co

ds

kj

kn

kt

la

pr

pw

st

100.00 0 0 0 0 0 2.50 0 2.50 0 0 2.50

0 97.50 0 0 0 0 0 0 0 0 0 0

0 0 97.50 0 0 0 0 0 5.00 0 0 0

0 0 0 75.00 0 2.50 0 0 17.50 0 0 2.50

0 0 0 0 92.50 0 0 0 0 0 0 0

0 0 2.50 0 0 90.00 0 5.00 2.50 0 0 2.50

0 0 0 0 0 0 95.00 0 5.00 2.50 0 2.50

0 2.50 0 0 0 2.50 0 95.00 2.50 0 0 2.50

0 0 0 25.00 2.50 2.50 2.50 0 60.00 0 5.00 5.00

0 0 0 0 0 0 0 0 0 95.00 0 0

0 0 0 0 0 0 0 0 0 0 95.00 0

0 0 0 0 5.00 2.50 0 0 5.00 2.50 0 82.50

ap: applause, cl: cup jingle, cm: chair moving, co: cough, ds: door slam, kj: key jingle, kn: knock, kt: keyboard typing, la: laugh, pr: phone ringing, pw: paper work, st: steps.

137

M. Mulimani, S.G. Koolagudi / Applied Acoustics 155 (2019) 130–138

Table 5 Confusion matrix for the classification of test samples at clean condition using BoVW features (Values are in % and recognition accuracy of a class is highlighted along the diagonal).

ap cl cm co ds kj kn kt la pr pw st

ap

cl

cm

co

ds

kj

kn

kt

la

pr

pw

st

100.00 0 0 0 0 2.50 0 0 0 0 0 0

0 97.50 0 0 0 0 0 0 0 0 0 0

0 0 90.00 2.50 0 0 0 0 0 0 0 0

0 0 0 85.00 0 0 0 0 12.50 5.00 0 0

0 0 0 0 100.00 0 0 0 0 0 0 0

0 0 0 2.50 0 95.00 0 0 0 0 0 0

0 0 2.50 0 0 0 100.00 0 0 0 0 0

0 0 0 0 0 2.50 0 100.00 5.00 2.50 5.00 0

0 0 0 7.50 0 0 0 0 82.50 2.50 0 0

0 0 0 0 0 0 0 0 0 85.00 0 0

0 2.50 0 2.50 0 0 0 0 0 2.50 92.50 5.00

0 0 7.50 0 0 0 0 0 0 2.50 2.50 95.00

ap: applause, cl: cup jingle, cm: chair moving, co: cough, ds: door slam, kj: key jingle, kn: knock, kt: keyboard typing, la: laugh, pr: phone ringing, pw: paper work, st: steps.

Table 6 Confusion matrix for the classification of test samples at 0 dB SNR using BoVW features (Values are in % and recognition accuracy of a class is highlighted along the diagonal).

ap cl cm co ds kj kn kt la pr pw st

ap

cl

cm

co

ds

kj

kn

kt

la

pr

pw

st

100.00 0 0 2.50 2.50 2.50 0 0 5.00 2.50 0 2.50

0 87.50 0 0 0 0 0 0 0 0 2.50 0

0 5.00 77.50 2.50 5.00 5.00 10.00 0 2.50 0 0 0

0 0 2.50 57.50 0 5.00 0 0 23.50 0 0 2.50

0 0 0 0 82.50 0 0 2.50 0 5.00 0 2.50

0 2.50 5.00 0 0 75.00 0 0 10.00 2.50 2.50 0

0 0 2.50 2.50 0 0 90.00 0 0 0 0 0

0 0 0 0 0 0 0 90.00 0 2.50 2.50 2.50

0 0 5.00 35.00 5.00 0 0 0 56.50 5.00 2.50 2.50

0 0 0 0 5.00 2.50 0 2.50 2.50 70.00 0 0

0 2.50 2.50 0 0 5.00 0 5.00 0 5.00 87.50 7.00

0 2.50 5.00 0 0 5.00 0 0 0 7.50 2.50 80.50

ap: applause, cl: cup jingle, cm: chair moving, co: cough, ds: door slam, kj: key jingle, kn: knock, kt: keyboard typing, la: laugh, pr: phone ringing, pw: paper work, st: steps.

nents, which are normally influenced by the ‘speech babble’ noise leading to 82.56% of lesser recognition accuracy. The dominant intensity values of the acoustic events (see dominant values of the acoustic event in Fig. 2) are more effectively discriminated by the FFV features from those of noise and exhibit relatively lower misclassification between classes in 0 dB SNR compared to BoVW features (see confusion matrix of BoVW features given in Table 6). 5. Summary and conclusions In this paper, a novel FFV features are proposed for AEC. FFV is the combination of prominent features, selected from individual Fisher vectors of monochrome images of the pseudo-color spectrogram of an acoustic event. Hence, FFV features are the acoustic event specific features and outperform the baseline and state-ofthe-art approaches. Fisher vectors encode the higher order statistics (mean and standard deviation concerning gradients) of visual words (Gaussians) from monochrome images of the acoustic events and outperform BoVW representations. Intensity values of the acoustic events in the monochrome images are higher than that of noise and these values are effectively discriminated by FFV features leading to the average recognition accuracy of around 94% in clean and different noisy conditions. It indicates that proposed FFV features are robust and have a significant contribution towards the AEC. In future, the Fisher vector may be computed from the another Fisher vector for more discriminative FFV features for AEC. The computation of the Fisher vector from another Fisher vector is referred as deep Fisher network. Combination of the Fisher vectors from monochrome images of different colormaps may improve the performance of the AEC system. The advanced dimension reduction approaches, such as Linear Discriminant Analysis (LDA),

max-margin learning may be explored in the place of traditional PCA. The concatenation of spectral and other image related features to FFV features may further improve the performance of the acoustic event characterization. Acknowledgements The Authors would like to thank and acknowledge the associate editor and the anonymous reviewers for their helpful suggestions and comments. We also would like to thank Ministry of Electronics & Information Technology (MeitY), Government of India for their support in part of the research. References [1] Bregman AS. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT press; 1994. [2] Brown GJ, Cooke M. Computational auditory scene analysis. Computer Speech Lang. 1994;8(4):297–336. [3] D.P. Ellis, K. Lee, Minimal-impact audio-based personal archives, in: Proceedings of the the 1st ACM workshop on Continuous archival and retrieval of personal experiences, ACM, 2004, pp. 39–47.. [4] Foggia P, Petkov N, Saggese A, Strisciuglio N, Vento M. Reliable detection of audio events in highly noisy environments. Pattern Recogn. Lett. 2015;65:22–8. [5] A. Harma, M.F. McKinney, J. Skowronek, Automatic surveillance of the acoustic activity in our living environment, in: IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2005, pp. 4–8.. [6] Mulimani M, Koolagudi SG. Extraction of MapReduce-based features from spectrograms for audio-based surveillance. Digital Signal Process. 2019;87:1–9. [7] Lyon RF. Machine hearing: An emerging field [exploratory DSP]. IEEE Signal Process. Mag. 2010;27(5):131–9. [8] Perperis T, Giannakopoulos T, Makris A, Kosmopoulos DI, Tsekeridou S, Perantonis SJ, Theodoridis S. Multimodal and ontology-based fusion approaches of audio and visual processing for violence detection in movies. Expert Syst. Appl. 2011;38(11):14102–16.

138

M. Mulimani, S.G. Koolagudi / Applied Acoustics 155 (2019) 130–138

[9] Temko A, Nadeu C. Classification of acoustic events using svm-based clustering schemes. Pattern Recogn. 2006;39(4):682–94. [10] K. Kim, H. Ko, Hierarchical approach for abnormal acoustic event classification in an elevator, in: Proceedings of the 8th IEEE International Conference on Advanced Video and Signal Based Surveillance, IEEE, 2011, pp. 89–94.. [11] Zhang Z, Schuller B. Semi-supervised learning helps in sound event classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2012. p. 333–6. [12] Lojka M, Pleva M, Kiktová E, Juhár J, Cˇizˇmár A. Efficient acoustic detector of gunshots and glass breaking. Multimedia Tools Appl. 2016;75(17):10441–69. [13] Mulimani M, Koolagudi SG. Segmentation and characterization of acoustic event spectrograms using singular value decomposition. Expert Syst. Appl. 2019;120:413–25. [14] Dennis J, Tran HD, Chng ES. Image feature representation of the subband power distribution for robust sound event classification. IEEE Trans. Audio, Speech, Lang. Process. 2013;21(2):367–77. [15] A. Temko, E. Monte, C. Nadeu, Comparison of sequence discriminant support vector machines for acoustic event classification, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 5, IEEE, 2006, pp. 721–724.. [16] M. Mulimani, U. Jahnavi, S.G. Koolagudi, Acoustic event classification using graph signals, in: Region 10 Conference (TENCON), IEEE, 2017, pp. 1812–1816.. [17] Sharan RV, Moir TJ. Pseudo-color cochleagram image feature and sequential feature selection for robust acoustic event recognition. Appl. Acoust. 2018;140:198–204. [18] Pancoast S, Akbacak M. Bag-of-audio-words approach for multimedia event classification. In: Thirteenth Annual Conference of the International Speech Communication Association. p. 2105–8. [19] Schmitt M, Ringeval F, Schuller BW. At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: INTERSPEECH. p. 495–9. [20] Grzeszick R, Plinge A, Fink GA. Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2017;25(6):1242–52. [21] Jaakkola T, Haussler D. Exploiting generative models in discriminative classifiers. Adv. Neural Inform. Process. Syst. 1999:487–93. [22] N. Smith, M.J. Gales, Using svms and discriminative models for speech recognition, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, IEEE, 2002, pp. 77–80.. [23] Wan V, Renals S. Speaker verification using sequence discriminant support vector machines. IEEE Trans. Speech Audio Process. 2005;13(2):203–10. [24] Dennis J, Tran HD, Li H. Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process. Lett. 2011;18 (2):130–3.

[25] Mulimani M, Koolagudi SG. Robust acoustic event classification using bag-ofvisual-words. INTERSPEECH. p. 3319–22. [26] Sharan RV, Moir TJ. Subband time-frequency image texture features for robust audio surveillance. IEEE Trans. Inf. Forensics Secur. 2015;10(12):2605–15. [27] Sharan RV, Moir TJ. Acoustic event recognition using cochleagram image and convolutional neural networks. Appl. Acoust. 2019;148:62–6. [28] Salamon J, Bello JP. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 2017;24(3):279–83. [29] J. Li, W. Dai, F. Metze, S. Qu, S. Das, A comparison of deep learning methods for environmental sound detection, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 126–130.. [30] Sánchez J, Perronnin F, Mensink T, Verbeek J. Image classification with the Fisher vector: theory and practice. Int. J. Computer Vision 2013;105 (3):222–45. [31] Lowe DG. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 2004;60(2):91–110. [32] Jolliffe IT. Choosing a subset of principal components or variables. In: Principal Component Analysis. Springer; 1986. p. 92–114. [33] Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification. In: European conference on computer vision. Springer; 2010. p. 143–56. [34] Temko A, Malkin R, Zieger C, Macho D, Nadeu C, Omologo M. Clear evaluation of acoustic event detection and classification systems. In: International Evaluation Workshop on Classification of Events, Activities and Relationships. Springer; 2006. p. 311–22. [35] Varga A, Steeneken HJ. Assessment for automatic speech recognition: II. NOISEX- 92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 1993;12(3):247–51. [36] Valero X, Alias F. Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimedia 2012;14 (6):1684–9. [37] Q. Kong, I. Sobieraj, W. Wang, M. Plumbley, Deep neural network baseline for DCASE challenge 2016, Proceedings of DCASE.. [38] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR abs/1409.1556.. [39] Unoki M, Akagi M. A method of signal extraction from noisy signal based on auditory scene analysis. Speech Commun. 1999;27(3–4):261–79. [40] Phan H, Hertel L, Maass M, Mazur R, Mertins A. Learning representations for nonspeech audio events through their similarities to speech patterns. IEEE/ ACM Trans. Audio, Speech, Lang. Process. 2016;24(4):807–22.