Acoustic classification of frog within-species and species-specific calls

Applied Acoustics 131 (2018) 79–86 Contents lists available at ScienceDirect Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust A...

Download PDF

721KB Sizes 0 Downloads 153 Views

Report

PDF Reader
Full Text

Applied Acoustics 131 (2018) 79–86

Contents lists available at ScienceDirect

Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust

Acoustic classiﬁcation of frog within-species and species-speciﬁc calls a,c,⁎,1

Jie Xie , Karlina Indraswari Paul Roea a b c

a,1

b

a

MARK

a

, Lin Schwarzkopf , Michael Towsey , Jinglan Zhang ,

Electrical Engineering and Computer Science School, Queensland University of Technology, Brisbane, Australia College of Science and Engineering, James Cook University, Townsville, Australia Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada

A R T I C L E I N F O

A B S T R A C T

Keywords: Soundscape ecology Frog community interactions Acoustic features Machine learning algorithms

There have been various studies using automated recognisers of acoustic features and machine learning algorithms to classify frog species within a chorusing community. Such studies rarely consider within-species call variation in the classiﬁcation process. Individual frog species may make a range of diﬀerent calls, with diﬀerent purposes. Including modiﬁcation of calls in automated recognition has the potential to not only increase the accuracy of classiﬁcation of calls to species, but also to provide information on frog calling behaviour within species. Here we use acoustic feature extraction and machine learning algorithms (1) to investigate the acoustic feature importance of identifying species-speciﬁc calls; (2) to determine which acoustic features can be used to classify within-species calls. Our method was tested for its performance in recognising four frog species (Litoria bicolor, Litoria rothii, Litoria wotjulumensis, and Uperoleia inundata) and four call types of L.wotjulumensis (normal, click, response, and long trill). Mean classiﬁcation accuracy was high, with 84.0% at the species level, and 83.7% at the call type level. The overall classiﬁcation accuracy can be up to 93.0%, when considering four call types of L. wotjulumensis as individual classes and being combined with other three frog species. Two techniques, principal component analysis and Fisher discriminant ratio were used for dimension reduction and to select important features for discriminating among calls of diﬀerent species, and call types within species. In conclusion, our proposed classiﬁcation mechanism could eﬀectively not only classify diﬀerent frog species but also identify diﬀerent call types within the same species. Moreover, we found that time-domain features were important for classiﬁcation of within-species calls, whereas frequency-domain features were more useful for classiﬁcation of species-speciﬁc calls.

1. Introduction Acoustic signals are used by many animals to convey information [2]. Such signals can be used by con-species or other species, including humans, to extract information about species presence or absence, or more detailed information on individuals, including size, sex, or ﬁtness. Recently, advances in recording and storage technology have allowed researchers to collect large amounts of acoustic information, which can be used for a variety of ecological and environmental studies (e.g. [3]). At present, it is possible to easily collect large amounts of acoustic information, but to extract certain types of information, especially species presence, absence, and activity, is extremely time consuming [37]. As a result, there are a growing number of studies examining automated techniques for extracting such information from acoustic recordings (e.g. [20,6,25,34,19,18]). Such studies typically focus on the acoustic

⁎

1

signals made by mammals and birds, although a few studies have examined the calls of anurans [24]. Anurans make excellent subjects for automated extraction of information, because their calls tend to be relatively simple and repetitive, compared to mammals and birds, and because they call at night, when there is less background noise than daytime [10]. However, in addition to species speciﬁc calls, within a species, individual frogs may make up to four types of calls: advertisement calls, reciprocation calls, release calls, and distress calls [36]. Advertisement calls are typically the ones used for extracting acoustic features to identify species [31,39]. However, recognition of species alone may not be suﬃcient for ecologists interested in the behaviour of species within frog communities. A single species tends to have multiple call modiﬁcations. Types of variation in call properties within species may be manifest as changes in fundamental frequency, call duration, or call

Corresponding author at: Electrical Engineering and Computer Science School, Queensland University of Technology, Brisbane, Australia. E-mail address: [email protected] (J. Xie). These authors contributed equally to this work.

http://dx.doi.org/10.1016/j.apacoust.2017.10.024 Received 25 April 2017; Received in revised form 14 October 2017; Accepted 19 October 2017 0003-682X/ © 2017 Elsevier Ltd. All rights reserved.

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

Fig. 1. Our classiﬁcation system for studying within-species calls and species-speciﬁc calls. Here PCA and FDR denote principal component analysis and Fisher discrimination ratio. NB, KNN, and RF represent naive Bayes, Knearest neighbour, and random forest. Two dimension reduction techniques, PCA and FDR, are selectively used for the analysis. Three classiﬁers are compared for ﬁnding the best one.

within-species calls and species-speciﬁc calls (Fig. 1). We examined four widely distributed Australian frog species from a single community: Litoria bicolor, Litoria rothii, Litoria wotjulumensis, and Uperoleia inundata. In addition, four within-species call types from a single species, L. wotjulumensis, are classiﬁed. We investigated a range of 14 acoustic features for classiﬁcation and also used principal component analysis (PCA) and Fisher discriminant ratio (FDR) to reduce feature dimension and identify important components. Finally, we created a large tagged frog call dataset including both within-species and species-speciﬁc calls, which can be accessed through our group website2 when you are a registered user.

complexity by adding or removing certain call components [11,29,13]. A variety of previous studies have used acoustic features for automated recognition of frogs [5,12,17,1,39,26,38,9]. In automated recognition, the signal recognition process is typically broken into two tasks: signal detection and signal classiﬁcation. In signal detection, structured sounds of interest are separated from random background noise. Signal classiﬁcation involves the labelling of sounds into biologically relevant groups (usually species) [8]. Calls are broken down into acoustic features, which are then classiﬁed as belonging to a particular species using machine learning algorithms. The methods used for these two processes vary, and are under development [40]. Various methods have been used for automated recognition of the calls of anurans. For example, in [5], duration of each individual call was employed for pre-classiﬁcation of species, because individual frog calls of diﬀerent species tend to have diﬀerent durations [35]. Then, multi-stage average spectrum was used to recognize diﬀerent species via template matching. Xie et al. [39] extracted a novel Cepstral feature set using adaptive frequency scaled wavelet packet decomposition (WPD). The decomposition tree of WPD was adaptively constructed by a frequency scale, generated by applying k-means clustering to dominant frequencies. Since an adaptive frequency scale was generated based on classiﬁed frog species, this type of frequency scale could better distinguish frequency components of diﬀerent frog species. Thus, its classiﬁcation performance was an improvement over frequency scales such as the Mel-scale [39]. Both [26,38] fused features from diﬀerent domains to classify frog calls. Noda et al. [26] proposed a feature fusion of temporal features and Cepstral features, whereas Xie et al. [38] combined features from the temporal domain, spectral domain, and Cepstral domain. Compared to classiﬁcation of features from one domain, experimental results demonstrated a higher classiﬁcation accuracy when features from diﬀerent domains were combined. By examining more features, discrimination of within-species calls and species-speciﬁc calls, should increase species recognition. In this study we propose a robust classiﬁcation system to study frog

2. Materials and methods 2.1. Study site and species Recordings used for this study were collected from Bickerton Island, located near Groote Eylandt (Fig. 2), Northern Territory, Australia (Latitude: −13°77′, Longitude: 136°19′). We sampled calls from four frog species: northern dwarf tree frogs (Litoria bicolor), northern laughing tree frogs (Litoria rothii), watjulum frogs (Litoria wotjulumensis), and ﬂoodplain toadlets (Uperoleia inundata). We sampled recordings for three days between the 11–13 December 2013 and between the hours of 20:00–21:00. Frogs call principally at night, so these times were selected to ensure that the sun had completely set at the start of the recording sample time. We listed, sampled and labeled calls within this time. The number of call instances identiﬁed from each species for L. rothii, L. wotjulumensis, U. inundata, and L. bicolor were 2583, 1803, 211, and 1233, respectively. In addition, we could identify four within-species calls for L. wotjulumensis both visually and audibly. We refer to those within-species calls: normal call (1113), 2

80

https://www.ecosounds.org/.

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

Fig. 2. Location of Bickerton Island.

Table 1 Description of L. wotjulumensis within-species calls. Call type

Description

Normal call Click call Long trill call

Common advertisement call of a species that has not undergone any modiﬁcation in the spectral or temporal domain Click-like notes produced by a male frog A series of continuous pulses that make up a trill-like sound. Long trill is assumed to be a modiﬁcation from a common advertisement call. This modiﬁcation is predicted to either be an aggressive response, or a form of call plasticity, a calling behaviour that allows males to appear of a higher quality than actual A response call is made when two frogs call consecutively. Time between the ﬁrst and the second call are very close, at times overlapping of calls may even occur. This overlap may have the potential for the call to be misidentiﬁed as a diﬀerent call type rather than two individual frogs calling consecutively. Common characteristic of a response call is that the dominant frequency of the latter may either increase or decrease, as a response to the ﬁrst call

Response call

Table 2 Acoustic features for the recognition of frog within-species calls and species-speciﬁc calls. Here an asterisk denotes time-domain features; others are frequency-domain features. The total dimension of all those 14 features is 48. No.

Acoustic features

Dim.

Code

Description

Reference

1

Mel-frequency cepstral coeﬃcients

18

MFCCs

[23]

2

18

LFCCs

3 4 5 6

Linear frequency cepstral coeﬃcients Spectral centroid Spectral ﬂux Spectral rolloﬀ Spectral ﬂatness

1 1 1 1

SC SX SR SF

7 8 9 10 11 12 13 14

Signal bandwidth Fundamental frequency Averaged energy∗ Zero-crossing rate∗ Oscillation rate∗ Shannon entropy∗ Rényi entropy∗ Tsallis entropy∗

1 1 1 1 1 1 1 1

BW FY AE ZR OSR SE RE TE

Short-term power spectrum of a sound based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency Short-term power spectrum of a sound based on a linear cosine transform of a log power spectrum on a linear scale of frequency Weighted mean by amplitude of the frequencies present in the frog syllable Frame-to-frame spectral diﬀerence to characterise the change in the shape of the spectrum Amount of the right-skewness of the power spectrum Ratio between the geometric mean and arithmetical mean and indicates whether a frequency spectrum is smooth or spiky Diﬀerence between the upper and lower cut-oﬀ frequencies Averaging frequency of amplitude peaks for all frames within one frog syllable Square accumulation of signal amplitude divided by the length of the signal Number of time-domain zero-crossings in each individual frog syllable Click periodicity within a speciﬁed frequency band Average of all the information contents weighted by their probabilities of occurrence Diﬀerent averaging of the probabilities via one parameter Another form of SE generalisation for signal complexity measurement, a high value of SE indicates low complexity

click call (296), response call (244), and long trill call (150). The deﬁnitions we used to distinguish the four within-species calls are provided in Table 1.

[26] [17] [17] [17] [17] [17] [38] [38] [38] [38] [38] [15] [7]

2.2. Call preprocessing Segmented individual calls were obtained by drawing boxes around frog calls through manual annotation of recordings using our group’s website [33]. Each frog call was re-sampled at 16 kHz frequency and saved as 32-bit monaural format. Here re-sampling was used to remove 81

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

2.5.1. Naive Bayes (NB) NB is a simple technique for constructing classiﬁers: these models assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some ﬁnite set [28]. There is no single algorithm for training such classiﬁers, but a family of algorithms based on a common principle: all naive Bayes classiﬁers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

high frequency components such as insect calls and reduce the computational burden. The amplitude of each frog call was normalised as follows.

y (t ) = 2

x (t )−min (x (t )) −1 max (x (t ))−min (x (t ))

(1)

where x (t ) was the original frog syllable, min (·) and max (·) indicate the maximal and minimal value. Feature extraction in this study was conducted using Matlab 2014b (The MathWorks, Inc., Natick, Massachusetts, United States.). All machine learning algorithms were conducted by Weka [14]. Mel-Frequency Cepstral coeﬃcients (MFCCs) and Linear frequency Cepstral coeﬃcients (LFCCs) were computed using the LFCCrastamat toolbox developed by Zhou et al. [42]. Other features were programmed by the authors.

2.5.2. K-nearest neighbour (K-NN) K-NN is a supervised learning algorithm [22] that predicts the species or types of any frog feature vector based on the species or types of its K closest neighbours in feature space. For instance, the species or types that is most common among these K nearest neighbours is assigned as the species or types to any new feature vector. 2.5.3. Random forest (RF) RF is a tree-based algorithm [16] that builds a speciﬁed number of classiﬁcation trees without pruning. The nodes are split based on a random drawing of m features from the entire feature set M. Each tree is built from a bootstrapped sample from the training data. For each classiﬁer, the parameters were optimised using grid search to achieve the best overall performance. The number of nearest neighbours was determined and ranged from 2 to 15. The classiﬁer had the highest accuracy when there were 12 nearest neighbours. For random forest, we examined the number of trees to be generated from 20 to 120 with a stepwise length of 10, and found the default setting of 100 provided the highest accuracy.

2.3. Feature extraction To fully reﬂect characteristic of frog calls, twelve common acoustic features were investigated, and a list of all features was provided in Table 2. These features have been used in previous studies for the recognition of frog species [15,17,26,38]. A window size of 20 ms with an overlap of 50% on each frame was established to calculate MFCCs and LFCCs. For other features, the window size was 32 ms with the same overlap. This value was determined experimentally by varying the window size from 5 ms to 1 s and evaluating the results. 2.4. Feature dimension reduction

2.6. Performance statistics

2.4.1. Principal component analysis (PCA) PCA is a common technique used to decorrelate feature vectors and transform high-dimensional feature sets to low-dimensional orthogonal feature space, while remaining the maximum variance of the original high-dimensional feature sets. The feature sets, consisting of fourteen features, has a dimension of 48, as described in Table 2. Each resulting orthogonal feature is referred to as a Principal Component (PC). PCs are ranked by their corresponding eigenvalues, which are scalar representations of the degree of variance within the corresponding PCs. Consequently, PC1 captures the most signiﬁcant variance of the original feature sets. PC2 is perpendicular to PC1 and it contains the next most signiﬁcant variance.

In this experiment, the dataset was ﬁrst divided into ﬁve folds. Four folds were used as the training data, and the rest was for testing. The performance of our proposed frog call classiﬁcation system was evaluated by quantitatively classiﬁcation metrics, such as average accuracy, precision, and speciﬁcity. The deﬁnition of precision and recall are shown as follows.

2.4.2. Fisher discrimination ratio (FDR) We also used Fisher discrimination ratio to reduce the dimensionality of the call features. For each acoustic feature, we calculated FDR, deﬁned as

FDR (i) =

TP TP + FN

(3)

Specificity =

TN TN + FP

(4)

Accuracy =

TP + TN TP + TN + FP + FN

(5)

where TP is true positive, FP is false positive, TN is true negative, and FN is false negative. Because the number of instances of diﬀerent frog species varied, a weighted classiﬁcation accuracy (WACC) was used, and deﬁned as follows.

(μa,i −μb,i )2 (σa2,i + σb2,i )

Sensitivity =

(2)

N

Weighted Metric =

where μa,i and μb,i represent the means of frog species or types a and b and feature i, while σa,i and σb,i represent the variances for classes a and b, respectively. A higher FDR value indicates a better discrimination of a certain feature. For instance, a feature with a high distance between the means of the two classes but low intra-class variance has have a good discrimination ability.

∑

Metric (n) ∗

n=1

n N

(6)

where Metric is the average accuracy or precision or speciﬁcity, n is the index of within-species or species-speciﬁc calls, N is the total number of species or within-species call types. 3. Results

2.5. Machine learning (ML) algorithms 3.1. Feature reduction analysis To verify that frequency-domain information was more important for classifying frog species-speciﬁc calls, while time-domain were more important for distinguishing within-species calls, three standard ML algorithms were used in this study to perform the classiﬁcation: naive Bayes, K-nearest neighbour, and random forest.

To give an intuitive sense of the power of speciﬁc features for our classiﬁcation tasks, boxplots of MFCCs and LFCCs are provided in Fig. 3. If the ranges of two individual features in boxplots are clearly separable, using both features often leads to a better discrimination ability 82

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

20

Fig. 3. Boxplot of values for Mel-frequency Cepstral coeﬃcients and linear frequency Cepstral coeﬃcients. Red lines are medians, box edges are at 20% and 75% quantiles. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

4

15

2

10 5

0

0

-2

-5

-4

-10 -6

-15 -20

-8

-25

-10

-30

-12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

MFCC index

LFCC index

MFCC13

0.2

Component 2

ZR

MFCC18 FY MFCC9

0.3

BW

classiﬁcation are shown in Table 3. The best classiﬁcation accuracy was obtained by random forest (84.0%), which is higher than naive Bayes (80.0%) and k-nearest neighbour (81.2%). All results were compared to manual annotation that was done on sample recordings. The classiﬁcation results are in accordance with our previous study [38], which found that the random forest method produced the highest classiﬁcation accuracy. However, the accuracy obtained from random forest was not signiﬁcantly higher than k-nearest neighbour classiﬁer (Z = −0.98, P = 0.32), but was signiﬁcantly higher than naive Bayes classiﬁer (Z = −3.82, P = 0). The confusion matrix for random forest for all features is shown in Table 4. The classiﬁcation accuracies for L. rothii, L. wotjulumensis, U. inundata, and L. bicolor were 82.0%, 81.3%, 96.8%, and 90.1%, respectively. Overall, we detected 48 dimensions using PCA and FDR. Table 5 shows the classiﬁcation results using top ﬁve and seven features selected by FDR, and using the ﬁrst ﬁve and seven PCs. To provide these assessments, we used the random forest classiﬁer, as it showed the beset performance (Table 3). We achieved the highest classiﬁcation accuracy using all features (84.0%). The accuracies obtained using the top ﬁve (FRD ⩾ 0.3) and seven features (FDR ⩾ 0.2) were 79.2% and 81.6%. For the ﬁrst ﬁve and seven PCs, the accuracies were 79.8% and 80.3%. Thus, classiﬁcation accuracy after dimension reduction was slightly lower than when using all features.

SC SR

MFCC11 SX MFCC16 MFCC2 MFCC1 LFCC9 MFCC5 SE MFCC7 LFCC2 OSR RE TE LFCC4 LFCC11 LFCC16 LFCC18 LFCC6 LFCC5 LFCC7 LFCC14 SF LFCC12 LFCC15 AE MFCC6 LFCC3 MFCC14 LFCC13 LFCC17 MFCC15 LFCC8 LFCC10 LFCC1 MFCC3

0.1 0 -0.1

MFCC12

-0.2

MFCC4

MFCC17MFCC10

MFCC8

-0.3 -0.3

-0.2

-0.1

0

0.1

0.2

0.3

Component 1 Fig. 4. Bi-plot of PC coeﬃcients and PC scores for species-speciﬁc call classiﬁcation.

than if their ranges overlap. We can see from Fig. 3 that the ﬁrst ten values of both MFCCs and LFCCs are separately distributed, but the ranges of last ﬁve feature values are highly overlapped. Consequently, it is necessary to select important feature vectors and reduce feature vectors’ dimension. To visualise both the orthonormal PC coeﬃcients for each feature vector and the PC scores for each frog syllable in a single plot, a bi-plot of PC1 and PC2 are shown in Fig. 4. Each feature is represented by a vector in this bi-plot, and the direction and length of the vector indicate how each feature contributes to those two PCs in the plot. For PC1, it has positive coeﬃcients for features, such as MFCC1 and MFCC10, but negative values for features like LFCC1 and MFCC11. This indicates that PC1 distinguishes among species-speciﬁc calls that have high values for those features with positive coeﬃcients and low for negative coeﬃcients. To rank each feature vector, FDR was applied to all the 48 feature vectors, and their normalised FDR values are shown in Fig. 5. The ﬁve features contributing the most to species classiﬁcation were all Cepstral features, indicating that frequency-domain information provided better discrimination ability than time-domain for species recognition. In spite of this, the features that contributed least to call discrimination were also Cepstral features. To classify call variation within L. wotjulumensis, however, the ﬁrst and fourth contributing features were time-domain features (Tsallis entropy and zero-crossing rate). Thus, time-domain features were more important for discriminating among calls within L. wotjulumensis than were frequency-domain features.

3.3. Within-species call classiﬁcation The confusion matrix for classiﬁcation of within-species calls of L. wotjulumensis including normal call, click call, response call, and long trill call of L. wotjulumensis achieved using random forest classiﬁer is shown in Table 6. The accuracy, speciﬁcity, and sensitivity for withinspecies call classiﬁcation were 83.7%, 70.9%, and 78.1% (see Table 7). 3.4. Classiﬁcation for combined within-species and species-speciﬁc calls We ﬁrst attempted to classify species calls and call types within L. wotjulumensis, separately. We also tried classifying all four species along with four call types of species L. wotjulumensis in a single step. Surprisingly, the classiﬁcation accuracy (93.0%) was much higher than classifying species and call types separately. This might have occurred because the timing of calls (and call types) among species overlapped in the temporal realm. Therefore, when species calls and call types were classiﬁed separately, only features from a single domain (either the time-domain or the frequency-domain) contributed to classiﬁcation., allowing us to distinguish among species via the frequency-domain or the time-domain, producing less accurate call classiﬁcation. On the other hand, when we classify combined within-species and speciesspeciﬁc calls L. wotjulumensis, both time-domain and frequency-domain features contributed to classiﬁcation, and producing more accurate call classiﬁcation.

3.2. Species-speciﬁc call classiﬁcation The results of the method using all features for species-speciﬁc call 83

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

Fig. 5. Normalised FDR values of all the 48 acoustic feature vectors: (a) species-speciﬁc call classiﬁcation and (b) within-species call classiﬁcation. The features of the ﬁrst ten indices for species-speciﬁc call classiﬁcation are: LFCC1, MFCC1, MFCC2, MFCC3, LFCC5, MFCC5, MFCC12, LFCC4, TE, SX, MFCC11, LFCC12. The features of the ﬁrst ten indices for within-species call classiﬁcation are: TE, FT, MFCC7, ZR, SC, MFCC3, MFCC17, MFCC11, MFCC2, BW, MFCC13, MFCC6.

diﬀerent species but also call types within a species. For species-speciﬁc call classiﬁcation, we achieved the best classiﬁcation result for U. inundata, which might be caused by the more instance of calls of this species than others. For within-species call classiﬁcation, we found that the most common mistake was misclassiﬁcation of response calls as normal calls, probably because there is a high structural similarity between response calls and normal calls. Response calls consist of adjacent normal calls made by two diﬀerent frog individuals responding to each other. The time of the second call after the ﬁrst is so close to and they over overlap. We found that response calls were mostly classiﬁed as normal calls in spite of a light change in call structure due to adjacency. In contrast, long trill calls were classiﬁed with high accuracy (97.6%). The longer the trills are, the more distinct they are in structure compared to other call types. For L. wotjulumensis, long trills may be up to 18 s, making them distinct from other call types, which were 0.5–2 s in length. We found that the highest classiﬁcation performance was achieved using the random forest classiﬁer. Frequency-domain features were best used to distinguish the calls of diﬀerent species, while time-domain features were most useful to distinguish among the call types of a single species (Table 8). Our results suggest that L. wotjulumensis vary the temporal components of calls for various purposes, while species are most easily distinguished using frequency components of calls.

Table 3 The performance of three common classiﬁers using all features. Classiﬁers

Accuracy (%)

Speciﬁcity (%)

Sensitivity (%)

Naive Bayesian (NB) K-nearest neighbour (K-NN) Random forest (RF)

80.0 81.2 84.0

85.7 84.5 85.8

64.4 69.6 75.2

Table 4 Confusion matrix of species-speciﬁc call classiﬁcation of RF using all features. Classiﬁed as →

L. rothii

L. wotjulumensis

U. inundata

L.bicolor

L. rothii L. wotjulumensis U. inundata L. bicolor

2216 484 32 162

287 1195 56 139

6 6 50 9

74 118 73 923

Table 5 Call species classiﬁcation results with features after dimension reduction using PCA and FDR. # of features

Accuracy (%)

All original 48 features

Speciﬁcity (%)

Sensitivity (%)

84

85.8

75.2

PCA analysis

First ﬁve PCs First seven PCs

79.8 80.3

82.8 82.8

68.3 69.1

FDR analysis

Top ﬁve features Top seven features

79.2

81.9

68.1

81.6

83.9

71.7

4.2. Within-species call classiﬁcations and frog communication behaviour The fact that diﬀerences in the frequency, and other frequency related features of calls, are best used to distinguish among species in machine learning algorithms is consistent with the predictions of the acoustic niche hypothesis. The acoustic niche hypothesis predicts that, to allow optimal transmissions of calls to receivers, species require an acoustic niche, which might be a specialized frequency band or a speciﬁc time period, in which to call [21]. With an acoustic niche, optimal propagation of calls can occur, preventing call masking by other sounds. The hypothesis is controversial, as it is diﬃcult to determine whether there is evidence of selection against overlap, or whether communities of sounds assemble randomly, and there is some avoidance of overlap accidentally [4]. Other studies have found evidence contradicting the acoustic niche hypothesis, as, for example, bird calls overlaps extensively in morning chorus[32]. We cannot provide evidence that the range of calls in our frog community was not assembled by climate change, nor can we provide evidence of avoidance of overlap when species are together, and no such avoidance when they are apart, both of which would be required to demonstrate that the acoustic niche hypothesis was supported. We can, however, argue that lack of species call overlap in the frequency

Table 6 Confusion matrix for classiﬁcation of four within-species call types of L. wotjulumensis. Classiﬁed as →

Normal call

Response call

Click call

Long trill call

Normal call Response call Click call Long trill call

1053 189 109 19

36 99 1 8

15 1 134 0

9 7 0 123

4. Discussion 4.1. Classiﬁcation performance We found that combining signal processing and machine learning algorithms allowed us to successfully classify not only the calls of 84

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

Table 7 Confusion matrix for classifying combined within-species and species-speciﬁc calls. Classiﬁed as →

L. rothii U. innudata L. bicolor L. wotjulumensis

L. rothii

Normal Response Click Long trill

2478 56 191 44 50 9 42

U. innudata

L. bicolor

4 60 13 0 0 0 0

93 87 1024 9 1 9 0

Accuracy (%)

Speciﬁcity (%)

Sensitivity (%)

Call species

Frequencydomain Time-domain

76.3

79.5

63.6

70.5

75.0

54.6

Call types

Frequencydomain Time-domain

71.0

55.7

60.0

77.4

67.3

68.6

Normal

Response

Click

Long trill

8 8 4 1015 163 104 9

0 0 1 19 75 1 5

0 0 0 17 1 121 0

0 0 0 9 6 0 94

detection and speech recognition [30,27,41], there has been a tendency to unify the feature classiﬁcation and classiﬁcation steps into a single machine learning model, because of the recent development of deep learning (learning feature representations). However, a feature learning approach has not been successful in classifying frog calls, possibly because of a lack of large-scale dataset. Therefore, a future research direction would be to prepare a large dataset to which we could apply deep learning techniques, to improve current classiﬁcation performance. One drawback of this study was that the frog community studied included only a single species characterised by several call types, notably L. wotjulumensis, making it impossible to test the generality of our conclusions on identifying call types by temporal features. A solution to this problem is to ﬁnd other frog communities with more species with various call modiﬁcations. Another drawback was that all frog calls were manually segmented in this study, and it would be worthwhile developing automatic frog call segmentation method to aid ecologists in frog analysis.

Table 8 Classiﬁcation performance for combined within-species and species-speciﬁc calls using most important three frequency-domain and time-domain features, respectively. Top three features

L. wotjulumensis

domain is consistent with specialised acoustic niche occupation, and allows species to optimally propagate their calls for communication. Thus, in our system, even though there was overlap of calls among species in the time, call masking did not occur, because of the variation in frequencies. On the other hand, time-domain features were more important to classify within-species calls. This may have occurred if dominant frequency is a species-speciﬁc trait, while pulse rate, and length of calls, all which are time-domain features, can be varied to create diﬀerent call types within a species.

6. Applications This method could be used to aid in continuous long duration ecological monitoring of frog communities by analysing of frog calling behaviour. This method allows a more accurate automated classiﬁcation of frog species and reduces the need for ecologists to manually listen to hours, weeks or even years of data to identify calling frogs. The proportion of false positives is greatly overshadowed by the number of correctly classiﬁed instances of frog calls. In addition, the ability to distinguish among call types of the same species allows ecologists to examine patterns in the calling behaviour of individuals as well as species. Animal behaviour research may also beneﬁt from discrimination of calls within species, reducing the listening eﬀort required to retrieve information on changes in the dynamics of calling behaviour.

5. Conclusions and limitations Our results demonstrate that combining signal processing techniques with machine learning algorithms has considerable potential for the study of frog communication. We reduced dimensionality of call features, and achieved a high performance for call classiﬁcation. Temporal features were more important for classifying call types within a species, while frequency-domain features were more useful to distinguish among the calls of diﬀerent species. The random forest classiﬁer was the best classiﬁer, as it was both robust and had high performance, achieving better results than naive Bayes and k-nearest neighbour classiﬁers. We were able to achieve high classiﬁcation accuracy for four call types of one species, L. wotjulumensis. We found that using both frequency-domain and temporal features achieved a higher overall classiﬁcation accuracy for distinguishing among the calls of diﬀerent species and the diﬀerent types of calls of a single species. Compared to naive Bayes and k-nearest neighbour classiﬁers, the random forest classiﬁer, used on suitable acoustic features provided an ideal basis for the development of individual recognition software. For feature extraction, we used 14 features with a dimension of 48, eight of which were frequency-domain features, and the others were time-domain. Frequency-domain features were best at classiﬁcation of calls among species whereas time-domain features were more useful to classify call type within species. It is worthwhile to note that selecting suitable features can achieve acceptable performance with much lower dimensionality. We used two techniques, PCA and FDR, to reduce dimensionality and select important features. For some pattern recognition applications, such as acoustic event

Acknowledgements Thanks to the QUT Eco-acoustics Research Group for providing the datasets used in this experiment, as well as to the support from the Wet Tropics Management Authority, Queensland, Australia. Thanks to the anonymous reviewers for their careful work and thoughtful suggestions that have helped improve this paper substantially. All funding for this research was provided by the Queensland University of Technology, the Indonesian Endowment Fund for Education (LPDP) and the China Scholarship Council (CSC).

Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.apacoust.2017.10.024. 85

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

References [24] [1] Bedoya C, Isaza C, Daza JM, López JD. Automatic recognition of anuran species based on syllable identiﬁcation. Ecol Inform 2014;24:200–9. [2] Bradbury JW, Vehrencamp SL. Principles of animal communication; 1998. [3] Bridges AS, Dorcas ME, Montgomery W. Temporal variation in anuran calling behavior: implications for surveys and monitoring programs. Copeia 2000;2000(2):587–92. [4] Chek AA, Bogart JP, Lougheed SC. Mating signal partitioning in multi-species assemblages: a null model test using frogs. Ecol Lett 2003;6(3):235–47. [5] Chen W-P, Chen S-S, Lin C-C, Chen Y-Z, Lin W-C. Automatic recognition of frog calls using a multi-stage average spectrum. Comput Math Appl 2012;64(5):1270–81. [6] Cortopassi KA, Bradbury JW. The comparison of harmonically rich sounds using spectrographic cross-correlation and principal coordinates analysis. Bioacoustics 2000;11(2):89–127. [7] Dayou J, Han NC, Mun HC, Ahmad AH, Muniandy SV, Dalimin MN. Classiﬁcation and identiﬁcation of frog sound based on entropy approach. In: International conference on life science and technology (ICLST 2011); 2011. p. 7–9. [8] El Ayadi M, Kamel MS, Karray F. Survey on speech emotion recognition: features, classiﬁcation schemes, and databases. Pattern Recogn 2011;44(3):572–87. [9] Gage SH, Farina A. Ecoacoustics challenges. Ecoacoust: Ecol Role Sounds 2017:313. [10] Gerhardt HC. The evolution of vocalization in frogs and toads. Annu Rev Ecol Syst 1994:293–324. [11] Gerhardt HC, Huber F. Acoustic communication in insects and anurans: common problems and diverse solutions. University of Chicago Press; 2002. [12] Gingras B, Fitch WT. A three-parameter model for classifying anurans into four genera based on advertisement calls. J Acoust Soc Am 2013;133(1):547–59. [13] Grafe TU. A function of synchronous chorusing and a novel female preference shift in an anuran. Proc Roy Soc Lond B: Biol Sci 1999;266(1435):2331–6. [14] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The weka data mining software: an update. ACM SIGKDD Explor Newslett 2009;11(1):10–8. [15] Han NC, Muniandy SV, Dayou J. Acoustic classiﬁcation of Australian anurans based on hybrid spectral-entropy approach. Appl Acoust 2011;72(9):639–45. [16] Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 1998;20(8):832–44. [17] Huang C-J, Chen Y-J, Chen H-M, Jian J-J, Tseng S-C, Yang Y-J, et al. Intelligent feature extraction and classiﬁcation of anuran vocalizations. Appl Soft Comput 2014;19(0):1–7. [18] Kasten EP, McKinley PK, Gage SH. Ensemble extraction for classiﬁcation and detection of bird species. Ecol Inform 2010;5(3):153–66. [19] Kirschel AN, Earl DA, Yao Y, Escobar IA, Vilches E, Vallejo EE, et al. Using songs to identify individual mexican antthrush formicarius moniliger: comparison of four classiﬁcation methods. Bioacoustics 2009;19(1–2):1–20. [20] Kogan JA, Margoliash D. Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden markov models: a comparative study. J Acoust Soc Am 1998;103(4):2185–96. [21] Krause B. The great animal orchestra: ﬁnding the origins of music in the world’s wild places. Little, Brown; 2012. [22] Larose DT. Discovering knowledge in data: an introduction to data mining. John Wiley & Sons; 2014. [23] Lee C-H, Chou C-H, Han C-C, Huang R-Z. Automatic recognition of animal

[25] [26] [27]

[28] [29] [30]

[31]

[32] [33]

[34]

[35] [36] [37] [38]

[39] [40] [41] [42]

86

vocalizations using averaged mfcc and linear discriminant analysis. Pattern Recogn Lett 2006;27(2):93–101. Márquez R, Bosch J, Eekhout X. Intensity of female preference quantiﬁed through playback setpoints: call frequency versus call rate in midwife toads. Anim Behav 2008;75(1):159–66. Mellinger DK, Clark CW. Recognizing transient low-frequency whale sounds by spectrogram correlation. J Acoust Soc Am 2000;107(6):3518–29. Noda JJ, Travieso CM, Snchez-Rodrguez D. Methodology for automatic bioacoustic classiﬁcation of anurans based on feature fusion. Exp Syst Appl 2016;50:100–6. Parascandolo G, Huttunen H, Virtanen T. Recurrent neural networks for polyphonic sound event detection in real life recordings. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2016. p. 6440–4. Rish I. An empirical study of the naive bayes classiﬁer. In: IJCAI 2001 workshop on empirical methods in artiﬁcial intelligence. Vol. 3. IBM New York; 2001. p. 41–6. Ryan MJ, Fox JH, Wilczynski W, Rand AS. Sexual selection for sensory exploitation in the frog physalaemus pustulosus; 1990. Schwarz A, Huemmer C, Maas R, Kellermann W. Spatial diﬀuseness features for dnn-based speech recognition in noisy and reverberant environments. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2015. p. 4380–4. Stevenson BC, Borchers DL, Altwegg R, Swift RJ, Gillespie DM, Measey GJ. A general framework for animal density estimation from acoustic detections across a ﬁxed microphone array. Methods Ecol Evol 2015;6(1):38–48. Tobias JA, Planqué R, Cram DL, Seddon N. Species interactions and the structure of complex communication networks. Proc Natl Acad Sci 2014;111(3):1020–5. Truskinger A, C.-F.M.. R.P. Acoustic workbench (version 19.2) [computer software]. Brisbane: QUT Ecoacoustics Research Group; 2016. Retrieved from . Urazghildiiev IR, Clark CW. Acoustic detection of north atlantic right whale contact calls using the generalized likelihood ratio test. J Acoust Soc Am 2006;120(4):1956–63. Welch AM, Semlitsch RD, Gerhardt HC. Call duration as an indicator of genetic quality in male gray tree frogs. Science 1998;280(5371):1928–30. Wells KD. The ecology and behavior of amphibians. University of Chicago Press; 2010. Wimmer J, Towsey M, Roe P, Williamson I. Sampling environmental acoustic recordings to determine bird species richness. Ecol Appl 2013;23(6):1419–28. Xie J, Towsey M, Zhang J, Roe P. Acoustic classiﬁcation of australian frogs based on enhanced features and machine learning algorithms. Appl Acoust 2016;113:193–201. Xie J, Towsey M, Zhang J, Roe P. Adaptive frequency scaled wavelet packet decomposition for frog call classiﬁcation. Ecol Inform 2016;32:134–44. Xie J, Towsey M, Zhang J, Roe P. Frog call classiﬁcation: a survey. Artif Intell Rev 2016:1–17. Xu Y, Huang Q, Wang W, Plumbley MD. Hierachical learning for dnn-based acoustic scene classiﬁcation; 2016. Also available at: arXiv preprint arXiv:1607.03682. Zhou X, Garcia-Romero D, Duraiswami R, Espy-Wilson C, Shamma S. Linear versus mel frequency cepstral coeﬃcients for speaker recognition. In: 2011 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE; 2011. p. 559–64.

Acoustic classification of frog within-species and species-specific calls

Acoustic classification of frog within-species and species-specific calls

Recommend Documents