Acoustic classification of frog within-species and species-specific calls

Acoustic classification of frog within-species and species-specific calls

Applied Acoustics 131 (2018) 79–86 Contents lists available at ScienceDirect Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust A...

721KB Sizes 0 Downloads 153 Views

Applied Acoustics 131 (2018) 79–86

Contents lists available at ScienceDirect

Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust

Acoustic classification of frog within-species and species-specific calls a,c,⁎,1

Jie Xie , Karlina Indraswari Paul Roea a b c

a,1

b

a

MARK

a

, Lin Schwarzkopf , Michael Towsey , Jinglan Zhang ,

Electrical Engineering and Computer Science School, Queensland University of Technology, Brisbane, Australia College of Science and Engineering, James Cook University, Townsville, Australia Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada

A R T I C L E I N F O

A B S T R A C T

Keywords: Soundscape ecology Frog community interactions Acoustic features Machine learning algorithms

There have been various studies using automated recognisers of acoustic features and machine learning algorithms to classify frog species within a chorusing community. Such studies rarely consider within-species call variation in the classification process. Individual frog species may make a range of different calls, with different purposes. Including modification of calls in automated recognition has the potential to not only increase the accuracy of classification of calls to species, but also to provide information on frog calling behaviour within species. Here we use acoustic feature extraction and machine learning algorithms (1) to investigate the acoustic feature importance of identifying species-specific calls; (2) to determine which acoustic features can be used to classify within-species calls. Our method was tested for its performance in recognising four frog species (Litoria bicolor, Litoria rothii, Litoria wotjulumensis, and Uperoleia inundata) and four call types of L.wotjulumensis (normal, click, response, and long trill). Mean classification accuracy was high, with 84.0% at the species level, and 83.7% at the call type level. The overall classification accuracy can be up to 93.0%, when considering four call types of L. wotjulumensis as individual classes and being combined with other three frog species. Two techniques, principal component analysis and Fisher discriminant ratio were used for dimension reduction and to select important features for discriminating among calls of different species, and call types within species. In conclusion, our proposed classification mechanism could effectively not only classify different frog species but also identify different call types within the same species. Moreover, we found that time-domain features were important for classification of within-species calls, whereas frequency-domain features were more useful for classification of species-specific calls.

1. Introduction Acoustic signals are used by many animals to convey information [2]. Such signals can be used by con-species or other species, including humans, to extract information about species presence or absence, or more detailed information on individuals, including size, sex, or fitness. Recently, advances in recording and storage technology have allowed researchers to collect large amounts of acoustic information, which can be used for a variety of ecological and environmental studies (e.g. [3]). At present, it is possible to easily collect large amounts of acoustic information, but to extract certain types of information, especially species presence, absence, and activity, is extremely time consuming [37]. As a result, there are a growing number of studies examining automated techniques for extracting such information from acoustic recordings (e.g. [20,6,25,34,19,18]). Such studies typically focus on the acoustic



1

signals made by mammals and birds, although a few studies have examined the calls of anurans [24]. Anurans make excellent subjects for automated extraction of information, because their calls tend to be relatively simple and repetitive, compared to mammals and birds, and because they call at night, when there is less background noise than daytime [10]. However, in addition to species specific calls, within a species, individual frogs may make up to four types of calls: advertisement calls, reciprocation calls, release calls, and distress calls [36]. Advertisement calls are typically the ones used for extracting acoustic features to identify species [31,39]. However, recognition of species alone may not be sufficient for ecologists interested in the behaviour of species within frog communities. A single species tends to have multiple call modifications. Types of variation in call properties within species may be manifest as changes in fundamental frequency, call duration, or call

Corresponding author at: Electrical Engineering and Computer Science School, Queensland University of Technology, Brisbane, Australia. E-mail address: [email protected] (J. Xie). These authors contributed equally to this work.

http://dx.doi.org/10.1016/j.apacoust.2017.10.024 Received 25 April 2017; Received in revised form 14 October 2017; Accepted 19 October 2017 0003-682X/ © 2017 Elsevier Ltd. All rights reserved.

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

Fig. 1. Our classification system for studying within-species calls and species-specific calls. Here PCA and FDR denote principal component analysis and Fisher discrimination ratio. NB, KNN, and RF represent naive Bayes, Knearest neighbour, and random forest. Two dimension reduction techniques, PCA and FDR, are selectively used for the analysis. Three classifiers are compared for finding the best one.

within-species calls and species-specific calls (Fig. 1). We examined four widely distributed Australian frog species from a single community: Litoria bicolor, Litoria rothii, Litoria wotjulumensis, and Uperoleia inundata. In addition, four within-species call types from a single species, L. wotjulumensis, are classified. We investigated a range of 14 acoustic features for classification and also used principal component analysis (PCA) and Fisher discriminant ratio (FDR) to reduce feature dimension and identify important components. Finally, we created a large tagged frog call dataset including both within-species and species-specific calls, which can be accessed through our group website2 when you are a registered user.

complexity by adding or removing certain call components [11,29,13]. A variety of previous studies have used acoustic features for automated recognition of frogs [5,12,17,1,39,26,38,9]. In automated recognition, the signal recognition process is typically broken into two tasks: signal detection and signal classification. In signal detection, structured sounds of interest are separated from random background noise. Signal classification involves the labelling of sounds into biologically relevant groups (usually species) [8]. Calls are broken down into acoustic features, which are then classified as belonging to a particular species using machine learning algorithms. The methods used for these two processes vary, and are under development [40]. Various methods have been used for automated recognition of the calls of anurans. For example, in [5], duration of each individual call was employed for pre-classification of species, because individual frog calls of different species tend to have different durations [35]. Then, multi-stage average spectrum was used to recognize different species via template matching. Xie et al. [39] extracted a novel Cepstral feature set using adaptive frequency scaled wavelet packet decomposition (WPD). The decomposition tree of WPD was adaptively constructed by a frequency scale, generated by applying k-means clustering to dominant frequencies. Since an adaptive frequency scale was generated based on classified frog species, this type of frequency scale could better distinguish frequency components of different frog species. Thus, its classification performance was an improvement over frequency scales such as the Mel-scale [39]. Both [26,38] fused features from different domains to classify frog calls. Noda et al. [26] proposed a feature fusion of temporal features and Cepstral features, whereas Xie et al. [38] combined features from the temporal domain, spectral domain, and Cepstral domain. Compared to classification of features from one domain, experimental results demonstrated a higher classification accuracy when features from different domains were combined. By examining more features, discrimination of within-species calls and species-specific calls, should increase species recognition. In this study we propose a robust classification system to study frog

2. Materials and methods 2.1. Study site and species Recordings used for this study were collected from Bickerton Island, located near Groote Eylandt (Fig. 2), Northern Territory, Australia (Latitude: −13°77′, Longitude: 136°19′). We sampled calls from four frog species: northern dwarf tree frogs (Litoria bicolor), northern laughing tree frogs (Litoria rothii), watjulum frogs (Litoria wotjulumensis), and floodplain toadlets (Uperoleia inundata). We sampled recordings for three days between the 11–13 December 2013 and between the hours of 20:00–21:00. Frogs call principally at night, so these times were selected to ensure that the sun had completely set at the start of the recording sample time. We listed, sampled and labeled calls within this time. The number of call instances identified from each species for L. rothii, L. wotjulumensis, U. inundata, and L. bicolor were 2583, 1803, 211, and 1233, respectively. In addition, we could identify four within-species calls for L. wotjulumensis both visually and audibly. We refer to those within-species calls: normal call (1113), 2

80

https://www.ecosounds.org/.

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

Fig. 2. Location of Bickerton Island.

Table 1 Description of L. wotjulumensis within-species calls. Call type

Description

Normal call Click call Long trill call

Common advertisement call of a species that has not undergone any modification in the spectral or temporal domain Click-like notes produced by a male frog A series of continuous pulses that make up a trill-like sound. Long trill is assumed to be a modification from a common advertisement call. This modification is predicted to either be an aggressive response, or a form of call plasticity, a calling behaviour that allows males to appear of a higher quality than actual A response call is made when two frogs call consecutively. Time between the first and the second call are very close, at times overlapping of calls may even occur. This overlap may have the potential for the call to be misidentified as a different call type rather than two individual frogs calling consecutively. Common characteristic of a response call is that the dominant frequency of the latter may either increase or decrease, as a response to the first call

Response call

Table 2 Acoustic features for the recognition of frog within-species calls and species-specific calls. Here an asterisk denotes time-domain features; others are frequency-domain features. The total dimension of all those 14 features is 48. No.

Acoustic features

Dim.

Code

Description

Reference

1

Mel-frequency cepstral coefficients

18

MFCCs

[23]

2

18

LFCCs

3 4 5 6

Linear frequency cepstral coefficients Spectral centroid Spectral flux Spectral rolloff Spectral flatness

1 1 1 1

SC SX SR SF

7 8 9 10 11 12 13 14

Signal bandwidth Fundamental frequency Averaged energy∗ Zero-crossing rate∗ Oscillation rate∗ Shannon entropy∗ Rényi entropy∗ Tsallis entropy∗

1 1 1 1 1 1 1 1

BW FY AE ZR OSR SE RE TE

Short-term power spectrum of a sound based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency Short-term power spectrum of a sound based on a linear cosine transform of a log power spectrum on a linear scale of frequency Weighted mean by amplitude of the frequencies present in the frog syllable Frame-to-frame spectral difference to characterise the change in the shape of the spectrum Amount of the right-skewness of the power spectrum Ratio between the geometric mean and arithmetical mean and indicates whether a frequency spectrum is smooth or spiky Difference between the upper and lower cut-off frequencies Averaging frequency of amplitude peaks for all frames within one frog syllable Square accumulation of signal amplitude divided by the length of the signal Number of time-domain zero-crossings in each individual frog syllable Click periodicity within a specified frequency band Average of all the information contents weighted by their probabilities of occurrence Different averaging of the probabilities via one parameter Another form of SE generalisation for signal complexity measurement, a high value of SE indicates low complexity

click call (296), response call (244), and long trill call (150). The definitions we used to distinguish the four within-species calls are provided in Table 1.

[26] [17] [17] [17] [17] [17] [38] [38] [38] [38] [38] [15] [7]

2.2. Call preprocessing Segmented individual calls were obtained by drawing boxes around frog calls through manual annotation of recordings using our group’s website [33]. Each frog call was re-sampled at 16 kHz frequency and saved as 32-bit monaural format. Here re-sampling was used to remove 81

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

2.5.1. Naive Bayes (NB) NB is a simple technique for constructing classifiers: these models assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set [28]. There is no single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

high frequency components such as insect calls and reduce the computational burden. The amplitude of each frog call was normalised as follows.

y (t ) = 2

x (t )−min (x (t )) −1 max (x (t ))−min (x (t ))

(1)

where x (t ) was the original frog syllable, min (·) and max (·) indicate the maximal and minimal value. Feature extraction in this study was conducted using Matlab 2014b (The MathWorks, Inc., Natick, Massachusetts, United States.). All machine learning algorithms were conducted by Weka [14]. Mel-Frequency Cepstral coefficients (MFCCs) and Linear frequency Cepstral coefficients (LFCCs) were computed using the LFCCrastamat toolbox developed by Zhou et al. [42]. Other features were programmed by the authors.

2.5.2. K-nearest neighbour (K-NN) K-NN is a supervised learning algorithm [22] that predicts the species or types of any frog feature vector based on the species or types of its K closest neighbours in feature space. For instance, the species or types that is most common among these K nearest neighbours is assigned as the species or types to any new feature vector. 2.5.3. Random forest (RF) RF is a tree-based algorithm [16] that builds a specified number of classification trees without pruning. The nodes are split based on a random drawing of m features from the entire feature set M. Each tree is built from a bootstrapped sample from the training data. For each classifier, the parameters were optimised using grid search to achieve the best overall performance. The number of nearest neighbours was determined and ranged from 2 to 15. The classifier had the highest accuracy when there were 12 nearest neighbours. For random forest, we examined the number of trees to be generated from 20 to 120 with a stepwise length of 10, and found the default setting of 100 provided the highest accuracy.

2.3. Feature extraction To fully reflect characteristic of frog calls, twelve common acoustic features were investigated, and a list of all features was provided in Table 2. These features have been used in previous studies for the recognition of frog species [15,17,26,38]. A window size of 20 ms with an overlap of 50% on each frame was established to calculate MFCCs and LFCCs. For other features, the window size was 32 ms with the same overlap. This value was determined experimentally by varying the window size from 5 ms to 1 s and evaluating the results. 2.4. Feature dimension reduction

2.6. Performance statistics

2.4.1. Principal component analysis (PCA) PCA is a common technique used to decorrelate feature vectors and transform high-dimensional feature sets to low-dimensional orthogonal feature space, while remaining the maximum variance of the original high-dimensional feature sets. The feature sets, consisting of fourteen features, has a dimension of 48, as described in Table 2. Each resulting orthogonal feature is referred to as a Principal Component (PC). PCs are ranked by their corresponding eigenvalues, which are scalar representations of the degree of variance within the corresponding PCs. Consequently, PC1 captures the most significant variance of the original feature sets. PC2 is perpendicular to PC1 and it contains the next most significant variance.

In this experiment, the dataset was first divided into five folds. Four folds were used as the training data, and the rest was for testing. The performance of our proposed frog call classification system was evaluated by quantitatively classification metrics, such as average accuracy, precision, and specificity. The definition of precision and recall are shown as follows.

2.4.2. Fisher discrimination ratio (FDR) We also used Fisher discrimination ratio to reduce the dimensionality of the call features. For each acoustic feature, we calculated FDR, defined as

FDR (i) =

TP TP + FN

(3)

Specificity =

TN TN + FP

(4)

Accuracy =

TP + TN TP + TN + FP + FN

(5)

where TP is true positive, FP is false positive, TN is true negative, and FN is false negative. Because the number of instances of different frog species varied, a weighted classification accuracy (WACC) was used, and defined as follows.

(μa,i −μb,i )2 (σa2,i + σb2,i )

Sensitivity =

(2)

N

Weighted Metric =

where μa,i and μb,i represent the means of frog species or types a and b and feature i, while σa,i and σb,i represent the variances for classes a and b, respectively. A higher FDR value indicates a better discrimination of a certain feature. For instance, a feature with a high distance between the means of the two classes but low intra-class variance has have a good discrimination ability.



Metric (n) ∗

n=1

n N

(6)

where Metric is the average accuracy or precision or specificity, n is the index of within-species or species-specific calls, N is the total number of species or within-species call types. 3. Results

2.5. Machine learning (ML) algorithms 3.1. Feature reduction analysis To verify that frequency-domain information was more important for classifying frog species-specific calls, while time-domain were more important for distinguishing within-species calls, three standard ML algorithms were used in this study to perform the classification: naive Bayes, K-nearest neighbour, and random forest.

To give an intuitive sense of the power of specific features for our classification tasks, boxplots of MFCCs and LFCCs are provided in Fig. 3. If the ranges of two individual features in boxplots are clearly separable, using both features often leads to a better discrimination ability 82

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

20

Fig. 3. Boxplot of values for Mel-frequency Cepstral coefficients and linear frequency Cepstral coefficients. Red lines are medians, box edges are at 20% and 75% quantiles. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

4

15

2

10 5

0

0

-2

-5

-4

-10 -6

-15 -20

-8

-25

-10

-30

-12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

MFCC index

LFCC index

MFCC13

0.2

Component 2

ZR

MFCC18 FY MFCC9

0.3

BW

classification are shown in Table 3. The best classification accuracy was obtained by random forest (84.0%), which is higher than naive Bayes (80.0%) and k-nearest neighbour (81.2%). All results were compared to manual annotation that was done on sample recordings. The classification results are in accordance with our previous study [38], which found that the random forest method produced the highest classification accuracy. However, the accuracy obtained from random forest was not significantly higher than k-nearest neighbour classifier (Z = −0.98, P = 0.32), but was significantly higher than naive Bayes classifier (Z = −3.82, P = 0). The confusion matrix for random forest for all features is shown in Table 4. The classification accuracies for L. rothii, L. wotjulumensis, U. inundata, and L. bicolor were 82.0%, 81.3%, 96.8%, and 90.1%, respectively. Overall, we detected 48 dimensions using PCA and FDR. Table 5 shows the classification results using top five and seven features selected by FDR, and using the first five and seven PCs. To provide these assessments, we used the random forest classifier, as it showed the beset performance (Table 3). We achieved the highest classification accuracy using all features (84.0%). The accuracies obtained using the top five (FRD ⩾ 0.3) and seven features (FDR ⩾ 0.2) were 79.2% and 81.6%. For the first five and seven PCs, the accuracies were 79.8% and 80.3%. Thus, classification accuracy after dimension reduction was slightly lower than when using all features.

SC SR

MFCC11 SX MFCC16 MFCC2 MFCC1 LFCC9 MFCC5 SE MFCC7 LFCC2 OSR RE TE LFCC4 LFCC11 LFCC16 LFCC18 LFCC6 LFCC5 LFCC7 LFCC14 SF LFCC12 LFCC15 AE MFCC6 LFCC3 MFCC14 LFCC13 LFCC17 MFCC15 LFCC8 LFCC10 LFCC1 MFCC3

0.1 0 -0.1

MFCC12

-0.2

MFCC4

MFCC17MFCC10

MFCC8

-0.3 -0.3

-0.2

-0.1

0

0.1

0.2

0.3

Component 1 Fig. 4. Bi-plot of PC coefficients and PC scores for species-specific call classification.

than if their ranges overlap. We can see from Fig. 3 that the first ten values of both MFCCs and LFCCs are separately distributed, but the ranges of last five feature values are highly overlapped. Consequently, it is necessary to select important feature vectors and reduce feature vectors’ dimension. To visualise both the orthonormal PC coefficients for each feature vector and the PC scores for each frog syllable in a single plot, a bi-plot of PC1 and PC2 are shown in Fig. 4. Each feature is represented by a vector in this bi-plot, and the direction and length of the vector indicate how each feature contributes to those two PCs in the plot. For PC1, it has positive coefficients for features, such as MFCC1 and MFCC10, but negative values for features like LFCC1 and MFCC11. This indicates that PC1 distinguishes among species-specific calls that have high values for those features with positive coefficients and low for negative coefficients. To rank each feature vector, FDR was applied to all the 48 feature vectors, and their normalised FDR values are shown in Fig. 5. The five features contributing the most to species classification were all Cepstral features, indicating that frequency-domain information provided better discrimination ability than time-domain for species recognition. In spite of this, the features that contributed least to call discrimination were also Cepstral features. To classify call variation within L. wotjulumensis, however, the first and fourth contributing features were time-domain features (Tsallis entropy and zero-crossing rate). Thus, time-domain features were more important for discriminating among calls within L. wotjulumensis than were frequency-domain features.

3.3. Within-species call classification The confusion matrix for classification of within-species calls of L. wotjulumensis including normal call, click call, response call, and long trill call of L. wotjulumensis achieved using random forest classifier is shown in Table 6. The accuracy, specificity, and sensitivity for withinspecies call classification were 83.7%, 70.9%, and 78.1% (see Table 7). 3.4. Classification for combined within-species and species-specific calls We first attempted to classify species calls and call types within L. wotjulumensis, separately. We also tried classifying all four species along with four call types of species L. wotjulumensis in a single step. Surprisingly, the classification accuracy (93.0%) was much higher than classifying species and call types separately. This might have occurred because the timing of calls (and call types) among species overlapped in the temporal realm. Therefore, when species calls and call types were classified separately, only features from a single domain (either the time-domain or the frequency-domain) contributed to classification., allowing us to distinguish among species via the frequency-domain or the time-domain, producing less accurate call classification. On the other hand, when we classify combined within-species and speciesspecific calls L. wotjulumensis, both time-domain and frequency-domain features contributed to classification, and producing more accurate call classification.

3.2. Species-specific call classification The results of the method using all features for species-specific call 83

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

Fig. 5. Normalised FDR values of all the 48 acoustic feature vectors: (a) species-specific call classification and (b) within-species call classification. The features of the first ten indices for species-specific call classification are: LFCC1, MFCC1, MFCC2, MFCC3, LFCC5, MFCC5, MFCC12, LFCC4, TE, SX, MFCC11, LFCC12. The features of the first ten indices for within-species call classification are: TE, FT, MFCC7, ZR, SC, MFCC3, MFCC17, MFCC11, MFCC2, BW, MFCC13, MFCC6.

different species but also call types within a species. For species-specific call classification, we achieved the best classification result for U. inundata, which might be caused by the more instance of calls of this species than others. For within-species call classification, we found that the most common mistake was misclassification of response calls as normal calls, probably because there is a high structural similarity between response calls and normal calls. Response calls consist of adjacent normal calls made by two different frog individuals responding to each other. The time of the second call after the first is so close to and they over overlap. We found that response calls were mostly classified as normal calls in spite of a light change in call structure due to adjacency. In contrast, long trill calls were classified with high accuracy (97.6%). The longer the trills are, the more distinct they are in structure compared to other call types. For L. wotjulumensis, long trills may be up to 18 s, making them distinct from other call types, which were 0.5–2 s in length. We found that the highest classification performance was achieved using the random forest classifier. Frequency-domain features were best used to distinguish the calls of different species, while time-domain features were most useful to distinguish among the call types of a single species (Table 8). Our results suggest that L. wotjulumensis vary the temporal components of calls for various purposes, while species are most easily distinguished using frequency components of calls.

Table 3 The performance of three common classifiers using all features. Classifiers

Accuracy (%)

Specificity (%)

Sensitivity (%)

Naive Bayesian (NB) K-nearest neighbour (K-NN) Random forest (RF)

80.0 81.2 84.0

85.7 84.5 85.8

64.4 69.6 75.2

Table 4 Confusion matrix of species-specific call classification of RF using all features. Classified as →

L. rothii

L. wotjulumensis

U. inundata

L.bicolor

L. rothii L. wotjulumensis U. inundata L. bicolor

2216 484 32 162

287 1195 56 139

6 6 50 9

74 118 73 923

Table 5 Call species classification results with features after dimension reduction using PCA and FDR. # of features

Accuracy (%)

All original 48 features

Specificity (%)

Sensitivity (%)

84

85.8

75.2

PCA analysis

First five PCs First seven PCs

79.8 80.3

82.8 82.8

68.3 69.1

FDR analysis

Top five features Top seven features

79.2

81.9

68.1

81.6

83.9

71.7

4.2. Within-species call classifications and frog communication behaviour The fact that differences in the frequency, and other frequency related features of calls, are best used to distinguish among species in machine learning algorithms is consistent with the predictions of the acoustic niche hypothesis. The acoustic niche hypothesis predicts that, to allow optimal transmissions of calls to receivers, species require an acoustic niche, which might be a specialized frequency band or a specific time period, in which to call [21]. With an acoustic niche, optimal propagation of calls can occur, preventing call masking by other sounds. The hypothesis is controversial, as it is difficult to determine whether there is evidence of selection against overlap, or whether communities of sounds assemble randomly, and there is some avoidance of overlap accidentally [4]. Other studies have found evidence contradicting the acoustic niche hypothesis, as, for example, bird calls overlaps extensively in morning chorus[32]. We cannot provide evidence that the range of calls in our frog community was not assembled by climate change, nor can we provide evidence of avoidance of overlap when species are together, and no such avoidance when they are apart, both of which would be required to demonstrate that the acoustic niche hypothesis was supported. We can, however, argue that lack of species call overlap in the frequency

Table 6 Confusion matrix for classification of four within-species call types of L. wotjulumensis. Classified as →

Normal call

Response call

Click call

Long trill call

Normal call Response call Click call Long trill call

1053 189 109 19

36 99 1 8

15 1 134 0

9 7 0 123

4. Discussion 4.1. Classification performance We found that combining signal processing and machine learning algorithms allowed us to successfully classify not only the calls of 84

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

Table 7 Confusion matrix for classifying combined within-species and species-specific calls. Classified as →

L. rothii U. innudata L. bicolor L. wotjulumensis

L. rothii

Normal Response Click Long trill

2478 56 191 44 50 9 42

U. innudata

L. bicolor

4 60 13 0 0 0 0

93 87 1024 9 1 9 0

Accuracy (%)

Specificity (%)

Sensitivity (%)

Call species

Frequencydomain Time-domain

76.3

79.5

63.6

70.5

75.0

54.6

Call types

Frequencydomain Time-domain

71.0

55.7

60.0

77.4

67.3

68.6

Normal

Response

Click

Long trill

8 8 4 1015 163 104 9

0 0 1 19 75 1 5

0 0 0 17 1 121 0

0 0 0 9 6 0 94

detection and speech recognition [30,27,41], there has been a tendency to unify the feature classification and classification steps into a single machine learning model, because of the recent development of deep learning (learning feature representations). However, a feature learning approach has not been successful in classifying frog calls, possibly because of a lack of large-scale dataset. Therefore, a future research direction would be to prepare a large dataset to which we could apply deep learning techniques, to improve current classification performance. One drawback of this study was that the frog community studied included only a single species characterised by several call types, notably L. wotjulumensis, making it impossible to test the generality of our conclusions on identifying call types by temporal features. A solution to this problem is to find other frog communities with more species with various call modifications. Another drawback was that all frog calls were manually segmented in this study, and it would be worthwhile developing automatic frog call segmentation method to aid ecologists in frog analysis.

Table 8 Classification performance for combined within-species and species-specific calls using most important three frequency-domain and time-domain features, respectively. Top three features

L. wotjulumensis

domain is consistent with specialised acoustic niche occupation, and allows species to optimally propagate their calls for communication. Thus, in our system, even though there was overlap of calls among species in the time, call masking did not occur, because of the variation in frequencies. On the other hand, time-domain features were more important to classify within-species calls. This may have occurred if dominant frequency is a species-specific trait, while pulse rate, and length of calls, all which are time-domain features, can be varied to create different call types within a species.

6. Applications This method could be used to aid in continuous long duration ecological monitoring of frog communities by analysing of frog calling behaviour. This method allows a more accurate automated classification of frog species and reduces the need for ecologists to manually listen to hours, weeks or even years of data to identify calling frogs. The proportion of false positives is greatly overshadowed by the number of correctly classified instances of frog calls. In addition, the ability to distinguish among call types of the same species allows ecologists to examine patterns in the calling behaviour of individuals as well as species. Animal behaviour research may also benefit from discrimination of calls within species, reducing the listening effort required to retrieve information on changes in the dynamics of calling behaviour.

5. Conclusions and limitations Our results demonstrate that combining signal processing techniques with machine learning algorithms has considerable potential for the study of frog communication. We reduced dimensionality of call features, and achieved a high performance for call classification. Temporal features were more important for classifying call types within a species, while frequency-domain features were more useful to distinguish among the calls of different species. The random forest classifier was the best classifier, as it was both robust and had high performance, achieving better results than naive Bayes and k-nearest neighbour classifiers. We were able to achieve high classification accuracy for four call types of one species, L. wotjulumensis. We found that using both frequency-domain and temporal features achieved a higher overall classification accuracy for distinguishing among the calls of different species and the different types of calls of a single species. Compared to naive Bayes and k-nearest neighbour classifiers, the random forest classifier, used on suitable acoustic features provided an ideal basis for the development of individual recognition software. For feature extraction, we used 14 features with a dimension of 48, eight of which were frequency-domain features, and the others were time-domain. Frequency-domain features were best at classification of calls among species whereas time-domain features were more useful to classify call type within species. It is worthwhile to note that selecting suitable features can achieve acceptable performance with much lower dimensionality. We used two techniques, PCA and FDR, to reduce dimensionality and select important features. For some pattern recognition applications, such as acoustic event

Acknowledgements Thanks to the QUT Eco-acoustics Research Group for providing the datasets used in this experiment, as well as to the support from the Wet Tropics Management Authority, Queensland, Australia. Thanks to the anonymous reviewers for their careful work and thoughtful suggestions that have helped improve this paper substantially. All funding for this research was provided by the Queensland University of Technology, the Indonesian Endowment Fund for Education (LPDP) and the China Scholarship Council (CSC).

Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.apacoust.2017.10.024. 85

Applied Acoustics 131 (2018) 79–86

J. Xie et al.

References [24] [1] Bedoya C, Isaza C, Daza JM, López JD. Automatic recognition of anuran species based on syllable identification. Ecol Inform 2014;24:200–9. [2] Bradbury JW, Vehrencamp SL. Principles of animal communication; 1998. [3] Bridges AS, Dorcas ME, Montgomery W. Temporal variation in anuran calling behavior: implications for surveys and monitoring programs. Copeia 2000;2000(2):587–92. [4] Chek AA, Bogart JP, Lougheed SC. Mating signal partitioning in multi-species assemblages: a null model test using frogs. Ecol Lett 2003;6(3):235–47. [5] Chen W-P, Chen S-S, Lin C-C, Chen Y-Z, Lin W-C. Automatic recognition of frog calls using a multi-stage average spectrum. Comput Math Appl 2012;64(5):1270–81. [6] Cortopassi KA, Bradbury JW. The comparison of harmonically rich sounds using spectrographic cross-correlation and principal coordinates analysis. Bioacoustics 2000;11(2):89–127. [7] Dayou J, Han NC, Mun HC, Ahmad AH, Muniandy SV, Dalimin MN. Classification and identification of frog sound based on entropy approach. In: International conference on life science and technology (ICLST 2011); 2011. p. 7–9. [8] El Ayadi M, Kamel MS, Karray F. Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 2011;44(3):572–87. [9] Gage SH, Farina A. Ecoacoustics challenges. Ecoacoust: Ecol Role Sounds 2017:313. [10] Gerhardt HC. The evolution of vocalization in frogs and toads. Annu Rev Ecol Syst 1994:293–324. [11] Gerhardt HC, Huber F. Acoustic communication in insects and anurans: common problems and diverse solutions. University of Chicago Press; 2002. [12] Gingras B, Fitch WT. A three-parameter model for classifying anurans into four genera based on advertisement calls. J Acoust Soc Am 2013;133(1):547–59. [13] Grafe TU. A function of synchronous chorusing and a novel female preference shift in an anuran. Proc Roy Soc Lond B: Biol Sci 1999;266(1435):2331–6. [14] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The weka data mining software: an update. ACM SIGKDD Explor Newslett 2009;11(1):10–8. [15] Han NC, Muniandy SV, Dayou J. Acoustic classification of Australian anurans based on hybrid spectral-entropy approach. Appl Acoust 2011;72(9):639–45. [16] Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 1998;20(8):832–44. [17] Huang C-J, Chen Y-J, Chen H-M, Jian J-J, Tseng S-C, Yang Y-J, et al. Intelligent feature extraction and classification of anuran vocalizations. Appl Soft Comput 2014;19(0):1–7. [18] Kasten EP, McKinley PK, Gage SH. Ensemble extraction for classification and detection of bird species. Ecol Inform 2010;5(3):153–66. [19] Kirschel AN, Earl DA, Yao Y, Escobar IA, Vilches E, Vallejo EE, et al. Using songs to identify individual mexican antthrush formicarius moniliger: comparison of four classification methods. Bioacoustics 2009;19(1–2):1–20. [20] Kogan JA, Margoliash D. Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden markov models: a comparative study. J Acoust Soc Am 1998;103(4):2185–96. [21] Krause B. The great animal orchestra: finding the origins of music in the world’s wild places. Little, Brown; 2012. [22] Larose DT. Discovering knowledge in data: an introduction to data mining. John Wiley & Sons; 2014. [23] Lee C-H, Chou C-H, Han C-C, Huang R-Z. Automatic recognition of animal

[25] [26] [27]

[28] [29] [30]

[31]

[32] [33]

[34]

[35] [36] [37] [38]

[39] [40] [41] [42]

86

vocalizations using averaged mfcc and linear discriminant analysis. Pattern Recogn Lett 2006;27(2):93–101. Márquez R, Bosch J, Eekhout X. Intensity of female preference quantified through playback setpoints: call frequency versus call rate in midwife toads. Anim Behav 2008;75(1):159–66. Mellinger DK, Clark CW. Recognizing transient low-frequency whale sounds by spectrogram correlation. J Acoust Soc Am 2000;107(6):3518–29. Noda JJ, Travieso CM, Snchez-Rodrguez D. Methodology for automatic bioacoustic classification of anurans based on feature fusion. Exp Syst Appl 2016;50:100–6. Parascandolo G, Huttunen H, Virtanen T. Recurrent neural networks for polyphonic sound event detection in real life recordings. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2016. p. 6440–4. Rish I. An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence. Vol. 3. IBM New York; 2001. p. 41–6. Ryan MJ, Fox JH, Wilczynski W, Rand AS. Sexual selection for sensory exploitation in the frog physalaemus pustulosus; 1990. Schwarz A, Huemmer C, Maas R, Kellermann W. Spatial diffuseness features for dnn-based speech recognition in noisy and reverberant environments. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2015. p. 4380–4. Stevenson BC, Borchers DL, Altwegg R, Swift RJ, Gillespie DM, Measey GJ. A general framework for animal density estimation from acoustic detections across a fixed microphone array. Methods Ecol Evol 2015;6(1):38–48. Tobias JA, Planqué R, Cram DL, Seddon N. Species interactions and the structure of complex communication networks. Proc Natl Acad Sci 2014;111(3):1020–5. Truskinger A, C.-F.M.. R.P. Acoustic workbench (version 19.2) [computer software]. Brisbane: QUT Ecoacoustics Research Group; 2016. Retrieved from . Urazghildiiev IR, Clark CW. Acoustic detection of north atlantic right whale contact calls using the generalized likelihood ratio test. J Acoust Soc Am 2006;120(4):1956–63. Welch AM, Semlitsch RD, Gerhardt HC. Call duration as an indicator of genetic quality in male gray tree frogs. Science 1998;280(5371):1928–30. Wells KD. The ecology and behavior of amphibians. University of Chicago Press; 2010. Wimmer J, Towsey M, Roe P, Williamson I. Sampling environmental acoustic recordings to determine bird species richness. Ecol Appl 2013;23(6):1419–28. Xie J, Towsey M, Zhang J, Roe P. Acoustic classification of australian frogs based on enhanced features and machine learning algorithms. Appl Acoust 2016;113:193–201. Xie J, Towsey M, Zhang J, Roe P. Adaptive frequency scaled wavelet packet decomposition for frog call classification. Ecol Inform 2016;32:134–44. Xie J, Towsey M, Zhang J, Roe P. Frog call classification: a survey. Artif Intell Rev 2016:1–17. Xu Y, Huang Q, Wang W, Plumbley MD. Hierachical learning for dnn-based acoustic scene classification; 2016. Also available at: arXiv preprint arXiv:1607.03682. Zhou X, Garcia-Romero D, Duraiswami R, Espy-Wilson C, Shamma S. Linear versus mel frequency cepstral coefficients for speaker recognition. In: 2011 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE; 2011. p. 559–64.