Applying an ensemble convolutional neural network with Savitzky–Golay filter to construct a phonocardiogram prediction model

Applying an ensemble convolutional neural network with Savitzky–Golay filter to construct a phonocardiogram prediction model

Applied Soft Computing Journal 78 (2019) 29–40 Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsevi...

2MB Sizes 0 Downloads 37 Views

Applied Soft Computing Journal 78 (2019) 29–40

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

Applying an ensemble convolutional neural network with Savitzky–Golay filter to construct a phonocardiogram prediction model ∗

Jimmy Ming-Tai Wu a , Meng-Hsiun Tsai b , , Yong Zhi Huang b , SK Hafizul Islam c , Mohammad Mehedi Hassan d , Abdulhameed Alelaiwi e , Giancarlo Fortino f a

College of Computer Science and Engineering, Shandong University of Science and Technology, Qindao, Shandong, China Department of Management Information Systems, National Chung Hsing University, Taichung, Taiwan Department of Computer Science and Engineering, Indian Institute of Information Technology Kalyani, West Bengal 741235, India d Chair of Smart Technologies and Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia e Chair of Smart Technologies and Software Engineering Department, King Saud University, Riyadh 11543, Saudi Arabia f Department of Informatics, Modeling, Electronics,and Systems, University of Calabria, 87036 Rende, Italy b c

highlights • • • •

This paper focus on to construct a phonocardiogram automatic classification model. An ensemble convolutional neural network with Savitzky–Golay filter is utilized. The experimental results showed that the proposed method is very competitive. This method will assist physicians in the diagnosis of heart sounds.

article

info

Article history: Received 28 October 2018 Received in revised form 10 January 2019 Accepted 13 January 2019 Available online 2 February 2019 Keywords: Coronary artery Phonocardiograms Convolutional neural network Ensemble deep learning Savitzky–Golay filter

a b s t r a c t Coronary artery disease is a common chronic disease, also known as ischemic heart disease, which is a cardiac dysfunction caused by the insufficient blood supply to the heart and kills countless people every year. In recent years, coronary artery disease ranks first among the world’s top ten causes of death. Cardiac auscultation is still an important examination for diagnosing heart diseases. Many heart diseases can be diagnosed effectively by auscultation. However, cardiac auscultation relies on the subjective experience of physicians. To provide an objective diagnostic means and assist physicians in the diagnosis of heart sounds at a clinic, this study uses phonocardiograms to build an automatic classification model. This study proposes an automatic classification approach for phonocardiograms using deep learning and ensemble learning with a Savitzky–Golay filter. The experimental results showed that the proposed method is very competitive, and showed that the performance of the phonocardiogram classification model in hold out testing was 86.04% MAcc (86.46% sensitivity, 85.63% specificity), and in ten-fold cross validation it was 89.81% MAcc (91.73% sensitivity, 87.91% specificity). These two experimental results are all better than two state-of-art algorithms and show the potential to apply in real clinic situation. © 2019 Published by Elsevier B.V.

1. Introduction The computing capability and storage capacity of computers have been greatly improved with advancements that have been made in hardware. Therefore, there are many machine learning ∗ Corresponding author. E-mail addresses: [email protected] (J.M.-T. Wu), [email protected] (M.-H. Tsai), [email protected] (Y.Z. Huang), [email protected] (S.K.H. Islam), [email protected] (M.M. Hassan), [email protected] (A. Alelaiwi), [email protected] (G. Fortino). https://doi.org/10.1016/j.asoc.2019.01.019 1568-4946/© 2019 Published by Elsevier B.V.

algorithms that can be implemented. The research and application of artificial intelligence have started to flourish. Simple machine learning algorithms (Naive Bayes, logistic regression, decision trees, etc.) are gradually being used to solve problems such as understanding speech and images, and assisting in medical diagnosis. However, simple machine learning algorithms must extract features manually, which consumes a lot of labor costs and time. In addition, it is possible for important features to be ignored. Therefore, the performance of these simple machine learning algorithms is still inferior to human capabilities [1].

30

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

The advantage of deep learning are representational learning and hierarchical feature extraction algorithms which replace manual extraction features to overcome the problem of feature engineering. In recent years, deep learning has been widely used in natural language understanding, image recognition, activity recognition and emotion recognition [2–4]. Moreover, it has achieved good results. The effectiveness of its classification is very close to human beings, and even in some fields has surpassed human ability. Cardiovascular diseases (CVD) is a general term for noncommunicable diseases involving the heart and blood vessels. Common CVDs include coronary artery disease (CAD), stroke, cardiomyopathy, valvular heart disease, and hypertensive heart disease [5,6]. Cardiovascular diseases cause tens of millions of deaths every year in the world. As shown in Fig. 1, according to the data released by the World Health Organization in 2015, ischemic heart disease ranked first in the world’s top ten causes of death. In the same year, ischemic heart disease also caused the death of 8.7 million people worldwide, and between 2005 and 2015, the number of deaths from CVDs worldwide increased by 12.5 percentage points [7]. Ischaemic heart disease (IHD), also known as coronary artery disease (CAD), is a common chronic disease including stable angina pectoris, unstable angina pectoris, myocardial infarction, and sudden death, and it is a cardiac dysfunction or lesion caused by insufficient blood supply to the heart [8]. It has maintained its position as the world’s biggest killer for the past fifteen years [9]. In countries with better economies, there are many sophisticated inspection items to assist physicians in diagnosis (such as cardiac ultrasound, electrocardiogram, and magnetic resonance imaging), but cardiac auscultation is still an effective and low-cost diagnostic method for physicians, which also has the advantage of being non-intrusive. In contrast, low and middle-income countries are unable to afford the cost of sophisticated inspections and access to expert diagnosis is often impeded with patient to doctor ratios as high as 50,000 : 1 [10]. Auscultation of heart sound recordings or phonocardiogram (PCG) has been proven to be valuable for detecting diseases and lesions [11–13]. Cardiac auscultation is one of the important inspection methods for diagnosing heart disease. If the heart is abnormal, it will often respond to heart sounds and heart rhythms. The accuracy and sensitivity of cardiac auscultation is not high compared to other clinical cardiac diagnostic methods, but it is a very simple and valuable diagnostic method with the advantage of being non-invasive. Many heart diseases can be effectively diagnosed by abnormalities in the heart at an early stage through a physician’s auscultation. However, traditional cardiac auscultation relies heavily on the subjective experience of physicians and the quality of the auscultation environment. More and more state-of-art computer science techniques were applied in the medical field recently. Due to the efficiency of the new hardware and modern mining algorithms, people can diagnose diseases with high-performance and low-cost. Some related works are introduced below: Hussein et al. applied a cloudbased system to monitor heart rate variability remotely [14]. It can detect the heart disease immediately. A measuring and analysis method [15] for nonlinear characterization of cardiotocographic examinations is proposed by Marques et al. The method is applied entropy value to identified normal and suspicious/pathological groups. Mobile phone is also a useful tool for diagnosis of heart diseases. Hemanth et al. proposed an augmented reality-supported mobile system [16] of heart sound analysis and disease diagnosis. Photopletysmography (PPG) signal was also used for biomedical applications [17]. PPG is an optical technique applied in the monitoring of heart rate variability. The purpose of this study is to construct a phonocardiogram classification model using deep learning and ensemble learning

with a Savitzky–Golay filter. This method will assist physicians in the diagnosis of heart sounds. The key contributions of the study are listed below: 1. The heart sound segmentation method in Springer [18] is used to preprocess the input data for its method, however, it also takes a long time to finish this preprocess stage. In the proposed method, it does not need to apply this process and can reduce the training time effectively. 2. In this study, we directly transform small segments of the original heart sound into spectrum and cepstrum, which does not cause too much computational complexity in feature engineering and obtains a good sensitivity and specificity. 3. The method in this paper applies a filter to deal with the noise of heart sounds. Therefore, the performance of phonocardiogram classification model is improved. 2. Related work 2.1. Heart sound Heart sounds are vibrations generated by a series of actions of the heart, including blood flowing through the heart against the ventricular wall and closing of the heart valve. The human ear can receive the vibration on the chest surface of the heart area. A clear heart sound can be heard with a stethoscope [20]. Heart sounds can be divided into main heart sounds, extra heart sounds, and heart murmurs. Main heart sounds include the first heart sound (S1) and the second heart sound (S2). The additional heart sounds are the third heart sound (S3) and the fourth heart sound (S4). 1. First heart sound: This occurs in the systole with the largest amplitude, and is caused by the closure or contraction of the tricuspid and mitral valves at the beginning of ventricular contraction [21]. 2. Second heart sound: This occurs in the ventricular diastolic phase, which is caused by aortic valve closure and pulmonary valve closure. The amplitude is second only to the first heart sound. In addition, the frequency and the duration of second heart sound are high and short [21]. 3. Third heart sound: The third heart sound is an extra heart sound, which occurs after the first heart sound and the second heart sound. The amplitude is low, and therefore its sound is difficult to hear. It is usually heard in children, adolescents, and athletes. Some normal people may also have this heart sound, but it should disappear before middle age. If it appears after middle age, it may represent a problem with the heart, such as heart failure [21]. 4. Fourth heart sound: This is also an extra heart sound, and its amplitude is also low. The fourth heart sound occurs briefly before the first heart sound, It is the heart sound caused by atrial contraction and the blood flow quickly fills the ventricle [21]. 5. Heart murmur: Heart murmurs can be divided into pathological heart murmurs and physiologic heart murmurs. Pathological heart murmurs are often caused by abnormalities in the heart or heart valves, such as ventricular septal defect (VSD), narrowing or leaking valves. Physiological heart murmurs are blood flowing through the heart faster than normal, and often occur during pregnancy, fever, or anemia. These are benign heart murmurs and disappear over time.

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

31

Fig. 1. The top ten causes of death in 2015[19].

Fig. 2. Phonocardiogram.

frequency bands (25–45 Hz, 45–80 Hz, 80–200 Hz, 200–400 Hz) is input into the model, and then the convolution results from the four different frequency bands are flattened and input together before entering the fully connected layer. Finally, applying decision rules combines the results of the AdaBoost classifier and the CNN to predict the class of the phonocardiograms. Homsi et al. proposed a method combining the Random Forest, LogitBoost, and Cost-Sensitive Classifiers in 2016. In addition, this method achieved fifth place in the PhysioNet Challenge of 2016 [25]. This method divides the process into three phases, which are the pre-processing phase, the classification phase, and the evaluation phase. In the pre-processing phase, each phonocardiogram is resampled to 1000 Hz using the polyphase antialiasing filter. Moreover, four states of the heart sound were identified using the segmentation method proposed by Springer et al. and 131 features were extracted from the time domain, frequency domain, and statistics. In the classification phase and the evaluation phase, the authors tried several model combinations and parameter settings to train better models.

2.2. Phonocardiogram 2.4. Related works of deep learning A phonocardiogram is a heart sound information recording by an electronic device that changes the vibration from the chest into signals. Heart disease often leads to changes in heart sounds or heart murmurs. In addition, the recorded heart sound signals are able to assess more information than the human ears can alone [22]. By sampling the patient’s heart sounds, a lot of information about the state of the heart is provided with the time and vibration [23]. Fig. 2 shows a phonocardiogram between 0 to 8 s, depicting the amplitude change of the heart sound over time. 2.3. Related works of phonocardiograms Potes et al. proposed an algorithm based on AdaBoost classifier and a convolutional neural network (CNN) in 2016 [24]. They won first place in the PhysioNet/CinC Challenge of 2016. Fig. 3 shows the phonocardiogram classification algorithm proposed by Potes et al. The process consisted of the following steps: The first step is data preprocessing. In this step, each phonocardiogram is resampled to 1000 Hz, and a bandpass filter is used to extract heart sounds of 25 Hz to 400 Hz. Furthermore, the preprocessed heart sound is divided into four frequency bands using a segmentation method proposed by Springer et al. [18]. Second, 36 time domain features and 52 frequency domain features are extracted, and then the AdaBoost classifier and CNN are built. In this approach, the one-dimensional phonocardiogram of the four

Deep Learning is a branch of machine learning. It is a method for generating predictive models based on a series of nonlinear transformations and weight adjustments. There are many deep learning architectures, such as CNN, recurrent neural network (RNN), and long short-term memory. The deep learning architecture used in this study is a CNN, and therefore the literature discussion of deep learning will focus on CNNs. 2.4.1. LeNet-5 A CNN was first proposed by LeCun et al. in 1990 and applied to the identification task of handwritten zip codes [26]. In 1998, Lecun et al. presented their classic LeNet-5 model after a series of experiments and improvements, and achieved excellent performance in the MNIST handwritten digital database [27]. The biggest difference between this model and the traditional neural network (multilayer perceptrons) is the concept of convolution and pooling, which makes the neural network be able to directly extract features from image or 2D arrays without feature engineering. The architecture of the LeNet-5 model includes convolutional layers, pooling layers, and fully connected layers. The convolutional layer of the CNN gives the characteristics of sparse connectivity and parameter sharing. In short, sparse connectivity means that the neuron is only connected to a local area of the input image,

32

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

Fig. 3. The phonocardiogram classification algorithm proposed by Potes et al. [24].

and the size of connected area is called receptive field, which is the size of convolution kernel. Parameter sharing means that a convolution kernel corresponds to a feature map, which greatly reduces the number of parameters. In addition, the convolution kernel can be easily understood as a set of weights. 2.4.2. AlexNet AlexNet was proposed by Krizhevsky et al. [28] in 2012 and won first place in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) held by ImageNet. Moreover, the Top-5 error rate is significantly ahead of second place. Since then, models of various CNNs have emerged one after another, creating a booming era for CNN usage for large-scale visual recognition. AlexNet has 60 million parameters and 650,000 neurons, which consists of five convolutional layers, two layers of fully connected layers, and a layer of softmax output layers. Compared to LeNet5, AlexNet not only makes the depth of the network deeper and increases network parameters, but it also implements state-of-theart tricks to improve the speed of neural network training and to avoid overfitting which includes: 1. Replace the popular hyperbolic tangent (tanh) function with the rectified linear unit (ReLu) as the activation function. 2. Use a graphics processing unit (GPU) to accelerate the training of the CNNs. 3. Adopt a dropout training method [29] to avoid the overfitting problem. 4. Use data augmentation techniques, such as random rotation, shifts, shear, and flips, to boost performance.

Table 1 Summary of the training sets. Recordings

Abnormal

Normal

Training-a Training-b Training-c Training-d Training-e Training-f

409 490 31 55 2141 114

292 104 24 28 183 34

117 386 7 27 1958 80

Total

3240

665

2575

different clinical or non-clinical environments. The sampling rate of each heart sound was consistent at 2000 Hz and each phonocardiogram was an uncompressed wave format file. In the dataset, the experts who provided the phonocardiogram labeled all phonocardiograms into two categories: normal and abnormal. Normal heart sounds are collected from healthy individuals. Most of the abnormal heart sounds are from patients with heart valve defects and coronary artery disease. In addition, patients with heart valve defects included mitral valve prolapse, mitral insufficiency, aortic insufficiency, aortic stenosis, and individuals undergoing valvular surgery [10]. The contents of the dataset are shown in Table 1: The training dataset consists of training subsets a to f, of which 2575 heart sounds are labeled as normal heart sounds and 665 are labeled as abnormal heart sounds. However, the contents of the hidden testing set are unknown, because the testing set is not open to non-participants. 3. Method

5. Local Response Normalization and Overlapping Pooling 2.4.3. VGGNet VGGNet is a CNN model proposed by Simonyan and Zisserman of the Visual Geometry Group at Oxford University in 2014 [30] and it achieved second place in the ILSVRC in 2014. In the network architecture, VGGNet continues the framework of LeNet-5 and AlexNet. VGGNet has five convolutional groups, two layers of fully connected layers, and a softmax classification as the output layer. There are several differences between AlexNet and VGGNet. To reduce the amount of parameters and obtain larger feature maps, VGGNet deepens the depth of the convolutional layer and stacks multiple 3 × 3 small convolution kernel convolution layers in each convolutional layer. Instead of overlapping pooling, the pooling layer settings of VGGNet is a 2 × 2 max pooling window with a stride of 2. In this study, the proposed framework will establish 3 VGGNet to build a prediction system for heart disease. 2.5. Dataset The phonocardiograms used in the study were released by PhysioNet/Computing in the Cardiology Challenge of 2016 (PhysioNet 2016), and come from seven different research institutions in

This section describes and explains in detail the methods and algorithms used in this study, and includes five subsections. Section 3.1 is an overview of the method in this study. Section 3.2 describes the data preprocessing phase, which includes a Savitzky– Golay filter, one-hot coding, and Z-score standardization. Section 3.3 describes the feature engineering of phonocardiograms. Section 3.4 details the deep learning architecture of this study. Section 3.5 describes the implementation and algorithms of ensemble learning in this study. 3.1. Overview of methods The methods and algorithms are summarized in Fig. 4 In this study. First, a Savitzky–Golay filter is used to denoise each heart sound, and then the filtered and unfiltered time domain signals are transformed to the frequency domain by short-time Fourier transform. Fourier transform is used to produce spectrograms which are able to describe the frequency distribution over time and can be considered to be an image. The second phase is the phase of feature engineering. Besides the spectrogram, this study further transforms the spectrogram by using mel scale, triangular band

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

33

Fig. 4. Overview of methods in this study.

pass filters, and discrete cosine transform to obtain mel spectrograms and mel frequency cepstral coefficients (MFCCs). The third phase is the training of the CNNs. This study uses three features (Spectrogram, Mel Spectrogram, and MFCCs) as input for the model to build the classifier for phonocardiogram classification and to train multiple CNNs. Finally, the ensemble model is built through two ensemble strategies, and the prediction results of multiple models (CNNs) are combined by a majority voting method. Finally, the final prediction result is output by the ensemble model to determine whether the phonocardiograms are normal or abnormal. 3.2. Preprocessing In this section, a detail preprocessing of the proposed method is described here. The original phonocardiograms cannot be the input data of CNN architecture, it also includes too many noises which maybe affect the accuracy for the proposed method. The first step of the preprocessing is to remove noises. The proposed method applies Savitzky–Golay Filter to do noise reduction and transfer the phonocardiograms to the frequency domain by short-time Fourier transform. The following step is standardization, Z-score Standardization is applied to calculate the standard score for each input value. In addition, the proposed method uses One-Hot Encoding to define the classification for each input phonocardiogram. For now, a set of input data for one CNN network is available. However, the proposed method will generate three CNN network to predict phonocardiogram. The final step of preprocessing is that to transfer the first input data set to the other two sets by Spectrogram and Mel Spectrogram. The description of each step is shown in the following sub-sessions. 3.2.1. Savitzky–Golay filter The Savitzky–Golay filter was proposed by Savitzky and Golay in 1964 for signal smoothing noise reduction [31]. This method achieves the smoothing of signals by using convolution on the time domain signal and using polynomial least-squares in each convolution window. This filter keeps the shape and length of the signal while removing tiny noises.

According to the description of the literature [32], the principle of this filter is to find a polynomial solution fitted to the selected number of local signal sample points, thereby smoothing the local signal. The polynomial is as shown in Eq. (3.1), with 2M + 1 signal samples as input (the number of samples must be singular), let n = 0 be the center point, N is the power of the polynomial, and N ≤ 2M + 1 . p(n) =

N ∑

ak nk

(3.1)

k=0

Next, Eq. (3.2) is used to describe the mean-squared approximation error and find the polynomial approximate to the input sample, that is, to minimize the minimum average of the input signals centered at n = 0. The square approximation error is used to smooth the local signal.

εN =

M ∑

(p(n) − x[n])2 =

n=−M

M N ∑ ∑

(

ak nk − x[n])2

(3.2)

n=−M k=0

3.2.2. Z-score standardization Z-score standardization, also known as standard score, is one of the common methods of standardization in statistics. Before the data is mapped to the space of the standard score, the average and standard deviation of the original data must be calculated (standard deviation). In addition, the concept can be simply understood as the distance between the original value and population mean, which is measured in units of standard deviation. The standard score can be obtained by Eq. (3.3). z=

x−µ

σ

, σ ̸= 0

(3.3)

where x is the original value, which is the value that needs to be standardized; µ is the population mean; and σ is the population standard deviation. The converted value assumes a normal distribution and the value is scaled to plus or minus five or six. 3.2.3. One-hot encoding Categorical data, also known as nominal data, are variables that contain label values rather than numeric values (for example,

34

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

Fig. 5. The mel scale.

sunny, rainy, and cloudy). However, many machine learning algorithm are based on operations in the vector space which cannot operate on label data directly. If categorical data is converted into an integer by integer encoding, it may cause unequal distance between categories, which indirectly affects the performance of model because the integer values have a natural ordered relationship between each other. Therefore, machine learning algorithms may be able to understand and learn this relationship. To address problems with integer encoding, one-hot encoding will be encoded according to all the states in the category. It can be explained by a weather example: There are three states of the weather (sunny, cloudy, and rainy). Thus, the category of weather can be encoded by lengths of 3 bits (100, 010, and 001). In addition to keeping a non-ordered relationship, one-hot coding can make the distance between the categories more reasonable. 3.3. Feature engineering The sound signal is a one-dimensional time domain signal, which records the amplitude of each time point. The time-domain waveform shows how a signal changes over time. However, by observing the one-dimensional signal, it is difficult to understand the frequency component of the signal. Therefore, using the frequency domain to observe signals has become a common method of analysis. 3.3.1. Spectrogram A spectrogram is used to observe and describe changes in amplitude and frequency over a series of discrete times, which is a combination of multiple spectrums. The spectrum only describe the relationship between the amplitude and frequency of the signal. However, the spectrogram adds the dimension of time. As shown in Eq. (3.4), the spectrogram of a signal can be obtained by shorttime Fourier transform. X (t , f ) =



+∞

x(τ )w (t − τ )e−j2π f τ dτ

Fig. 6. Mel Triangular Bandpass Filters.

(Hz) and the Mel Scale. The mapping relationship can be described by Eq. (3.5) [33]: m(f ) = 2595 log10 (1 +

3.3.2. Mel spectrogram Human’s hearing ranges do not have a linear relationship for the perception of frequency. The human ear is more sensitive to low-frequency sounds, and the high-frequency part requires more frequency changes to distinguish the pitch. Based on the characteristics of the human ear’s frequency discrimination. Stevens, Volkmann, and Newman proposed a nonlinear frequency scale for the human ear in 1937 to describe the mapping between frequency

) = 1127 loge (1 +

f 700

)

(3.5)

Mel is derived from the word melody, which means that the scale is based on the pitch comparison. Moreover, the reference point is 1000 Hz. As shown in Fig. 5, the mapping relations of the Mel scale and the frequency can be observed. Using this correspondence, spectrograms can be converted into Mel spectrograms.

3.3.3. Mel frequency cepstral coefficients First proposed by Davis and Mermelstein in 1980, MFCCs are a feature widely used in the field of speech recognition [34]. The steps to extract the MFCCs are as follows: 1. Calculate the spectrogram using a short-time fast Fourier transform, as shown in Section 3.3.1. 2. Use the triangular bandpass filters to operate the spectrogram, and multiply the energy of the energy spectrum by a set of n triangular bandpass filters. As shown in Fig. 6, the log energy (LogEnergy) Ek [35] of each filter output is obtained, where k represents each filter. 3. This step employs discrete cosine transform (DCT), which brings k logarithmic energy Ek into discrete cosine transform to find the Mel-scale cepstrum parameter of the order L [35]. The equation is described by (3.6).

(3.4)

−∞

f 700

Cm =

N ∑ k=1

Ek cos[

m(k − 0.5)π N

] , m = 1, 2, . . . ., L

(3.6)

3.3.4. An illustrating example of preprocessing In this section, a simple example of preprocessing is shown. In the example, a record in the input training data will be transferred to the input image of the proposed CNN framework. Assume a record in the input data is simply shown below:

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

35

[−0.01116943 −0.04907227 −0.08248901 . . . −0.16204834 −0.13397217 −0.10409546]

Then, the preprocessing method will apply Savitzky–Golay Filter to smooth the input record and reduce noises. After the Savitzky– Golay Filter process, the record will be transferred a new matrix like as: Fig. 7. Illustration of max pooling [39].

[−0.00794716 −0.00221903 0.00427395 . . . 0.09947642 0.10785766 0.1136656]

The matrix after the step is still one-dimensional. However, the input data for a CNN network is usually a multi-dimensional matrix. The proposed method will transfer the matrix into a twodimensional matrix by using short-time Fourier transform. It is a frequency domain of the original matrix like as:

[[1.5904831e−06 4.7905727e−05 9.7855320e−07 . . . 7.3406829e−07 8.6158622e−07 4.9366743e−11] [1.6949858e−04 2.8835117e−05 2.3331378e−04 . . . 4.5718207e−07 2.3732489e−05 7.3456545e−06] [2.1208012e−04 7.0644857e−04 7.8546564e−04 . . . 2.0085528e−04 6.2020007e−04 6.0669723e−04] ... [9.7654246e−11 1.5555661e−10 2.4851379e−10 . . . 4.3029106e−13 1.5517296e−10 1.3627786e−10] [2.6275285e−10 7.1316252e−11 1.8038597e−10 . . . 3.0988434e−11 8.1496442e−11 2.4139293e−10] [2.8115674e−13 4.2477820e−11 2.0004181e−11 . . . 4.4037215e−11 2.8752675e−12 1.7700053e−11]]

the network to output the neural network prediction value. Then, based on the selected loss function, the error between the neural network predicted value and the ground truth is calculated. Finally, the error is fed back to the convolution kernel, the hidden unit is weighed via backpropagation, and the neural network is trained by updating the weight of the neural network through continuous iteration. The hidden unit and loss functions of a typical CNN are described below: In the convolutional layer, convolution is performed on the input image through n filters to obtain n feature maps. In comparison to CNNs, traditional neural networks must flatten twodimensional image data into one-dimensional vectors. However, this causes spatial information of the original two-dimensional images to be lost. If the neural network directly uses two-dimensional data or even three-dimensional data as input, the problem of the loss of spatial information can be effectively avoided. Furthermore, the CNN can use filters to extract various local spatial features on the image. Convolution is described by Eq. (3.7) [37]. S(i, j) = (I ∗ K )(i, j) =

∑∑ m

Now, the matrix is already transferred to a two-dimensional matrix, it is logically an image which can be inputted in a CNN network. In the proposed method, this matrix will also be transferred to the other two matrices by Spectrogram and Mel Spectrogram. Therefore, there are three CNN network will be trained. Then these input metrics will be normalized by Z-score Standardization. The matrix after normalization is like as:

I(m, n)K (i − m, j − n)

(3.7)

n

The Rectified Linear Unit (ReLU) was first used by Nair et al. [38] and quickly became the default nonlinear activation function. This study also uses this activation function to perform nonlinear transformations to accelerate the training process and avoid the problem of gradient explosion and disappearance [36]. Eq. (3.8) describes the ReLu. ReLu(x) = max(0, x)

(3.8)

Finally, the proposed method will generate three set of input images will normalize values. Then it will train three CNN network by the following steps.

The concept of the pooling layer is to use the pooling function to obtain the overall feature representation of a sub-area, and to reduce the feature space. It can be regarded as being a downsampling method. As shown in Fig. 7, the pooling function used in this study is the max pooling function. The fully Connected Layer is same as the traditional artificial neural network (multilayer perceptron). When feeding the threedimensional feature map to the fully connected layer, all the image features must be flattened, and then features can enter the fully connected layer for operation. The operation process of the fully connected layer can be represented by Eq. (3.9), W is the weight vector, x is the input vector, and b is the bias vector:

3.4. Convolutional neural network

z = W Tx + b

[[−0.04444204 0.17136419 −0.04729333 . . . −0.04843251 −0.04783834 −0.05185268] [0.7379271 0.08250455 1.0352745 . . . −0.04972266 0.05872881 −0.01762577] [0.9363361 3.2398496 3.60803 . . . 0.88403386 2.8379738 2.7750573] ... [−0.05185245 −0.05185218 −0.05185175 . . . −0.05185291 −0.05185218 −0.05185227] [−0.05185168 −0.05185258 −0.05185207 . . . −0.05185276 −0.05185252 −0.05185178] [−0.05185291 −0.05185271 −0.05185281 . . . −0.0518527 −0.05185289 −0.05185283]]

A CNN is a special artificial neural network because it is able to describe the image layer by layer and perform small-area operations. CCNs are especially suitable for computer vision-related applications [36]. A convolution is a special kind of linear operation, and the terms ‘‘convolutional neural network’’ indicate that this network uses convolution operations [37]. A CNN consists of an input layer, multiple hidden layers, and an output layer. The hidden layers usually contain convolutional layers, pooled layers, and fully connected layers. After the model is constructed layer by layer, the input data is fed forward through

(3.9)

The loss function is also called the cost function. Its purpose is to reduce the gap between the predicted value and the ground truth, and to find the weight combination with the smallest error. Moreover, the weights of the fully connected layer and the convolution layer are updated via the back propagation method to optimize the model. In this study, cross entropy is chosen as the loss function, as shown by Eq. (3.10), and the softmax unit is used as the unit of output prediction. Softmax is defined as shown in Eq. (3.11). Loss = −

∑ i

yi log(y′i )

(3.10)

36

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

Table 2 The VGG model in this study. Name

Layer type

Output number

Filter size/stride

Conv1_1 Conv1_2 Pool1 Conv2_1 Conv2_2 Pool2 Conv3_1 Conv3_2 Conv3_3 Pool3 Fc4 Fc5 Fc6 Fc7 Loss

Conv Conv Max pool Conv Conv Max pool Conv Conv Conv Max pool Fully connected Fully connected Fully connected Fully connected Softmax

16 16 16 32 32 32 64 64 64 64 1024 512 128 2 2

3 × 3/1 3 × 3/1 2 × 2/2 3 × 3/1 3 × 3/1 2 × 2/2 3 × 3/1 3 × 3/1 3 × 3/1 2 × 2/2 – – – – –

exp(xi ) softmax(x)i = ∑ j exp(xj )

Se =

λ ∑ 2n

w2

True Positiv e Positiv e

=

True Positiv e True Positiv e + False Negativ e

(4.13)

The specificity, also called the true negative rate, is the proportion of actual negatives that are correctly classified as negatives in the classification (e.g., the percentage of normal phonocardiograms that are correctly identified as normals). The specificity was calculated as shown in Eq. (4.14). (3.11)

The convolutional neural network architecture used in this study refers to the VGG Network architecture proposed in [30], and the network architecture is modified based on our own requirements, as shown in Tables 1–3. In addition, the L2 regularization method was been added to the model constructed in this study to avoid problems of overfitting in the training of the CNNs. Regularization aims to modify the learning algorithm to reduce the generalization error rather than the training error, which is also one of the core issues in the field of machine learning [37]. L2 regularization is defined as shown in Eq. (3.12). C = C0 +

evaluation indicators in the PhysioNet 2016 competition to evaluate the classification model [10]. The evaluation indicators included Se (sensitivity), Sp (specificity) and Macc(modified accuracy). The sensitivity, also called the true positive rate or recall, is the proportion of actual positives that are correctly classified as positive in the classification (e.g., the percentage of abnormal phonocardiograms that are correctly identified as abnormals). The sensitivity was calculated as shown in Eq. (4.13).

(3.12)

w

3.5. Ensemble learning Ensemble learning combines multiple weak classifiers to obtain better predictive performance and a strong classifier. To reduce the overall error to obtain a classifier with better classification effect, each ensemble classifier combines k different classifiers M1 , M2 , M3 , . . . , Mk and integrates multiple classification boundaries [40]. In the ensemble learning phase, this study use three different phonocardiogram features to train CNNs. The phonocardiogram features are spectrogram, Mel spectrogram, and MFCCs. After training the neural network, the models are constructed by three different phonocardiogram features. Finally, the prediction results of the three models use the majority voting method to produce the voting results. 4. Experimental results Substantial experiments were conducted to compare the effectiveness and efficiency of the proposed CNN framework with two previous state-of-art methods [24,25,41]. The algorithms in the experiments were implemented in Python language with the famous deep learning library TensorFlow. The detail parameters setting is described in the following session. 4.1. Classification model evaluation To confirm and evaluate the classification ability of the deep learning model proposed in this study, this study used the model

Sp =

True Negativ e Negativ e

=

True Negativ e True Negativ e + False Positiv e

(4.14)

The accuracy calculation used in the competition was as shown in Eq. (4.15): MAcc =

Se + Sp 2

(4.15)

4.2. Results of the Savitzky–Golay filter Because the sound files came from different environments and different devices, some heart sounds in the dataset had some background noise or high frequency noise. Before the phonocardiogram was converted to spectrums and cepstrums, we used the Savitzky– Golay filter to deal with the noise. Fig. 8 shows one second of a phonocardiogram. After filtering, a slight smoothing in the time domain signal around 0.6 to 0.8 s can be observed. In the parameter setting of the filter, we set the window size to 13 (Window length = 2M + 1, M = 6), and the power of the polynomial was set to 2 (N = 2). This study also observes the filtered response of the signal from the perspective of the frequency domain since it is difficult to observe the filtered changes in the time domain signal. After a shorttime Fourier transform, we could obtain a spectrogram of a signal. As shown in Fig. 9, taking an 8-second heart-sound spectrogram as an example, the time–frequency spectrum before filtering in the frequency domain can be observed. (a) Before filtering, there is some salt and pepper noise in the figure (b) Smoothing of most of the noise. 4.3. Experimental results of deep learning and ensemble learning In the stage of modeling, in addition to training the CNN, we also used ensemble learning to improve the accuracy of model prediction by using multiple features to train multiple CNN classifiers. As shown in Fig. 10, this study used three different features to enter the model for training, and then combined the results of the three models using the majority voting method, and finally the phonocardiogram classification was determined. Fig. 11 shows the VGG network architecture used in the training phase. The spectrogram, Mel spectrogram, and MFCC were used as input to the model, all of which are two-dimensional data. The hyperparameter setting is shown in Table 3. Each feature trained one model and the hyperparameter settings for each training model were the same. The epoch was set to 500 times, the batch size was 1024, the learning rate was 0.0003, the λ value of L2 regularization was 0.0001, and the dropout was 0.8. Actually, there are more different hyperparameters being set in the experiments. The hyperparameters were set in Table 3 could

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

Fig. 8. Phonocardiogram before and after comparison using Savitzky–Golay filter. Table 3 Hyperparameter setting. Hyperparameter

Value

Description

Epoch Batch size Learning rate

500 1,024 0.0003 0.0001 0.8

Number of iterations of training Number of batch samples at each iteration Learning rate Lambda value in regularization Keep probability of neurons

λ Dropout

* All samples complete a forward propagation and back propagation as 1 Epoch.

obtain the best experimental results among all of the hyperparameter setting. Therefore, the parameters shown in Table 3 is the recommendation parameters in the proposed framework. In the hold-out validation, we divided the dataset into a training set and a testing set in a proportion of 80/20. Table 4 shows the results of the ensemble CNN with the Savitzky–Golay filter in the hold-out validation. In this experiment, although the MAcc reached approximately 80%, there was a significant difference in the performance of sensitivity and specificity, indicating that the model did not have good classification performance for the positive samples. The proportion of the positive and negative categories in the original data is 655:3240, which makes weights of the model biased to fit negative samples. Therefore, we attempted to adjust the proportion of the positive and negative categories to balance the quantity of categories.

37

Fig. 9. Phonocardiogram before and after comparison using Savitzky–Golay filter (spectrogram). Table 4 Classification results using the ensemble CNN with the Savitzky–Golay filter (data imbalanced). Model

Sensitivity

Specificity

MAcc

Spectrogram Mel spectrogram MFCC

68.4210% 68.4210% 71.4285%

95.5339% 92.0338% 91.8446%

81.9775% 80.2299% 81.6366%

Ensemble

67.6691%

94.9514%

81.3103%

In category balance, we directly copied the samples of the positive examples in training set to increase the number of samples in the positive category by five times so that the number of positive and negative categories in the training set were approximately the same. In addition, the testing set maintained the original category proportion. Table 5 shows the result in hold-out validation after category balance. The sensitivity was greatly improved, and the gap between sensitivity (86.46) and specificity (85.63) was also reduced to less than 1 percentage point. The MAcc also increased from 81.31 to 86.04 after category balance. In addition, we also used the 10-fold cross validation method to evaluate the classification ability of the model. We verified the robustness of model by dividing multiple non-overlapping subsets on the dataset and testing them sequentially. Table 6 shows the cross-validation classification results using the ensemble CNN with

38

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

Fig. 10. Ensemble CNN model.

Fig. 11. Illustration of VGGNet in this study. Table 5 Classification results using the ensemble CNN with the SG filter (category balanced by 1:1). Model

Sensitivity

Specificity

MAcc

Spectrogram Mel spectrogram MFCC

78.1954% 80.4511% 88.7218%

81.9417% 85.6310% 82.3300%

80.0686% 83.0410% 85.5295%

Ensemble

86.4661%

85.6310%

86.0486%

Table 6 Cross-validation results using the ensemble CNN with the SG filter (category balanced by 1:1). Fold

Sensitivity

Specificity

MAcc

1 2 3 4 5 6 7 8 9 10

92.1052% 89.7058% 96.7741% 90.0% 96.4285% 93.5483% 91.5254% 85.9375% 86.8421% 94.4444%

84.7389% 86.3813% 81.3688% 91.7647% 86.9888% 91.5708% 87.1212% 87.6447% 87.8542% 93.6254%

88.4221% 88.0436% 89.0715% 90.8823% 91.7087% 92.5596% 89.3233% 86.7911% 87.3481% 94.0349%

Average

91.7311%

87.9059%

89.8185%

the SG filter. The average sensitivity was 91.73, the average specificity was 87.9, and the average MAcc was 89.81. It can therefore be proved that this model has good classification ability and robustness. 4.4. Comparison with other studies As shown in Table 7, we compared the results with studies by Potes et al. [24], Homsi et al. [25], and Singh-Miller et al. [41]. In addition, the rank in the table is the ranking of the study in the competition. Since the PhysicsNet 2016 competition does not have open access to hidden test datasets, the testing data of our model was randomly extracted from the training set provided by the competition. In comparison, we looked for studies with in house testing results for comparison.

Table 7 Comparison table with the results of other studies. Method

Se

Sp

MAcc

Division

Rank

Potes et al. [24] Homsi et al. [25] Singh-Miller et al. [41]

88 – 81

82 – 89

85 88.4 85

80/20, train/test 10-fold CV 10-fold CV

1 5 16

This study This study (CV)

86.46 91.73

85.63 87.91

86.04 89.81

80/20, train/test 10-fold CV

– –

This study employed the hold-out validation and crossvalidation methods to evaluate the classification performance of the model. In hold-out validation, the proportion of the training set and the testing set was 80/20, and the MACC reached 86.04. Moreover, the sensitivity and MACC were better than Potes et al. In the cross-validation method, the 10-fold cross-validation method obtained a MACC of 89.81, which was better than the 88.4 of Homsi et al. and the 85 of Singh-Miller et al. The comparison between this study and other research methods is shown in Table 8. First, in heart sound segmentation, Potes et al. and Homsi et al. used the segmentation method proposed by Springer et al. to indicate the position of the first heart sound (S1) and the second heart sound (S2). This step was not implemented in the study. Second, in the phase of extracting features, features of the phonocardiogram in this study were obtained through three transformations, which were two-dimensional features. Potes et al. and Homsi et al. used a variety of methods to extract features, including extracting features in the time and frequency domains, or using statistical methods to generate feature sets. Third, in feature selection, because the CNN was used in this study, the convolutional layer could automatically perform feature selection. Fourth, in the classifier selection, compared to Potes et al. and Homsi et al. using different classifiers to construct an ensemble model, this study only employed CNNs to build an ensemble model. Lastly, in the data category balance, the other studies did not balance the data categories. However, to improve the classification ability of the model for positive cases, this study balanced positive and negative categories in the training phase.

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

39

Table 8 Comparison table with other research methods. Method

Segment

Feature method

Feature selection

Classifier

Data balance

Potes et al. [24] Homsi et al. [25] Singh-Miller et al. [41]

Yes Yes No

T, F T, F, W, S Sp

No No Yes

AdaBoost & CNN RF+LB+CSC RF

No No No

This study

No

TF&MFCC

No

Ensemble of CNN

Yes

* CNN = Convolutional Neural Network, LB = LogitBoost, RF = Random Forest, CSC = Cost Sensitive Classification, T = Time Domain, F = Frequency Domain, Sp = Spectral, TF = Time–Frequency, W = Wavelet, St = Statistical, MFCC = Mel Frequency Cepstral Coefficients.

Fig. 12. A flow chart for the proposed framework to diagnosis heart disease.

4.5. Application in real scenarios The average training time for the proposed framework in the experiments is about 8 min and 35 s. However, estimating a testing record is almost instantaneous. It is very suitable for large amount pre-inspection, especially for an environment without enough medical instrument. The only one issue is whether the accuracy of the proposed method is good enough to fit the real clinic requirements or not. The first part of the experiments, we divided the dataset into a training set and a testing set. It is closer to reality than the second part of the experiments. The first experiment did not balance the quantity of categories in the training data. The final MAcc is about 81%, and the second experiment improves the MAcc to 86% by using a balance data set. Even thought the MAcc of the proposed method is better than the previous methods. Unfortunately, it is still not good enough to apply in a real clinic environment. In addition to improving the accuracy rate in the future, is it any possible to apply the currently proposed method in real clinic scenarios? We notice there are high specificity value in the first experiment especially for Spectrogram model. Getting the benefit for high specificity value, we can design a flowchart like as Fig. 12. By this diagnosis flowchart, we can quickly select a set of candidates to further diagnose heart disease. If the result from the proposed CNN framework is positive, the patient should check the heart disease by a professional approach. If the result is negative, due to the high specificity value, we can believe the result from the system. However, the value of sensitivity is too low, there are too many patients need to do further diagnose and the dependability of specificity is also not good enough. The accuracy of the proposed framework still needs to be increased in the future work. 5. Conclusions In the study of phonocardiograms, many researchers have employed the segmentation method proposed by Springer et al. [18] to indicate the fundamental heart sounds (FHSs), which divides the heart sound into S1 and S2 heart sounds, and systolic and diastolic phases. After heart sounds segmentation, the time domain, frequency domain, and statistical features are extracted from fundamental heart sounds. Although the segmentation of heart sounds can acquire a variety of features, the process of heart sound analysis becomes more complicated and the computational complexity is increased.

Therefore, we directly turned a small segment of the original heart sound into spectrum and cepstrum, which does not require much computational cost in feature engineering, and the final classification result was comparable to other competitive models. Data category imbalance is a typical problem that is common in medical-related datasets because the number of normal samples are always more than abnormal samples. Before the category balance of this study, the sensitivity of the model was always poor. However, this study increased the minority samples by copying, which significantly improved the sensitivity of the model and the performance of MAcc. In the future, we will expand upon this study in two directions. First, to further validate the model’s generalization ability in other heart sounds, we will look for other public phonocardiogram data or cooperate with hospitals to obtain more phonocardiograms. Second, we believe better results could be obtained by adding in other phonocardiogram features to build an ensemble model with the existing features in this study. Acknowledgments The authors are grateful to the Deanship of Scientific Research at King Saud University for funding this work through the Vice Deanship of Scientific Research Chairs: Chair of Smart Technologies. References [1] P. Pace, G. Aloi, R. Gravina, G. Caliciuri, G. Fortino, A. Liotta, An edge-based architecture to support efficient applications for healthcare industry 4.0, IEEE Trans. Ind. Inf. (2019). [2] M.M. Hassan, M.G.R. Alam, M.Z. Uddin, S. Huda, A. Almogren, G. Fortino, Human emotion recognition using deep belief network architecture, Inf. Fusion 51 (2019) 10–18. [3] M.Z. Uddin, M.M. Hassan, Activity recognition for cognitive assistance using body sensors data and deep convolutional neural network, IEEE Sens. J. (2018). [4] M.M. Hassan, M.Z. Uddin, A. Mohamed, A. Almogren, A robust human activity recognition system using smartphone sensors and deep learning, Future Gener. Comput. Syst. 81 (2018) 307–313. [5] S. Mendis, P. Puska, B. Norrving, W.H. Organization, et al., Global atlas on cardiovascular disease prevention and control, World Health Organization, Geneva, 2011. [6] M.M. Hassan, S. Huda, J. Yearwood, H.F. Jelinek, A. Almogren, Multistage fusion approaches based on a generative model and multivariate exponentially weighted moving average for diagnosis of cardiovascular autonomic nerve dysfunction, Inf. Fusion 41 (2018) 105–118. [7] H. Wang, M. Naghavi, C. Allen, R. Barber, A. Carter, D. Casey, F. Charlson, A. Chen, M. Coates, M. Coggeshall, et al., Global, regional, and national life expectancy, all-cause mortality, and cause-specific mortality for 249 causes of death, 1980–2015: a systematic analysis for the global burden of disease study 2015, Lancet 388 (10053) (2016) 1459–1544. [8] N.D. Wong, Epidemiological studies of CHD and the evolution of preventive cardiology, Nat. Rev. Cardiol. 11 (5) (2014) 276. [9] C. Mathers, G. Stevens, W. Mahanani, J. Ho, D. Fat, D. Hogan, Who Methods and Data Sources for Country-level Causes of Death 2000-2015, Global Health Estimates Technical Paper WHO/HIS/IER/GHE/2016.3. [10] G.D. Clifford, C. Liu, B. Moody, D. Springer, I. Silva, Q. Li, R.G. Mark, Classification of normal/abnormal heart sound recordings: The physionet/computing in cardiology challenge 2016, in: Computing in Cardiology Conference, CinC, 2016, IEEE, 2016, pp. 609–612.

40

J.M.-T. Wu, M.-H. Tsai, Y.Z. Huang et al. / Applied Soft Computing Journal 78 (2019) 29–40

[11] A. Raghu, D. Praveen, D. Peiris, L. Tarassenko, G. Clifford, Engineering a mobile health tool for resource-poor settings to assess and manage cardiovascular disease risk: SMarthealth study, BMC Med. Inform. Decis. Mak. 15 (1) (2015) 36. [12] A. Leatham, Auscultation of the Heart and Phonocardiography, Churchill Livingstone, 1975. [13] G.D. Clifford, C. Liu, B. Moody, J. Millet, S. Schmidt, Q. Li, I. Silva, R.G. Mark, Recent advances in heart sound analysis, Physiol. Meas. 38 (2017) E10. [14] A.F. Hussein, A. Kumar, M. Burbano-Fernandez, G. Ramirez-Gonzalez, E. Abdulhay, V.H.C. de Albuquerque, An automated remote cloud-based heart rate variability monitoring system, IEEE Access (2018). [15] J.A.L. Marques, P.C. Cortez, J.P. Madeiro, V.H.C. de Albuquerque, S.J. Fong, F.S. Schlindwein, Nonlinear characterization and complexity analysis of cardiotocographic examinations using entropy measures, J. Supercomput. (2018) 1–16. [16] J.D. Hemanth, U. Kose, O. Deperlioglu, V.H.C. de Albuquerque, An augmented reality-supported mobile application for diagnosis of heart diseases, J. Supercomput. (2018) 1–26. [17] J. Moraes, M. Rocha, G. Vasconcelos, J. Vasconcelos Filho, V. de Albuquerque, Advances in photopletysmography signal analysis for biomedical applications, Sensors 18 (6) (2018) 1894. [18] D.B. Springer, L. Tarassenko, G.D. Clifford, Logistic regression-hsmm-based heart sound segmentation, IEEE Trans. Biomed. Eng. 63 (4) (2016) 822–832. [19] W.H. Organization, Top 10 causes of death, Global Health Observatory (GHO) data, http://www.who.int/gho/mortality_burden_disease/causes_death/top_ 10/en/. [20] K.D. Pickrell, Miller-Keane encyclopedia and dictionary of medicine, nursing, and allied health, Hosp. Health Netw. 77 (8) (2003) 70. [21] V.L. Clark, J.A. Kruse, Clinical methods: the history, physical, and laboratory examinations, JAMA 264 (21) (1990) 2808–2809. [22] R.M. Rangayyan, R.J. Lehner, Phonocardiogram signal analysis: a review., Crit. Rev. Biomed. Eng. 15 (3) (1987) 211–236. [23] R.M. Youngson, R.M. Youngson, Collins Dictionary of Medicine, HarperCollins, 1992. [24] C. Potes, S. Parvaneh, A. Rahman, B. Conroy, Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds, in: Computing in Cardiology Conference, CinC, 2016, IEEE, 2016, pp. 621–624. [25] M.N. Homsi, N. Medina, M. Hernandez, N. Quintero, G. Perpiñan, A. Quintana, P. Warrick, Automatic heart sound recording classification using a nested set of ensemble algorithms, in: Computing in Cardiology Conference, CinC, 2016, IEEE, 2016, pp. 817–820.

[26] Y. LeCun, B.E. Boser, J.S. Denker, D. Henderson, R.E. Howard, W.E. Hubbard, L.D. Jackel, Handwritten digit recognition with a back-propagation network, in: Advances in Neural Information Processing Systems, 1990, pp. 396–404. [27] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [28] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105, http://papers.nips.cc/paper/4824imagenet-classification-with-deep-convolutional-neural-networks.pdf. [29] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint arXiv:1207.0580, 2012. [30] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014. [31] A. Savitzky, M.J. Golay, Smoothing and differentiation of data by simplified least squares procedures, Anal. Chem. 36 (8) (1964) 1627–1639. [32] R.W. Schafer, What is a savitzky-golay filter?[lecture notes], IEEE Signal Process. Mag. 28 (4) (2011) 111–117. [33] D. O’shaughnessy, Speech Communication: Human and Machine, Universities press, 1987. [34] S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, in: Readings in Speech Recognition, Elsevier, 1990, pp. 65–74. [35] J.-S.R. Jang, Audio Signal Processing and Recognition. Available at the links for on-line courses at the author’s homepage at, http://www.cs.nthu.edu.tw/ ~jang. [36] I. Hadji, R.P. Wildes, What Do We Understand About Convolutional Networks? arXiv preprint arXiv:1803.08834v1, 2018. [37] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, http: //www.deeplearningbook.org. [38] V. Nair, G.E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th International Conference on Machine Learning, ICML-10, 2010, pp. 807–814. [39] A. Karpathy, Cs231n Convolutional Neural Networks for Visual Recognition, 2016. [40] J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, Elsevier, 2011. [41] N.E. Singh-Miller, N. Singh-Miller, Using spectral acoustic features to identify abnormal heart sounds, in: Computing in Cardiology Conference, CinC, 2016, IEEE, 2016, pp. 557–560.