Stacked pruning sparse denoising autoencoder based intelligent fault diagnosis of rolling bearings

Stacked pruning sparse denoising autoencoder based intelligent fault diagnosis of rolling bearings

Applied Soft Computing Journal 88 (2020) 106060 Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsev...

2MB Sizes 0 Downloads 61 Views

Applied Soft Computing Journal 88 (2020) 106060

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

Stacked pruning sparse denoising autoencoder based intelligent fault diagnosis of rolling bearings ∗

Haiping Zhu a , Jiaxin Cheng a , Cong Zhang a , Jun Wu b , , Xinyu Shao a a b

School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China School of Naval Architecture and Ocean Engineering, Huazhong University of Science and Technology, Wuhan, China

article

info

Article history: Received 3 April 2019 Received in revised form 26 December 2019 Accepted 30 December 2019 Available online 2 January 2020 Keywords: Stacked pruning sparse denoising autoencoder Deep learning Rolling bearing Fault diagnosis Pruning operation

a b s t r a c t This paper proposes a new stacked pruning sparse denoising autoencoder (sPSDAE) model for intelligent fault diagnosis of rolling bearings. Different from the traditional autoencoder, the proposed sPSDAE model, including a fully connected autoencoder network, uses the superior features extracted in all the previous layers to participate in the subsequent layers. This means that some new channels are created to connect the front layers and the back layers, which reduces information loss. To improve the training efficiency and precision of the sPSDAE model, a pruning operation is added into the sPSDAE model so as to prohibit non-superior units from participating in all the subsequent layers. Meanwhile, a feature fusion mechanism is introduced to ensure the uniqueness of the feature dimensions. After that, the sparse expression of the sPSDAE model is strengthened, thereby improving the generalization ability. The proposed method is evaluated by using a public bearing dataset and is compared with other popular fault diagnosis models. The results show that the ability of the sPSDAE model to extract features is significantly enhanced and the phenomenon of gradient disappearance is further reduced. The proposed model achieves higher diagnostic accuracy than other popular fault diagnosis models. © 2020 Elsevier B.V. All rights reserved.

1. Introduction Rolling bearings are one of the most critical components of transmitting force and torque, which are widely used in various types of industrial equipment. Unexpected faults in the rolling bearings might cause serious losses of safety and large costs of maintenance. To ensure the reliability and safety of the industrial equipment, accurate and timely fault diagnosis of the rolling bearings is always needed. Since rolling bearings often operate in a harsh environment, weak fault features are buried in the excessive noises and the disturbance of the rotor rotating frequency with harmonics, which makes it difficult to accurately diagnose the faults [1,2]. In the traditional bearing fault diagnosis, the main idea is to modulate the amplitude of faulty bearing signals. For example, fast Fourier transform [3] and wavelet analysis [4] is used to extract different features by transforming signals into the frequency domain or time-frequency domain. Empirical mode decomposition method [5,6] is used to select the basis function from original signals and analyze the instantaneous frequency of the signals for fault diagnosis. Nevertheless, these methods rely heavily on engineering experiments and signal analysis experience. ∗ Corresponding author. E-mail address: [email protected] (J. Wu). https://doi.org/10.1016/j.asoc.2019.106060 1568-4946/© 2020 Elsevier B.V. All rights reserved.

With the rapid development of computer technologies, machine learning has become one of the popular data-driven fault detection methods. For example, Shatnawi et al. [7] used the extended neural network combined with wavelet packet decomposition to extract and diagnose fault signal of the internal combustion engine. Fuzzy logic was introduced to establish the correspondence between faults and signals, achieving fault pattern recognition [8,9]. Support vector machine (SVM) and its improved algorithms were adopted for classifying faults using small sample data sets [10]. However, machine learning requires numerous signal processing methods and rich engineering experience to extract effective features. Moreover, machine learning is not so efficient for big data sets. In order to process these massive data, deep learning methods are introduced in the field of fault diagnosis. Since Hinton et al. [11] proposed a deep belief network (DBN) model, the deep learning methods have received extensive attention from academia and industry. More and more deep learning methods such as deep restricted Boltzmann machine [12,13], recurrent neural network (RNN) [14] and convolutional neural network (CNN) [15] are proposed with excellent learning ability. With the continuous improvement of the deep learning methods, there are several new network structures that have achieved great success in improving model performance. For example, Highway Networks [16], ResNet [17] and DenseNet [18] improve model performance by strengthening the information transfer between

2

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

layers of the models. In terms of specific fault diagnosis applications, Tamilselvan et al. [19] establish a relatively complete equipment health monitoring system using DBN and conducted experiments on data related to aircraft engines and power transformers. Wen et al. [20] used the method of converting fault data into 2D images to train the CNN model and achieved good results. Cheng et al. [21] proposed a fault detection model based on deep long-term memory recurrent neural networks (LSTMRNN). Marcin et al. [22] used RNN to diagnose dynamic nonlinear systems and solved the robustness problems that may occur in actuator failures. Compared with the traditional machine learning method, the biggest advantage of deep learning is its powerful feature extraction ability, which greatly reduces the need for engineering practical experience. Stacked autoencoder (sAE) is one of the widely used deep learning model. Unlike the supervised deep learning models such as CNN and RNN, the sAE combines unsupervised data feature extraction with supervised overall fine-tuning. Its primary purpose is to achieve dimensionality reduction of data features. To complete the classification task, different classifiers are added to the output layer in autoencoders [23]. Vincent et al. [24] established a stacked denoising autoencoder (sDAE), which is based on the idea of reconstructing a noisy signal to obtain a better eigenvector, to enhance the robustness of the model to noise signals. Poultney [25] use the sparse autoencoder (SAE) to complete the unsupervised extraction of overcomplete features. Masci et al. [26] proposed a convolutional autoencoder, which replace the original fully connected neural layer with a convolutional layer and a pooling layer to preserve the twodimensional information of the data. Kingma et al. [27] proposed a variable-point autoencoder (VAE). VAE used the decoding part to automatically generate an output similar to the training data so that it was quickly trained without strong assumptions. Chen et al. [28] proposed marginalized denoising autoencoders for nonlinear representation, which marginally processes noise interference based on sDAE and greatly reduces the training time of sDAE. In addition, various autoencoders are widely used in fault diagnosis scenarios for many types of industrial equipment. For example, Sakurada et al. [29] used the denoising autoencoder to detect the subtle anomalies in the linear PCA failure process of spacecraft telemetry data. Sun et al. [30] used a sparse autoencoder for fault diagnosis of induction motors and achieved good results. Lu et al. [31] used a stack denoising autoencoder for fault diagnosis of the rotating mechanical system. The improvements of the most autoencoders are to optimize the error functions or add different classifiers. However, these improvements are limited by the performance of the original autoencoder model, whose network structure determines its ability to information processing. In order to fundamentally improve the model information processing ability, this paper proposes a stacked pruning sparse denoising autoencoder (sPSDAE) model for intelligent fault diagnosis of rolling bearings, which optimizes the traditional autoencoder model by changing the network structure. The proposed sPSDAE model is dedicated to enhancing the ability of fault diagnosis by improving the information delivery path and sharing efficiency. The main contributions of this paper are summarized as follows: (1) A new sPSDAE model is built, which uses a fully connected encoder network structure for sharing information among multi-layers, and broadens the information transmission sources of each unit to extract features. (2) To solve the problem of increased network complexity caused by the full connection structure, pruning operation is introduced into the sPSDAE model. The pruning operation uses the reconstruction error of each sparse denoising

Fig. 1. The network structure of an AE.

autoencoder (SDAE) unit as indicators to find the units with low contribution to reconstruction. After that, these units are prohibited to participate in subsequent units training, thereby reducing the complexity of model training and speeding up network iteration. (3) Multiple features extracted from the pruning units are fused to ensure the uniformity of the data structure of output feature, which simplify the training task and make the fine-tuning process go smoothly. Meanwhile, sparse methods such as dropout and Leaky-ReLU function are adopted for the sparse expression of the sPSDAE model in order to reduce the occurrence of over-fitting. The rest of this paper is organized as follows. Section 2 presents the basic theory of the sDAE model. In Section 3, a sPSDAE model is built and the procedure of fault diagnosis using the proposed sPSDAE model is given. Section 4 describes the experimental validation. Section 5 draws the conclusions. 2. Basic theory of sDAE model The basic structure of deep artificial neural networks for implementing unsupervised feature extraction is a single autoencoder (AE) [32]. As shown in Fig. 1, each AE is regarded as an independent three-layer network model, which reconstructs the input by exploiting encoder and decoder operators. The output target of the AE is equal to the original input data Xn , which forms the neural network with type of the ‘X -C -Y ’. Once the output Yn successfully replicates the original data, it shows that Cn contains most of the information of the original data. Then, the feature Cn can be used as the input of the next AE. Denoising autoencoder (DAE) [33] is a variant of the AE. Its main structure is shown in Fig. 2. A noise signal obeying a specific distribution is added into the original data Xn , and a new input data Xn∗ is generated. The target of the output data is set to Xn . In training, the reconstructed result Yn is similar to Xn as much as possible. This makes the hidden feature C robust to a noisy environment. In general, the types of noise added to the DAE are Gaussian noise, random noise, and mask noise, respectively. In this paper, Gaussian noise is used, which is expressed as:

{

Xn∗ = Xn + N 0, σ 2 I

(

σ = σ0 ∗ k2

) (1)

where σ0 is the variance of the original data, k is the noising parameter. The first layer of the DAE extracts the intermediate feature C1 from the noise-added data Xn∗ by using the encoding function, which is described as: C1 = f1 (ωXn∗ + b1 )

(2)

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

Fig. 2. The DAE model.

where ω is the weight value of the DAE encoding unit, and b1 is the deviation correction amount. The input data is used as the target output, and the intermediate feature is decoded to obtain the reconstructed data Yn , which is defined as: Yn = h1 (ε C1 + b2 )

(3)

where ε is the weight value of the DAE decoding unit, and b2 is the deviation correction amount. Furthermore, the reconstruction error LDAE between the reconstructed Yn and the original data Xn is minimized to complete the establishment of a single DAE model. LDAE =

1 M

(

) M ∑   ∗ X − Yn  n

(4)

m=1

where M is the number of single-layer sample sets. In order to improve the feature extraction capability of a single DAE model, multiple DAE units are connected (see Fig. 3). In this figure, the noise is added into the raw data to create noised data which is applied to the first unit of the network, then the features extracted after the noise are transmitted layer by layer to form the sDAE network. The intermediate feature Cn−1 of the previous DAE unit is used as the input data and the output target of the latter nth DAE unit, thereby training the corresponding intermediate feature Cn until the last DAE training ends. 3. Proposed sPSDAE model for fault diagnosis The traditional sDAE network only focuses on the encoding and decoding of the intermediate features of the adjacent layer, which means that the information transmission is layer-to-layer. If there are L DAE units, L reconstruction processes will form one information transmission channel. However, as the number of the hidden layers increases and the dimension of the features decreases continuously, the original information gradually disappears under the influence of the reconstruction error. Therefore, as the network deepens, the gradient disappears obviously, resulting in over-fitting or under-fitting [34]. To solve the problem, a new sPSDAE model is built. 3.1. The construction of sPSDAE model The core idea of the proposed sPSDAE model is to cut off the layers that do not help the latter layer’s training, while ensuring the maximum flow of information between the layers in the network. So the latter layers might obtain the information contained in all the previous superior layers, which improve training speed and feature extraction performance. By using the DAE model as the basis, the sPSDAE model is constructed as follows.

3

3.1.1. Unpruned fully connected network layout As shown in Fig. 4, this paper first designs a new network, which is called the stacked unpruned sparse denoising autoencoder (sUPSDAE). This network is the basic network structure of the sPSDAE model. The bottom unit of the sUPSDAE is SDAE unit, whose sparsity is discussed in Section 3.1.4. Each double arrow in Fig. 4 represents a SDAE unit, in which the signal will perform noise addition, encoding, decoding and extraction of the intermediate features. However, the sUPSDAE is different from the stack sparse denoising autoencoder (sSDAE), whose units are simply connected in series (marked red). The structure of sUPSDAE is a fully connected training network of all the units, that is, directly connecting the layers with feature mapping in the network. Its purpose is to share the feature information of all the previous layers based on the layer-by-layer transfer, and pass the self-fused feature map to all subsequent layers. Simply put, the input to each layer comes from the output of all the previous layers. A sUPSDAE model is stacked by multiple unpruned sparse denoising autoencoder (UPSDAE) units. In the first UPSDAE unit, the noise-added data Xn∗ is considered as the original input data. Its goal is to extract the first feature C1 from Xn∗ and use the feature to participate in the following training. In the second UPSDAE unit, there are two SDAE units, whose input data are Xn∗ and C1 , respectively. Then, the features trained by the two SDAE units are weighted and fused in a certain way to obtain the second feature C2 . Immediately after Xn∗ , C1 and C2 will be the input to the third unit, respectively. By analogy, a multi-channel stack structure is formed. Assume that there are n UPSDAE units in the entire sUPSDAE model, and the ith UPSDAE unit consists of i SDAE units. The i − 1 feature C n (n = 1, 2, . . . , i − 1) of the previous layers and Xn∗ will perform feature extraction, respectively. And then, the multiple intermediate features are fused by weights to obtain a new feature C i . Therefore, in an n-layers sPSDAE model, the number of the connected SDAE units is n ∗ (n + 1)/2, and the number of information channels formed is (n^2−n+2)/2. 3.1.2. Non-superior units pruning Since the structure of sUPSDAE is full connected, the number of SDAE units and the training time increase greatly. To solve the problem, the pruning operation of non-superior units is proposed by prohibiting the input data of the units with large reconstruction errors from participating in the subsequent training process. The pruning operation consists of the following four steps: Step 1: The reconstruction error corresponding to each SDAE unit is recorded after forming the stack structure. In the training process of the ith UPSDAE unit, i intermediate features Cj (j = 1, 2, . . . , i) and the corresponding reconstruction error Lij (j = 1, 2, . . . , i) are obtained, which is defined as: Lij =

M ∑  ( ) 1 ∑ ∗ X − X˜∗ ) + β L qˆ ∥ q + ρ ∥w∥2 ( n n M 2M

(

)

L qˆ ∥ q = q log

(5)

w

n=1

q qˆ

+ (1 − q) log

1−q 1 − qˆ

(6)

where Lij represents the reconstruction error of the jth SDAE in the ith UPSDAE unit, M is the number of samples, Xn∗ is the input data of this layer, X˜n∗ is the layer reconstruction data, β is a sparse control coefficient, q is a sparse penalty parameter, qˆ is the average activation weight of a neuron input data, ρ is the regularization coefficient, and w is the activation weight of the layer SDAE unit. Step 2: Information transfer channel of the traditional sDAE in the sUPSDAE model is considered as the fine-tuning channel, which is marked red in Fig. 4. The SDAE units are contained in fine-tuning channel, which is called the fine-tuning units.

4

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

Fig. 3. The stack structure of sDAE network.

Fig. 4. Schematic diagram of sUPSDAE fully connected network model.

Step 3: In addition to the fine-tuning units (the blue circles) in Fig. 5, the reconstruction errors of the other SDAE units are compared to find the minimum reconstruction error and its corresponding SDAE unit. Step 4: If the reconstruction error of other non-fine tuning units is much larger than the minimum reconstruction error (more than 20 times), it shows that the input data of these units cannot effectively compress the data into a better feature compared to the minimum reconstructed error unit. Therefore, the input data of these units (the yellow circles) are prohibited from participating in the operation of the subsequent layers through the fully connected method. After performing the pruning operation, all the subsequent units (circles marked by the red line) corresponding to the nonsuperior unit are not trained, and are called the pruned units. This forms the sPSDAE model, which is stacked by the pruning sparse denoising autoencoder (PSDAE) units. In the PSDAE units, only the superior units (hollow circles) are trained, which reduces the network growth rate and the network calculation amount. 3.1.3. Feature fusion in PSDAE unit To ensure the uniqueness of the feature dimensions extracted by each PSDAE unit and simplify the model training, it is necessary to fuse the intermediate features extracted by the superior SDAE units in a PSDAE unit. The paper proposes a feature fusion method, whose main processes will be described as follows.

Fig. 5. Schematic diagram of sUPSDAE pruning process.

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

Assuming that there are N SDAE units left after the pruning operation in the ith PSDAE unit, N intermediate features and N reconstruction errors of the units obtained by the training will construct the feature matrix ϕ = [C1 , C2,..., CN ] and error matrix σ = [Li1 , Li2 , . . . ,LiN ], respectively. The feature matrix ϕ of dimension i ∗ n is fused into a new n-dimensional matrix C i . The n-dimensional matrix C i is referred to as the fused feature of the ith PSDAE unit, which is expressed as: Ci =

N ∑ λ j=1

(

i

)

Lij

1 − ∑i

j=1

Lij

Cj + B

(7)

where i represents the ith PSDAE unit, N is the number of SDAE units, j is the jth SDAE unit of this layer, λ is the threshold offset, Cj is an intermediate feature matrix, and B is the deviation matrix. After training a single PSDAE unit, it need be connected to the previous multiple units to form a multi-layer stack structure. Therefore, by using Eq. (10), the fusion feature C i trained by the ith PSDAE unit and the other fusion features trained by all the previous layers are combined into a fusion feature matrix ϕ[C 1 , C 2 , . . . , C i ], which is brought into the training of the next layer unit. PSDAEn = fψ (ϕ C 1 , C 2 , . . . , C i )

[

]

(8)

where fψ (·) is a composite function, which includes the activation function, the feature fusion function, the error function and the reconstruction error function. Thus, the network structure of the sPSDAE model is formed. In this network structure, the fusion features of all previous superior PSDAE units will participate in the feature training process of the next PSDAE unit. And the features obtained by fusion not only have low reconstruction errors, but also contain multiple sources of information. This allows the latter PSDAE units to reuse the excellent fusion features trained to maximize the sharing of information. 3.1.4. Group sparse expression The sPSDAE model uses the feature fusion method to share information, which reduces the loss of information and widens the horizontal transmission of the network. However, after training the multiple layers, the sPSDAE model is too complicated and tends to cause over-fitting due to the increased number of training parameters. In order to speed up the learning rate and reduce the over-fitting, the following sparsity strategies are applied to the sPSDAE model. (1) Dropout method for SDAE units and fine-tuning processes As mentioned above, the training of i SDAE units is included in the PSDAE unit of the ith layer. To introduce sparsity into the SDAE unit, we randomly select some of the input layer features C i in each of the SDAE unit training loops and temporarily discard it by the Eqs. (9)–(10). Then, in each SDAE training process, the introducing of sparsity is repeated every cycle until all units are trained. R1 = Bernoulli (1 − p1 ) ∗

C i = R1 ∗ C i

(9) (10)

where p1 is the SDAE unit drop probability (probability of taking zeros in the Bernoulli distribution), and C i is the input matrix before the dropout. After the training of the entire sPSDAE unit is over, the back propagation neural network (BPNN) algorithm is required to perform overall fine-tuning of the weights. Similarly, in this process, the dropout unit is added by the Eqs. (11)–(12) to further reduce the model’s over-fitting probability. R2 = Bernoulli (1 − p2 )

(11)

Xi∗ = R2 ∗ Xi

5

(12)

where p2 is the drop probability of fine-tuning process (probability of taking zeros in the Bernoulli distribution), Xi is the ith network input on the fine-tuning channel, and Xi∗ is the input data after random discarding in one cycle. (2) Leaky RELU activation function In traditional DAE unit training, the activation function is the sigmoid function. However, when the sigmoid function is close to 0 or 1, the data is close to saturation and the iteration speed is very slow, which cause the gradient disappearance. The rectified linear unit (ReLU) is a commonly used activation function in deep network training to overcome this shortcoming. It is defined as:

{ ReLU(x) =

x

(x > 0)

0

(x ≤ 0)

(13)

ReLU turns the negative value into zero, which means that the neuron enters hard saturation. When the input data is greater than zero, the gradient does not decay. By this means, saturation problem is solved. Moreover, Krizhevsky A et al. [35] found that the ReLU function is used for increasing the iterative convergence speed of the stochastic gradient descent method, which is much better than Sigmoid. However, some neurons in ReLU may be ‘‘dead’’ during training, that is, the gradient through this neuron is always zero, resulting in training failure. For this reason, in the training of the sPSDAE model, the selected activation function is the Leaky-RELU (LReLU) function, as shown in Eq. (14). It is multiplied by a relatively small constant α specified by human experience before the ReLU function. LReLU modifies the data distribution of the ReLU function while retaining some negative information in the input data set. He K et al. [36] obtained a good result by repeatedly experimenting and selecting a specific α in the experiment comparing the ReLU function with the LReLU function. LReLU (x) =

(x > 0) α x (x ≤ 0)

{

x

(14)

The LReLU function does not require complicated power operations, which greatly reduces the computation time and complete the sparse expression of the information. 3.1.5. Top-layer fine-tuning process The above four main processes help each PSDAE unit to reduce the feature dimensions, which is called the pre-training process. To optimize the weight coordination between networks, it is necessary to fine-tune the weight of the entire network on the basis of pre-training, which will enhance the relationship between the weights of each unit. The greedy layer-wise pretraining method proposed by Hinton [11] is used for training. This method will use the gradient descent to complete the training of each unit one by one (The training refers to the process of extracting intermediate features from the encoding and decoding process), and then use the supervised overall fine-tuning to solve the local optimal problem. As shown in Fig. 6, the encoding weight ω of each PSDAE unit is determined layer by layer in pre-training process. In the finetuning phase, a fine-tuning channel with SDAE units connected is first established, which is mentioned in Section 3.1.2. And then these encoding weights are assigned to the corresponding SDAE units according to the feature dimension. Finally, the BPNN is considered as a classifier and the weight is updated by backpropagation. It is worth mentioning that most models achieving better training results only use the weight of the first PSDAE unit in the pre-training to assign to the first layer in the fine-tuning, which is shown in Fig. 6. The reason for this phenomenon may be that

6

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

pruning operation are merged to obtain the fusion feature. When the entire pre-training cycle is over, the fine-tuning method is used to update the optimal network weights. (4) A classifier is used to classify the features and output the classification labels. Then, the fault types of the rolling bearing may be identified. Fig. 6. Pre-training and fine-tuning processes of the PSDAE model.

4. Experimental study the subsequent weights are null by using this sparse assignment method, which ensures the adjustment ability of the network fine-tuning and achieves better training results. 3.2. Fault diagnosis model based on sPSDAE In this paper, a new intelligent fault diagnosis method of rolling bearings is proposed by using the sPSDAE model mentioned above. As shown in Fig. 7, the main steps are as follows: (1) Collect vibration signals of the rolling bearing in different states. The vibration signals are inputted into a sPSDAE model as raw data with state labels. (2) Set the parameters of the sPSDAE model including the number of hidden layers, learning rate, noise ratio, feature fusion ratio, total and discrete unit dropout ratio. (3) Train the sPSDAE model. Different features in each PSDAE units are extracted by reducing the reconstruction error cyclically. Then, the minimum reconstruction error is searched to find the non-superior units and achieve the pruning operation. And, the remaining features after the

4.1. Experimental setup and data description The effectiveness of this proposed model is validated by using the bearing fault diagnosis dataset coming from the Case Western Reserve University. The dataset is generated by a bearing test bench, as shown in Fig. 8. The bearing test bench mainly consists of a motor (left), a torque sensor/encoder (center), a dynamometer (right) and a control electronics (not shown). The tested bearings with the type of SKF 6205 are located at the driving end of the rotating shaft of the motor. In the experiments, the damages of 7 mils or 21 mils in diameter are induced into the inner race, the ball and the outer race of the tested bearing by means of electric spark. Then, the processed fault bearing is reloaded into the motor, and the rotation speed of the motor varies from 1730 to 1797 rpm. The vibration acceleration signal is collected and recorded at a frequency of 12kHZ. Finally, a total of 8 types of data samples are obtained, which are shown in Table 1. For each data sample, 90% of the samples are randomly selected as the training dataset, and the rest 10% of the samples as the testing dataset.

Fig. 7. Flow chart of fault diagnosis model based on sPSDAE model.

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

7

Table 1 Summary of samples size and labels for different fault types collected by bearing fault experiments. Bearing load (HP)

Fault size (mils)

Fault location

Number of samples

Label

16,80,000

1

Inner race Ball Outer raceway 6 o’clock Outer raceway 3 o’clock

4,80,000 4,80,000 4,80,000 4,80,000

2 3 4 5

Inner race Ball Outer raceway 6 o’clock

4,80,000 4,80,000 4,80,000

6 7 8

Normal

0-3

7

21

Table 2 Reconstruction error statistics for different network layer settings of sPSDAE. Network layers settings

Reconstruction error

800 800 800 800 800

0.00001 0 0.000005 0.000038 0.000023

400 400 400 400 400

200 200 200 200 400

100 100 100 100 200

8 50 8 50 25 8 50 25 10 8 100 50 50 25 25 10 8

Fig. 9. Comparison of reconstruction errors of different initial nodes. Fig. 8. Bearing test bench.

4.2. Parameter setting of fault diagnosis model Before the model is trained, multiple parameters of the sPSDAE model should be set, which include the number of initial input node and network layers, the parameter of LReLU function, learning rate, and denoising parameter. 4.2.1. Network initial input nodes determination In this paper, the number of the initial nodes of the sPSDAE model is set to 200, 400, 600 and 800, respectively. The reconstruction error of the model is recorded as shown in Fig. 9. It can be seen from Fig. 9 that as the number of the initial nodes increases, the reconstruction error of the training becomes smaller and smaller. However, the larger the number of the initial nodes, the deeper required network depth. It might cause information loss and affect the accuracy and speed of the classification. Therefore, 800 is chosen as the number of the initial unit, which maintain a balance between high accuracy and speed of the training. 4.2.2. Network layer settings If there are too many layers, it would cause the latter network layers to learn the wrong feature information, resulting in overfitting. In this paper, five experiments with different network layers settings of the sPSDAE model are implemented. It can be seen in Table 2 that as the number of network layers increases, the reconstruction error of the training dataset decreases first and then increases. When the model is set to 3 or 4 hidden layers, the reconstruction error is very small. Therefore, 4 or 5 hidden layers are selected.

Fig. 10. The α value selection experiment of Leaky-ReLU function.

4.2.3. Parameter selection of LReLU function In this paper, α of LReLU function is set to 0.05, 0.5, 5.5 and 100, respectively. The experimental results are shown in Fig. 10. When α is set to 0.05, the faster iterative convergence speed and the lowest mean square error can be maintained. Therefore, this paper uses the LReLU function with α set to 0.05 as the sPSDAE activation function. 4.2.4. Learning rate setting It is well known that if the learning rate is set too small, the convergence speed will be very slow. If the learning rate is set too

8

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

Fig. 11. The effect of PSDAE unit learning rate and NN process learning rate on classification accuracy.

Fig. 12. Influence of different noise-added ratios of original data on classification results.

large, it will cross the lowest point and cannot reach the lowest point. As shown in Fig. 11, several different learning rates are set to train the model and obtain the classification accuracy. In this experiment, the learning rate of the PSDAE is set to two levels of 1, 0.1, and then sets the learning rate of NN four levels of 1, 0.5, 0.1, and 0.01. Then we combine the two learning rates into eight situations. After many trainings, we found that when the PSDAE learning rate is too low, some units are easy to over-fitting in the superior units, thus reducing the accuracy. So when the learning of PSDAE is higher, good training results can be obtained. It can be found in Fig. 11 that when the PSDAE unit learning rate Lr PSDAE is set to 1, and the fine-tuning process learning rate Lr BPNN is set to 1 or 0.1, the classification accuracy is the highest. 4.2.5. Denoising parameter setting The ratio of the denoising signal added into the original data is called denoising parameter. It can be seen in Fig. 12 that when the denoising parameter is too small, the denoising ability of the model to the noise signal is not ideal, and the classification accuracy is low. However, when the denoising parameter is too large, the noise signal masks too much original information, which increases the difficulty of extracting the correct feature. In the experiment, when the denoising parameter is 0.5, the classification accuracy rate of the model reaches the highest value of 0.999. Thus, the denoising parameter is set in this paper to be 0.5. 4.3. Results comparison and analysis 4.3.1. Results comparison To evaluate the performance of the sPSDAE model, the sPSDAE model is compared with other popular models such as stacked sparse autoencoder (sSAE), sDAE, CNN, BPNN, and SVM. (1) Comparison of fault classification accuracy The results of the classification test are shown in Fig. 13. The average of the classification accuracy of sPSDAE, CNN, sDAE, sSAE, BPNN, and SVM is 99.94%, 99.6%, 98.02%, 95.186%, 81.96%, 87.31%, respectively. Among them, the sPSDAE has the highest classification accuracy rate, and the CNN accuracy rate is second. It is because these two network models have relatively good complex data mapping functions to identify data characteristics. Compared with the sDAE and the sSAE, the sPSDAE shows better data classification and recognition capabilities. It indicates that the pruning connected networks of the sPSDAE model can improve the performance of the original autoencoder and enhance the adaptive feature recognition ability by multi-channel information sharing.

Fig. 13. Classification accuracy of different models in five randomized experiments.

Compared with the shallow machine learning models such as BPNN and SVM, the classification accuracy of the sPSDAE model is much higher. Especially in the first experiment, the classification accuracy of BPNN was lower than 80%. It shows that the sPSDAE constructs a deeper model that solves the problem of insufficient learning ability of shallow networks. In conclusion, the use of pruning connection can not only solve the problem of insufficient learning ability of shallow model, but also has a certain degree of performance improvement compared with the deep learning model proposed earlier. (2) Comparison of training reconstruction error Fig. 14 shows a comparison of the curves of the reconstruction errors (or training errors) during the training iterations of the six models (CNN, sDAE, sPSDAE, BPNN, SVM, sSAE). It can be seen from Fig. 13 that the reconstruction error of CNN and sPSDAE are about the same, and the reconstruction error of sPSDAE is lower than CNN. The traditional autoencoder model (sSAE and sDAE) has a significantly lower descent speed than the sPSDAE model. And there is a significant plateau period of the sSAE and sDAE. The comparison of sSAE and sDAE shows that the introduction of sparsity makes sSAE overcome the first gradient disappearance problem faster. However, sPSDAE did not show obvious gradient disappearance at this stage, indicating that sPSDAE solved the gradient disappearance problem very well. For the shallow neural network models (BPNN and SVM), there are the significant gradient disappearance problems in the training process. (3) Comparison of feature extraction capabilities In the deep learning network, the autoencoder is characterized by the ability to adaptively extract the features from the original signal. To verify the feature extraction ability of the sPSDAE

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

9

but some features overlap. In figure Fig. 15(d), after the data is processed by using the sDAE, the extracted features are basically separated, but some data edge points remain overlapping. It can be seen from Fig. 15(e) that after the data is processed by using the sPSDAE, the data points corresponding to the same bearing fault type are well separated to form a better differentiated data cluster. It proves that the sPSDAE model has stronger ability to extract features and classify. (4) Comparison of noise robustness of different models In the actual machining site, the collected data is inevitably interfered by industrial noise. To compare the sensitivity of different models to noise, we use the function of Eq. (15) to randomly add noise of a binomial distribution with a certain ratio q to the original data. Then, each model is used for fault diagnosis, and the classification accuracy is counted. X N ∼ lD (X |q)

Fig. 14. Comparison of reconstruction error reduction speeds of different fault diagnosis models.

model for complex bearing fault data noise data with a ratio of 1:1 are added into the original data, which make the data more complicated before training different models. Meanwhile, the principal component analysis (PCA) is used to analyze the fault feature extracted by different models, and get two main components which are displayed in a two-dimensional scatter plot (Fig. 15(b)–(e)). The distribution of the original data is in Fig. 15(a). It can be seen in Fig. 15(a) that the proposed overlapping of the original vibration signals is more serious, indicating that it is more difficult to classify them. In Fig. 15(b), the use of shallow BPNN methods to classify data is not ideal, and most features still overlap. In Fig. 15(c), after the sSAE is used to extract the signal characteristics, various faults of the bearing begin to separate,

(15)

where lD is a random noise that satisfies the binomial distribution, X is the original data, X N is the noise-added data, q is the ratio of the original data to the noisy data in the noise-added data. It can be clearly seen from Fig. 16 that as the noise is increasing (the signal-to-noise ratio decreases), the classification accuracy of each model is decreasing. The accuracy of traditional autoencoder and shallow network models is obviously reduced. But the accuracy of the sPSDAE is always higher than other models. In particular, even when the signal-to-noise ratio is 1, the classification accuracy of the sPSDAE model is still over 93%. This shows that sPSDAE has stronger noise robustness than other models. 4.3.2. Further analysis of the sPSDAE model (1) Training time analysis To further illustrate the sPSDAE pruning effect and multichannel information fusion, we construct sUPSDAE model, sPSDAE model and sDAE model with network layers number of ‘800-400-200-100-50-25-8’, respectively.

Fig. 15. Comparison of sPSDAE model and other models on raw data extraction features.

10

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

Fig. 16. Comparison of the robustness of different models to noise.

Fig. 18. Weight ratio diagram of information channel selection method of sPSDAE model.

Fig. 17. Comparison of different model training time.

The training time and classification accuracy of the three models are compared. As shown in Fig. 17, the classification accuracy of the sUPSDAE model and the sPSDAE model is about 99.8%, which is 3.4% higher than the 96.4% of the sDAE model. The training time of the sPSDAE model is reduced by 40.8% compared with the sUPSDAE model, and only about 5% longer than the traditional sDAE model. It shows that the sPSDAE model avoids the problem of greatly increasing the training time caused by full connection, and further improves the classification accuracy of the model within an acceptable time increasing range. (2) Pruning operation analysis To visualize the pruning and fusion process, this paper narrows the gap between the fusion weights of each unit in the sUPSDAE, and shows its weight in layer fusion. As shown in Fig. 18, the ordinate represents the input data and output target of each SDAE unit, and each row corresponds to the unit that is trained by the same input data. The abscissa represents the intermediate hidden feature values trained in an SDAE unit. So the number of small squares included in one column is the number of SDAE units in an UPSDAE unit (or a PSDAE unit after pruning), that is, the number of units that need to be feature-fused. The color of each small square represents the weight of the SDAE unit in the feature fusion of the UPSDAE unit. The lighter the color, the higher the fusion weight of the SDAE unit. In addition, the small square marked with a red cross represents the SDAE unit being pruned. From Fig. 18, we can analyze the advantages of sPSDAE over previous traditional models. The traditional sSDAE has only one

information transmission channel, which is the fine-tuning channel indicated by the red arrow in the figure. In contrast, the sPSDAE model has multiple channels of information transfer (Arbitrarily select one SDAE unit in each column to connect together). The green arrow in the figure marks the main information transmission channel selected by the sPSDAE model through feature fusion. This channel ensures that the PSDAE unit training extracts features along the direction where the reconstruction error is sufficiently small. The reason why the sPSDAE model chooses a new channel is: In the process of extracting the second fusion feature C 2, a SDAE unit (C 1, C 21 ) (This shows that the training structure of the SDAE unit is ‘C 1, C 21 , C 1’) with low reconstruction error appears, which means that the feature C21 extracted by this layer fully understand the characteristics of its input data C 1. The reconstruction error of the feature extracted by another SDAE unit (X0 , C 22 ) is much larger than the first one, so it takes a very low weight (navy blue) in the feature fusion process of C 2. This shows that the input layer data C 1 of the superior unit is more suitable for extracting (converting into) the lower dimensional feature C 2 than the input data X0 of the non-superior unit. Therefore, the superior unit has the advantages of excellent intermediate feature and excellent input data at the same time. The sPSDAE preserves the two advantages of the superior units during the training process, reducing the degree to which the error is amplified layer by layer. Furthermore, in each PSDAE unit, the superior unit is given a greater weight to produce the fusion feature, thereby forming a channel dominated by the superior unit, which ensures the maximum transmission of information with minimum reconstruction errors at each layer. In this experiment, since the input layer data C 1 participates in the training of all the superior units after the first layer, the main information transmission channel marked by the green arrow is formed. The formation of this minimum error descent channel is also the reason why the sPSDAE model has higher classification accuracy and faster iteration speed. Therefore, it can be clearly seen from Fig. 18 that the weight of all pruned units in the fusion process of UPSDAE units is extremely low. This means that these pruned units have low contribution to fusion compared to the superior units. This shows that the strategy of PSDAE prohibiting the training after nonsuperior unit participation is relatively reasonable. This not only retains the information of all the previous superior units and realizes the information sharing of the unit, but it also greatly reduces the complexity of the fully connected network.

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060 Table 3 Average accuracy of sPSDAE diagnostics for two bearing fault data sets. Bearing data set

Classification accuracy (%)

Small Dataset (4 types) Big Dataset (8 types)

100 99.86

(3) Classification ability analysis for small data sets In particular, 4 types of fault data are selected from original data randomly. And use the small data set to train the proposed model, and test the model’s ability to classify small sample. The experimental results are shown in Table 3. It can be seen that in the fault diagnosis of small data set, the classification accuracy of the model is 100%. And In the diagnostic test of large sample dataset, the test set accuracy rate reached 99.86%. This shows that the full connection mode of the sPSDAE model can learn useful information from both big datasets and small datasets very well. 5. Conclusion This paper proposes a new sPSDAE model for rolling bearing fault diagnosis. The sPSDAE is based on a fully connected network formed by linking all feature extraction layers of SDAE. Then, according to the reconstruction error of each layer, the pruning operation is performed on the non-superior units in the fully connected network to decrease the training amount of the network. After that, the features extracted from the same layer after pruning are fused to ensure the uniqueness of the feature dimensions. In the process of unit training, the sparseness adjustment method and the LReLU function are introduced to reduce the data complexity and complete the sparse expression. According to the experimental results, sPSDAE is superior to other popular fault diagnosis methods in classification accuracy and training iterative rate. As can be seen from further analysis, the advantages of the proposed sPSDAE model are: (1) Reusing and sharing the information on each layer in the network, which improves the feature mining ability of model adaptation. (2) Selecting the best information flow channel in the network, which speeds up the decline of training iteration error. (3) The sparse network structure prevents over-fitting, which improves the prediction accuracy and generalization of the model. The basic unit of the proposed method is the sDAE model. However, the noise adding method might lead to longer training time. Future related research will be carried out in the following ways. Firstly, the noise interference of the basic sDAE unit in the model can be marginalized, that is, replaced with a better basic autoencoder unit. Secondly, the feature fusion study must be improved to increase the speed of model while the information sharing of each layer is maximized. Thirdly, in the pruning operation, future research is needed to determine how many times the reconstruction error between the units is different to find better pruning units. Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.106060.

11

CRediT authorship contribution statement Haiping Zhu: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing - review & editing. Jiaxin Cheng: Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing - original draft, Writing - review & editing. Cong Zhang: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Supervision, Validation, Visualization, Writing - original draft, Writing - review & editing. Jun Wu: Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing - original draft, Writing - review & editing. Xinyu Shao: Conceptualization, Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Supervision, Validation, Visualization, Writing - review & editing. Acknowledgments This work is supported in part by the National Key Research and Development Program of China under the Grant No. 2018YFB 1703204, in part by the National Natural Science Foundation of China under the Grant No. 51875225, and in part by the Key Research and Development Program of Guangdong Province, China under the Grant No. 2019B090916001. The authors express their gratitude to editor and the anonymous reviewers for their valuable comments. References [1] Z. Ming, Z. Jiang, K. Feng, Research on variational mode decomposition in rolling bearings fault diagnosis of the multistage centrifugal pump, Mech. Syst. Signal Process. 93 (2017) 460–493. [2] J. Wu, C. Wu, S. Cao, Degradation data-driven time-to-failure prognostics approach for rolling element bearings in electrical machines, IEEE Trans. Ind. Electron. PP (99) (2018) 1. [3] V.K. Rai, A.R. Mohanty, Bearing fault diagnosis using FFT of intrinsic mode functions in Hilbert–Huang transform, Mech. Syst. Signal Process. 21 (6) (2007) 2607–2615. [4] W. Tse Peter, W.X. Yang, H.Y. Tam, Machine fault diagnosis through an effective exact wavelet analysis, J. Sound Vib. 277 (4–5) (2004) 1005–1024. [5] N.E. Huang, Z. Shen, S.R. Long, The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, Proc. A 454 (1971) (1998) 903–995. [6] J. Wu, C. Wu, Y. Lv, Design a degradation condition monitoring system scheme for rolling bearing using EMD and PCA, Ind. Manag. Data Syst. 117 (4) (2017) IMDS-11-2016-0469. [7] Shatnawi, Yousef, M. Al-Khassaweneh, Fault diagnosis in internal combustion engines using extension neural network, IEEE Trans. Ind. Electron. 61 (3) (2013) 1434–1443. [8] Zadeh, A. Lofti, Fuzzy logic, Computer 21 (4) (2008) 83–93. [9] J. Wu, Y. Su, Y. Cheng, Multi-sensor information fusion for remaining useful life prediction of machining tools by adaptive network based fuzzy inference system, Appl. Soft Comput. 68 (2018) 13–23. [10] Cortes, Corinna, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [11] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507. [12] Mohamed, R. Abdel, G. Hinton, Phone recognition using restricted Boltzmann machines, in: IEEE International Conference on Acoustics Speech and Signal Processing, IEEE, 2010, pp. 4354–4357. [13] T.T. Van, A. Failsal, B. Andrew, An approach to fault diagnosis of reciprocating compressor valves using Teager-Kaiser energy operator and deep belief networks, Expert Syst. Appl. 41 (9) (2014) 4113–4122. [14] L. Faa-jeng, S. I-fan, Y. Kai-jie, et al., Recurrent fuzzy neural cerebellar model articulation network fault-tolerant control of six-phase permanent magnet synchronous motor position servo drive, IEEE Trans. Fuzzy Syst. 24 (1) (2016) 153–167. [15] T. Ince, S. Kiranyaz, L. Eren, M. Askar, M. Gabbouj, Real-time motor fault detection by 1-d convolutional neural networks, IEEE Trans. Ind. Electron. 63 (11) (2016) 7067–7075.

12

H. Zhu, J. Cheng, C. Zhang et al. / Applied Soft Computing Journal 88 (2020) 106060

[16] Srivastava, K. Rupesh, K. Greff, J. Schmidhuber, Training very deep networks, Comput. Sci. (2015). [17] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision & Pattern Recognition, 2016. [18] G. Huang, Z. Liu, V.D.M. Laurens, Densely connected convolutional networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017. [19] Tamilselvan, Prasanna, P. Wang, Failure diagnosis using deep belief learning based health state classification, Reliab. Eng. Syst. Saf. 115 (7) (2013) 124–135. [20] L. Wen, X. Li, L. Gao, A new convolutional neural network-based datadriven fault diagnosis method, IEEE Trans. Ind. Electron. 65 (7) (2018) 5990–5998. [21] Y. Cheng, H. Zhu, J. Wu, X. Shao, Machine health monitoring using adaptive kernel spectral clustering and deep long short-term memory recurrent neural networks, IEEE Trans. Ind. Inf. PP (99) (2019) 1. [22] M. Marcin, L. Marcel, P. Marcin, Neural network-based robust actuator fault diagnosis for a non-linear multi-tank system, ISA Trans. 61 (2016) 318–328. [23] H. Bourlard, Y. Kamp, Auto-association by multilayer perceptrons and singular value decomposition, Biol. Cybernet. 59 (4–5) (1988) 291–294. [24] P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzago, W.W. Cohen, Extracting and composing robust features with denoising autoencoders, in: International Conference on Machine Learning, ACM, 2008. [25] C. Poultney, C. Sumit, L. Yann, Efficient learning of sparse representations with an energy-based model, Adv. Neural Inf. Process. Syst. (2007) 1137–1144. [26] J. Masci, U. Meier, D. Ciresan, Jürgen Schmidhuber, Stacked convolutional auto-encoders for hierarchical feature extraction, in: International Conference on Artificial Neural Networks, 2011.

[27] Kingma, P. Diederik, M. Welling, Auto-Encoding Variational Bayes, 2013. [28] M. Chen, K. Weinberger, F. Sha, Marginalized denoising auto-encoders for nonlinear representations, in: International Conference on International Conference on Machine Learning, 2014. [29] Sakurada, Mayu, T. Yairi, Anomaly detection using autoencoders with nonlinear dimensionality reduction, in: Mlsda Workshop on Machine Learning for Sensory Data Analysis, ACM, 2014. [30] W. Sun, S. Shao, R. Zhao, R. Yan, X. Zhang, A sparse auto-encoder-based deep neural network approach for induction motor faults classification, Measurement 89 (ISFA) (2016) 171–178. [31] C. Lu, Z.Y. Wang, W.L. Qin, Fault diagnosis of rotary machinery components using a stacked denoising autoencoder-based health state identification, Signal Process. 130 (C) (2017) 377–388. [32] Y. Bengio, P. Lamblin, P. Dan, Greedy layer-wise training of deep networks, Adv. Neural Inf. Process. Syst. 19 (2007) 153–160. [33] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, J. Mach. Learn. Res. 11 (12) (2010) 3371–3408. [34] S. Meidi, W. Hui, L. Ping, A sparse stacked denoising autoencoder with optimized transfer learning applied to the fault diagnosis of rolling bearings, Measurement 146 (2019) 305–314. [35] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: International Conference on Neural Information Processing Systems, 2012. [36] K. He, X. Zhang, S. Ren, Delving Deep Into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015.