An adaptive deep transfer learning method for bearing fault diagnosis

An adaptive deep transfer learning method for bearing fault diagnosis

Journal Pre-proofs An adaptive deep transfer learning method for bearing fault diagnosis Zhenghong Wu, Hongkai Jiang, Ke Zhao, Xingqiu Li PII: DOI: Re...

2MB Sizes 0 Downloads 101 Views

Journal Pre-proofs An adaptive deep transfer learning method for bearing fault diagnosis Zhenghong Wu, Hongkai Jiang, Ke Zhao, Xingqiu Li PII: DOI: Reference:

S0263-2241(19)31092-9 https://doi.org/10.1016/j.measurement.2019.107227 MEASUR 107227

To appear in:

Measurement

Received Date: Revised Date: Accepted Date:

11 August 2019 30 September 2019 30 October 2019

Please cite this article as: Z. Wu, H. Jiang, K. Zhao, X. Li, An adaptive deep transfer learning method for bearing fault diagnosis, Measurement (2019), doi: https://doi.org/10.1016/j.measurement.2019.107227

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Ltd.

An adaptive deep transfer learning method for bearing fault diagnosis Zhenghong Wu, Hongkai Jiang1, Ke Zhao, Xingqiu Li School of Aeronautics, Northwestern Polytechnical University, 710072 Xi’an, China Abstract: Bearing fault diagnosis has made some achievements based on massive labeled fault data. In practical engineering, machines are mostly in healthy and faults seldom happen, it’s difficult or expensive to collect massive labeled fault data. To solve the problem, an adaptive deep transfer learning method for bearing fault diagnosis is proposed in this paper. Firstly, a long-short term memory recurrent neural network model based on instance-transfer learning is constructed to generate some auxiliary datasets. Secondly, joint distribution adaptation, a feature-transfer learning method, which is used to reduce the differences in probability distributions between an auxiliary dataset and target domain dataset. Finally, grey wolf optimization algorithm is introduced to adaptively learn key parameters of joint distribution adaptation. The proposed method is verified with two kinds of datasets, and the results demonstrate the effectiveness and robustness of the proposed method when the labeled fault data are scarce. Keywords: Feature-transfer learning; Instance-transfer learning; Long-short term memory recurrent neural network; Joint distribution adaptation; Grey wolf optimization algorithm 1.

Introduction

Bearing is a significant part of mechanical system and has been widely used in rotating machinery. Bearing works in complex and variable environments for a long time [1-4]. When bearing fails, it will affect working efficiency of the machine and cause economic losses, and seriously it will become a threat to personal safety, so bearing fault diagnosis has been a research hotspot all the time [5-7]. With the rapid development of machine learning, deep learning has been introduced to bearing fault diagnosis [8,9]. Under a common assumption: the training and the testing data are drawn from the same feature space and the same distribution, deep learning has made some achievements of bearing fault diagnosis [10]. Shao et al proposed an adaptive deep belief network (DBN) to rolling bearing faults diagnosis, which achieved good classification and robustness [11]. Yang et al applied long-short term memory recurrent neural network (LSTM) to rotating machine fault diagnosis, and the effectiveness of the method was verified by a large number of comparative experiments [12]. Zhang et al used convolutional neural network (CNN) to the fault diagnosis under noisy environment and different working load [13].Xiang et al designed a stacked auto-encoder (SAE) for rolling bearing fault diagnosis [14]. In practical engineering, it is often impossible to meet the common assumption mentioned above due to the complex and varied working environments. Therefore, how to use a small number of labeled data to build a reliable model to classify becomes particularly important. Transfer learning (TL) is a new deep learning method that uses existing knowledge to tackle

1

Corresponding author

Email address: [email protected]

problems in different but related fields, which eases the requirements for data characteristics [15]. Considering the different probability distributions of the training dataset and the testing dataset in the research of TL, the dataset with the same probability distributions as the testing dataset is called target domain (Dtar). The dataset without the same probability distributions as the testing dataset is called source domain (Dsrc) [16]. TL can be classified as instance- TL, feature-TL, parameter-TL and relational-knowledge -TL. Instance -TL learn more knowledge in Dtar by reweighting. Feature-TF is to learn a”good” feature representation for Dtar, the knowledge is encoded into the learned feature representation [15]. In recent years, TL has attracted more and more attention [17-19]. In real life, the applications of TL can be found in such fields as text recognition [20], emotion analysis [21], image recognition [22] and software defect recognition [23]. Scholars have also made some attempts to apply TL to bearing fault diagnosis. For example, Wu et al utilized support vector machine framework to fault diagnosis [24]. Wei et al designed a domain adaptive fault diagnosis method under different working conditions to improve the classification accuracies [25]. Wen et al used an automatic encoder to bearing fault diagnosis [26]. Zhong et al proposed a method based on CNN for gas turbine fault diagnosis, abundant experiments confirmed that the proposed method had great classification accuracies under the condition of small samples [27]. Chen et al applied transfer component analysis method to fault diagnosis for achieving cross-domain features extraction [28]. The above researches show that TL has achieved good results, but it still has three weaknesses as follows: 1) The data used for TL are all from the same bearing, the transferred knowledge is just from one working condition to another. Moreover, these methods are not verified with practical data, and the generalization abilities of these methods have not been confirmed. 2) Some previous studies extracted data features before TL, which puts forward higher requirements for scholars’ professional knowledge and engineering experience. 3) They took a lot of time to manually find out optimal parameters of their methods. In order to overcome the above weaknesses, an adaptive deep transfer learning method for bearing fault diagnosis is proposed here. Because the bearing vibration signals are in time sequences, LSTM has a powerful ability to learn long-term dependencies hidden in sequential data. Firstly, the proposed method uses a LSTM model based on instance -TF to establish the mapping relationship between Dsrc and Dtar to generate some auxiliary datasets. Then, joint distribution adaptation (JDA) is used to adapt the marginal probability and the conditional probability of both an auxiliary dataset and Dtar. As a feature-TF method, JDA was first proposed by Long et al [29]. JDA has been applied to bearing fault diagnosis [30, 31]. Finally, grey wolf optimization (GWO) algorithm is introduced to adaptively learn JDA key parameters, which improves classification accuracies and saves the time in searching for optimal parameters. The proposed method is verified with experimental bearing dataset and practical locomotive bearing dataset. In order to obtain clearer health information of bearing and decrease samples dimensions [32, 33], the raw vibration dataset is converted into the frequency spectra dataset. The proposed method is compared with JDA, deep learning methods of CNN, LSTM, DBN and SAE. The feasibility of the proposed method is demonstrated by four experiments. The contributions of the proposed method can be summarized as follows: 1) Building a LSTM model based on instance-TL: the model can generate some auxiliary

2)

datasets to solve scarce labeled data in Dtar. Using JDA to adapt two datasets: JDA is used to reduce the differences in probability distributions between an auxiliary dataset and Dtar.

3)

Optimizing JDA key parameters by GWO algorithm: GWO algorithm is introduced to improve classification accuracies and save the time in finding out the optimal parameters. The organization of the remainder is as follows: Section 2 introduces the principles of LSTM and JDA. Section 3 elaborates on the construction process of the proposed method. The proposed method is verified by sufficient experiments in Section 4. The general conclusions are given in Section 5. 2.

Theoretical background

2.1. Long-short term memory Deep learning methods of CNN, DBN and SAE have some commonalities, inputting data not only needs same dimensions, but the transmission of signals can only be transmitted from the current layer to the next layer [34, 35]. However, recurrent neural network (RNN) is different from these methods mentioned above. RNN has certain memory function, and the output of neurons can be directly applied to itself at the next moment. That is, at time t, the input of the hidden layer neurons, in addition to the input layer neurons at that time, also includes the output of the neurons at time (t-1) [36]. However, as the time interval increases, there will be gradient disappearance or gradient explosion come up with RNN. If there are too many layers in RNN, it will cause RNN to lose the ability to learn more distant information. Therefore, what widely used in practice is LSTM. As shown in Fig. 1, the model improves traditional RNN by introducing the inputting gate, the forgetting gate and the outputting gate to process information [37, 38]. ht ×

ht

h t-1 

xt

h

h t-1

h t-1

ot

c t 1 

ft

c t 1

ct

ct 1

cell

×

xt xt

×



xt

g

h t-1

h t-1

c t 1

it

xt

Fig. 1. Architecture of LSTM. The operation processes of the three gates can be described by the following formulas [37]: gt=g(Wg.[ht-1,xt]+sg) it=σ(Wi.[ht-1,xt]+ei.ct-1+si) ft=σ(Wf.[ht-1,xt]+ef.ct-1+sf) ot=σ(Wo.[ht-1,xt]+eo.ct-1+so) ct=it*gt+ ft* ct-1

(1) (2) (3) (4) (5)

ht=ot*h(ct) in which Wg, Wi, Wf and Wo are connection weight matrices from the inputting layer to the LSTM layer, the inputting gate, the forgetting gate, and the outputting gate, respectively; ei, ef and eo are the weight matrix of the peephole connection to the inputting gate, the forgetting gate, and the outputting gate, respectively; ct-1 and ct donate the cell at time (t-1) and time t, respectively; sg , si, sf and so are the corresponding offset vectors, respectively;  (  ) denotes the activation function of gate structure; g (  ) and h (  ) are the activation function of the inputting and that of the outputting,

(6)

respectively. In Fig. 1, cell acts as a processor to judge whether the information is useful. The three dotted lines are called peephole connection. The peephole structure can introduce the cell state at the last moment into the three gates and control the three gates. The inputting gate determines whether new information can be added to the neural network; the forgetting gate selectively forgets the information in the cell state; the outputting gate decides which part of the cell information will be output. LSTM utilizes flexibly the form of the gate structure, which is an effective method to solve long-term dependence of RNN. 2.2. Joint distribution adaptation A domain D is composed of an r-dimensional feature space X and a marginal probability P(x), i.e., Dsrc={ Xsrc, P(xsrc)}, Dtar={ Xtar, P(xtar)},where xsrcXsrc and xtarXtar. For easy understanding, Dsrc and Dtar can be described into Dsrc={( x1, y1),…, ( xn, yn)} and Dsrc={( x1, y1),…, ( xm, ym)},where n= nsrc and m= ntar. The method of dimensionality reduction devotes to getting new feature representation by minimizing the reconstruction error of the input data [29]. Principal Component Analysis (PCA) is used to reduce data dimension in the proposed method, the process could be generalized as equation (8). X=[ x1, …,xn+m]  Rr×(n+m)

(7)

max tr[H T X(I (n  m) 1 L)X T H]

(8)

HT H  I

where Xsrc and Xtar are the feature space of source domain and that of target domain, respectively; P(xsrc) and P(xtar) are marginal probability of source domain data and that of target domain data, respectively; xn and yn represent the nth sample and the corresponding label, respectively; H is the goal matrix that PCA is looking for; I is the identity matrix; L is the (n+m) ×(n+m) matrix of ones; tr() is the trace of the matrix. How to accurately diagnose unlabeled data in Dtar with the help of massive labeled data in Dsrc. JDA provides a promising idea to tackle this problem. As a feature-TL method, JDA is to find a matrix H by using PCA to reduce dimension of the input data. Under the conditions that P(xsrc) and P(xtar), P(ysrc |xsrc) and P(ytar |xtar) obey different probability distributions, the difference between P(HTxsrc) and P(HTxtar) is reduced as much as possible by matrix H, as well as the difference between P(ysrc |HTxsrc) and P(ytar |HTxtar). P(ysrc |xsrc) and P(ytar |xtar) donate conditional probability of source domain data and that of target domain data, respectively. The most prominent feature of JDA is that it can adapt simultaneously marginal probability and conditional probability, JDA measures the differences in probability distributions between two datasets by Maximum Mean Discrepancy (MMD) [29]. 2.2.1. Marginal probability Adaptation To reduce the discrepancy between marginal probability P(xsrc) and P(xtar), JDA aims to look

for a matrix H to decrease the discrepancy that could be evaluated by MMD. By continually simplifying, the final optimization goal can be summarized as equation (10).

 1  2  n src   1 (M m )ij   2  n tar  1    n src n tar

x i , x j  Dsrc x i , x j  D tar

(9)

otherwise

D (Dsrc, Dtar)=tr(HTXMmXTH) where nsrc and ntar donate the total number of source domain samples and that of target domain samples, respectively ; Mm is the central matrix. JDA can reduce the dimensions of Dsrc and Dtar in the process of finding matrix H, thus reducing their marginal probability difference. 2.2.2. Conditional probability Adaptation However, reducing the marginal probability discrepancy does not guarantee that its conditional probability will also be decreased [29]. Conditional probability P(ysrc |xsrc) and P(ytar |xtar) are closely related to robust distribution adaptation. The matrix H is again fully utilized to adapt conditional probability. The optimization objective can be presented in equation (12).

1   n2 src,s   1  2  (M s )ij   n tar,s   1  n src,s n tar,s  0  

(10)

(s) x i, x j  Dsrc

x i, x j  D(s) tar xi  D , x j  D   xj D   xi  D otherwise (s) src (s) tar

(11) (s) tar (s) src

S

D(Dsrc , D tar )   tr(H T XM s X T H)

(12)

s 1

where MS denotes the matrix involved in the classification label; nsrc,s and ntar,s are the number of (s ) s-class samples in Dsrc and that of in Dtar, respectively; D src and D (tas r) represent samples of s-class in Dsrc and that of in Dtar, respectively. To address the scarcity of labeled data in Dtar when adapting their conditional probability, JDA trains a classifier based on (xsrc, ysrc),then predicts xtar to generate pseudo-labels, finally improves the accuracy of pseudo-labels by increasing the number of iterations. The classification accuracy is obtained by comparing pseudo-labels and ytar. By combining the two adaptations mentioned above, an overall optimization goal can be obtained: S

(XMs XT  I)A  XMmXTA s0

where I, and  represent the unit matrix, regularization parameter and a Lagrange multiplier, respectively. JDA adaptation principle can be expressed as Fig. 2. By comparing Fig. 2 (a) and Fig. 2 (b), it can be seen that similar characteristics are extended by JDA, which makes it possible to get a high accuracy of bearing fault diagnosis.

(13)

Dsrc fault

Dtar fault

Similar characteristics

(b)

(a) Dsrc

Dtar

Dsrc

Dtar

Fig. 2. The adaptation principle of JDA: (a) without JDA; (b) with JDA. 3. The proposed method In this paper, an adaptive deep transfer learning method is proposed for bearing fault diagnosis in which the LSTM model based on instance -TL generates some auxiliary datasets, JDA adapts an auxiliary dataset and Dtar, and GWO algorithm is introduced to adaptively learn JDA key parameters. This part is composed of the following sections: Section 3.1 constructs the LSTM model based on instance-TL. Section 3.2 presents the designing JDA model based on feature-TL. Adaptive learning JDA key parameters is shown in Section 3.3. Section 3.4 describes general procedures and relevant processing effort of the proposed method. 3.1. Constructing the LSTM model based on instance-TL When mechanical system is operated under complex conditions, factors such as variable speed and variable load will cause some differences in probability distribution between Dsrc and Dtar. Based on the above situation, if a model is trained with Dsrc, and then Dtar is directly classified with the model, the accuracy of fault identification will be difficult to guarantee. In practical engineering, as for necessary new machinery diagnose, the new mechanical fault data is directly classified with relevant mechanical fault data due to the scarce labeled fault data of new mechanical. Thus the classification accuracy is low and credibility is difficult to guarantee. On the one hand, compared with CNN, DBN and SAE, LSTM which has memory ability because of introducing memory cell and three gates in its network structure is capable to learn long-term dependencies hidden in sequential data. LSTM can take full advantage of both spatial and temporal dependencies to capture data features for fault diagnosis [39]. On the other hand, time dependence of bearing fault cause is very common in mechanical systems due to bearing failure has time-degradation characteristics [40]. Therefore, taking the two reasons into consideration, LSTM is selected as the model to learn the mapping relationship between Dsrc and Dtar. The constructed model is shown in Fig. 3. For clearer bearing fault information the model can learn, the frequency spectra dataset that is transformed from raw vibration dataset by Fast

Fourier Transformation (FFT).

Input Dsrc  I

Dsrc (1)

C (1)

Dtar (1)

Dsrc (2)

C (2)

Dtar (2)

文本

...

Dsrc  II

...

Dsrc  I

...

Dsrc

Input Dtar  I 文本

Dsrc (5  i)

C (5  i)

Dtar (5  i)

Dtar  I Dtar  II

Dtar

Model training

Input Dsrc  II

The LSTM model based on instance-transfer learning

Ouput Auxiliary dataset

Fig. 3. The structure of the LSTM model based on instance-TL. Here we input two kinds of frequency spectra with different probability distributions into the model, namely Dsrc-I and Dtar -I. Both Dsrc-I and Dtar -I contain five types of bearing health status, and the number of samples for each type is i, so they respectively contain (5×i) samples. Dsrc(1), Dsrc(2), Dsrc(5×i) represent the first, the second and the (5×i)th sample of Dsrc, respectively . Dtar (1), Dtar (2), Dtar (5×i) represent the first, the second and the (5×i)th sample of Dtar, respectively. By inputting Dsrc-I and Dtar –I to train the LSTM model parameters, finally, the LSTM model based on instance-TL is obtained. After the model is obtained, Dsrc-II with same probability distributions as Dsrc-I is used as inputting samples of the model. Dsrc-II and Dsrc-I contain same five types of bearing health states, and each type of state sample is j, so total samples of Dsrc-II is (5×j). After inputting Dsrc-II to the model, the output of the model is auxiliary dataset, which can be considered to contain five types of bearing health states as well, and the number of samples for per type is j, so there are (5×j) samples in total. The differences between the auxiliary dataset and Dtar-I in probability distributions are small. The auxiliary dataset facilitates the alleviation of scarce labeled data in Dtar, and provides a feasible way to classify under a small amount of labeled data. 3.2. Designing JDA model based on feature-TL As a feature-TL method, JDA can complete new feature representations of two datasets in the process of reducing datasets dimensions. JDA is essentially looking for similarities between two datasets at the feature level, and the available data surface has been greatly expanded through transferring feature levels. After the auxiliary dataset is generated, JDA plays a role as the classifier. The auxiliary dataset and Dtar-II are used as the training dataset and the testing dataset of JDA, respectively. JDA classification process can be shown in Fig. 4. At the beginning, there are lots of differences between the auxiliary dataset and Dtar-II in probability distributions, the similarity is small and the information can be used less. If such a similar surface of this size is used to train a classification

model to deal with bearing fault diagnosis, classification accuracy will be low. In order to expand the amount of available information, JDA can find out a matrix H to narrow the relationship between the two datasets, similar surface is obviously enlarged. Using the enlarged similarity surface to train a classification model, the classification accuracy will be significantly improved.

Dtar  II

Auxiliary dataset

S

H

( X  M s X T   I ) A  X M m X T A s0

H

Classify Classification accuracy Fig. 4. JDA classification process. The auxiliary dataset is generated via the LSTM model based on instance-TL that learns the connection between Dsrc and Dtar, therefore, the differences in probability distributions of the auxiliary dataset and Dtar-II is smaller than that of Dsrc and Dtar-II. When labeled data is scarce, JDA can be used as a classifier on the basis of the LSTM model based on instance-TL, which can get better recognition results. 3.3. Adaptive learning JDA key parameters When JDA is used as a classifier, the values of parameters α, β and T directly affect JDA classification accuracy. Where α represents final dimensions to which the two datasets were dropped; β is the width of the selected kernel; T is the number of iterations. It will be a time-consuming and laborious process to make classification accuracy the best by continuously trying to find the optimal values of α, β and T. GWO algorithm is a popular optimization method by simulating the predation of gray wolves, which can find the global maximum with stronger robustness and generalization ability than other optimization methods [41-42]. Thus, GWO algorithm is used to optimize JDA key parameters, which helps to improve classification accuracy, the stability of classification results and save the time in finding out the optimal parameters [43].The process of GWO algorithm can be summarized in Table 1:

Table 1 The process of GWO algorithm. Step

Details

Step1

Set the parameters of GWO algorithm: the number of gray wolves N; the initial position of gray wolves Init-S; the number of attempts try-N; the maximum number of iterations MAX-T; optimized parameter set ai=[αi, βi, Ti](i=1,2…N). To judge the quality of parameters found by GWO algorithm intuitively, the classification accuracy is designed as the fitness function of GWO algorithm, and the quality of each parameter is judged by outputting accuracy. Select the range of the optimized parameters: within the corresponding range, each parameter randomly takes a value to form ai. Calculating the value of the fitness function corresponding to ai, and saving the maximum fitness value in each iteration. Determine the termination condition of the GWO algorithm: when MAX-T is reached, ending adaptive process to prevent the algorithm from learning indefinitely. Verify the reliability of parameters: in order to verify quality and stability of the found parameters, parameters found are put into JDA without GWO algorithm to run. Comparing whether the classification results are significant difference. If there is no significant difference, ai can be determined. If there has significant difference, the above steps should be repeated until there is no significant difference.

Step2

Step3

Step4

3.4. General procedures and relevant processing effort of the proposed method In this paper, an adaptive deep transfer learning method for bearing fault diagnosis is proposed. The framework of the proposed method is shown in Fig. 5, general procedures and relevant processing effort are summarized as follows:  Step 1: Dtar acquisition: acquiring the raw vibration signals of the locomotive bearing by vibration sensor. Dsrc can be downloaded directly from the Internet.  Step 2: Obtaining the frequency spectra data of Dsrc and Dtar by FFT. For confirming the effectiveness of the proposed method without considering noise, frequency spectra data of Dsrc and Dtar are directly obtained from corresponding raw vibration signals by FFT and employed to launch some comparative experiments.When an experiment is conducted to demonstrate the robustness of the proposed method to noise, first of all, using five different levels of White Gaussian Noise (at 10, 20, 30, 40, 50 dB SNR) to deteriorate these raw vibration signals of Dsrc and Dtar. Then, utilizing FFT to acquire corresponding frequency spectra data from these noisy data. Finally, using these noisy data to gain classification accuracies. Experiment about noise is introduced in subsection 4.6.  Step 3: Constructing the LSTM model based on instance-TL. When we carry out some





experiments without considering noise, partial frequency spectra data of Dsrc and Dtar without noise are used to train the LSTM model to generated some auxiliary datasets. When it comes to analyzing noise, corresponding nosiy data are used to train the LSTM model. Step 4: Designing JDA model based on feature-TL. JDA as a classifier of the proposed method is compared with these method of CNN, DBN, LSTM and SAE. The input of these methods can be obtained by Step1 and Step2. Step 5: Adaptive learning JDA model parameters, outputting the final diagnosis result.

Dtar Acquisition Acquire the raw vibration signals of the locomotive bearing

The locomotive bearing

Obtaining the frequency spectra of D srcand D tar The raw vibration signals of D src

The raw vibration signals of D tar

·

FFT

FFT Obtain the frequency spectra of D src

Obtain the frequency spectra of D tar

14

12

12

10

10

8

8

6

6

4

4

2

2 0 0

4000

6000

8000

10000

12000

0 0

2000

4000

6000

8000

10000

12000

Constructing the LSTM model based on instance-transfer learning

Adaptive learning JDA model parameters

Optimization parameters

Input Dsrc  I

α、β and T

Dsrc (1)

C (1)

Dtar (1)

Dsrc (2)

C (2)

Dtar (2)

文本

Dsrc  II

...

Dsrc  I

...

Dsrc

Input Dtar  I 文本

...

Accuracy(%)

2000

Dsrc (5 i)

C (5  i)

Dtar (5 i)

Dtar  I

Dtar

Dtar  II

Model training

Input Dsrc  II Iteration number

The LSTM model based on instance-transfer learning

Ouput Auxiliary dataset

Designing JDA model based on feature-transfer learning Dtar  II fault

Auxiliary dataset fault

Dtar  II

Auxiliary dataset

Similar characteristics

(a) Auxiliary dataset Dtar  II

(b) Auxiliary dataset Dtar  II

S

H

( X  M s X T   I ) A  X M m X T A s0

Classify

(a)without JDA

(b)with JDA

Classification accuracy

Fig. 5. The framework of the proposed method.

H

4. Experimental verification 4.1. Datasets description Two categories of datasets are used to verify the effectiveness of the proposed method for bearing fault diagnosis. One is the experimental bearing dataset provided by Case Western Reserve University, and the other is the locomotive bearing dataset collected in the practical engineering. For inputting clearer bearing fault information to the model, raw vibration signals of the two datasets are transformed into frequency spectra [33]. In the proposed method, the experimental bearing dataset is Dsrc, and the locomotive bearing dataset is Dtar. This arrangement not only verifies the effectiveness of the proposed method, but also provide an effective solution for the scarcity of labeled data in practical engineering. 4.1.1. Dsrc description Dsrc acquisition equipment is shown in Fig. 6(a), and the schematic diagram is shown in Fig. 6(b). The acquisition equipment is mainly composed of a motor, a torque sensor, a power meter and electronic control. The single point fault is arranged on the bearing (SKF6205) by electric discharge machining technology, and raw vibration signals are collected by an accelerometer arranged on the bearing [13]. Dsrc has five types of bearing faults, including no fault (N), slight outer race defect fault (Slight ORD), serious outer race defect fault (Serious ORD), roller defect fault (RD) and inner race defect fault (IRD). More details are shown in Table 2. Table 2 Introduction Dsrc. The source of dataset

Fault categories

Case Western Reserve University

N Slight ORD Serious ORD RD IRD

(a)

Motor speed (rpm) 1797 1797 1797 1797 1797

Defect diameter (inches)

Number of samples

The points of each sample

Labels

0 0.007 0.021 0.007 0.007

305 305 305 305 305

200 200 200 200 200

1 2 3 4 5

(b)

Accelerometer on Housing Induction Motor

Fan End

Drive-end Bearing

Drive End

Fig. 6. (a) Dsrc acquisition equipment; (b) schematic diagram.

Load

4.1.2. Dtar description Dtar acquisition equipment is shown in Fig. 7(a), raw vibration signals are acquired by an accelerometer mounted on the bearing (552732QT) [32]. Dtar contains five health states, as shown in Table 3. The sampling frequency is 12.8 kHz and the rotating speed is about 500rpm. Slight ORD, Serious ORD, RD and IRD damage as shown in Fig. 7(b). 122,000 data points are taken for each fault category and raw vibration signals are converted into frequency spectra. Therefore, there are 305 samples for each fault category and 200 sampling points for each sample. Table 3 Introduction Dtar. The source of dataset

Fault categories

Motor speed(rpm)

Number of samples

The points of each sample

Labels

Locomotive bearing dataset

N Slight ORD Serious ORD RD IRD

490 490 481 531 498

305 305 305 305 305

200 200 200 200 200

1 2 3 4 5

(a)

(b) Accelerometer

Inner race defect

Roller defect

Slight outer race defect Serious outer race defect

Fig. 7. (a) Dtar acquisition equipment; (b) faults in the locomotive roller bearings. 4.1.3. Dsrc-I, Dsrc-II, Dtar-I and Dtar-II description In Dsrc, each type of fault takes 20 samples in turn to form (5 × 20) samples, and this dataset is named Dsrc-20. According to this rule, Dsrc-5, Dsrc-10, Dsrc-15, Dsrc-25, Dsrc-40, Dsrc-60, Dsrc-80, Dsrc-100, Dtar-5, Dtar -10, Dtar -15, Dtar -20, Dtar -25 and Dtar-280 can be obtained in turn. Dsrc-I consists of Dsrc-5, Dsrc-10, Dsrc-15, Dsrc-20 and Dsrc-25. Dsrc-II consists of Dsrc-40, Dsrc-60, Dsrc-80 and Dsrc-100. Dtar-I consists of Dtar-5, Dtar -10, Dtar -15, Dtar -20 and Dtar -25. Dtar-II consists of Dtar-280.The relationships between these datasets are shown in Fig. 8, the detailed information of partial datasets are shown in Table 4. Others are similar.

Fig. 8. The relationships of these datasets. 4.2. The sufficiency of data for training the LSTM model based on instance-TL To make the best of the knowledge from insufficient labeled Dtar, the LSTM model based on instance-TL is designed to learn the relationship between Dsrc and Dtar.To train the model, the same number of labeled Dsrc-I and laebled Dtar-I are input into the model. To demonstrate the sufficiency of data when different numbe of labeld Dtar are utilized to train the model. In the experiment, different number of labeled Dtar are selected , these datasets are Dtar -5, Dtar -10, Dtar -15, Dtar -20 and Dtar-25, respectively. Dtar -5 expresses a total of 25 samples, including 5 fault categories, each of which has 5 samples. Others are similar. The LSTM model based on instance-TL trained through Dsrc-I and laebled Dtar-I can generate some auxiliary datasets with different number of samples, and these auxiliary datasets have the same labels as Dsrc-II. An auxiliary dataset and the Dtar-280 are used as the training dataset and the testing dataset of JDA, respectively. The particular steps for classifying with LSTM+JDA (the proposed method) are listed as follows: 



Step1: Select (Dsrc-5 and Dtar-5), (Dsrc-10 and Dtar-10), (Dsrc-15and Dtar-15), (Dsrc-20 and Dtar-20) and (Dsrc-25 and Dtar-25) in turn as input of the LSTM model based on instance-TL to fully train the model. The model is shown in Fig. 3. Step2: The Dsrc-40, Dsrc-60, Dsrc-80 and Dsrc-100 are input into the pre-trained model in turn, which generates (5×40),(5×60),(5×80) and (5×100) auxiliary samples in sequence.



According to the rule proposed above, they are named auxiliary dataset-40, auxiliary dataset-60, auxiliary dataset-80 and auxiliary dataset-100 in order. Details of these auxiliary datasets are shown in Table 5. Step3: An auxiliary dataset is input into JDA as the training dataset, and Dtar-280 will be used as the testing dataset. Then GWO algorithm will be introduced to adaptively learn JDA key parameters, and finally outputting classification accuracy. Under

different number of labeled Dtar used to train the LSTM model based on instance-TL, to avoid the contingency of results, each type of experiments were ran for ten trials in the same environment. The testing classification accuracies are summarized in Table 6.

Table 4 Detailed information of Dsrc-I, Dsrc-II, Dtar-I and Dtar-II. The source of datasets

Dsrc-I

Datasets

Dsrc-20

Dsrc-40

Dsrc-60

Dsrc-II

Dsrc-80

Dsrc-100

Dtar-I

Dtar-20

Fault categories

Labels

N

1

Slight ORD

2

Serious ORD

3

RD

4

IRD

5

N

1

Slight ORD

2

Serious ORD

3

RD

4

IRD

5

N

1

Slight ORD

2

Serious ORD

3

RD

4

IRD

5

N

1

Slight ORD

2

Serious ORD

3

RD

4

IRD

5

N

1

Slight ORD

2

Serious ORD

3

RD

4

IRD

5

N

1

Slight ORD

2

Serious ORD

3

RD

4

IRD

5

Number of samples

5×20

5×40

5×60

5×80

5×100

5×20

Dtar-II

Dtar-280

N

1

Slight ORD

2

Serious ORD

3

RD

4

IRD

5

5×280

Table 5 Detailed information of these auxiliary datasets. Datasets used to generate auxiliary datasets

Dsrc-40

Dsrc-60

Dsrc-80

Dsrc-100

Generated auxiliary datasets

Fault categories

Labels

Number of samples

Auxiliary dataset-40

N Slight ORD Serious ORD RD IRD

1 2 3 4 5

5×40

Auxiliary dataset-60

N Slight ORD Serious ORD RD IRD

1 2 3 4 5

5×60

Auxiliary dataset-80

N Slight ORD Serious ORD RD IRD

1 2 3 4 5

5×80

Auxiliary dataset-100

N Slight ORD Serious ORD RD IRD

1 2 3 4 5

5×100

Table 6 Classification accuracies of the proposed method based on different datasets selected to train the LSTM model based on instance-TL. Datasets used to train LSTM model based on instance-TL

(Auxiliary dataset-40)(Dtar-280)

(Auxiliary dataset-60)(Dtar-280)

(Auxiliary dataset-80)(Dtar-280)

(Auxiliary dataset-100)(Dtar-280)

Dsrc-5 and Dtar-5 Dsrc-10 and Dtar-10 Dsrc-15 and Dtar-15 Dsrc-20 and Dtar-20 Dsrc-25 and Dtar-25

92.86% 94.14% 94.93% 96.56% 99.07%

94.07% 96.50% 96.93% 97.83% 100%

96.57% 97.36% 98.07% 98.87% 100%

99.07% 99.21% 99.29% 99.35% 100%

Remarks: (Auxiliary dataset-40) - (Dtar-280) represents that auxiliary dataset-40 and Dtar-280 are

selected as the training dataset and testing of the proposed method, respectively. Others are similar. As shown in Table 6, when we select different datasets to train the LSTM model, the accurcaies increase with the incresing number of labeled samples from Dtar-I. When Dsrc-25 and Dtar-25 are used, the testing result of auxiliary dataset-40 is 99.07%, the rest are all 100%. When Dsrc-20 and Dtar-20 are selected, 96.56% of auxiliary dataset-40 is higher than other three results 92.86%, 94.14% and 94.93%, the other three results of (Dsrc-20 and Dtar-20) have the same phenomenon. Comparing the four accuracies of (Dsrc-20 and Dtar-20) with that of (Dsrc-25 and Dtar-25), these gaps are acceptable. It is sufficient for (Dsrc-20 and Dtar-20) to train the LSTM model based on instance-TL. Hereafter, in the remaining experiments of the article, (Dsrc-20 and Dtar-20) is fixed to trian the LSTM model based on instance-TL.

Experiment 1

Experiment 2

Experiment 3

Experiment 4

100

Accuracy (%)

99 98 97 96 95 1

2

3

4

5

6

7

8

9

10

Trial number Fig. 9. Diagnosis results of different experiments for ten trials. Experiment 1 represents that auxiliary dataset-40 is selected as the training dataset, others are similar.

1 1

1.00

0.00

0.00

0.00

0.00 0.8

2

0.00

1.00

0.00

0.00

0.00

Actual label

0.6 3

0.00

0.00

1.00

0.00

0.00 0.4

4

0.00

0.00

0.00

1.00

0.00 0.2

5

0.01

0.00

0.14

0.01

0.84

1

2

3

4

5

0

Predict number Fig. 10. The confusion matrix of the proposed method for experiment 1. Experiment 1 represents that auxiliary dataset-40 is selected as the training dataset. When (Dsrc-20 and Dtar-20) is used, the concrete diagnosis results of four experiments for each trail are shown in Fig. 9 . The confusion matrix of experiment 1 is shown in Fig. 10. It can be seen that the highest average accuracy is 99.35%, while the lowest is 96.56%, the classification results were stable.The experimental results can be summarized as follows: (1) As the number of auxiliary samples increase, the average classification accuracy increases gradually. (2) Diagnosis results for each experiments are not much different, which confirms that the proposed method has good stability. 4.3. The effectiveness of the auxiliary dataset This part is mainly to verify the effectiveness of the auxiliary dataset generated by the LSTM model based on instance-TL, so it is compared with JDA and deep learning methods of CNN,SAE,DBN and LSTM , these five comparison methods are ran without the auxiliary dataset. In order to ensure the fairness of the experiment and the credibility of the results, each methods runs ten trials under their respective environments. The average classification accuracies and standard deviations of each methods are shown in Table 7, the diagnosis accuracies of each methods for ten trials are summarized in Fig. 11. Table 7 Average accuracies and standard deviations. Method

The training dataset

The testing dataset

Average accuracy(%)± standard deviation (%)

The proposed method

Auxiliary dataset-100

99.35±0.16

JDA

52.93±1.38

CNN

59.56±1.46 Dtar-280

SAE

Dsrc-100

47.83±0.23

DBN

44.37±0.44

LSTM

40.65±0.77

100

The proposed method

JDA

CNN

SAE

DBN

LSTM

Accuracy (%)

80 60 40 20 0

1

2

3

4

5

6

7

8

9

10

Trial number Figure. 11. Diagnosis results of different methods for ten trials. As shown in Table 7, JDA, CNN,SAE,DBN and LSTM are selected as classifiers without the auxiliary dataset. The training dataset and the testing dataset of the five classifiers are all Dsrc-100 and Dtar-280, respectively. The classification accuracies are 52.93%,59.56%, 47.83%,44.37% and 40.65%,respectively. Auxiliary dataset-100 and Dtar-280 are the training dataset and the testing dataset of the method proposed, respectively. The classification accuracy is 99.35%, which is much higher than that of other five methods. By comparing standard deviations of the six methods, it can be found that the classification stability of the proposed method is better.The experimental results can draw the following conclusions:(1)The auxiliary dataset does beneficial to improve the classification accuracy, and provides an effective solution for fault diagnosis of scarce labeled data.(2) The method proposed has lower standard deviation and better fault recognition stability for ten trials. 4.4. The effectiveness of JDA as a classifier No matter how strong data are generated by the LSTM model, the final classification result may not be satisfactory without a powerful classifier. Based on the premise of some auxiliary datasets generated, JDA is compared with CNN ,DBN,LSTM and SAE. To present the effectiveness of JDA as a classifier, Dsrc-20 and Dtar-20 are used to train LSTM model to generate four different auxiliary datasets which are selected as JDA’s training dataset in turn. The testing datasets of the five methods are all Dtar-280. Under different training datasets, the average classification accuracies and standard deviations of various methods are shown in Table 8. The diagnosis accuracies of each methods based on different training datasets are summarized in Fig. 12. Table 8 Average accuracies and standard deviations of different methods under different training datasets. The training dataset

The testing dataset

Proposed method) (%)

LSTM +CNN (%)

LSTM +DBN (%)

LSTM +LSTM (%)

LSTM +SAE (%)

Auxiliary dataset-40

96.56±0.45

90.86±0.52

81.00±0.62

79.14±1.35

77.50±0.49

Auxiliary dataset-60

97.83±0.27

93.50±0.33

81.86±0.35

79.71±1.04

78.86±0.36

98.87±0.19

95.07±0.32

82.79±0.25

81.36±0.95

79.00±0.31

99.35±0.16

95.51±0.30

85.04±0.19

81.57±0.83

79.46±0.20

Auxiliary dataset-80

Dtar-28 0

Auxiliary dataset-10 0

Table 8 presents a comparison of classification results for five different methods under different training datasets. When auxiliary dataset-40 is used, the average accuracy of the proposed method is 96.56% which is significantly higher than that of CNN, DBN, LSTM and SAE, their classification accuracies are 90.86%, 81.00%, 79.14% and 77.50%, respectively. As the number of auxiliary samples increases, the performance of all methods is gradually improving. Particularly, the classification accuracy of the proposed method is 99.35% which is much higher than the other four methods when the auxiliary dataset-100 is employed. As shown in Table 8 and Fig. 12, JDA not only has better classification accuracy but also has a smaller standard deviation than the other four methods. The experimental results show that JDA classification is more effective under the premise of these auxiliary dataset generated.

100

Proposed method

LSTM+CNN

LSTM+DBN

LSTM+LSTM

LSTM+SAE

Accuracy (%)

95 90 85 80

75 Auxiliary dataset-40

Auxiliary dataset-60

Auxiliary dataset-80

Auxiliary dataset-100

The training dataset Fig. 12. Diagnosis results of different methods under different training datasets. (a) 100

80

Accuracy (%)

100

80 60 40 20

60

0

1

2

3

4

5

The proposed method LSTM+CNN LSTM+DBN LSTM+LSTM LSTM+SAE JDA

Trial number

(b) 100

Accuracy (%)

80 60 40 20 0

6

7

8

Trial number

9

10

Fig. 13.Diagnosis results of all methods for ten trials under auxiliary dataset-100 is used: (a) from the first trial to the fifth trial; (b) from the sixth trial to the tenth trial. In addition, under auxiliary dataset-100 is selected as the training dataset, Fig. 13 is drawn by combing Table 7, Fig. 11, Table 8, and Fig. 12. Observing Fig. 13, it can be concluded that these generated auxiliary datasets can not only greatly improve classification accuracies, but also have a strong generalization ability. Under the condition of these auxiliary datasets generated, the classification results of JDA, CNN, SAE, DBN and LSTM are much higher than that of those methods without these auxiliary dataset generated. Therefore, the proposed method has a good development potential in the case of scarce labels. 4.5. The effectiveness of GWO algorithm When classified by JDA, final data dimension α, the width of selected kernel β and algorithm iterations number T together affect classification accuracies. Only by coordinating the values of the three parameters can we get satisfactory accuracies. However, it would be a time-consuming and laborious operation to find parameters manually. To address this problem, GWO algorithm is introduced to adaptively learn these parameters. To show influence of these parameters on the performance of JDA and verify the effectiveness of GWO algorithm. In this experiment, three sets of parameters selected manually are compared with a set of parameters given by GWO for four different training datasets. The particular details are presented in Table 9. As shown in Table 9, some points could be basically summarized. One is that when α, β and T take different values under same training dataset is used, there are signification differences in average accuracies. When auxiliary dataset-40 is selected as the training dataset, the values of α-β-T obtained manually are 140-7-8, 95-4-9, 106-6-9, respectively. The highest average accuracy reaches to 96.56%, the lowest is only 80.36%, more than 15% differences in average accuracy. When auxiliary dataset-100 is used, the average classification accuracy of the parameters given by GWO algorithm can reach to 99.35%, however, the diagnosis accuracy obtained by manual adjustment parameters is 90.47% of the highest one, and the lowest one is only 81.57%. In spite of this, there is a big gap compared with 99.35%. A similar situation exists in the other two training datasets. To conclude, the values of these three parameters greatly impact the classification performance of JDA. Two is that classification results obtained by GWO algorithm are better and more stable than

these results obtained manually under four different testing datasets. If auxiliary dataset-60 is selected as the training dataset, the classification accuracy of 126-4-8 given by GWO is 97.83% which is higher than results acquired manually. These average accuracies demonstrate that GWO algorithm is effective to obtain higher and more stable fault recognition results. Three is that these values given by GWO change with different training datasets. When auxiliary dataset-40, auxiliary dataset-60, auxiliary dataset-80 and auxiliary dataset-100 is selected as the training dataset in turn, the average accuracies given by GWO reach to a high level and do not change a lot. The result that the highest is 99.35% with lowest standard deviation confirms the feasibility of the proposed method in generating some auxiliary datasets. To sum up, it is feasible to learn JDA key parameters α, β and T by GWO algorithm. Under auxiliary dataset-100 is selected as the training dataset, the results of four sets of parameters for ten trials are summarized in Fig. 14 and the variation of classification accuracies with the number of iterations increasing when α=120 and β=6 are used are shown in Fig. 15. It can be seen from Fig. 15 that the classification accuracy increases with the number of iterations. When the number of iterations reaches 7, the classification accuracy tends to be stable, with a stable accuracy of 99.64%.

Table 9 Classification results under different values for different training datasets. The training dataset

The testing dataset

The values of α-β-T

(Average accuracy ± standard deviation) %

115-5-6(Given by GWO) 140-7-8 95-4-9 106-6-9

Auxiliary dataset-40

Auxiliary dataset-60 Dtar-280

Auxiliary dataset-100

90.36±0.57 84.29±0.49 80.36±0.54

126-4-8(Given by GWO)

97.83±0.27

96-5-8 149-10-7 135-15-10

85.21±036. 82.36±0.41 95.29±0.30

104-7-6(Given by GWO) 138-19-4 Auxiliary dataset-80

96.56±0.45

98.87±0.19 80.36±0.54

165-11-8

94.07±0.29

84-16-7

85.14±0.48

120-6-9(Given by GWO)

99.35±0.16

120-6-9

90-12-10

81.57±0.45

150-6-6

90.47±0.23

170-5-12

85.77±0.37

90-12-10

150-6-6

170-5-12

100 95

Accuracy (%)

90 85 80 75 70 65 60

1

2

3

4

5 6 Trial number

7

8

9

10

Fig. 14.Diagnosis results of four sets of parameters for ten trials under auxiliary dataset-100 is selected.

100 98 Accuracy (%)

When α=120, β=6

96 94 92 90 1

2

3

4

5

6

7

8

9

Iteration number Fig. 15.Diagnosis results for nine iterations. 4.6. The robustness of the proposed method to noise To demonstrate the robustness of the proposed method to noise, the proposed method is

evaluated by deteriorating these datasets (Dsrc-20 ,Dsrc-40, Dsrc-60, Dsrc-80, Dsrc-100, Dtar-20 and Dtar-280) with White Gaussian Noise (at 10, 20, 30, 40, 50 dB SNR) [44]. Taking the noise level of 10dB SNR as an example, two datasets of Dsrc-20 and Dtar-20 with same nosie level are used to train the LSTM model based on instance-TL, then input some datasets with 10dB SNR of Dsrc-40, Dsrc-60, Dsrc-80 and Dsrc-100 to the pre-trained model to generate some auxiliary datasets in turn. These auxiliary datasets and Dtar-280 with the noise level of 10dB SNR as the training dataset and the testing of JDA, respectively, others are similar. More detailed results are shown in Table 10. Table 10 Classification results under different noise levels. Different noise levels

(Auxiliary dataset-40)(Dtar-280)

(Auxiliary dataset-60)(Dtar-280)

(Auxiliary dataset-80)(Dtar-280)

(Auxiliary dataset-100)(Dtar-280)

At 10 dB SNR At 20 dB SNR At 30 dB SNR At 40 dB SNR At 50 dB SNR No noise

95.79% 96.00% 96.29% 96.36% 96.50% 96.56%

97.07% 97.21% 97.43% 97.57% 97.79% 97.83%

98.36% 98.43% 98.71% 98.79% 98.86% 98.87%

99.07% 99.07% 99.21% 99.29% 99.35% 99.35%

Remarks: (Auxiliary dataset-40) - (Dtar-280) represents that auxiliary dataset-40 and Dtar-280 are selected as the training dataset and testing of the proposed method, respectively. Others are similar. No noise indicates that the experiment is performed without adding noise to datasets. As can be seen from Table 10 that the proposed method performs well under different training datasets at different noise levels, with the accuracies in all experiments being more than 95%, which indicates that the proposed method is robust greatly to noise. By contrast, these accuracies of 95.79%, 97.07%, 98.36% and 99.07% at 10 dB SNR have small differences with that of the other four types of noise levels. Moreover, differences between accuracies obtained at 10 dB SNR and accuracies obtained under no noise are all less than 1%. At different noise levels, the testing results increase steadily as the number of auxiliary samples increases, which further confirms that the effectiveness of the proposed method.

5. Conclusions In this paper, an adaptive deep transfer learning method is proposed for bearing fault diagnosis. Firstly, the LSTM model based on instance-TL is constructed to establish the mapping relationship between Dsrc and Dtar to generate some auxiliary datasets. Secondly, JDA is used to reduce the differences in probability distributions between an auxiliary dataset and Dtar. Finally, GWO algorithm is introduced to optimize JDA key parameters. The proposed method is confirmed by the experimental bearing dataset and the practical locomotive bearing dataset. With a small amount of labeled fault data, the proposed method could achieve more effective and robust fault diagnosis performance than other methods. Applying transfer learning to bearing faults diagnosis is a very rewarding task, the authors would further

investigate this topic in future study. Acknowledgements This research is supported by the National Natural Science Foundation of China (No. 51875459), the major research plan of the National Natural Science Foundation of China (No. 91860124), the Aeronautical Science Foundation of China (No. 20170253003) and Research Funds for Interdisciplinary subject, NWPU. References [1] H.K. Jiang, C.L. Li, H.X. Li, An improved EEMD with multiwavelet packet for rotating machinery multi-fault diagnosis, Mechanical Systems and Signal Processing, 36 (2013) 225-239. [2] H.K. Jiang, Y. Xia , X.D. Wang , Rolling bearing fault detection using an adaptive lifting multiwavelet packet with a 1½ dimension spectrum, Measurement Science and Technology, 24 (2013) 125002. [3] L. Zhang, N.Q. Hu, Fault diagnosis of sun gear based on continuous vibration separation and minimum entropy deconvolution, Measurement,141(2019)332-344. [4] J.C. Yin, M.Q. Xu, H.L. Zheng, Fault diagnosis of bearing based on Symbolic Aggregate approximation and Lempel-Ziv, Measurement,138(2019)206-216. [5] H.D. Shao, H.K. Jiang, H.W. Zhao, F.A. Wang, A novel deep autoencoder feature learning method for rotating machinery fault diagnosis, Mechanical Systems and Signal Processing, 95(2017)187-204. [6] Y.B. Li, M.Q. Xu, H.Y. Zhao, W.H. Huang, Hierarchical fuzzy entropy and improved support vector machine based binary tree approach for rolling bearing fault diagnosis, Mechanism and Machine Theory, 98(2016)114-132. [7] A.K.S. Jardine, D.M. Lin, D. Banjevic, A review on machinery diagnostics and prognostics implementing condition-based maintenance, Mechanical Systems and Signal Processing, 20(2006)1483-1510. [8] G.D. Hadden, P. Bergstrom, G. Vachtsevanos, B.H. Bennett, J. Vandyke, Shipboard machinery diagnostics and prognostics/condition based maintenance: a progress report, IEEE Aerospace Conference Proceedings, (2000)277-292. [9] W. Jiang, J.Z. Zhou, H. Liu, Y.H, Shan, A multi-step progressive fault diagnosis method for rolling element bearing based on energy entropy theory and hybrid ensemble auto-encoder, ISA Transactions, 87(2019)235-250. [10] Y.T. Yang, H.L. Zheng, Y.B. Li, M.Q. Xu, Y.S. Chen, A fault diagnosis scheme for rotating machinery using hierarchical symbolic analysis and convolutional neural network, ISA Transactions, (2019)201-219. [11] H.D. Shao, H.K. Jiang, F.A. Wang, Y.N. Wang , Rolling bearing fault diagnosis using adaptive deep belief network with dual-tree complex wavelet packet, ISA Transactions, 69(2017)187-201. [12] R. Yang, M.J. Huang, Q.D. Lu, M.Y. Zhong, Rotating Machinery Fault Diagnosis Using Long-short-term Memory Recurrent Neural Network, IFAC-PapersOnline,51(2018)228-232. [13] W. Zhang, C.H. Li, G.L. Peng, Y.H. Chen, Z.J. Zhang, A deep convolutional neural network with new training methods for bearing fault diagnosis under noisy environment and different

working load, Mechanical Systems and Signal Processing, 100(2018)439-453. [14] Z. Xiang, X.N. Zhang, W.W. Zhang, X.R.X, Fault diagnosis of rolling bearing under fluctuating speed and variable load based on TCO Spectrum and Stacking Auto-encoder, Measurement, 138(2019)163-174. [15] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on Knowledge & Data Engineering, 22(2010)1345-1359. [16] V.M. Patel, R. Gopalan, R. Li, R. Chellappa, Visual domain adaptation, IEEE Signal Process. Mag. 32 (2015) 53–69. [17] G.L. Sun, L.L. Liang, T. Tang, F. Xiao, F. Lang, Network traffic classification based on transfer learning, Computers & Electrical Engineering, 69(2018)920-927. [18] Y.Y. Wang, J. Zhai , Y. Li, K.J. Chen, H. Xue, Transfer learning with partial related “instance-feature” knowledge , Neurocomputing, 310(2018)115-124. [19] S.L. Lu, Q.B. He, J.W. Zhao, Bearing fault diagnosis of a permanent magnet synchronous motor via a fast and online order analysis method in an embedded system, Mechanical Systems and Signal Processing, 113(2017)36-49. [20] B. Liu, Y.S. Xiao, Z.F. Hao, A Selective Multiple Instance Transfer Learning Method for Text Categorization Problems, Knowledge-Based Systems, 141(2018)178-187. [21] S. Ozawa, S. Yoshida, J. Kitazono, S. Takahiro, H. Tatsuya, A sentiment polarity prediction model using transfer learning and its application to SNS flaming event detection, IEEE Sumposium Series on Computational Intelligence, (2017)1-7. [22] M. Claudia, B. Jose, T. Maria, A. Enrique, Transfer Learning for Classification of Cardiovascular Tissues in Histological Images, Computer Methods and Programs in Biomedicine,165(2018)69-76. [23] M. Ribeiro, K. Grolinger, H.F. Elyamany, W.A. Higashino, M.A.M. Capretz, Transfer Learning with Seasonal and Trend Adjustment for Cross-Building Energy Forecasting, Energy and Buildings, 165(2018)352-363. [24] P. Wu, T.G. Dietterich, Improving SVM accuracy by training on auxiliary data, proceedings of the 20st international conference on Machine learning. (2004). [25] W.N. Lu, B. Liang, Y. Cheng, D.S. Meng, J. Yang, T. Zhang, Deep Model Based Domain Adaptation for Fault Diagnosis, IEEE Transactions on Industrial Electronics, 64(2017)2296-2305. [26] L. Wen, L. Gao, X.Y. Li, A New Deep Transfer Learning Based on Sparse Auto-Encoder for Fault Diagnosis, IEEE Transactions on Systems, 49( 2019)136-144. [27] S.S. Zhong, S. Fu, L. Lin, A novel gas turbine fault diagnosis method based on transfer learning with CNN, Measurement, 137(2019) 435-453. [28] C. Chen, Z.H. Li, J. Yang, B. Liang, A cross domain feature extraction method based on transfer component analysis for rolling bearing fault diagnosis, 2017 29th Chinese Control and Conference, 2017(5622-5626). [29] M.S. Long, J.M. Wang, G.G. Ding, J.G. Sun, P.S. Yu, Transfer Feature Learning with Joint Distribution Adaptation, 2013 IEEE International Conference on Computer Vision, (2013)2200-2207. [30] W.W. Qian, S.M. Li, P.X. Yi, K.C. Zhang, A novel transfer learning method for robust fault diagnosis of rotating machines under variable working conditions, Measurement,138(2019)514-525. [31] M.S. Long, J.M. Wang, M.I. Jordan, Unsupervised Domain Adaptation with Residual

Transfer Networks, (2016). [32] B. Yang, Y.G. Lei, F. Jia, S.B. Xing, An intelligent fault diagnosis approach based on transfer learning from laboratory bearings to locomotive bearings ,Mechanical Systems and Signal Processing, 122(2019)692-706. [33] F. Jia, Y.G. Lei, J. Lin, X. Zhou, N. Lu, Deep neural networks: A promising tool for fault characteristic mining and intelligent diagnosis of rotating machinery with massive data, Mechanical Systems and Signal Processing, 72-73(2016)303-315. [34] A.B. Patil, J.A .Gaikwad, J.V. Kulkarni , Bearing fault diagnosis using discrete Wavelet Transform and Artificial Neural Network, 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology, (2016)399-405. [35] H. Yan, B.P. Tang, L. Deng, Multi-level wavelet packet fusion in dynamic ensemble convolutional neural network for fault diagnosis , Measurement, 127(2018)246-255. [36] H. Liu, J.Z. Zhou, Y. Zheng, W. Jiang, Y.C. Zhang, Fault diagnosis of rolling bearings with recurrent neural network-based auto-encoders, ISA Transactions, 77(2018)167-178. [37] J.J. Zhang, P. Wang, R.Q. Yan, R.X. Gao, Long short-term memory for machine remaining life prediction, Journal of Manufacturing Systems, 48(2018)78-86. [38] L. Guo, N.P. Li, F. Jia, Y.G. Lei, .Lin, A recurrent neural network based health indicator for remaining useful life prediction of bearings, Neurocomputing, 240(2017)98-109. [39] J.H. Lei, C. Liu, D.X. Jiang, Fault diagnosis of wind turbine based on Long Short-term memory Networks, Renewable Energy, 133(2019)422-432. [40] R. Yang, M.J. Huang, Q.D. Lu, M.Y. Zhong, Rotating Machinery Fault Diagnosis Using Long-short-term Memory Recurrent Neural Network, IFAC PapersOnLine, 51-24(2018)228-232. [41] S. Mirjalili, S.M. Mirjalili, A. Lewis, Grey Wolf Optimizer, Advances in Engineering Software, 69(2014)46-61. [42] X. Zhang, Z.W. Liu, Q. Miao, L. Wang , An optimized time varying filtering based empirical mode decomposition method with grey wolf optimizer for machinery fault diagnosis ,Journal of Sound and Vibration, 418(2018)55-78. [43] X. Zhang, Q. Miao, Z.W. Liu, Z.J. He, An adaptive stochastic resonance method based on grey wolf optimizer algorithm and its application to machinery fault diagnosis, ISA Transactions, 71(2017)206-214. [44] P.D .Swami, R. Sharma, A. Jain, D.K. Swami, Speech enhancement by noise driven adaptation of perceptual scales and thresholds of continuous wavelet transform coefficients, ScienceDirect, 70(2015)1-12.

Constructing aLSTMmodel togenerate some auxiliary datasets Applying JDA to reduce the differencesin probability distributions oftwo datasets GWO algorithm is introduced to adaptively learn JDA key parameters