A new unsupervised data mining method based on the stacked autoencoder for chemical process fault diagnosis

A new unsupervised data mining method based on the stacked autoencoder for chemical process fault diagnosis

Journal Pre-proof A new unsupervised data mining method based on the stacked autoencoder for chemical process fault diagnosis Shaodong Zheng , Jinson...

2MB Sizes 1 Downloads 111 Views

Journal Pre-proof

A new unsupervised data mining method based on the stacked autoencoder for chemical process fault diagnosis Shaodong Zheng , Jinsong Zhao PII: DOI: Reference:

S0098-1354(19)30986-X https://doi.org/10.1016/j.compchemeng.2020.106755 CACE 106755

To appear in:

Computers and Chemical Engineering

Received date: Revised date: Accepted date:

23 September 2019 25 December 2019 23 January 2020

Please cite this article as: Shaodong Zheng , Jinsong Zhao , A new unsupervised data mining method based on the stacked autoencoder for chemical process fault diagnosis, Computers and Chemical Engineering (2020), doi: https://doi.org/10.1016/j.compchemeng.2020.106755

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Β© 2020 Published by Elsevier Ltd.

A new unsupervised data mining method based on the stacked autoencoder for chemical process fault diagnosis Shaodong Zheng a, Jinsong Zhao a,b,* a State Key Laboratory of Chemical Engineering, Department of Chemical Engineering, Tsinghua University, Beijing, China b Beijing Key Laboratory of Industrial Big Data System and Application, Tsinghua University, Beijing, China

Abstract Process monitoring plays an important role in chemical process safety management, and fault diagnosis is a vital step of process monitoring. Among fault diagnosis researches, supervised ones are inappropriate for industrial applications due to the lack of labeled historical data in real situations. Thereby, unsupervised methods which are capable of dealing with unlabeled data should be developed for fault diagnosis. In this work, a new unsupervised data mining method based on deep learning is proposed for isolating different conditions of chemical process, including normal operations and faults, and thus labeled database can be created efficiently for constructing fault diagnosis model. The proposed method mainly consists of three steps: feature extraction by the convolutional stacked autoencoder (SAE), feature visualization by the t-distributed stochastic neighbor embedding (t-SNE) algorithm, and clustering. The benchmark Tennessee Eastman process (TEP) and an industrial hydrocracking instance are utilized to illustrate the effectiveness of the proposed data mining method. Keywords data mining; fault diagnosis; unsupervised; the SAE; clustering; the TEP 1. Introduction The rapid development of chemical engineering industry brings benefits and convenience but meanwhile increases risk and accidents with more complex and large-scale process. Fortunately, progress has been made in process monitoring techniques including fault prognostic, detection, diagnosis and root cause analysis. While fault detection aims at determining whether a fault has occurred, fault diagnosis focuses on judging which type of faults has happened and thus can assist operators to take appropriate actions to eliminate the faults. The accumulation of historical data and the boosting of computing power enables data-driven fault diagnosis methods such as machine learning to outperform model-based methods. Fault diagnosis researches of chemical process based on machine learning methods can be mainly partitioned into supervised ones and unsupervised ones. Supervised researches require data samples along with their corresponding labels, and

by contrast, unsupervised researches which do not involve the labels dig the inherent characteristics of data samples. Although supervised researches have reached high diagnosis accuracy(Chiang et al., 2004; Ma and Wang, 2009; Raich and Γ‡inar, 1995; Zhang and Zhao, 2017), there exists non-negligible gap between these supervised researches and the industrial application that historical data from real plants are commonly lack of labels. Therefore, it is significant to carry on unsupervised researches which can cope with unlabeled data samples. Notice that some researchers used unsupervised algorithms such as the Principal Component Analysis (PCA), but the construction of fault diagnosis models based on them still demands for prior knowledge of data labels, which makes the researches supervised. For instance, separate PCA models based on data collected during each specific fault situation were developed to handle multiple faults(Raich and Γ‡inar, 1995). As shown in Fig. 1, there are two major approaches of unsupervised fault diagnosis. The first approach is to train an unsupervised fault diagnosis model, which is commonly based on clustering algorithms. The model is trained by unlabeled historical data and then applied on diagnosing real-time data samples by calculating their class metrics. The class metrics can be a membership(Alaei et al., 2013; Bhushan and Romagnoli, 2008), a possibility(Bahrampour et al., 2011; Yu, 2012) or a visualized position(Zhong et al., 2016), which indicates the class/fault that samples belong to. Since this approach is not the emphasis of our study, the details of these researches are not described here. Our study follows the second approach, which focuses on using unsupervised methods for historical data mining and knowledge discovery to create labeled database for constructing conventional supervised diagnosis model. As can be seen in Fig. 1(b), we refer to the diagnosis model as pseudo-supervised model since the labels used for training are not actual labels but pseudo labels obtained through unsupervised data mining methods. This approach has been studied by researchers. An unsupervised Bayesian automatic classification method was applied by Wang and McGreavy on the refinery fluid catalytic cracking (R-FCC) process and 42 cases were divided into five clusters(Wang and McGreavy, 1998). Chen and Wang et al. proposed a framework integrating wavelet analysis with Adaptive Resonance Theory net (ARTnet)(Chen et al., 1999; Wang et al., 1999). The framework was used in the R-FCC process for clustering 64 cases. Sebzalli and Wang developed a method using PCA and fuzzy c-means clustering to identify operational spaces of the R-FCC process and projected 303 cases to four clusters or operational zones(Sebzalli and Wang, 2001). Singhal and Seborg modified K-means clustering algorithm with two similarity factors based on PCA and Mahalanobis distance to cluster multivariate time-series data from both batch and continuous chemical systems(Singhal and Seborg, 2005). These studies achieved promising results but they only handled with 303 cases at most, which are insufficient for generating high-quality pseudo-supervised diagnosis model. Escobar et al. put forward a generative topographic mapping and graph theory combined approach for clustering the Tennessee Eastman process (TEP) data(Escobar et al., 2015, 2017). The method can clearly distinguish the normal data and one single type of fault data at a time, but

failed to output a meaningful result when dealing with datasets containing multiple fault types. Thomas et al.(Thomas et al., 2018) realized data mining of the TEP dataset consisting of normal and multiple fault types data by unsupervised feature extraction and clustering algorithm, which was accomplished by Zheng and Zhao as well(Zheng and Zhao, 2018). But they did not explore whether and how the created labeled databased can be used for constructing pseudo-supervised diagnosis model, as well as other abovementioned researches. Wang and Li combined conceptual clustering and PCA for knowledge discovery of history data and then generated the pseudo-supervised decision tree based on it(Wang and Li, 1999), but the scale of the decision tree grows explosively when samples, measurements and conditions of the process increases. He et al. isolated normal and abnormal data clusters by the K-means algorithm before applying pairwise Fisher Discriminant Analysis (FDA) for diagnosis(He et al., 2005), but the K-means algorithm has the deficiency that it only discovers convex clusters.

Fig. 1. (a) The first approach of unsupervised fault diagnosis; (b) the second approach of unsupervised fault diagnosis.

With these contributions made on data mining for unsupervised chemical process fault diagnosis, there is still room for improvement. Firstly, deep learning algorithms which have been proved to perform better than conventional statistical methods on data compression of images(Hinton and Salakhutdinov, 2006) and employed in many

other tasks(Heo and Lee, 2018; Ji et al., 2013; Sarikaya et al., 2014) were rarely adopted as feature extraction techniques in these studies. Feature extraction is the core of the unsupervised data mining method because it determines whether a satisfying clustering result can be achieved, while clustering algorithms are solving the problem of how to achieve it. Secondly, most of clustering algorithms run with parameters which have significant impact on clustering results and require to be adjusted manually. This leads to time-consuming trials before an ideal clustering result can be reached, especially when no prior knowledge of the process is available. Finally, only one of these researches mentioned how to fulfill cluster annotation of the data mining result and it only discussed about annotating the normal data cluster(Escobar et al., 2015). Since a perfect clustering result is difficult to obtain when coping with complex and high-dimensional process data, there may exist clusters composing of multiple classes of data. To minimize the amount of mislabeled data samples and ensure the reliability of the pseudo-supervised diagnosis model, it is crucial to propose a knowledge-based strategy to annotate the cluster as the label which represents the most data samples in it. There is a tradeoff between introducing minimal process knowledge and acquiring the expected annotation result. The main contribution of this study is proposing a new unsupervised data mining method combing feature extraction, data visualization and clustering techniques, which can help isolate chemical process data of different process conditions and create pseudo-labeled database for constructing the fault diagnosis model. The method has following advantages compared to existing researches: 1) data features that are more efficient for data discrimination are extracted by deep learning algorithms; 2) a feature visualization step is designed for indicating the appropriate clustering algorithm as well as narrowing down the value ranges of algorithm parameters; 3) the strategy of cluster annotation is discussed. Based on these advantages, we succeed to isolate the normal operation and 11 fault types of the TEP, which has not been reported by former unsupervised researches to our best knowledge. In addition, the integrality of this study is guaranteed by constructing the pseudo-supervised fault diagnosis model based on the data mining result and evaluating its performance. The rest of this paper is organized as follows: Section 2 introduces the basic theory of the unsupervised learning algorithms used in the proposed method. The details of the proposed method are described in Section 3. Application of the method on the benchmark TEP is illustrated in Section 4. An industrial hydrocracking instance is used to demonstrate the applicability of the method on real situations in Section 5. Finally, conclusions and outlooks are drawn in Section 6. 2. Unsupervised learning algorithms The main target of unsupervised data mining is diving data into different clusters, but clustering in high-dimensional spaces presents much difficulty(Berkhin, 2006). Feature extraction and visualization techniques are thus conducted beforehand for reducing the dimensionality of data while preserving effective information of data.

Notice that feature visualization is actually a specific case of feature extraction, in which the data dimensionality is reduced to 2 or 3. The visualization is not necessary, but it provides intuitionistic display of data characteristics and guidance of the clustering mission. 2.1. Feature extraction and visualization techniques 2.1.1.

Stacked autoencoder (SAE)

The autoencoder is a three-layer neural network consisting of an encoder and a decoder for learning compressed data representations(Rumelhart et al., 1986). The input data sample π‘₯ ∈ π‘…π‘š is mapped into a low-dimensional feature space 𝑅𝑛 with a nonlinear function f by the encoder as shown in Eq. (1): (1) z = f(x) where z represents the n-dimensional feature and n
1 2 L= βˆ‘β€–xΜƒ (i) βˆ’ x (i) β€–2 2π‘Ž

(3)

i=1

where a is the amount of input samples. The sets of function parameters are learned simultaneously during the minimization. The autoencoders and the autoassociative neural networks with similar concept were investigated in chemical process monitoring(Fan et al., 2017; Kramer, 1991), but the performance was limited by their shallow architectures. Since pre-training and fine-tuning techniques were introduced into the training process of deep neural networks(Hinton and Salakhutdinov, 2006), deep learning has become feasible in practice and shown stronger learning capability than shallow learning. The stacked autoencoder (SAE) is a deep neural network with multiple layers by stacking autoencoders and it is widely used for unsupervised compression coding and dimensionality reduction in many fields including chemical process monitoring(Lv et al., 2017; Zhang et al., 2018). The performance of the SAE can be improved when the long short-term memory (LSTM) layer which can efficiently manage time-series data is introduced to the network architectures(Park et al., 2019; Zhang et al., 2019), but the convolutional layer, as the basic element of the convolutional neural networks (CNN) which have shown superiority in fault diagnosis(Lee et al., 2017; Wu and Zhao, 2018), has rarely been combined with the SAE for process monitoring. In this study, the SAE is incorporated with the LSTM layer and convolutional layer for unsupervised feature extraction. The LSTM layer of neural network The LSTM network is a variant of the recurrent neural network (RNN). Different

from other feedforward neural networks, RNN can pass information across time steps and retain a state or memory that reflects an arbitrary long context window(Lipton, 2015). The value of hidden unit β„Ž(𝑑) in RNN is relevant to β„Ž(π‘‘βˆ’1) in the previous moment, which can be denoted as h(t) = f(i(t) , h(tβˆ’1) ) where t is the timestamp of data and i represents the input. RNNs are powerful dynamic systems, but the backpropagation through time (BPTT) algorithm used for training has the drawback that backpropagated gradients either grow or shrink at each time step and will cause gradient explosion or vanishment eventually(Bengio et al., 1994). The LSTM is designed to overcome this error back-flow problem and it can learn to bridge time intervals in excess of 1000 steps without losing short time lag capabilities(Hochreiter and Schmidhuber, 1997). The function is realized by using the unique computational cell shown in Fig. 2 to construct the hidden layer of RNN.

Fig. 2. Schematic of the LSTM unit(Wu et al., 2018)

In Fig. 2, 𝜎 represents the sigmoid activation function and π‘‘π‘Žπ‘›β„Ž represents the tanh activation function. C is the cell state running horizontally through the top of the diagram and it is like a belt conveying long-term information straight down the entire chain to each unit. The information can be removed or added to the cell state by structures called the forget gate, input gate and output gate. The forget gate takes π‘₯ (𝑑) and β„Ž(π‘‘βˆ’1) as the input and outputs a value between 0 and 1 which controls how much to throw away from 𝐢 (π‘‘βˆ’1). The output of the forget gate is shown as Eq. (4): Of = Οƒ(Wf βˆ™ [x t , h(tβˆ’1) ] + bf )

(4)

where π‘Šπ‘“ and 𝑏𝑓 represents the weights and the bias of the forget gate respectively, and β€œβˆ™β€ means matrix multiplication. Then the input gate is used to determine what new information to be stored in 𝐢 (𝑑) . The output of the input gate is shown as Eq. (5): Oi = Οƒ(Wi1 βˆ™ [x t , h(tβˆ’1) ] + bi1 ) βŠ™ tanh⁑(Wi2 βˆ™ [x t , h(tβˆ’1) ] + bi2 )

(5)

where βŠ™ is the pointwise multiplication operation. 𝐢 (𝑑) can be updated with the outputs of the forget gate and input gate as Eq. (6): (6) C (t) = Of βŠ™ C (tβˆ’1) + Oi Finally, the output of the LSTM unit β„Ž(𝑑) is decided with filtration by the output gate: h(t) = Oo = Οƒ(Wo βˆ™ [x (t) , h(tβˆ’1) ] + bo ) βŠ™ tanh⁑(C (t) ) (7)

The convolutional layer of neural network Since the convolutional neural network (CNN) was proposed in 1989(LeCun et al., 1989), noteworthy progress has been made in processing images and videos and advanced net structures were carried out such as GoogleNet(Szegedy et al., 2015) and ResNet(He, 2016). The details of these architectures are too complicated to be further discussed in this research, and the convolutional layer which is the core of outstanding performance of the CNN will be focused on. The convolutional layer brings two key ideas of the CNN: local connections and shared weights. Units in the convolutional layer are organized in feature maps and each one of them is connected to local patches in feature maps of the previous layer through a set of weights. All units in the same feature map share the same weights which are also called filter banks, and this help detect patterns regardless of patch locations while reducing the scale of network parameters effectively(Lecun et al., 2015). Assuming that there are M feature maps in input layer and N filter banks, therefore the output layer has N feature maps. The jth feature map π‘₯𝑗 in the output layer is calculated by Eq. (8): M

xj = f(xΜƒ) = Οƒ(βˆ‘ xΜƒi βˆ— k ij + bj )

(8)

i=1

where Οƒ is the activation function, π‘₯̃𝑖 is the ith feature map of input layer, π‘˜π‘–π‘— is the kernel of the jth filter connected with the ith feature map of input layer, and β€œ*” represents the convolutional operation. The schematic of the convolutional layer is given in Fig. 3 and the convolutional operation is described in Fig. 4 with the stride set to 1.

Fig. 3. Schematic of the convolutional layer

Fig. 4. The convolutional operation

2.1.2. The t-distributed stochastic neighbor embedding (t-SNE) algorithm Stochastic neighbor embedding (SNE) was proposed in 2002 as a manifold learning algorithm for data dimensionality reduction and it can embed high-dimensional vectors into a lower-dimensional space in a way that preserves neighbor identities(Hinton and Roweis, 2003). In the algorithm, the asymmetric probability 𝑝𝑗|𝑖 that i would pick j as its neighbor is computed by Eq. (9): 2

pj|i

β€–xi βˆ’ xj β€– exp⁑(βˆ’ ) 2Οƒ2i = β€–x βˆ’ x β€–2 βˆ‘kβ‰ i exp⁑(βˆ’ i 2 k ) 2Οƒi

(9)

where πœŽπ‘– is the variance of the Gaussian distribution that is centered on datapoint π‘₯𝑖 . A similar conditional probability π‘žπ‘—|𝑖 for the lower-dimensional counterparts 𝑦𝑖 and 𝑦𝑗 of π‘₯𝑖 and π‘₯𝑗 is computed by Eq. (11): 2

qj|i

exp⁑(βˆ’β€–yi βˆ’ yj β€– ) = βˆ‘kβ‰ i exp⁑(βˆ’β€–yi βˆ’ yk β€–2 )

(10)

where the Gaussian variance is set to be 1β„βˆš2. The SNE algorithm aims at matching two distributions as well as possible by minimizing the cost function C which is a sum of KL divergences: pj|i C = βˆ‘ KL(Pi ||Q i ) = βˆ‘ βˆ‘ pj|i log (11) qj|i i

i

j

The SNE algorithm can construct reasonably good visualizations, however, it is still hampered by a cost function that is hard to optimize and by a crowding problem which means low-dimensional points tend to gather together. The t-SNE algorithm was proposed in 2008 and it employs a heavy-toiled Student-t distribution rather than a Gaussian in the low-dimensional space to alleviate both crowding problem and optimization problem of the SNE(Van Der Maaten and Hinton, 2008). Based on the symmetrized version of the SNE cost function, a Student t-distribution with one degree of freedom is employed and the joint probabilities π‘žπ‘–π‘— is defined as Eq. (12)(Van Der Maaten and Hinton, 2008): 2

(1 + β€–yi βˆ’ yj β€– )βˆ’1 qij = βˆ‘kβ‰ l(1 + β€–yk βˆ’ yl β€–2 )βˆ’1

(12)

The Student-t distribution with a single degree of freedom brings a particularly nice numerator term which makes the map’s representation of joint probabilities for points that are far away almost invariant to changes in the scale of the map and thus alleviate the crowding problem(Van Der Maaten and Hinton, 2008). The t-SNE algorithm has been applied for chemical process data visualization recently(Tang and

Yan, 2017; Zhu et al., 2019), but the performance became unsatisfactory as fault types of data increased. The SAE and the t-SNE algorithm can both be used alone for unsupervised feature extraction and visualization, but the former has difficulty in dealing with high-dimensional data while the latter cannot generate information-rich features if it reduces the data dimensionality to 2 directly. In this study, these two algorithms are combined to cover their shortages by conducting the SAE in advance to reduce the data dimensionality to an appropriate value for the t-SNE. 2.2. Clustering algorithms Clustering aims to excavate the internal nature of unlabeled data samples and allocate samples into different clusters based on their similarities. Each cluster consists of samples that are similar between themselves while dissimilar to samples in other clusters. Clustering algorithms can be categorized as partitioning-based, hierarchical-based, density-based, grid-based and model-based methods(Fahad et al., 2014), and they were broadly investigated in chemical process monitoring and data mining(Abonyi et al., 2005; JΓ€msΓ€-Jounela et al., 2003; Srinivasan et al., 2004; Thomas et al., 2018). There does not exist a universal algorithm that is suitable for all kinds of data, and the selection of algorithms mainly depends on data characteristics. In this study, the density-based spatial clustering of applications with noise (DBSCAN) and K-means algorithm are applied. The K-means algorithm was recognized as one of the top 10 data mining algorithms by the IEEE(Wu et al., 2008). The algorithm assigns data samples to clusters based on distance calculation, which brings the simplicity and efficiency along with the limitation that it will falter when the clusters are not reasonably spherical. Due to the limited space, details of the algorithm are not described here because excellent literature on this topic is available(Hartigan and Wong, 2013; Lloyd, 1982; Wu et al., 2008). The DBSCAN algorithm was proposed by Ester et al. as a density-based clustering algorithm of which the key idea is that for each point of a cluster, the density in its neighborhood of a given radius has to exceed some threshold(Ester et al., 1996). The algorithm is used in this study for its capability of recognizing clusters with arbitrary shape and its minimal requirements of process knowledge. The algorithm is based on several definitions as follows. 1. The Ξ΅-neighborhood: The Ξ΅-neighborhood π‘πœ€ (π‘₯𝑖 ) of a sample π‘₯𝑖 ∈ 𝐷 is defined by NΞ΅ (xi ) = {xj ∈ D|β€–xi βˆ’ xj β€– ≀ Ξ΅}

(13)

where D represents the dataset and Ξ΅ is the distance threshold. 2. The core object: The core object refers to the sample π‘₯𝑖 of which the Ξ΅-neighborhood π‘πœ€ (π‘₯𝑖 )

contains more than 𝑀𝑖𝑛𝑃𝑑𝑠 samples. 3. Directly density-reachable and density-reachable: The sample π‘₯𝑗 is directly density-reachable from π‘₯𝑖 when π‘₯𝑖 is a core object and π‘₯𝑗 is in π‘πœ€ (π‘₯𝑖 ), and two samples are density-reachable as long as they are within a sequence of samples that are directly density-reachable successively. 4. Density-connected: The sample π‘₯𝑖 is density-connected to π‘₯𝑗 if there exists a π‘₯π‘˜ density-reachable to both of them. The definitions are illustrated in Fig. 5.

Fig. 5. Definitions in the DBSCAN algorithm(Zhou, 2016)

Based on these definitions, all samples that are density-reachable to the core object π‘₯𝑖 constitute a cluster, and the clustering result is determined by parameters Ξ΅ and 𝑀𝑖𝑛𝑃𝑑𝑠. The procedure of the algorithm can be concisely described as follows: Input: Dataset, value of parameters πœ€ and 𝑀𝑖𝑛𝑃𝑑𝑠. Step1: Find all core objects in the input dataset. Step2: Start from an arbitrary core object π‘₯𝑖 and find all samples density-reachable to it. These samples make up a cluster. Step3: Repeat Step2 until all core objects are reviewed. Step4: Annotate the samples that do not belong to any cluster as noise. Output: The clustering result. 3. The proposed unsupervised data mining method The proposed unsupervised data mining method is shown in Fig. 6.

Fig. 6. The diagram of the proposed unsupervised data mining method

The method requires no prior knowledge of the process at all and it produces fully unsupervised clustering result. Its procedure can be described as follows: Step 1: Historical data without labels is obtained from chemical process. Step 2: The data is preprocessed by variable selection, data combination and data normalization. Step 3: The structure and hyperparameters of the SAE are tuned by grid search. Step 4: Features of the data are extracted by the SAE. Step 5: The extracted features are transformed to 2-dimensional vectors by the t-SNE algorithm for visualization. Step 6: The cluster count and clustering algorithm are determined according to the visualization result. Step 7: The chosen clustering algorithm is applied on visualized sample features to get the data mining result. The temporal information is introduced into samples by the data combination task in Step 2. Chemical process is dynamic and time-varying, thus the data sampled from a single moment cannot reflect the process condition accurately. Therefore, data sampled during certain time window are combined end to end chronologically to form new samples which can better reveal the process condition. The general rule of determining the clustering algorithm in Step 5 can be summarized as follows: If there exist nonconvex clusters: The DBSCAN algorithm is used. Else:

The K-means algorithm is used. END 4. Application on the Tennessee Eastman process (TEP)

The TEP is proposed in 1993 as a highly simulated chemical process model for developing, studying and evaluating process control technology(Downs and Vogel, 1993). The process produces two products G, H and a byproduct F from four reactants A, C, D, E with an inert B in the system. This benchmark model is widely studied due to the lack of real industrial data, and in this research a revised version of the TEP Simulink model is used for acquiring datasets. The model is shown in Fig. 7 and the code can be downloaded from http://depts.washington.edu/control/LARRY/TE/down-load.html(Bathelt et al., 2015). The mode used contains 21 process conditions including one normal operation and twenty different fault types. There are totally 22 process measurements, 12 manipulated variables and 19 component analysis variables.

Fig. 7. P&ID of the TEP(Bathelt et al., 2015)

4.1. The datasets for research In industrial situations, the component analysis is asynchronous with sampling of process variables. Therefore, 19 component analysis variables are not used in this study. Among 34 process measurements, there are three constant variables including the compressor recycle valve, the stripper steam valve and the agitator speed. Besides, the stripper steam flow values obtained from the Simulink model deviate seriously from the base case value in reference and some of them are negative. These 4 variables are eliminated from the variable set. In conclusion, 30 variables are selected to constitute the data sample at each moment. The time window for data combination is set to be 1 hour with the sampling interval of 3 minutes, which means 20 samples

are combined end to end chronologically to form a 600-dimensional sample. In this way, the dynamic nature of the process is introduced into the samples, and the process condition in the moment t is represented by the sample formed by data of the previous hour. Existing studies on supervised fault diagnosis of the TEP have shown that several fault types are difficult to be diagnosed, including Fault 3, 5, 9, 15 and 16(Lee et al., 2006; Raich and Γ‡inar, 1995; Wu and Zhao, 2018; Yin et al., 2012). These fault types share analogous data characteristics with the normal operation, which impairs the performance of the supervised fault diagnosis model. It can be seen from Fig. 8(a)-(h) that variable curves of these faults almost tell no difference from the normal operation while Fault 1 and Fault 2 show evident dissimilarity. Unsupervised learning will be more affected by this fact, therefore, not all 20 fault types are involved in this study. The mutual information (MI) of fault types is calculated by the formula proposed by Verron to measure the dissimilarity between them and the normal operation(Verron et al., 2008), and the result is shown in Fig. 8(i). Fault types with the MI greater than 3 are studied in this research, including Fault 1, 2, 4, 6, 7, 11, 13, 14, 17, 19 and 20. The following research is processed on twelve TEP conditions consisting of one normal operation and the abovementioned eleven fault types.

Fig. 8 (a)-(h): Variable curves of the normal operation and different fault types. The x axis presents time while the y axis presents variable value, and each curve represents a process variable. Fault 0 stands for the normal state. (i): Mutual information of the TEP fault types.

For each initial status of the TEP Simulink model, the simulator runs in the normal operation for 65 hours to generate 1,200 normal data samples. For each of the 11 fault types, the corresponding disturbance is introduced into the simulator after it

runs 10 hours in the normal operation, and then the simulator continues to run for 7 hours before 120 fault samples are collected. The simulator is conducted with 4 different initial status to obtain 4 batches of data. The details of each batch of data are shown in Table 1. The first batch of data is referred to as Dataset 1 which plays the role of the β€œunlabeled historical process data” in Fig. 6. The remaining three batches of data are bound together to compose Dataset 2 as the β€œunlabeled real-time process data” in Fig. 13. Volumes of the datasets are shown in Table 2. Table 1 Details of each batch of data. Samples index st

th

1 – 1200 1201st – 1320th 1321st – 1440th 1441st – 1560th 1561st – 1680th 1681st – 1800th 1801st – 1920th 1921st – 2040th 2041st – 2160th 2161st – 2280th 2281st – 2400th 2401st – 2520th

The corresponding TEP condition Normal operation Fault1: step error in A/C feed ratio Fault2: step error in B composition Fault4: step error in reactor cooling water inlet temperature Fault6: step error in A feed loss Fault7: step error in C header pressure loss-reduced availability Fault11: random variation in reactor cooling water inlet temperature Fault13: drift error in reaction kinetics Fault14: sticking error in reactor cooling water valve Fault17: unknown type Fault19: unknown type Fault20: unknown type

Table 2 Volumes of the datasets. Dataset Composition

Normal samples count

Fault samples count

Total samples count

1 2

1,200 3,600

120 Γ— 11 360 Γ— 11

2,520 7,560

Batch 1 Batch 2,3,4

4.2. Feature extraction and visualization The structure and hyperparameters of the SAE are tuned and determined through grid search after narrowing down the search space. The candidate structures are shown in Table 3 and the output dimensionality (OD) of the SAE ranges from 16 to 25. Table 3 Candidate structures of the SAE. Struc ture

Details

1 2 3

input(600)-fc(100)-fc(OD)*-fc(100)-fc(600) input(600)-fc(200)-fc(50)-fc(OD)*-fc(50)-fc(200)-fc(600) input(600)-reshape-conv(32)-flatten(4800)-fc(OD)*-fc(4800)-reshape-convT(32)-convT(1 )-reshape(600)

4 5 6 7 8

input(600)-reshape-upsampling-conv(32)-conv(64)-flatten(4800)-fc(OD)*-fc(4800)-resha pe-convT(64)-convT(32)-convT(1)-pool-reshape(600) input(600)-reshape-lstm(50)-fc(OD)*-fc(50)-repeat-lstm(30)-reshape(600) input(600)-reshape-lstm(100)-lstm(50)-fc(OD)*-fc(50)-repeat-lstm(100)-lstm(30)-reshape (600) input(600)-reshape-conv(32)-reshape-lstm(100)-fc(OD)*-fc(100)-repeat-lstm(480)-reshap e-convT(1)-reshape(600) input(600)-reshape-conv(32)-reshape-lstm(50)-fc(OD)*-fc(50)-repeat-lstm(480)-reshape-c onvT(1)-reshape(600)

fc: fully connected layer. conv: convolutional layer. convT: the transpose of convolutional layer. fc*: the output layer of the SAE which outputs extracted features. Parameters represent the filter counts for conv and convT layer while stand for the output dimensionality for other layers. The default of kernel size in conv and convT layers is 3 Γ— 3.

Five runs are repeatedly conducted for each grid and the mean of MSEs (MMSE) is calculated. According to the grid search result shown in Table 4, the SAE with Structure 4 and an OD of 20 leads to the minimal MMSE and thus is selected for feature extraction of Dataset 1. Table 4 Grid search for the SAE structure and hyperparameters. Structure MMSE OD

1

2

3

4

5

6

7

8

16 17 18 19 20 21 22 23 24 25

0.1647 0.1673 0.1662 0.1607 0.1609 0.1589 0.1582 0.1669 0.1567 0.1593

0.1606 0.1632 0.1595 0.1540 0.1663 0.1560 0.1543 0.1593 0.1616 0.1476

0.1567 0.1558 0.1557 0.1578 0.1491 0.1636 0.1492 0.1536 0.1507 0.1514

0.1572 0.1509 0.1555 0.1444 0.1361 0.1437 0.1385 0.1434 0.1438 0.1453

0.8822 0.8841 0.8781 0.8796 0.8701 0.8849 0.8990 0.8877 0.8720 0.8778

0.9541 0.9469 1.0199 1.0212 1.0098 0.9769 0.9877 0.9927 0.9863 0.9689

0.1711 0.1692 0.1670 0.1670 0.1712 0.1682 0.1718 0.1668 0.1684 0.1715

0.1749 0.1660 0.1694 0.1701 0.1652 0.1720 0.1627 0.1686 0.1655 0.1684

The extracted 20-dimensional features are visualized by the t-SNE algorithm with the dimensionality reduced to 2, and the visualization result is shown in Fig. 9(a) which provides clusters with clear separations. The result is compared to ones when other unsupervised data compression techniques are applied, including the PCA, Independent Component Analysis (ICA), Kernel PCA (KPCA), Locally Linear Embedding (LLE), Spectral Embedding (SE), Multidimensional Scaling (MDS) and Isometric Feature Mapping (ISOMAP). The comparison shown as Fig. 13 manifests that the proposed unsupervised feature visualization method of combining the SAE

and t-SNE algorithm outperforms other techniques.

Fig. 9. Visualization results with different techniques.

4.3. Clustering

The 2-dimensional features are clustered and the expected cluster count is 12 as can be seen in Fig. 13(a). The DBSCAN algorithm is adopted firstly based on two main reasons: 1) there exists non-convex clusters that the K-means algorithm cannot recognize; 2) the visualization can guide us with a general value range of the algorithm parameter πœ€ and thus accelerate parameters tuning. The parameters πœ€ and⁑𝑀𝑖𝑛𝑃𝑑𝑠 of the DBSCAN algorithm require manually tuning and the cluster counts of clustering results with various parameters sets are shown in Table 5. Table 5 The cluster counts of clustering results with various parameters sets. 𝑀𝑖𝑛𝑃𝑑𝑠

πœ€

40

45

50

55

60

65

70

75

80

85

90

6 7 8 9 10 11 12

13* 11* 11 11 10 8 8

15* 12* 11 11 10 8 8

17* 12* 11* 11 10 8 8

13* 11* 12* 11 10 8 8

7* 8* 11* 11 10 8 8

3* 11* 11* 12* 11 8 8

3* 11* 8* 11* 11 9 8

1* 8* 8* 9* 10* 8* 7*

0* 5* 9* 8* 10* 8* 7*

0* 3* 11* 8* 9* 8* 7*

0* 2* 9* 7* 8* 8* 8*

*: there are samples recognized as noise in the clustering result.

The clustering results of which the cluster count is in consistent with the visualization are all along with samples misidentified as noise, hence the result with the cluster count equal to 11 and without noise is selected and shown in Fig. 10(a). The two unseparated clusters in the result are then clustered by the K-means algorithm. With the features coordinates known from the visualization, appropriate initial cluster centers can be given artificially to the K-means algorithm instead of

random ones and that will lead to a better clustering result as shown in Fig. 10(c)(d). The final integrated clustering result after employing the DBSCAN and K-means algorithm successively is shown in Fig. 10(e), which surpasses the results with other clustering algorithms including the Mean Shift, Birch, Agglomerative Clustering and Spectral Clustering as shown in Fig. 10(f). The details of the final clustering result are shown in Table 6.

Fig. 10 (a): The selected DBSCAN clustering result. (b): Clustering results with various parameters sets of the DBSCAN algorithm. (c): The K-means clustering result of two unseparated clusters in (a) with given initial cluster centers. (d): The K-means clustering result of two unseparated clusters in (a) with random initial cluster centers. (e): The final integrated clustering result. (f): Clustering results with other clustering algorithms. Table 6 Details of the clustering result. Cluster 1 2 3 4 5 6 7 8 9 10 11 12

Corresponding samples index st

th

st

th

Involved TEP condition st

1 – 1200 , 1921 – 1970 , 2281 -2285 1201st – 1320th 1321st – 1440th 1801st – 1920th 1561st – 1680th 1681st – 1800th 1971st – 2040th 2041st – 2160th 2161st – 2280th 2286th – 2400th 2401st – 2520th 1441st – 1560th

4.4. Discussion of the data mining result

th

Normal operation, Fault13, Fault19 Fault1 Fault2 Fault11 Fault6 Fault7 Fault13 Fault14 Fault17 Fault19 Fault20 Fault4

A quantified metrics Q is defined to evaluate the performance of the data mining method. Suppose that the dataset contains 𝑁𝑖 samples of process condition i (i=1, 2, β‹―, 12) and the cluster count in the data mining result is C. Cluster j (j=1, 2, β‹―, C) consists of 𝑁𝑗 samples and 𝑁𝑖𝑗 of them belong to process condition i. For the jth cluster, the cluster purity 𝑝𝑗 is defined as Eq. (14) while the cluster efficiency 𝑒𝑗 is defined as Eq. (15): pj =

ej =

max Ni,j i

(14)

Nj max Ni,j i

(15)

Ni

The purity indicates the quality of the cluster itself while the efficiency assesses the contribution that the cluster makes in the data mining result. The metrics 𝑄𝑗 for cluster j is the combination of 𝑝𝑗 and 𝑒𝑗 as shown in Eq. (16) where w is the weight of 𝑝𝑗 , and Q is the mean of 𝑄𝑗 as shown in Eq. (17). The purity is more valued in this study since it determines whether the cluster can be annotated correctly and how many samples will be mislabeled. Thereby, w is set to be 0.7. The Q value of a high-quality data mining result will be closed to 1. (16) Q j = wpj + (1 βˆ’ w)ej 1 C Q = βˆ‘ Qj C j=1

(17)

The Q values of the data mining results with different methods are listed in Table 7. As can be seen, the data mining result with the proposed method stands out. In addition, the method of combing the PCA and K-means algorithm requires prior knowledge of the cluster, and the method of the DBSCAN algorithm suffers from tough parameters tuning process. Table 7 The Q values of the data mining results with different methods. method 𝑄𝑗 cluster 1 2 3 4 5 6 7 8 9

proposed

ST+

ST+

ST+

t-SNE

PCA+

method

DBSCAN

K-means

Birch

+DBSCAN

K-means

0.969 1.000 1.000 1.000 1.000 1.000 0.875 1.000 1.000

0.969 1.000 0.920 0.918 1.000 1.000 0.96 0.865 1.000

0.997 0.883 0.599 0.956 0.443 0.659 0.985 0.967 1.000

0.657 0.791 0.650 0.701 0.742 0.757 1.000 0.721 1.000

0.855 0.942 1.000 0.951 0.797 1.000 1.000 0.893 0.810

0.762 0.775 0.845 1.000 0.785 0.922 0.745 0.827 0.855

DBSCAN

0.831 0.772 0.928 0.967 0.758 0.773 0.992 0.745 0.725

10 11 12

0.988 1.000 1.000

1.000 0.987 1.000

0.842 1.000 0.878

1.000 1.000 1.000

0.96 1.000 /

0.883 0.947 0.795

0.717 0.722 0.742

Q value

0.968 0.851 0.835 0.928 0.845 0.806 0.986 ST: the proposed feature extraction and visualization method which combines the SAE and t-SNE. PCA + K-means: (He et al., 2005); DBSCAN: (Thomas et al., 2018).

The reason why the 1921st to 1970th samples are allocated to the wrong cluster is explored in follows. As can be seen in Fig. 11(a) and (b), the clustering mission is perfectly accomplished. Therefore, the main problem is that those samples which are supposed to belong to Fault 13 assemble with the normal samples after feature extraction, as shown in Fig. 11(c).

Fig. 11. (a) Features visualization; (b) The clustering result; (c) True labels.

Fault 13 is a slow drift error in reaction kinetics and process measurements do not deviate from the normal operation until a transition period passes after the disturbance is introduced. Some variable curves are shown in Fig. 12. The 1921st to 1970th samples which are first 50 samples of Fault 13 are as same as the normal samples and thus cannot be distinguished.

Fig. 12. Variables comparison between the normal operation and Fault 13

4.5. Application of the data mining result for online fault diagnosis

Application of the unsupervised data mining result for online fault diagnosis is shown as Fig. 13. All data samples are allocated to several clusters in the result, and these clusters can be assigned to specific process conditions such as the normal operation and Fault 1 with experts artificially annotating the clusters based on their process knowledge. The demand for process knowledge in the cluster annotation step is inevitable in any unsupervised fault diagnosis methods, but it was seldomly discussed in former studies. The cluster annotation converts the unlabeled historical data to pseudo-labeled data for training pseudo-supervised fault diagnosis model which is based on the CNN in this study, and the trained model is capable of online fault diagnosis of unlabeled real-time process data. Whether the data mining result can lead to a valid fault diagnosis model is investigated in the following sections.

Fig. 13. Application of the unsupervised data mining result for online fault diagnosis

4.5.1.

Creating pseudo-labeled database by cluster annotation

The clusters are annotated as specific process condition labels with experts identifying a few samples of each cluster following the strategy shown as Fig. 14. Since a chemical process runs normally at most time, the largest cluster which contains the highest number of samples is annotated as the normal operation. The strategy of annotating the remaining clusters is based on two definitions: if more than 50% samples in a cluster C belong to the same process condition P, then C is referred to as an β€œideal cluster” and P is referred to as the β€œideal label” of C in this study. The amount of mislabeled samples is minimized if the cluster is annotated as its ideal label. The strategy is projected mainly based on two reasons: 1) there should not be too many samples labeled manually, otherwise the unsupervised study is meaningless; 2) according to the Maximum Likelihood Estimation, the clusters annotated with this strategy are ideal clusters and they are annotated as their ideal labels.

Fig. 14. The strategy of cluster annotation

Based on the data mining result and the cluster annotation strategy, the annotation result is shown in Table 8. Table 8 The cluster annotation result. Clu ster 1 2 3 4 5 6 7 8 9 10 11 12

Samples in the cluster 1st – 1200th, 1921st – 1970th, 2281st – 2285th 1201st – 1320th 1321st – 1440th 1801st –1920th 1561st – 1680th 1681st – 1800th 1971st –2040th 2161st – 2280th 2286th – 2400th 2041st – 2160th 2401st – 2520th 1441st – 1560th

Is the largest cluster TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Samples composition

Annotated label

Mislabeled samples

Normal operation

1921st – 1970th, 2281st – 2285th

Fault1 Fault2 Fault11 Fault6 Fault7 Fault13 Fault17 Fault19 Fault14 Fault20 Fault4

None None None None None None None None None None None

95.6%: Normal 4.0%: Fault13 0.4%: Fault19 100%: Fault1 100%: Fault2 100%: Fault20 100%: Fault6 100%: Fault7 100%: Fault13 100%: Fault17 100%: Fault19 100%: Fault20 100%: Fault11 100%: Fault4

The label accuracy is defined as Eq. (18), therefore the pseudo-labeled Dataset 1 is created with the label accuracy achieving

2520βˆ’55 2520

Γ— 100% = 97.8%, and the

process knowledge of only 33 sample labels is required. The time series plot of the pseudo-labeled dataset is also compared with the plot that Thomas et al. obtained in Fig. 15.

ACC =

amounts⁑of⁑samples⁑with⁑correct⁑labels Γ— 100% amounts⁑of⁑all⁑samples

(18)

Fig. 15. Time series plots in (a): this study; (b): the study of Thomas et al.(Thomas et al., 2018)

4.5.2.

Online fault diagnosis

The practicability of the pseudo-labeled database for training online fault diagnosis model is validated in this section. Fault diagnosis rate (FDR) and false positive rate (FPR), as defined in Eq. (19) and (20) based on the confusion matrix shown in Table 9, are commonly used to evaluate the fault diagnosis result. FDR and FPR both range from zero to one. An FDR equal to 1 and an FDR equal to 0 represents the perfect diagnosis result. The FDR usually draws more attention since it intuitively shows how many samples are classified correctly. Table 9 Confusion matrix for the ith class in fault diagnosis. Predicted

Actual belonging to the ith class not belonging to the ith class

belonging to the ith class

not belonging to the ith class

a b

c d

a a+c b FPR = b+d FDR =

(19) (20)

Dataset 1 with pseudo labels and actual labels are applied respectively for training pseudo-supervised and supervised fault diagnosis models based on the CNN and the comparison of diagnosis results is shown in Table 10. The models are tested on Dataset 2. Table 10 Comparison of fault diagnosis results of the supervised and pseudo-supervised models. Process condition

Supervised diagnosis model FDR(train)

FDR(test)

Pseudo-supervised diagnosis model FPR(test)

FDR(train)

FDR(test)

FPR(test)

Normal Fault1 Fault2 Fault4 Fault6 Fault7 Fault11 Fault13 Fault14 Fault17 Fault19 Fault20

0.975 1.000 1.000 1.000 1.000 1.000 1.000 0.900 1.000 1.000 1.000 1.000

0.976 1.000 0.997 0.997 1.000 1.000 1.000 0.256 1.000 0.833 0.911 0.789

0.045 0.000 0.020 0.000 0.000 0.000 0.000 0.005 0.000 0.000 0.000 0.011

0.973 1.000 1.000 1.000 1.000 1.000 1.000 0.958 1.000 1.000 1.000 1.000

0.975 1.000 1.000 0.994 1.000 1.000 1.000 0.319 1.000 0.839 0.808 0.819

0.036 0.000 0.016 0.000 0.001 0.000 0.005 0.005 0.000 0.000 0.000 0.009

average

0.983

0.931

0.005

0.985

0.930

0.005

The comparison demonstrates that the pseudo-supervised diagnosis model performs almost as well as the supervised model. In other words, the proposed data mining method is competent for processing unlabeled data as a preliminary step of online fault diagnosis. The confusion matrix of the pseudo-supervised testing result is shown in Fig. 16.

Fig. 16. The confusion matrix of the pseudo-supervised testing result

5. Application on industrial hydrocracking instance In this section, the proposed unsupervised data mining method is applied in a real hydrocracking instance to verify its capability of identifying different operating conditions of industrial process. Due to the confidentiality agreement, only the simplified flow diagram of the fractionation system in the hydrocracking process is shown in Fig. 17. A is a 𝐻2 𝑆 stripping tower, B is a heating furnace, C is a fractionating tower, D and E are stripping towers.

Fig. 17. Schematic of the fractionation system in the industrial hydrocracking process

The product of siding S2 is switched between product P1 and product P2 according to the production schedule. It is supposed that process operating condition changes when a product switch takes place, thus the proposed data mining method can be conducted on process historical data and the start of a product switch can be inferred from the data mining result. Ten variables that are most correlative to the switch are chosen from total 480 process measurements to form the data samples, and they are listed in Table 11. Table 11 Description of selected variables. Variable

Description

Variable

Description

V1 V2 V3 V4 V5

bottom temperature of A fuel gas flow of B feed temperature of C top temperature of C top backflow of C

V6 V7 V8 V9 V10

flow of siding S1 bottom temperature of D reboiler thermal load of D temperature of siding S1 temperature of siding S2

We studied 3 days from 2018/09/18 to 2018/09/21 of 10 measurements at 3-minute intervals and the data preprocessing step proceeds similarly as it goes for the TEP dataset. Therefore, the dataset consists of 1,440 200-dimensional samples. After the grid search of the SAE structure and hyperparameters, an SAE with the LSTM layer shown as Fig. 18 is used to extract 16-dimensional features from the input data.

Fig. 18. Structure of the LSTM SAE

The extracted features are visualized by the t-SNE algorithm and the visualization

result is shown as Fig. 19(a). Clustering can be accomplished by the DBSCAN algorithm and the clustering result is shown in Fig. 19(b).

Fig. 19. The visualization and clustering results

The details of the data mining result are shown in Table 12 and the corresponding time series plot is shown in Fig. 20. The result indicates that there existed two process conditions in those 3 days, and a product switch started at the 658th sample. According to the factory records, a switch actually began at 12 a.m. on 2018/09/20, which is the 640th sample of the dataset. The predicted start of the product switch is only 1 hour later than the actual start. Table 12 The data mining result. Cluster

Corresponding samples serial number

1 2

1st – 658th 659th – 1440th

Fig. 20. Time series plot of the data mining result

6. Conclusions and outlooks In this paper, an unsupervised data mining method based on the SAE is proposed for unsupervised fault diagnosis of chemical process. Lower-dimensional features are extracted from raw high-dimensional data by the SAE with the convolutional or LSTM layer and then visualized by the t-SNE algorithm. The visualized features are clustered to obtain the data mining result, thus pseudo-labeled database can be created efficiently for constructing pseudo-supervised model for online fault diagnosis. The proposed method is applied on 12 conditions of the TEP, including Fault 1, 2, 4, 6, 7, 11, 13, 14, 17, 19, 20 and the normal operation. A convolutional SAE is

selected by the grid search for feature extraction. The metrics Q is defined to evaluate the data mining result, and the method leads to the result with a Q value of 0.986 which exceeds other methods. Based on the data mining result, all data samples can be given specific labels efficiently by cluster annotation with the label accuracy achieving 97.8%. The pseudo-labeled dataset is used to train pseudo-supervised fault diagnosis model and the average testing FDR can reach 93.0%, which is as good as when actual-labeled dataset is used for training. In addition, the method is applied on a simplified real industrial situation to identify different operating conditions of the hydrocracking process. This method is dependable as the core and basis of unsupervised chemical process fault diagnosis when confronting unlabeled historical data. However, the diagnosis model based on this method cannot recognize new fault types which do not exist in the historical dataset. Besides, the method does not perform well in discriminating conditions with semblable data characteristics. To overcome this obstacle, more efficient feature extraction techniques will be investigated in the near future. One idea is integrating clustering algorithms into the SAE and use the clustering performance as the loss function of the SAE.

Conflict of interest We confirm that the manuscript has been read and approved by all named authors and there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.

Author statement Shaodong Zheng: Methodology, Writing-Original draft preparation, Software, Validation, Formal analysis, Investigation, Data Curation, Visualization. Jinsong Zhao (Corresponding author): Conceptualization, Writing-Reviewing and Editing, Supervision, Project administration, Funding acquisition.

Acknowledgement The authors gratefully acknowledge support from the National Natural Science Foundation of China (No. 21878171). References Abonyi, J., Feil, B., Nemeth, S., Arva, P., 2005. Modified Gath-Geva clustering for fuzzy segmentation of multivariate time-series. Fuzzy Sets Syst. 149, 39–56. https://doi.org/10.1016/j.fss.2004.07.008 Alaei, Hesam Komari, Salahshoor, K., Alaei, Hamed Komari, 2013. A new integrated

on-line fuzzy clustering and segmentation methodology with adaptive PCA approach for process monitoring and fault detection and diagnosis. Soft Comput. 17, 345–362. https://doi.org/10.1007/s00500-012-0910-9 Bahrampour, S., Moshiri, B., Salahshoor, K., 2011. Weighted and constrained possibilistic C-means clustering for online fault detection and isolation. Appl. Intell. 35, 269–284. https://doi.org/10.1007/s10489-010-0219-2 Bathelt, A., Ricker, N.L., Jelali, M., 2015. Revision of the Tennessee Eastman process model, in: 9th IFAC Symposium on Advanced Control of Chemical Processes ADCHEM 2015. Elsevier Ltd., pp. 309–314. https://doi.org/10.1016/j.ifacol.2015.08.199 Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies with Gradient Descent is difficult. IEEE Trans. Neural Networks 5, 157–166. https://doi.org/10.1109/72.279181 Berkhin, P., 2006. Survey of clustering data mining techniques, in: Kogan, J., Nicholas, C., Teboulle, M. (Eds.), Grouping Multidimensional Data. Springer, Berlin, Heidelberg, pp. 25–71. https://doi.org/10.1007/3-540-28349-8_2 Bhushan, B., Romagnoli, J.A., 2008. Self-organizing self-clustering network: A strategy for unsupervised pattern classification with its application to fault diagnosis. Ind. Eng. Chem. Res. 47, 4209–4219. https://doi.org/10.1021/ie071549a Chen, B.H., Wang, X.Z., Yang, S.H., McGreavy, C., 1999. Application of wavelets and neural networks to diagnostic system development, 1, feature extraction. Comput. Chem. Eng. 23, 899–906. https://doi.org/10.1016/S0098-1354(99)00258-6 Chiang, L.H., Kotanchek, M.E., Kordon, A.K., 2004. Fault diagnosis based on Fisher Discriminant Analysis and Support Vector Machines. Comput. Chem. Eng. 28, 1389–1401. https://doi.org/10.1016/j.compchemeng.2003.10.002 Downs, J.J., Vogel, E.F., 1993. A plant-wide industrial problem process. Comput. Chem. Eng. 17, 245–255. https://doi.org/10.1016/0098-1354(93)80018-I Escobar, M.S., Kaneko, H., Funatsu, K., 2017. On Generative Topographic Mapping and Graph Theory combined approach for unsupervised non-linear data visualization and fault identification. Comput. Chem. Eng. 98, 113–127. https://doi.org/10.1016/j.compchemeng.2016.12.009 Escobar, M.S., Kaneko, H., Funatsu, K., 2015. Combined Generative Topographic Mapping and Graph Theory unsupervised approach for nonlinear fault identification. AIChE J. 61, 1559–1571. https://doi.org/10.1002/aic.14748 Ester, M., Kriegel, H., Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise, in: KDD-96 Proceedings. Second International Conference on Knowledge Discovery and Data Mining. pp. 226–231. Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A.Y., Foufou, S., Bouras, A., 2014. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2, 267–279. https://doi.org/10.1109/TETC.2014.2330519

Fan, J., Wang, W., Zhang, H., 2017. AutoEncoder based high-dimensional data fault detection system, in: 2017 IEEE 15th International Conference on Industrial Informatics. pp. 1001–1006. https://doi.org/10.1109/INDIN.2017.8104910 Hartigan, J.A., Wong, M.A., 2013. A K-means clustering algorithm. Appl. Stat. 28, 100–108. https://doi.org/10.2307/2346830 He, K., 2016. Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778. https://doi.org/10.1109/CVPR.2016.90 He, Q.P., Qin, S.J., Wang, J., 2005. A new fault diagnosis method using fault directions in Fisher Discriminant Analysis. AIChE J. 51, 555–571. https://doi.org/10.1002/aic.10325 Heo, S., Lee, J.H., 2018. Fault detection and classification using artificial neural networks, in: 10th IFAC Symposium on Advanced Control of Chemical Processes. pp. 470–475. https://doi.org/10.1016/j.ifacol.2018.09.380 Hinton, G., Roweis, S., 2003. Stochastic Neighbor Embedding. Adv. Neural Inf. Process. Syst. 15, 833–840. https://doi.org/10.1109/TSMCB.2011.2106208 Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks. Science. 313, 504–507. https://doi.org/10.1126/science.1127647 Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9, 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 JΓ€msΓ€-Jounela, S.L., Vermasvuori, M., EndΓ©n, P., Haavisto, S., 2003. A process monitoring system based on the Kohonen self-organizing maps. Control Eng. Pract. 11, 83–92. https://doi.org/10.1016/S0967-0661(02)00141-7 Ji, S., Xu, W., Yang, M., Yu, K., 2013. 3D Convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231. https://doi.org/10.1109/TPAMI.2012.59 Kramer, M.A., 1991. Nonlinear Principal Component Analysis using autoassociative neural networks. AIChE J. 37, 233–243. https://doi.org/10.1002/aic.690370209 Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. https://doi.org/10.1038/nature14539 LeCun, Y., Jackel, L.D., Boser, B., Denker, J.S., 1989. Handwritten digit recognition: applications of neural network chips and automatic learning. IEEE Commun. Mag. 27, 41–46. https://doi.org/10.1109/35.41400 Lee, J., Qin, S.J., Lee, I., 2006. Fault detection and diagnosis based on modified Independent Component Analysis. AIChE J. 52, 3501–3514. https://doi.org/10.1002/aic.10978 Lee, K.B., Cheon, S., Kim, C.O., 2017. A convolutional neural network for fault classification and diagnosis in semiconductor manufacturing processes. IEEE Trans. Semicond. Manuf. 30, 135–142. https://doi.org/10.1109/TSM.2017.2676245 Lipton, Z.C., 2015. A critical review of Recurrent Neural Networks for sequence learning. arXiv Prepr. arXiv: 1506.00019v4 Lloyd, S.P., 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28,

129–137. https://doi.org/10.1109/tit.1982.1056489 Lv, F., Wen, C., Liu, M., Bao, Z., 2017. Weighted time series fault diagnosis based on a stacked sparse autoencoder. J. Chemom. 31, 1–16. https://doi.org/10.1002/cem.2912 Ma, C.Y., Wang, X.Z., 2009. Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis. Comput. Chem. Eng. 33, 1602–1616. https://doi.org/10.1016/j.compchemeng.2009.04.005 Park, P., Di Marco, P., Shin, H., Bang, J., 2019. Fault detection and diagnosis using combined autoencoder and Long Short-Term Memory network. Sensors 19, 1– 17. https://doi.org/10.3390/s19214612 Raich, A.C., Γ‡inar, A., 1995. Multivariate statistical methods for monitoring continuous processes: assessment of discrimination power of disturbance models and diagnosis of multiple disturbances. Chemom. Intell. Lab. Syst. 30, 37–48. https://doi.org/10.1016/0169-7439(95)00035-6 Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representations by back-propagating errors. Nature 323, 533–536. https://doi.org/10.1038/323533a0 Sarikaya, R., Hinton, G.E., Deoras, A., 2014. Application of Deep Belief Networks for natural language understanding. IEEE-ACM Trans. Audio Speech Lang. Process. 22, 778–784. https://doi.org/10.1109/TASLP.2014.2303296 Sebzalli, Y.M., Wang, X.Z., 2001. Knowledge discovery from process operational data using PCA and fuzzy clustering. Eng. Appl. Artif. Intell. 14, 607–616. https://doi.org/10.1016/S0952-1976(01)00032-X Singhal, A., Seborg, D.E., 2005. Clustering multivariate time-series data. J. Chemom. 19, 427–438. https://doi.org/10.1002/cem.945 Srinivasan, R., Wang, C., Ho, W.K., Lim, K.W., 2004. Dynamic Principal Component Analysis based methodology for clustering process states in agile chemical plants. Ind. Eng. Chem. Res. 43, 2123–2139. https://doi.org/10.1021/ie034051r Szegedy, C., Reed, S., Sermanet, P., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–9. Tang, J., Yan, X., 2017. Neural network modeling relationship between inputs and state mapping plane obtained by FDA-t-SNE for visual industrial process monitoring. Appl. Soft Comput. J. 60, 577–590. https://doi.org/10.1016/j.asoc.2017.07.022 Thomas, M.C., Zhu, W., Romagnoli, J.A., 2018. Data mining and clustering in chemical process databases for monitoring and knowledge discovery. J. Process Control 67, 160–175. https://doi.org/10.1016/j.jprocont.2017.02.006 Van Der Maaten, L.J.P., Hinton, G.E., 2008. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605. https://doi.org/10.1007/s10479-011-0841-3 Verron, S., Tiplica, T., Kobi, A., 2008. Fault detection and identification with a new feature selection based on mutual information. J. Process Control 18, 479–490. https://doi.org/10.1016/j.jprocont.2007.08.003

Wang, X.Z., Chen, B.H., Yang, S.H., McGreavy, C., 1999. Application of wavelets and neural networks to diagnostic system development, 2, an integrated framework and its application. Comput. Chem. Eng. 23, 945–954. https://doi.org/10.1016/S0098-1354(99)00260-4 Wang, X.Z., Li, R.F., 1999. Combining conceptual clustering and Principal Component Analysis for state space based process monitoring. Ind. Eng. Chem. Res. 38, 4345–4358. https://doi.org/10.1021/ie990144q Wang, X.Z., McGreavy, C., 1998. Automatic classification for mining process operational data. Ind. Eng. Chem. Res. 37, 2215–2222. https://doi.org/10.1021/ie970620h Wu, H., Zhao, J., 2018. Deep convolutional neural network model based chemical process fault diagnosis. Comput. Chem. Eng. 115, 185–197. https://doi.org/10.1016/j.compchemeng.2018.04.009 Wu, X., Kumar, V., Ross, Q.J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D., 2008. Top 10 algorithms in data mining, Knowledge and Information Systems. https://doi.org/10.1007/s10115-007-0114-2 Wu, Y., Yuan, M., Dong, S., Lin, L., Liu, Y., 2018. Remaining useful life estimation of engineered systems using vanilla LSTM neural networks. Neurocomputing 275, 167–179. https://doi.org/10.1016/j.neucom.2017.05.063 Yin, S., Ding, S.X., Haghani, A., Hao, H., Zhang, P., 2012. A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman process. J. Process Control 22, 1567–1581. https://doi.org/10.1016/j.jprocont.2012.06.009 Yu, J., 2012. A support vector clustering-based probabilistic method for unsupervised fault detection and classification of complex chemical processes using unlabeled data. AIChE J. 59, 407–419. https://doi.org/10.1002/aic.13816 Zhang, X., Zou, Y., Li, S., Xu, S., 2019. A weighted auto regressive LSTM based approach for chemical processes modeling. Neurocomputing 367, 64–74. https://doi.org/10.1016/j.neucom.2019.08.006 Zhang, Z., Jiang, T., Li, S., Yang, Y., 2018. Automated feature learning for nonlinear process monitoring – An approach using stacked denoising autoencoder and K-Nearest Neighbor rule. J. Process Control 64, 49–61. https://doi.org/10.1016/j.jprocont.2018.02.004 Zhang, Z., Zhao, J., 2017. A Deep Belief Network based fault diagnosis model for complex chemical processes. Comput. Chem. Eng. 107, 395–407. https://doi.org/10.1016/j.compchemeng.2017.02.041 Zheng, S., Zhao, J., 2018. States identification of complex chemical process based on unsupervised learning, Computer Aided Chemical Engineering. Elsevier Masson SAS. https://doi.org/10.1016/B978-0-444-64241-7.50368-2 Zhong, B., Wang, J., Wu, H., Zhou, J., Jin, Q., 2016. SOM-based visualization monitoring and fault diagnosis for chemical process, in: Proceedings of the 28th Chinese Control and Decision Conference, CCDC 2016. IEEE, pp. 5844–5849. https://doi.org/10.1109/CCDC.2016.7532043

Zhou, Z., 2016. Machine Learning, 1st ed. Tsinghua University Press, Beijing. Zhu, W., Webb, Z.T., Mao, K., Romagnoli, J., 2019. A deep learning approach for process data visualization using t-Distributed Stochastic Neighbor Embedding. Ind. Eng. Chem. Res. 58, 9564–9575. https://doi.org/10.1021/acs.iecr.9b00975