Author’s Accepted Manuscript Data-driven root cause diagnosis of faults in process industries Gang Li, S. Joe Qin, Tao Yuan
www.elsevier.com
PII: DOI: Reference:
S0169-7439(16)30320-3 http://dx.doi.org/10.1016/j.chemolab.2016.09.006 CHEMOM3318
To appear in: Chemometrics and Intelligent Laboratory Systems Received date: 4 July 2016 Revised date: 16 September 2016 Accepted date: 17 September 2016 Cite this article as: Gang Li, S. Joe Qin and Tao Yuan, Data-driven root cause diagnosis of faults in process industries, Chemometrics and Intelligent Laboratory Systems, http://dx.doi.org/10.1016/j.chemolab.2016.09.006 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Data-driven root cause diagnosis of faults in process industries Gang Lia,∗, S. Joe Qinb , Tao Yuanc b
a Department of Inertia Technology & Navigation Guidance instruments, Beihang University, Beijing, 100191, China Chinese University of Hong Kong, Shenzhen, China 518172, on leave from the Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089 USA c Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089, USA
Abstract Data driven fault detection and diagnosis methods become more and more attractive in modern industries especially process industries. They can not only guarantee safe operation but also greatly improve product quality. For example, dynamic principal component analysis models and reconstruction based contribution are widely applicable in many occasions. However, there is one issue which does not receive enough attention, namely locating the root cause of a fault when it occurs. In this paper, a framework of root cause location is proposed to address this issue, including both stationary faults and nonstationary faults. A case study on Tennessee Eastman process is used to demonstrate the usage and effectiveness of these approaches. Results show the proposed framework is valid. Keywords: Root cause diagnosis, dynamic principal component analysis, reconstruction based contribution, Granger causality analysis, dynamic time warping.
1. INTRODUCTION With a huge amount of data collected by distributed control systems, it becomes more and more popular to apply data-driven techniques to process monitoring in modern industries [1, 2, 3]. Statistical process monitoring as a common tool for analysis of correlated data, has been used successfully in fault detection and diagnosis in various processes [4]. For slowly varying and low speed sampling processes, principal component analysis(PCA) is an effective model to describe the correlations of observed variables. Multivariate samples are projected onto a principal subspace and a residual subspace generated by a PCA model, while statistics in different subspaces are used to monitor a process. PCA models do not take auto correlations of measurements into consideration. In order to solve the problem, Ku et al. proposed dynamic PCA to perform PCA modeling algorithm on an augmented data matrix [5]. There are also other realizations for dynamic modeling, such as canonical variate analysis [6], subspace identification methods(SIM) [7, 8], and the dynamic latent variable model [9]. ∗ Corresponding
author: Gang Li, Email:
[email protected]
Preprint submitted to Chemometrics and Intelligent Laboratory Systems
September 19, 2016
After a fault is detected in industrial processes, a diagnosis tool is desired to identify faulty variables. The most popular method of online fault diagnosis is contribution plots analysis [10]. However, contribution plots suffer from the smearing effect in many situations, which leads to erroneous results [11]. To overcome the drawbacks of contribution plots, Alcala and Qin proposed reconstruction-based contributions (RBC) for process monitoring of sensor faults (only one sensor affected per fault), which integrated contribution plots and reconstruction methods together [11]. Furthermore, Li et al. extended RBC method to general process faults, which will affect multiple sensors simultaneously [12]. Although it can identify faulty variables efficiently, RBC technology cannot locate exact root cause. This is because RBC method is based on correlation instead of causality among data. Only causal model can provide causality information directly, such as physical model or empirical models. Correlations based data models are not indicative of causality among variables. However, temporal data reserve the ability to reveal hidden causality. Root diagnosis issue has been addressed in the scenario called plant-wide oscillation identification, which aims to discover the control loop or device which excites undesired oscillations in other loops or manipulated variables. Ping Duan et al has reviewed recent methods for root cause diagnosis of plant-wide oscillations [13], including spectral envelop, adjacency matrix, causality analysis, transfer entropy and Bayesian network. Spectral envelop is a frequency domain method, which performs singular value decomposition in frequency domain to detect potential plant-wide oscillations effectively [14]. Recently, an information transfer method is also proposed in the frequency domain for the cause diagnosis of plantwide oscillations [15]. Adjacency matrix needs to build a connection graph from process knowledge, which is not completely data driven [16]. Causality analysis including Granger causality and transfer entropy is a direct method which reveals variable causality based on temporal data [17, 18]. Both methods define the causality between paired variables. Dynamic Bayesian network (DBN) is another widely used graphic model for probabilistic reasoning, which can indicate causal information for variables [19]. DBN can be derived either from process knowledge or from process data [20]. However, training a DBN model costs more time than Granger Causality analysis with a large scale data set. Although many approaches are available to analyze plant-wide oscillations, they are not directly applicable to root cause diagnosis of process faults. The process fault here means some undesired event or disturbance happening to an actuator or a device which affects more than one measurement simultaneously, such as a stuck valve , or a sudden change in the ratio of reactants. There are two main problems here. First of all, almost all causality analysis methods are sensitive to the number of candidate series that are analyzed. The more series a method covers, the less accurate and clear result it will lead to. Therefore, reducing the candidate number as much as possible is significant to causality analysis. In the root cause of wide-plant oscillations, the related series are preselected with spectral envelop method or principal component analysis, which does not work in the occasion of fault diagnosis always. Therefore, it is desired to use fault identification to preselect candidates of faulty variables effectively. Secondly, aforementioned causality analysis 2
methods all assume the process is stationary, or more accurately the prediction errors of these models should be stationary. This assumption holds for plant-wide oscillations, but does not for process faults generally. Because a faulty process may not be stationary for some types of faults, such as a step fault. In such a case, all above methods are not applicable. In this work, a framework based on statistical process monitoring is proposed to locate the root cause when a fault is detected. Firstly, a dynamic PCA model is used for fault detection. After that, faulty candidates are identified according to RBC based method. Then, Granger causality analysis is employed to locate the root cause of quasi-stationary faulty processes, while a dynamic time warping based method is proposed for nonstationary faulty processes. A case study on Tennessee Eastman process is used to demonstrate the usage and effectiveness of proposed methods. Conclusions are presented in the last section. 2. Fault detection and identification for dynamic processes 2.1. Fault detection with DPCA model Suppose all the measurements collected in a process at time i are represented by a m-dimensional vector xi ∈ Rm , which includes control signals as inputs, actuator signals as outputs, process indicators for environment, and quality indicators for products. Dynamic PCA model is used to capture both correlations among variables and auto correlations among lagged variables, which implements PCA on the following augmented matrix [5]
⎡
xs
⎢ ⎢ ⎢ xs−1 Z=⎢ ⎢ .. ⎢ . ⎣ x1
xs+1
. . . xn+s−1
xs .. .
. . . xn+s−2 .. .. . .
x2
...
⎤T ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
(1)
xn
where s is the number of lagged measurement, n is the number of augmented sample vector for training a model. The objective of DPCA is essentially to optimize the following objective: max pT ZT Zp p
s.t.
p = 1
(2)
where p is the loading vector and can be solved directly by performing singular value decomposition (SVD) ˜ be eigenvectors corresponding to the last m − A eigenvalues and on Z in an descending order. Let P zi = [xTi , xTi−1 , ...xTi−s+1 ]T , where A is the number of principal components and often selected by cumulative percentage variance criterion[5]. The norm of residual is frequently used as one of fault detection indices in different kinds of applications as follows [7, 8]: ˜P ˜ T zi Qi = zTi P 3
(3)
2.2. Fault identification via contribution analysis Once a fault is detected, it is desired to identify faulty variables. Contribution plot is a very popular tool for fault identification in multivariate statistical process monitoring. Traditional contribution plots assume variables with large contributions are most likely fault related variables. However, this assumption does not hold for even sensor faults [21]. To overcome this problem, Alcala and Qin proposed a reconstruction based contribution (RBC) method to locate contributing variables as follows [11]: RBCij =
(ξjT Mxi )2
(4)
ξjT Mξj
where ξj is the j th column of the identity matrix, M is the specific matrix corresponding to fault detection ˜P ˜ T for Q, and RBCij represents the RBC value of variable j for the ith sample. For index, such as P a sensor fault, the sensor with a fault of large magnitude can be identified successfully by RBC method theoretically. However, for process faults which affect many measurements simultaneously, RBC will select too many faulty variables. In order to improve identification performance, multidirectional RBC is adopted for identification[22]. Multidirectional RBC is an excellent extension of generalized RBC. The calculation of generalized RBC is used to preselect faulty candidates as follows[12]: RBCi,Ξ = xTi MΞ(ΞT MΞ)+ ΞT Mxi
(5)
where Ξ is an orthogonal matrix consisting of one or more columns as basis vectors of faulty subspace, called fault direction matrix. And RBCi,Ξ represents RBC value to a specific fault direction Ξ for the ith sample, ’+’ means Moore-Penrose pseudoinverse. Multidirectional RBC searches all possible combinations of faulty sensors, and locates the bundle of faulty variables which can reconstruct a sample completely with minimum element size. After that, the contribution of each selected variable j to MRBC of the ith sample can be calculated as[22] 1
MRBCij = {ξjT [(ΞTi MΞi )+ ] 2 ΞTi Mxi }2
(6)
where Ξi means the obtained Ξ for the ith sample, j is not from 1 to m but indicated by columns of Ξi . For example, if Ξi = [ξ1 , ξ3 , ξ5 ], j will only be selected from {1, 3, 5} for the ith sample. For many applications of RBC, faulty variables are observed sample by sample, which provides online fault identification. However, it is not convenient for further analysis. In order to select an overall candidate set of faulty variables, a quantitative criterion of determining the candidates from all identification results is proposed as follows:
nf 1 MRBCij (j = 1, . . . , m) TRj = nf i=1
(7) MRBC
ij is normalized MRBC where nf is the number of faulty samples after detection, MRBCij = MRBC ij j=1 m th of variable j for the i sample. As j=1 MRBCij = 1 holds for all samples, it can be obtained that m 1 j=1 TRj = 1. In this work, if TRj > m , xj will be selected into the candidate set of faulty variables.
4
3. Methods for root diagnosis of process faults It should be notable that not all selected variables are the root cause. In fact, some of them are implicated in the abnormality due to feedback control. This is because DPCA is a correlation based statistical model and RBC only implies whether a variable is related to the detected fault. Therefore, it is desired to further locate the root cause of the fault. 3.1. methods for stationary faulty processes There are several methods to tell the causality relation among a cluster of variables. For a set of time series data, one of the most popular methodology is by calculating the Granger causality (GC) [17]. Other two alternatives are dynamic Bayesian networks and the transfer entropy index. Granger causality uses a statistical hypothesis test to judge whether a time series is significantly and causally affected by another time series based on predictability. It has been widely used in the economics area, neuroscience and energy [23]. Recently, Yuan and Qin proposed a method based on Granger causality for the root diagnosis of oscillation source and propagation [17], which successfully located the source of plant wide undesired oscillations. In this paper, Granger causality test is employed in the root cause diagnosis of stationary faulty processes. Suppose X f = xf1 , xf2 , ..., xfl are selected variables from previous identification method, where l is the number of elements in candidate set. The simplest definition of Granger causality is derived from a two variable system. Considering xfu , xfv ∈ X f as two faulty candidates, traditional GC method assumes they follow a two dimensional Vector Auto Regressive model(VAR): ⎧ ⎨ xf (i) = Σp α xf (i − k) + Σp β xf (i − k) + e (i) u u k=1 uk u k=1 uk v p p ⎩ xf (i) = Σ α xf (i − k) + Σ β xf (i − k) + e (i) k=1 vk u
v
k=1 vk v
(8)
v
where αuk , αvk , βuk , βvk are VAR coefficients, eu , ev are prediction errors, p is the order of VAR model. This model is called as unrestricted model or full model. In order to indicate causal effect between these two variables, the following restricted model is used for comparison to full model: ⎧ ⎨ xf (i) = Σp α xf (i − k) + e (i) u/v u k=1 uk u ⎩ xf (i) = Σp β xf (i − k) + e (i) v
k=1 vk v
(9)
v/u
where eu/v , ev/u are prediction errors for variable u, v in restricted models without variable v, u, respectively. If there is a causality from variable u to v, there must be some information available in variable u to predict v. Otherwise, there is no causal effect in this direction. Namely, if there is a great loss in ev compared to ev/u , it can be concluded that variable u has a causal impact on variable v. According to this idea, Granger causality from variable u to variable v was defined as follows [17]: GCuv =
V ar(ev/u ) − V ar(ev ) p1 − p2 ∼ Fp1 −p2 ,l−p1 V ar(ev ) l − p1 5
(10)
where GCuv represents the Granger causality index from variable xfu to xfv , V ar(ev ) is the variance of complete modeling error for xfv with p1 parameters in model (8) and var(ev/u ) is the variance of partial modeling error for xfv in model (9) with p2 parameters, Fa,b represents the F-distribution with corresponding degrees of freedom, and the distribution holds if there is no causality from u to v. If the index exceeds a given significance level, the null hypothesis that there is no causality from xfu to xfv is rejected, which indicates xfu is causal to xfv . Although it is possible to test each pair of xfu , xfv from faulty candidates as above, it is more efficient to put all variables together to build a full model: xf (i) = A1 xf (i − 1) + · · · + Ap xf (i − p) + e(i)
(11)
where xf = [xf1 , xf2 , ...xfl ]T and Ak (k = 1, ..., p) are coefficient matrices for VAR model. By removing xfu from this model, it is easy to obtain a restricted model to calculate var(ev/u ) and further evaluate GC index as (10). Remark 1. Theoretically, hypothesis test (10) only indicates whether the corresponding coefficient is zero. Therefore, the result from pairwise test as (8) and global test as (11) should be the same. However, it is not true because causality can also spread among the variables. This issue will be discussed clearly in the case study. Remark 2. Transfer entropy (TE) is another definition of causality which only depends on the data distribution instead of a regression model. The advantage of TE is that it can adapt to linear and nonlinear autoregression data automatically, and treat causality as inherent property of data without any assumption. The initial transfer entropy index is defined as follows[18]:
p(xv |xpv , xpu ) dx p(xv , xpv , xpu ) log T Euv = p(xv |xdv ) D
(12)
where xpv = [xv (i − 1), ..., xv (i − p)], xpu = [xu (i − 1), ..., xu (i − p)]. If the data are sampled independently and traverse over the whole distribution, the above integration can be replaced by 1 p(xv , xpv , xpu )p(xpv ) log n p(xv , xpv )p(xpu , xpv ) n
T Euv ≈
(13)
where all probabilities can be estimated with kernel density estimation. The TE index for multivariate causality is also defined as direct TE[13]. Although it seems to be perfect theoretically, the calculation of TE index is a heavy burden if the number of sample is very large. Details will be discussed in case study. Remark 3. Bayesian network is a kind of probabilistic graphic model widely used for statistical inference, which is defined as a directed acyclic graph. If it is built from process knowledge or expertise of engineer, BN has ability to indicate causal directions among the variables. However, if it is derived from a static structure, 6
the directions from arrows of the graph do not indicate any causality because only correlation information is used during its modeling process. Dynamic BN which involves lagged variable in the model can capture causality by exploring correlations from past values to future values[19]. If Gaussian distribution is used for every node, DBN measures the causality very similar to Granger causality. In the recent work, it has been revealed that Granger causality and DBN have very similar ability of exploring causalities from variables when there are enough samples[20]. To sum up, Granger Causality digs the causality among variables by examining whether there is a significant loss in prediction error before and after one variable (say xfu ) is removed in a complete VAR model to predict variable (xfv ). The idea of Transfer Entropy is similar to Granger causality, although it uses mutual information to measure the reduction of uncertainty of a random variable (xfv ) after the knowledge of a second variable (xfu ). DBN finds the causality by estimating the conditional probability from one variable (xfv ) at time (i − 1, ..., i − p) to another variable (xfu ) at time i with the help of parameter estimation methods. 3.2. methods for non-stationary faulty processes It is not appropriate to use a VAR model for nonstationary multivariate series as the prediction error variance can vary from time to time. Therefore, previous framework works only for stationary or quasistationary faulty processes. This section attempts to propose a similarity measure based causality analysis index. The basic idea is as follows. If the correlations between faulty variables are all linear, the disturbance introduced by a fault should keep its shape along the time line with some stretches. However, if there is strong nonlinearity, the shape of disturbance cannot keep similar in different faulty variables. For those time series that have a similar shape, it is possible to sort them according to chronological order. The series that has earliest occurrence is believed to be the root cause of the fault. For those time series that donot seem like each other, causality analysis is still difficult to discover. In summary, all series are firstly clustered into groups, and for each group the earliest series is found by a searching algorithm. In the area of time series, dynamic time warping is widely used for measuring the similarity between two temporal sequences which may vary in time or speed [24]. Generally, DTW calculates an optimal match between two time series with some restrictions. The series are warped in time dimension to measure their similarity. Given two time series xfu , xfv in faulty candidate set, two sequences could be arranged on the sides of a grid, with xfu on the bottom and xfv on the left hand side. Inside each cell, a distance measure can be calculated for the corresponding elements of the two time series. The objective is to find route with the minimum sum of distances between starting point and ending point in all possible routes. The DTW algorithm adopts a dynamic programming algorithm to keep track of the best path at each point [24]. as follows γ(i, j) = d(i, j) + min{γ(i − 1, j − 1), γ(i, j − 1), γ(i − 1, j)} 7
(14)
where boundary conditions are given as γ(1, 1) = d(1, 1), γ(i, 1) = d(i, 1) + γ(i − 1, 1) , γ(1, j) = d(1, j) + γ(1, j − 1). d(i, j) = xfu (i) − xfv (j) is the distance measure for each pair of points in the grid, i.e. i point in xfu and j point in xfv . γ(i, j) is the shortest distance from point (1, 1) to (i, j). From γ(i, j) in all cells, a path with shortest distance can be searched effectively along only the bottom,left, bottom-left directions. As DTW is sensitive to the basis and scale of series data, these series should be normalized to a standard form before calculating DTW distance between them. Firstly , all time series are shifted so that the minimum point is zero. Secondly, series data is zoomed by a factor so that the maximum point is one. At last, DTW value is divided by the larger length of two series. After the normalization, DTW focuses on only the shape of series, regardless of the basis, scale, and length of two time series. After introducing DTW as a distance measure of two time series. It is possible to apply clustering algorithm with this new measure. k means clustering is popular for cluster analysis in data mining. Given a set of series samples (xf1 , xf2 , . . . , xfl ), k-means clustering aims to divide l observations into nc groups {S1 , S2 , . . . , Snc } [25]. Here, Euclidean distance is replaced by DTW distance in k-means clustering, which is more effective as in follows: Algorithm 1. ( k-means clustering with the DTW distance) 1. Select the number of clusters, nc properly. Initialize nc centers for all clusters randomly, cv , (v = 1, . . . , nc ). 2. For each time series xfu , calculate DTW(xfu , cv ) and assign each sample to the class with the minimum of DTW distance,i.e. class(xfu ) = argminv DTW(xfu , cv ). 3. When all the objects have been assigned, recalculate the center of each cluster, i.e. cv =
1 nv
xfu ∈Sv
xfu
, where nv is the number of elements in the cluster Sv . 4. Go to Step 2 until cv converges for all v. After clustering faulty series, it is clear which series are similar in shape and share a similar trend. By determining the sequence of each member within a group, a causality index for is defined as follows (an initial version is presented in conference [26]): DCIuv =
min Duv (w) Duv (0)
Duv (w) = DTW([xfu (1), ..., xfu (nf − w)], [xfv (w + 1), ..., xfv (nf )])
(15) (16)
where Duv (w) indicates the DTW distance of xu , xv with shift w (0 ≤ w ≤ wmax ) and DCIuv shows whether xu precedes xv significantly. If DCIuv < α, it means there is a significant causality from xu to xv , where α is a given significance level. If DCIuv ≥ α, it means two series are causal to each other. Notice that once faulty variables are clustered into several groups, the relations among the variables within one group only have three types, i.e. xu → xv , xv ← xu , and xu ↔ xv . If two series are not similar in the shape, it is not proper to use this index for indicating the causality. 8
The whole procedure of fault detection and causality analysis in this paper can be summarized in the Figure 1. Historical normal data
Online data stream
Causality Graph
model Normal
Dynamic PCA modeling
DTW based causality index
Fault detection
Fault
faulty sensors
Variables within a group
Summary Multidirectional RBC for identification
DTW based clustering
Candidate set
Time series of faulty data
Stationary?
No
Wavelet based denoising
Yes
Causality Graph
Granger causality analysis
Transfer Entropy
Causality Graph
Figure 1: flowchart of root cause diagnosis for industrial processes
4. Case study on Tennessee Eastman process In this section the effectiveness of both introduced and proposed root diagnosis methods is demonstrated with the case study on Tennessee Eastman process (TEP) [27]. The flow chart of TE process is shown in Fig. 2. TEP benchmark provides 12 manipulated variables and 41 measurements for analysis, including 19 quality indices as product qualities. As quality indices contain evident delay, only 22 direct process measurements (X1-X22) and 11 manipulated variables(X23-X33) are chosen as X (see Table1). There are 15 known faults in TEP, which includes seven step faults, five random faults and three valve and slow change faults, which can be found in [27]. In order to monitor TEP, 480 normal samples have been used to build an DPCA model with s=2 and A=42, where A is selected by cumulative percentage of variance (CPV). Q statistic for model residual is used to detect faults in the process. With significance level of 99%, the overall false alarm rate is 8.5% for this model. Among 21 faulty datasets for TEP, three of them are studied here including IDV(8), IDV(10), IDV(1). IDV(8)and IDV(10) are random faults, which lead to a quasi-stationary process, while IDV(1) is a step fault, which triggers a non-stationary process after it is introduced. Therefore, causality analysis including Granger causality and transfer entropy are tested for the first fault and DTW based causality index is tested for the second fault. 9
),
;&
3XUJH &:6
;&
/,
7,
),
)&
3,
&:6
/&
7,
(
;%
$ 1 $ / < = ( 5
),
3,
;$
3,
&:5
&RQGHQVHU
/,
3+/
&RPSUHVVRU
6&
'
)&
-,
)&
;&
;&
$
),
)&
),
)&
),
)&
;& ;' ;( ;) ;*
7&
;$ ;% ;& ;' ;(
$ 1 $ / < = ( 5
),
9DS/LT 6HSDUDWRU
7, &:5
), 7,
6WULSSHU
7&
)&
7,
7&
;&
)&
5HDFWRU
;)
$%&
;+
3,
/,
6WP &RQG
/&
$ 1 $ / < = ( 5
;' ;( ;) ;* ;+
),
3URGXFW
/&
Figure 2: Tennessee Eastman benchmark process flowchart
IDV(8) introduces random variations in total feed of A,B,C in the stream 4, which will affect X26 directly as component B is only fed in stream 4. The variations of X26 will spread to other variables due to feedback control, especially measured variable X4. Fig. 3 depicts the detection results of IDV(8) with DPCA model, where the overall detection rate is 98.38%. In order to discover faulty sensors for this fault, MRBC is further calculated for all samples and variables after the fault. Fig 4 shows the normalized MRBC for all faulty samples and variables in colors. The mean of MRBC for all faulty samples is used to select faulty variables as shown in Table 2. As a result, eleven highlighted variables are selected as candidates, which are transferred to Granger causality analysis with significance level 5%. Fig. 5 demonstrates the result of Granger causality analysis, where red lines indicate a bi-directional casual effect between each pair of nodes, and green lines with an arrow indicate a unidirectional casual effect. From Fig. 5, it is observed that X4 and X26 act as root cause of the fault. From fault description, both X4 (Flow sensor in stream 4) and X26 (Flow controller in stream 4) are most related to this fault. When the variation of A/B/C flow is introduced in stream 4 (IDV(8)), the controller X26 is manipulated to compensate this variation immediately and the flow sensor X4 measures this abnormality. Besides, X3 (Flow sensor in stream 3) and X24 (Flow controller in stream 3) have causal impact on others, this can be interpreted as feedback control strategy in order to keep the pace with stream 4. Fig. 6 checked the pair of X3 and X24 with the pair of X4 and X26, which indicates 10
Table 1: Process variables (X1-X22: measured Var and X23-X33: manipulated Var)
Index
Description
Index
Description
X1
A feed (stream 1)
X18
stripper temperature
X2
D feed (stream 2)
X19
stripper steam flow
X3
E feed (stream 3)
X20
compressor work
X4
total Feed (stream 4)
X21
reactor cooling water outlet temp
X5
recycle flow (stream 8)
X22
condenser cooling water outlet temp
X6
reactor feed rate (stream 6)
X23
D feed flow (stream 2)
X7
reactor pressure
X24
E feed flow (stream 3)
X8
reactor level
X25
A feed flow (stream 1)
X9
reactor temperature
X26
total feed flow (stream 4)
X10
purge rate (stream 9)
X27
compressor recycle valve
X11
separator temperature
X28
purge valve (stream 9)
X12
separator level
X29
separator pot liquid flow (stream 10)
X13
separator pressure
X30
stripper liquid product flow
X14
separator underflow (stream 10)
X31
stripper steam valve
X15
stripper level
X32
reactor cooling water flow
X16
stripper pressure
X33
condenser cooling water flow
X17
stripper underflow (stream 11) Table 2: Average MRBC for different variable(threshold: 1/m=0.0303)
Var Index
1
2
3
4
5
6
TR
0
0.0223
0.0442
0.0359
0.0065
0.0315
7
8
9
10
11
12
13
0.0012
0.0132
0.0682
0.0089
0.0250
0
0.0010
14
15
16
17
18
19
20
0.0358
0
0.0040
0.0000
0.0161
0.0238
0.0106
21
22
23
24
25
26
27
0.1790
0.0131
0.0651
0.1096
0
0.1001
0.0161
28
29
30
31
31
33
0.0029
0
0
0.0339
0.1319
0
that the former follows the trend of the latter as a whole. GC analysis cannot capture this causality because regression models used in GC donot cover such a long term (around 50 steps). 11
In order to show the superiority of multivariate GC analysis, bivariate GC analysis based on only the pair of candidates is demonstrated in Fig. 7, which finds too many causal edges. This is because bivariate GC analysis also detects indirect causality which is through intermediate node. On the other hand, Multivariate test will omit indirect causality with small contribution in prediction error. Transfer entropy also works for causality analysis and is able to detect nonlinear causality. Fig. 8 shows the causality result of transfer entropy, which plots a similar but more detailed causality graph compared to Granger causality. This is because transfer entropy analysis uses probability density of data which can reveal nonlinear causality between nodes. However, it suffers a lot from huge computation. According to theoretical analysis, the computation of TE is O(N 2 m2 p2 ) where N is sample size, m is variable size and p is the largest dimension of joint probability density estimation. Meanwhile, the computation of GC is proportional to O(N ∗ m). From the result of case study, GC analysis took 0.3 s while TE took 920s, approximately 3000 times over the former. (The experiment was completed under CPU i5 @3.2GHZ and RAM 4 GB.) Therefore, TE is hardly to use for the data analysis with large amount of samples.
40 Q control limit 35
30
Q statistic
25
20
15
10
5
0
0
100
200
300
400 500 600 sample index
700
800
900
1000
Figure 3: Fault detection for IDV(8) with Q statistic based on DPCA
To further demonstrate the effectiveness of Granger analysis, another random fault (IDV10) was tested. IDV(10) brings a variation into the temperature of C feed , which affects the temperature of gas mixture in stream 4 significantly. The influence should be observed at the temperature in the stripper (X18) at the earliest. After the fault is detected (Fig. 9), MRBC is performed to provide faulty candidates (Fig. 10). It is clearly shown that the X18 has the most contribution to RBC during the whole fault process, although 12
1 32 0.9
30 28
0.8
26 24
0.7
variable index
22 0.6
20 18
0.5
16 14
0.4
12 0.3
10 8
0.2
6 4
0.1
2 100
200
300
400 500 600 samples number
700
800
0
900
Figure 4: Normalized RBC contribution for IDV(8) with MRBC algorithm
26
24
9 32 23 21 3 6 4 14
31
Figure 5: multivariate Granger causality analysis for IDV(8)
the faulty effect is spread to other nodes. The selected set of candidates is highlighted in Table 3. Fig. 11 shows X18 and X3 has a dominant causal impact to others. The reason of X18 to be the source is clear according to fault description. However, the reason of X3 to be the source is questionable. After observing 13
15 X26 X4
X26 and X4
10 5 0 −5 −10 −15
1000
800
600
400
200
0
X3 and X24
10 X3 X24
5 0 −5 −10
1000
800
600 400 sample index
200
0
Figure 6: Comparison between X4,X26 and X3,X24 in IDV(8)
26
24
9 32 23 21 3 6 4 14
31
Figure 7: bivariate Granger causality analysis for IDV(8)
14
26
24
9 32 23 21 3 6 4 14
31
Figure 8: Transfer entropy based Causality analysis for IDV(8)
the record of X3 and X18 together in Fig 12, it can be found that although X3 is not affected directly and immediately by X18, there is a slight impact from X18 to X3 in the late stage of the fault process. This potential relation is caused by a long term feedback control effect, but hardly captured by causality analysis.
Table 3: Total RBC ratio for Fault IDV10
Var Index
1
2
3
4
5
6
TR
0.0006
0.0323
0.0483
0.0112
0.0057
0.0464
7
8
9
10
11
12
13
0.0003
0.0039
0.0267
0.0071
0.0058
0
0.0002
14
15
16
17
18
19
20
0.0712
0
0.0017
0
0.3171
0.0605
0.0020
21
22
23
24
25
26
27
0.0434
0.0165
0.0144
0.0055
0.0006
0.0315
0.0131
28
29
30
31
31
33
0.0029
0
0
0.0120
0.1948
0
If a fault leads to non-stationary process, it is not suitable to use Granger causality or transfer entropy any longer. IDV(1) is used as an example of non-stationary faulty process. When IDV(1) is introduced at the 160th sample, a step change is induced in the A/C feed ratio in Stream 4, which causes an increase in the C feed and a decrease in the A feed. Similar to IDV(8), this fault implies X4 and X26 are directly affected, 15
45 Q control limit
40 35
Q statistic
30 25 20 15 10 5 0
0
200
400 600 sample index
800
1000
Figure 9: Fault detection for IDV(10) based on DPCA
which is then spread to other parts. The fault detection is shown in Fig. 13. Fig. 14 depicts the result of multi-directional RBC, where candidates are picked out as {23, 21, 32, 19, 9, 31, 26}. If Granger causality analysis is used as before, the causality result is demonstrated in Fig. 16, which implies X23 as the root cause. It is not correct. Figures 17-15 show all candidates from 161th to 350th sample before and after denoising, which reflects denoising procedure is beneficial to trend extraction. The denoising procedure is implemented with Haar wavelet in 5 layers of decomposition and a soft threshold. Two groups are obtained via DTW based k-means clustering. One group consists of {X23, X21, X32, X26}, the other contains {X19, X9, 31}, which is convincing by observing the pattern in Fig. 15. Finally, DCIij is calculated for each group with a significance α = 0.99 shown in Figs 18-19, which indicates X26 is the root cause in Group 1 and X19 is the root cause in Group 2. However, it is difficult to determine the causality relation between two groups. There may be a long-term causality between two groups through the feedback control and process dynamics. To discover this causality is more challengeable and will be studied in the future. In this case study, both stationary and non-stationary faulty processes have been used to demonstrate different schemes of root cause diagnosis. It is known to us whether a fault process is stationary beforehand, 16
1 32 0.9
30 28
0.8
26 24
0.7
variable index
22 0.6
20 18
0.5
16 14
0.4
12 0.3
10 8
0.2
6 4
0.1
2 200
400 600 samples number
800
0
Figure 10: Normalized RBC contribution for IDV(10) with MRBC algorithm
so that proper method can be selected. However, this information is usually unavailable beforehand. Therefore, it is quite desired to distinguish stationary faults from nonstationary faults based on data or process knowledge. One of the possible and easy ways is to observe the whole sequence of fault detection index and make a decision manually. There are also data driven methods to make a judgement automatically. Li et al. summarized some nonstationary test for univariate time series, which are applicable to this situation[28]. 5. Conclusions In this paper, a framework has been established for root cause diagnosis of dynamic processes. When a fault is detected by DPCA model, contribution plot analysis is used for effective identification of faulty variables. However, it is difficult to locate the root cause from a correlation based model. There are two scenarios to be considered: for the random faults that lead to quasi-stationary faulty process, Granger causality analysis is introduced to discover the source of faults. In such cases, Transfer entropy is also effective but not an efficient choice. For the faults that lead to non-stationary faulty processes, a dynamic time warping based causality index is proposed to dig the causality relation among candidates. The case study on Tennessee Eastman process shows the effectiveness of the proposed framework. 17
14
19
32 3 18 6 26 21
2
Figure 11: multivariate Granger Causality analysis on faulty candidates for IDV(10)
Acknowledgments This work was supported by members of Texas-Wisconsin- California Control Consortium (TWCCC), and Center for Interactive Smart Oilfield Technologies (Cisoft). It was also supported by NSFC under grants (61020106003, 61490704, 61333005, 61273173, 61473002, 61473033 and 61673032), the SAPI Fundamental Research of Northeastern University of China(2013ZCX02-01). [1] P. Zhou, S.-W. Lu, T. Chai, Data-driven soft-sensor modeling for product quality estimation using case-based reasoning and fuzzy-similarity rough sets, Automation Science and Engineering, IEEE Transactions on 11 (4) (2014) 992–1003. [2] N. Sheng, Q. Liu, S. J. Qin, T. Chai, Comprehensive monitoring of nonlinear processes based on concurrent kernel projection to latent structures, Automation Science and Engineering, IEEE Transactions on. URL 10.1109/TASE.2015.2477272 [3] W. Du, Y. Tian, F. Qian, Monitoring for nonlinear multiple modes process based on ll-svdd-mrda, Automation Science and Engineering, IEEE Transactions on 11 (4) (2014) 1133–1148.
18
15 10 X18
5 0 −5 −10
0
200
400
600
800
1000
0
200
400 600 sample index
800
1000
6 4
X3
2 0 −2 −4 −6
Figure 12: Comparison between X18 and X3 in IDV(10)
[4] S. J. Qin, Survey on data-driven industrial process monitoring and diagnosis, Annual Reviews in Control 36 (2012) 220–234. [5] W. Ku, R. Storer, C. Georgakis, Disturbance detection and isolation by dynamic principal component analysis, Chemometrics and intelligent laboratory systems 30 (1) (1995) 179–196. [6] A. Negiz, A. C ¸ linar, Statistical monitoring of multivariable dynamic processes with state-space models, AIChE Journal 43 (8) (1997) 2002–2020. [7] W. Li, S. Qin, Consistent dynamic PCA based on errors-in-variables subspace identification, Journal of Process Control 11 (6) (2001) 661–678. [8] S. Ding, P. Zhang, A. Naik, E. Ding, B. Huang, Subspace method aided data-driven design of fault detection and isolation systems, Journal of Process Control 19 (9) (2009) 1496–1510. [9] G. Li, S. J. Qin, D. Zhou, A new method of dynamic latent variable modeling for processes monitoring, IEEE Transactions on Industrial Electronics 61 (2014) 6438 – 6445. [10] J. F. MacGregor, C. Jaeckle, C. Kiparissides, M. Koutoudi, Process monitoring and diagnosis by multiblock PLS methods, AIChE Journal 40 (5) (1994) 826–838. [11] C. Alcala, S. Qin, Reconstruction-based Contribution for Process Monitoring, Automatica 45 (7) (2009) 1593–1600. [12] G. Li, C. F. Alcala, S. J. Qin, D. Zhou, Generalized reconstruction-based contributions for output-relevant fault diagnosis with application to the tennessee eastman process, Control Systems Technology, IEEE Transactions on 19 (5) (2011) 1114–1127. [13] P. Duan, T. Chen, S. L. Shah, F. Yang, Methods for root cause diagnosis of plant-wide oscillations, AIChE Journal 60 (6)
19
80 Q control limit
70 60
Q statistic
50 40 30 20 10 0
800
600 400 sample index
200
0
1000
Figure 13: Fault detection for IDV(1) with Q statistic based on DPCA models
1 32 0.9
30 28
0.8
26 24
0.7
variable index
22 0.6
20 18
0.5
16 14
0.4
12 0.3
10 8
0.2
6 4
0.1
2 100
200
300
400 500 600 samples number
700
800
900
0
Figure 14: Normalized RBC contribution for IDV(1) with MRBC algorithm
(2014) 2019–2034. [14] H. Jiang, M. S. Choudhury, S. L. Shah, Detection and diagnosis of plant-wide oscillations from industrial data using the spectral envelope method, Journal of Process Control 17 (2) (2007) 143–155.
20
32 21 19 23 9 26 31 Figure 15: Granger causality analysis for the whole group of variables
0
0
0 −10
100 200 sample index
−20
0
100 200 sample index
0
100 200 sample index
0 −5
0
100 200 sample index
0
100 200 sample index
50 31
9
0
0 −5
100 200 sample index
5 x
x19
20
0
x
−5
5 x32
10 x21
x
23
5
0
100 200 sample index
0 −50
x26
10 0 −10
Figure 16: Original measurements of selected faulty variables
[15] S. Xu, M. Baldea, T. F. Edgar, W. Wojsznis, T. Blevins, M. Nixon, Root cause diagnosis of plant-wide oscillations based on information transfer in the frequency domain, Industrial & Engineering Chemistry Research 55 (6) (2016) 1623–1629. [16] H. Jiang, R. Patwardhan, S. L. Shah, Root cause diagnosis of plant-wide oscillations using the concept of adjacency matrix, Journal of Process Control 19 (8) (2009) 1347–1354. [17] T. Yuan, S. J. Qin, Root cause diagnosis of plant-wide oscillations using granger causality, Journal of Process Control 24 (2) (2014) 450–459. [18] M. Bauer, J. W. Cox, M. H. Caveness, J. J. Downs, N. F. Thornhill, Finding the direction of disturbance propagation in a chemical process using transfer entropy, Control Systems Technology, IEEE Transactions on 15 (1) (2007) 12–21. [19] J. Yu, M. M. Rashid, A novel dynamic bayesian network-based networked process monitoring approach for fault detection, propagation identification, and root cause diagnosis, AIChE Journal 59 (7) (2013) 2348–2365.
21
2
5
4
0
x
0
32
2 x21
x23
1
−1 0
−5
100 200 sample index
20
9
19
10
−10
1
30
0.5
20
−0.5 0
100 200 sample index
0
100 200 sample index
−4
100 200 sample index
0
x
x
0
0
x31
−2
0 −2
−1
0
100 200 sample index
0
100 200 sample index
10 0
0
100 200 sample index
−10
0
x26
−2 −4 −6
Figure 17: Measurements of selected faulty variables after wavelet denoising
21
32
23
26 Figure 18: Causality analysis for the 1st group of variables
[20] C. Zou, J. Feng, Granger causality vs. dynamic bayesian network inference: a comparative study, BMC bioinformatics 10 (1) (2009) 122. [21] S. Yoon, J. F. MacGregor, Fault diagnosis with multivariate statistical models Part I: using steady state fault signatures, Journal of Process Control 11 (4) (2001) 387–400. [22] G. Li, S. J. Qin, T. Y. Chai, Multi-directional reconstruction based contributions for root-cause diagnosis of dynamic processes, in: American Control Conference (ACC), 2014, IEEE, Oregon, USA, June, 2014, pp. 3500–3505.
22
9
19
31 Figure 19: Causality analysis for the 2nd group of variables
[23] H. White, D. Pettenuzzo, Granger causality, exogeneity, cointegration, and economic policy analysis, Journal of Econometrics 178 (2014) 316–330. [24] D. J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series., in: KDD workshop, Vol. 10, Seattle, WA, 1994, pp. 359–370. [25] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, A. Y. Wu, An efficient k-means clustering algorithm: Analysis and implementation, Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (7) (2002) 881–892. [26] G. Li, T. Yuan, S. J. Qin, T. Chai, Dynamic time warping based causality analysis for root-cause diagnosis of nonstationary fault processes, IFAC-PapersOnLine 48 (8) (2015) 1288–1293. [27] L. H. Chiang, E. Russell, R. D. Braatz, Fault detection and diagnosis in industrial systems, Springer, London, 2001. [28] G. Li, S. J. Qin, T. Yuan, Nonstationarity and cointegration tests for fault detection of dynamic processes, IFAC Proceedings Volumes 47 (3) (2014) 10616–10621.
23