A Data-Driven Multidimensional Visualization Technique for Process Fault Detection and Diagnosis Shriram Gajjar, Ahmet Palazoglu PII: DOI: Reference:
S0169-7439(16)30070-3 doi: 10.1016/j.chemolab.2016.03.027 CHEMOM 3217
To appear in:
Chemometrics and Intelligent Laboratory Systems
Received date: Revised date: Accepted date:
3 February 2016 9 March 2016 23 March 2016
Please cite this article as: Shriram Gajjar, Ahmet Palazoglu, A Data-Driven Multidimensional Visualization Technique for Process Fault Detection and Diagnosis, Chemometrics and Intelligent Laboratory Systems (2016), doi: 10.1016/j.chemolab.2016.03.027
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
A Data-Driven Multidimensional Visualization
CR I
Diagnosis
PT
Technique for Process Fault Detection and
US
Shriram Gajjar and Ahmet Palazoglu
MA N
Department of Chemical Engineering and Materials Science University of California, Davis
Abstract
PT E
D
Davis, CA 95616
A multidimensional visualization technique is described for fault detection and diagnosis of a multivariate process by principal component analysis (PCA) of historical data. The technique
AC CE
uses a parallel coordinate system to visualize data that allows for monitoring of abnormal process events that lead to process faults, enabling the visualization of multiple principal components effectively and facilitating the study of how each principal component varies with respect to time. The principal component and residual space control limits are established for fault detection and the Random Forests machine learning tool is adopted for fault diagnosis. The key features of the methodology are demonstrated through a study of the benchmark Tennessee Eastman process.
Keywords: Fault detection, Fault diagnosis, Random Forests, Principal component analysis (PCA), Multivariate statistical process monitoring, Multidimensional visualization, Tennessee Eastman process.
1
ACCEPTED MANUSCRIPT INTRODUCTION As the focus of the process industries moves towards optimizing and troubleshooting unit operations, statistical process monitoring (SPM) methods offer powerful and effective
PT
solutions to analyze large volumes of multivariate process data. Such methods provide the means to foresee and prevent faulty process operations that can trigger catastrophic events.
CR I
Multivariate statistical analysis and SPM techniques are applied in various industrial processes, including chemicals, polymers, semiconductor manufacturing and pharmaceutical processes. The tasks involved often include: (i) fault detection; (ii) fault identification and
US
diagnosis; (iii) fault estimation; and (iv) fault reconstruction [1].
A univariate control chart has proven to be a useful statistical process control (SPC)
MA N
technique in manufacturing industries [2]; in the presence of unusual variations in the process, the sample average will violate the control limits and such violations will then alert the operator to investigate the root cause and take necessary actions to stabilize the process. Since the data collected in chemical processes generally exhibit high correlations among
D
variables, however, multivariate statistical methods have been proposed for fault detection
PT E
and diagnosis, e.g., Principal Components Analysis (PCA) [1, 3-8]. Kourti and MacGregor [8] provide early overviews of SPM methods, and their use for the statistical process control of both continuous and batch multivariate processes. Wise and Gallagher [9] review
AC CE
approaches that rely on the formation of a mathematical/statistical model based on historical process data using PCA, PLS and their variations. Other similar approaches and applications are also reported in literature [10-13] A number of researchers have also suggested techniques to address fault detection. For example, Chiang et al. and Lee et al. observed that data-driven analysis is enhanced by incorporating fundamental causal relationships among variables [14, 15]. Kano et al. [16] put forward a SPM method based on the dissimilarity of process data, exploiting the idea that a change in the operating conditions can be detected by monitoring the distribution of timeseries process data because the distribution reflects the corresponding operating condition. Zhang [17] proposed the use of improved kernel independent component analysis (improved KICA) as an unsupervised technique for fault detection. Zhang compares fault detection performance of improved KICA, kernel principal component analysis (KPCA) [18] and kernel independent component analysis (KICA) using the Tennessee Eastman benchmark process [19]. Chiang and Braatz [14] proposed a method to compare distributions, where the modified distance (DI), based on Kullback-Libler information distance, is used to measure 2
ACCEPTED MANUSCRIPT the similarity of the measured variable between the current operating conditions and the historical operating conditions. They also suggested a modified causal dependency (CD) measure to assess the dependency of two variables. Russell, Chiang and Braatz [20]
PT
compared the sensitivity, promptness, and robustness of the statistics obtained from PCA, Dynamic principal component analysis (DPCA) and Canonical variate analysis (CVA). They
CR I
also demonstrated the use of residual space canonical variate analysis statistics for detecting faults in a large-scale process.
Once a process fault is detected, it is imperative to determine its root cause. Fault diagnosis is
US
the process of investigating the process variables to pinpoint one or more root causes that may have led to the fault [21]. The contribution plot [22] is a widely used method for fault
MA N
isolation that does not rely on historical fault data. It depicts the contribution of each process variable to the monitored statistics which assists in identifying the variables that are most likely to have caused the fault in question. Its effectiveness is limited to simple faults, e.g., sensor and actuator faults [1, 23], and it appears to be less useful for diagnosing complex
D
faults. Dunia and Qin [24] proposed fault identification by reconstruction of each possible
PT E
fault and finding the maximum reduction in the error (residual) space. Ku, Storer and Georgakis [25] used dynamic principal component analysis (DPCA) to account for autocorrelation in the data and identify and isolate several disturbances in the Tennessee Eastman
AC CE
process. Raich and Çinar [26] proposed novel distance and angle based methods to detect and diagnose process disturbances in the Tennessee Eastman process. Yoon and MacGregor [23] proposed an isolation technique based on a joint plot of the angles between the vectors of the current fault and those of the known faults. A critical element of fault detection and diagnosis methods is the manner with which the outcome of the analyses are visualized and presented to decision makers. For real-time process monitoring, a visualization technique such as a control chart is essential in fault detection, i.e., recognizing if a fault has occurred even if the root cause is yet unassigned. However, for a large-scale process, an operator would have to constantly monitor several process variables using as many univariate control charts. Historically, PCA of a large volume of data is generally presented in Cartesian coordinates or Euclidian space that can represent at most 3 principal components (PCs). Furthermore, traditional statistics such as the Hotelling’s T2 and Q (Square Prediction Error or SPE) charts aggregate the variation and errors in the principal components into a single statistic and are then plotted for monitoring purposes [1]. Several researchers explored alternative visualization techniques for process 3
ACCEPTED MANUSCRIPT monitoring and knowledge discovery [27, 28]. Fua, Ward and Rundensteiner [29] demonstrated the advantages of interactive visualization of large multivariate data sets based on a number of novel extensions to the parallel coordinates display technique. They
PT
successfully developed a multiresolution view of the data via hierarchical clustering, and used a variation on parallel coordinates to convey aggregation information for the resulting
CR I
clusters. Albazzaz, Wang and Marhoon [30] used parallel coordinates to plot the principal component scores for historical process data analysis. They identified the process to be faulty if there were clear deviations from the dense data area representing normal operation on the
US
parallel coordinates. Albazzaz and Wang [31] proposed the use of parallel coordinates for multidimensional data visualization of process variables and independent components
MA N
obtained from ICA of the process dataset. They used Box-Cox transformation and the percentile approach to define the upper and lower control limits of the independent components. Zhang [17] offered the use of kernel-transformed scores from improved KICA as inputs to a Support Vector Machine (SVM) for fault classification/diagnosis. Dunia, Edgar
D
and Nixon [32] developed a principal component monitoring method named PC2 that
PT E
incorporates plotting of Hotelling’s T2, SPE and scores contributions in a parallel coordinate plot for early detection and identification. They demonstrated how the visualization of principal components in parallel coordinates can be used to compare faulty events and
AC CE
determine the frequency of false alarms. Recently, Wang et al. [33] introduced time-explicit Kiviat plots to visualize the multidimensional time series data obtained from process operations. For fault detection, original variables or the projections of original variables on the PCs are plotted as axes of a Kiviat plot and a centroid is obtained for each sample. An ndimensional confidence ellipsoid is defined for a Kiviat plot from steady-state operation of the process with good performance. They demonstrated that centroids corresponding to normal operation would lie within the confidence ellipsoid, whereas the centroids corresponding to data collected when the process deviates from the reference state would lie outside the confidence ellipsoid and thus flagged as faults. This paper, inspired by the above mentioned studies, integrates the use of parallel coordinates with a process monitoring methodology that relies on the multidimensional visualization of PCs. The univariate control limits are developed for PCs in parallel coordinates to carry out the fault detection step and the signatures of faults in the parallel coordinate space are used to facilitate the fault diagnosis step. The salient features of the proposed method are demonstrated through the benchmark Tennessee Eastman process [34]. It will be shown in 4
ACCEPTED MANUSCRIPT this work that the dynamic score and loading plots can be used to readily identify variables that behave abnormally during the occurrence of different faults as well as to determine how the faults may propagate in the process.
PT
The paper is organized as follows: the next section briefly introduces some key SPM concepts for the sake of completeness, followed by the presentation of the case study based
CR I
on the Tennessee Eastman benchmark process simulation. The visualization in parallel coordinates is introduced next, articulating fault detection, fault diagnosis and fault propagation concepts in the parallel coordinate space of PCs. Subsequently, the results of the
US
case study are discussed with comparisons to existing methods. Finally, the conclusions and direction for future work are presented.
MA N
PRELIMINARIES
Principal Component Analysis (PCA) and Fault Detection PCA is a commonly used multivariate tool for dimensionality reduction that transforms
D
correlated original variables to uncorrelated variables called principal components, while
PT E
preserving the most important information of the original data set [35]. PCA is concerned with explaining the covariance structure of a set of variables through a few linear combinations of these variables. Mathematically, PCA is the eigenvector decomposition of
AC CE
the covariance of the correlation matrix obtained from data matrix X Rnm which contains n regular-sampled observations of m process variables and is scaled to zero mean and unit variance, into a transformed subspace of reduced dimension. The covariance matrix of X is defined as:
cov( X )
XT X n 1
(1)
The decomposition is then expressed as follows:
X TPT X E
(2)
where T Rnm and P Rmm are the score matrix and the loading matrix, respectively. The matrices X and E represent the estimation of X and the residual part of the PCA model, respectively, which are defined as follows: X Tl Pl T
l
T P i 1
T
(3)
i i
5
ACCEPTED MANUSCRIPT E Tml PmTl
m
TP
i l 1
T
(4)
i i
The principal component projection reduces the original set of variables to l PCs where l
PT
must be equal or less than the smaller dimension of X. The decomposition assumes that PC loadings are orthonormal (PiT Pj 0 for i j, PiT Pj 1 for i j ) and PC scores are orthogonal
CR I
(TiT T j 0 for i j ) . The Pi are the eigenvectors of the covariance or correlation matrix, ,
given by:
US
cov( X ) Pi i Pi
(5)
MA N
where i is the eigenvalue associated with the eigenvector Pi. The loadings (Pi) contain information on how variables relate to each other whereas Ti vectors are scores that contain information on how the samples relate to each other. Note that for X and any Ti, Pi pair the Pi has such a relationship with Ti as shown below:
D
XPi Ti
(6)
PT E
This is because the score vector Ti is the linear combination of the original X data defined by Pi. The Ti, Pi pairs are arranged in descending order according to the associated i . The i are a measure of the amount of variance described by the Ti, Pi pairs. Also, as the Ti, Pi pairs are
AC CE
in descending order of, i , the first pair captures the largest amount of information of any pair in the decomposition.
When the PCA model is employed for monitoring process systems, Hotelling's T2 and Q (or SPE) statistics are usually used for the fault detection [4]. T2 index is the squared Mahalanobis distance of the projections in the principal component sub-space (PCS), designed to measure the variability of the mean and covariance within the PCS. Q statistic is the measure of lack of PCA model fit and captures the variance in the error or residual subspace (RS). In performing real-time process monitoring, when a new sample vector x (after scaling with the mean and variance obtained from the normal data) becomes available, it is projected with the help of the PCA model to either the PCS or RS. In PCA, the major variation of variables in the PCS is measured by T2. l
T2 i 1
ti2
i
, where ti is the ith vector of xPT .
6
ACCEPTED MANUSCRIPT T2 is calculated by the Mahalanobis distance of a score vector, t, in the PCS, and Q is the Euclidean distance of a residual vector in the RS, which are given by T 2 t T 1t xT Pl 1Pl T x
(7)
PT
T Q eeT xT ( I PP l l )x
(8)
CR I
where diag{1 , 2 ,..., l } , e is the residual vector, and I Rmm denotes identity matrix. The T2 statistic is defined to measure the variation in latent variables space [4]. The
US
approximate control limits on T2 and Q statistics, with a confidence level , are determined from the normal operating data in several ways by applying the probability distribution
T2 T T S 1T
MA N
assumptions [1, 8]:
(n 1)l F (l , n l ) (n l )
(9)
where S T T T / (n 1) is the covariance matrix of the score T of the training dataset,
D
F (l , n l ) is the upper limit of percentile of the F-distribution with the degree of freedoms
and is given as: m
Q eT e e2ji Q
(10)
AC CE
j 1
PT E
l and n-l. The Q statistic is a measure of variation that breaks the normal process correlation
where ei x Tli PlT is the residual vector, Q is the corresponding control limit, which can be calculated by using a weighted 2 -distribution [36]:
Q g h2, ;
g
2m
,h
2m 2
(11)
where m and v are the estimated mean and variance of the squared prediction error from the modeling data, respectively. This approximating distribution is found to work well even in cases where the errors are not normal [37]. In summary, T2 statistic measures the significant variability within the PCs, while Q indicates how well the reduced dimensional PCA model can describe the significant process variation. In most cases, Q is considered as a complementary statistic for T2 considering that the value of Q can be significantly affected by the process noise. The process condition is considered faulty if the statistics of a new observation exceeds the control limit. 7
ACCEPTED MANUSCRIPT As mentioned earlier, the dimension of the data is set to be equal to the number of principal components. The number of principal components is often reduced to a set of size l, where 1 l m . The optimal number of components is chosen such that the model captures the
PT
variation in the dataset and not the noise. The trade-off here is that the closer the value of l is to m the better the PCA model will fit the data since more information has been retained,
CR I
while the closer l is to 1, the simpler the model. However, we want to make the analysis easier thus we only want to retain components capturing most of the variation in the dataset. Several techniques to determine l, i.e., the number of "meaningful" components, have been
US
proposed in the literature. These techniques are both heuristic and statistics based while some can be easily computed and others computationally intensive: the Kaiser-Guttman test, the
MA N
broken stick model, Log-Eigenvalue (LEV) diagram, Velicer's Partial Correlation Procedure, Cattell's SCREE test, cross-validation, bootstrapping techniques, cumulative percentage of total of variance, and Bartlett's test for equality of eigenvalues [35, 38]. An overview of the techniques can also be found in Cangelosi and Goriely [39]. Tamura and Tsujita [7] studied
D
the relationship between the sensitivity of fault detection and the number of PCs retained.
PT E
They concluded that the number of PCs retained greatly affects the ability of fault detection. Thus, selecting l is one of the most critical steps for fault detection using PCA. Cumulative percent variation is a measurement of the percent variation captured by the first l
AC CE
ordered PCs given by: l
CPV(l )
i 1 m
i
i
100%
(12)
i 1
where CPV denotes the cumulative percent of variance accounted for. The amount of variance is frequently pre-defined as 70, 80 or 85%. The first l PCs are determined when
CPV reaches a predetermined limit , and we suggest that 70% < < 90%. In this work, we use 85% . TENNESSEE EASTMAN PROCESS (TEP) Tennessee Eastman Process (TEP) is a well-known benchmark for testing the performance of various fault detection methods [40-43]. The TEP simulator has been widely used by the process monitoring community as a source of data for comparing control strategies and various other approaches [14]. A flowchart of the TEP is shown schematically in Fig. 1. 8
ACCEPTED MANUSCRIPT There are five major unit operations in the process: an exothermic two-phase reactor, a condenser, a recycle compressor, a flash separator, and a reboiled stripper. Four reactants A, C, D, E, and inert B are fed to the reactor where the products G and H are formed and a by-
PT
product F is also produced. The reactions in the reactor are irreversible, exothermic, and approximately first-order with respect to the reactant concentrations.
CR I
In TEP, reactants A, D, E and recycle from the compressor are fed to the reactor (stream 6). The reactor product stream (stream 7) is then cooled through a condenser and then fed to a vapor–liquid separator. The vapor from the separator is recycled to the reactor feed through
US
the compressor (stream 8). A fraction of the recycle stream (stream 9) is purged to keep the inert and byproducts from accumulating in the process. The liquid from the separator (stream
MA N
10) are pumped to the stripper. Stream 4 is used to strip the remaining reactants in liquid from the separator, and is combined with the recycle stream at compressor outlet. The products G and H exiting the base of the stripper are sent to a downstream process which is not included in this process.
D
TEP has 22 continuous process measurements, 12 manipulated variables, and 19 composition
PT E
measurements that are sampled less frequently. Details on the process description are well explained in [5]. A total of 33 variables are used for monitoring in this study as shown in Table 1. All composition measurements are excluded since they are hard to measure online in
AC CE
practice. Also, the sampling rate of composition variables can be different from the operating variables in an operating unit. In TEP, Faults 1-7 are deterministic faults whereas Faults 8-12 are stochastic faults. The detailed description of the programmed faults (Faults 1–21) is listed in Table 2.
The data are generated at a sampling interval of 3 min and can be downloaded from http://web.mit.edu/braatzgroup/links.html. For comparison, 960 normal samples are used to build the model. The simulation time for each run is 48 hr. The simulation is started with no faults, and the faults are introduced 8 simulation hours into the run. The total number of observations generated for each run is n = 960. Each fault data set contains 960 samples, with the fault introduced after sample 160.
9
AC CE
PT E
D
MA N
US
CR I
PT
ACCEPTED MANUSCRIPT
Figure 1. Flow sheet of Tennessee Eastman benchmark process.
Table 1: Monitoring variables in the TEP.
NO.
Process variables
NO.
Process variables
NO.
Process variables
1
A feed
12
Product separator level
23
D feed flow valve
2
D feed
13
Product separator pressure
24
E feed flow valve
3
E feed
14
Product separator under flow
25
A feed flow valve
4
Total feed
15
Stripper level
26
Total feed flow valve
5
Recycle flow
16
Stripper pressure
27
Compressor recycle valve
6
Reactor feed rate
17
Stripper under flow
28
Purge valve
7
Reactor pressure
18
Stripper temperature
29
Separator pot liquid flow
8
Reactor level
19
Stripper steam Flow
9
Reactor
20
Compressor work
21
Reactor
temperature 10 11
Purge rate Product separator
cooling
valve 30
10water
valve outlet
Separator
31 Stripper steam valve
temperature 22
Stripper liquid product flow
32 cooling
water
outlet
Reactor cooling water flow 33
ACCEPTED MANUSCRIPT Table 2: Process faults for the TEP.
Description
Type
1
A/C feed ratio, B composition constant (stream 4)
Step
2
B composition, A/C ratio constant (stream 4)
Step
3
D feed temperature (stream 2)
Step
4
Reactor cooling water inlet temperature
Step
5
Condenser cooling water inlet temperature
6
A feed loss (stream 1)
7
C header pressure loss – reduced availability (stream 4)
Step
8
A,B, C feed composition (stream 4)
Random variation
9
D feed temperature (stream 2)
10
C feed temperature (stream 4)
11
Reactor cooling water inlet temperature
Random variation
12
Condenser cooling water inlet temperature
Random variation
13
Reaction kinetics
Slow drift
14
Reactor cooling water valve
Sticking
15
Condenser cooling water valve
Sticking
16
Unknown
Unknown
17
Unknown
Unknown
18
Unknown
Unknown
19
Unknown
Unknown
20
Unknown
Unknown
21
The valve for stream 4 was fixed at the steady-state position
Constant Position
CR I
PT
No.
Step
AC CE
PT E
D
MA N
US
Step
Random variation Random variation
To determine the number of retained PCs, Stevens [44] investigated the accuracy of the Kaiser criterion and recommends its use for a dataset with small to moderate number of variables (less than 30) or with at least 250 observations is being analyzed and the variable communalities are high. Applying this criterion, one would retain 14 components for the dataset obtained from TEP as shown in Fig. 2.
11
MA N
US
CR I
PT
ACCEPTED MANUSCRIPT
Figure 2. Scree plot for Tennessee Eastman benchmark process.
D
VISUALIZATION, FAULT DETECTION AND DIAGNOSIS USING PCA
PT E
Visualization
Inselberg first presented the method of parallel coordinates in 1985 as a device for computational geometry. Subsequent work has been accomplished after that by Inselberg,
AC CE
Reif and Chomut [45], Inselberg [46], Wegman [47] and Inselberg [48]. With parallel coordinate representation of data, one can diagnose one-dimensional features such as marginal densities, two-dimensional features such as correlations and non-linear structures and multi-dimensional features such as clustering, hyper-planes and modes. In addition to the diagnostic abilities, Miller et al. [49] also discussed the interpretations of parallel coordinate density plots. A two-dimensional scatter diagram with a completely uncorrelated pair of variables forms a circumscribing circle in parallel coordinates. In a parallel coordinate set-up, ellipses would map into hyperbolas and thus would approximate a figure with a hyperbolic envelope. Wegman [47] stated that such a hyperbolic envelope would deepen as the correlation between the pairs of observation approaches -1 and in that limit one would observe a pencil of lines or a crossover effect. On the other hand if the correlation approaches +1, the hyperbolic envelope would widen, with fewer and fewer crossovers and in that limit we would have parallel lines.
12
ACCEPTED MANUSCRIPT A visualization method based on all process variables can be cumbersome if the number of variables is high and, due to its univariate nature, ignores the correlations among process variables to correctly pinpoint to the onset of the fault. Furthermore, detecting faults based
PT
on the control limits of individual variables in such a graph is unrealistic and ineffective. Such drawbacks can be addressed by the use of PCA. PCA finds the strongest orthogonal
D
MA N
US
CR I
direction of variation in a data, beginning with the largest direction of variation.
PT E
Figure 3. Process monitoring visualization in parallel coordinates for the 33 process variables in fault #1 operating data. Each line represents a time instance (or a sample). Note that not all sample points are shown on this and subsequent plots to avoid clutter.
Albazzaz, Wang and Marhoon [30] described the use of parallel coordinates as a technique to
AC CE
visualize multidimensional data using two-dimensional presentations and allow identification of clusters and outliers to detect faulty events. They observed that using parallel coordinates, the faulty events can be clearly identified due to their clear deviation from the dense data area representing normal operation as shown in Fig. 3. Furthermore, Wegman [47], demonstrated the easy diagnosis of correlation structure and clustering in a large data-set using parallel coordinates. As shown in Fig. 3, the parallel coordinates yield a pairwise comparison of variable 1 with variable 2, variable 2 with variable 3, and so on. However, the pairwise comparison of variable 1 with variable 3, variable 2 with variable 5 and so on cannot be easily done since they do not lie on adjacent axes. In this case, one can use PCA and arrange the resulting PCs in decreasing order of variance captured from PC 1 to PC n .The number of permutations required to arrange every axis adjacent to every other axis is
(n 1) in an n2
dimensional data set [47]. However, since PCs are uncorrelated no such permutations are required or would be meaningful.
13
US
CR I
PT
ACCEPTED MANUSCRIPT
Figure 4. Process monitoring visualization in parallel coordinates for the 14 principal components obtained from fault #1 operating data.
MA N
Traditionally, the PCs with the largest variation can be represented in the Euclidian space but such a visual representation could only be done at most for three PC directions and would present a limited view of the correlation structure expressed by all retained PCs. This then
D
limits our ability to interpret the patterns and relationship between the uncorrelated data and the process variables. Unlike the traditional representation, our proposed approach is to use
PT E
parallel coordinates to plot all the retained PCs. For the data presented in Fig. 3, a PCA model is developed with 14 PCs capturing 85% variance and Fig. 4 shows the retained PCs in parallel coordinates displaying how the faulty data in Fig. 3 is projected onto the parallel
AC CE
PCA coordinates. As seen in Fig. 4, the PC scores from faulty process operation are easily identified. Figure 5 presents a detailed view of the data just around the time when the fault is introduced. The transition from normal to faulty operation is clearly observed.
Figure 5. Fault #1, transition of process from normal to faulty, plotted on parallel coordinates.
14
ACCEPTED MANUSCRIPT Fault Detection We note that T2 is calculated by the Mahalanobis distance of a score vector, t, in the PCS, and
components within row i given by:
T i 1
ti2
i
l
i 1
ti2
i
ti2
m
i l 1
i
, where ti is the i th vector of PT x
(13)
CR I
l
2
PT
i is the variance for the ith PC obtained from the offline modelling. T2 is the summary of all l
By scaling each ti2 by the reciprocal of its variance, each PC term plays an equal role in the
US
computation of T2 irrespective of the amount of variance it explains in the X matrix. This illustrates some of the problems with using T2 when the variables are highly correlated and is highly ill-conditioned [8]. When the number of variables m is large,
MA N
is often
singular and cannot be inverted, nor can all the eigenvectors be obtained. Even if it can, the last PCs (i=l+1,…,m) explain less of the variance of X and generally represent random noise. If the variances l 1 ,..., m are incorporated, the slight deviations in these ti’s which have
D
almost no effect on X will lead to an out-of-control signal in T2. Therefore, T2 based on the
PT E
first l PCs provides a test for deviations in the process variables that are of greatest importance to the variance of X. Furthermore, if scores are not much different, i in both cases remains the same, thus T2 limit will not be violated. If the real-time data has variance
AC CE
similar to that captured by i obtained from offline model, T2 will fail to detect a small changes in score since it is the summation of variation across all principal components. Thus, we propose to look at the variance of the scores within each principal component independently. Equation (13) can be rephrased as follows: l
T2 i 1
ti2
i
l
i 1
ti2
i2
where i i
(14)
where i is the standard deviation of score of column i. Also, 1 2 3 l . This is henceforth called the principal component control limit (PCCL). Similarly, using the same approach, control limits for the residual space are also obtained and called the residual space control limit (RSCL). The Central Limit Theorem (CLT) states that, no matter what distribution X may have in the population, if the sample size is large enough, then the sampling distribution of X will be approximately a Gaussian distribution [50]. In this study, the sample size is 960 and therefore 15
ACCEPTED MANUSCRIPT assumed that the variables follow Gaussian distribution at normal operating (NO) conditions. Thus, it is due to this assumption that the upper and lower control limits (UCL and LCL) are defined as:
PT
UCL i ( z i )
(16)
CR I
LCL i ( z i )
(15)
where, i is the mean of ith PC, z = 1.96 for 95% confidence interval (CI), i is the standard
US
deviation of scores and error on ith PC. The means of the scores and residuals of each PC obtained from the normal operating data are zero. Thus, the upper and lower control limits (UCL and LCL) on each principal component (PCCL) and error/residual space (RSCL) can be
MA N
defined as:
PCCLUCL ( z i ) ; i 1, 2,..., l
(18) (19)
PT E
RSCLUCL ( z e j ) ; j 1, 2,..., m
D
PCCLLCL ( z i ) ; i 1, 2,..., l
(17)
RSCLLCL ( z e j ) ; j 1, 2,..., m
(20)
AC CE
where, z = 1.96 for 95% CI, i and e j are the standard deviation of scores on ith PC and error in jth variable.
These limits allow us to treat each PC as a univariate monitoring chart following Shewhart’s original construct [11]. Along with the control limits, one needs to use a run rule(s) to qualify a sample as faulty. These run rules are defined to control the false alarm rate associated with the algorithm. In this work, a fault is declared when two successive samples exceed the 2sigma warning limits for the same PC. This run-rule yields a ~5% false alarm rate for most of the normal operating data in the faulty dataset generated from TEP simulations. Fault Diagnosis Several researchers have proposed data-based fault diagnosis methods based on historical process data and process behavior. Some well-known models based on process knowledge are signed directed graph (SDG), Bond graph, Temporal causal graph, Structural Equation models (SEM), Rule-based models and Ontological models [51]. In the case when process knowledge and behavior are not known but process data is available, pairwise causality
16
ACCEPTED MANUSCRIPT capture methods are developed to identify cause and effect. In a real process that is often multivariate in nature, a topology should be constructed based on pairwise analysis results. While performing process data analysis, it is known that a cause will precede the effect and
PT
such phenomenon is first tested by cross-correlation with an assumed lag or fitting the inputoutput variables into dynamic models. Granger causality [51] and frequency domain methods
CR I
are used to obtain the relationships between process variables. Information transfer based methods known as transfer entropy are model-free methods that measure the information transfer from A to B by measuring the uncertainty reduction while assuming predictability.
US
Causal relationships also show probabilistic properties; thus Bayesian nets are used to describe these relationships. The Bayesian net is a probabilistic graphical method, where
MA N
nodes denote fault modes as well as process variables, and arcs denote conditional probabilities [52]. Some researchers have also worked on data-driven techniques such as qualitative trend analysis (QTA) [53]. QTA is a process history based data-driven technique that works by representing measured signals by using a sequence of seven primitives (basic
D
shapes) or trend. Dong, Beike, Xin and Chongguang [54] proposed the use of SDG in
PT E
combination with QTA for fault diagnosis. Maurya, Rengaswamy and Venkatasubramanian [53] proposed QTA on the PCs instead of on the original sensors, and showed that the computation time was substantially reduced. Recently, Li and Xiao [55] proposed signal
AC CE
geometry matching technique for fault diagnosis. They developed a fault diagnosis method by pattern matching using one-dimensional adaptive rank-order morphological filter (PCA1DARMF) to process unrecognized signals under supervision of different standard signals. Li and Xiao [55] created an offline database of supervisory signal templates for PCs extracted from TEP. Thus, there exists a signal template for each fault as well as the normal process operation. For fault diagnosis, the PCA1DARMF algorithm matches an unrecognized signal patterns obtained from a new sample against every template to yield the final classification result according to each PC. Recently considerable work has been carried out using ensemble learning algorithms to improve classification accuracy by voting the predictions of several classifiers. The two main methods are boosting [56] and bagging [57]. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees that do not depend on earlier trees each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. In such a manner, the combination of different learning 17
ACCEPTED MANUSCRIPT models is used to improve the classification accuracy. Breiman [58] proposed Random Forests, which uses random selection of features at each node to determine the split. Random Forests is an algorithm that is a large collection of de-correlated decision trees. Its algorithm
PT
has desirable characteristics such as its accuracy is as good or better than Adaboost [56], it is relatively robust to outliers and noise, it is faster than bagging or boosting and provides useful
CR I
internal estimates of error, strength, correlation and variable importance. Thus, Random Forests method is an effective prediction tool and performs better than many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust
US
against overfitting [58].
This paper will focus on the use of Random Forests algorithm for machine learning and
MA N
prediction of faults, i.e., fault diagnosis. Random Forests algorithm offers several advantages over others. It is faster to build, requires minimal user input, has the training step completely parallelized, does not require any preprocessing of the data, is resistant to outliers and can handle missing values in the training dataset. The Random Forests algorithm is described
D
below [58]:
PT E
1. For b=1 to B, where B is the user defined total number of decision trees to be created. a. Draw a bootstrap sample Z* of size N from training data. b. Grow a random-forest tree Tb to the bootstrapped data, by recursively
AC CE
repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached. i. Select p variables at random from the m variables.
ii. Pick the best variable/split-point among the p variables.
iii. Split the node into two daughter nodes. B
2. Output the ensemble of trees {Tb }1 . 3. To make a classification prediction at a new point x: Let Cˆb ( x) be the class prediction B B of the bth random-forest tree. Then Cˆ rf ( x) majority vote {Cˆb ( x)}1 .
As shown in Fig. 6, one initially creates a training data set of fault numbers as response variables and corresponding scores and residuals as predictors. For the TEP problem, different training datasets were created as shown in Table 6 that consisted of 700 samples from each fault and or normal operating data of which 75% of the samples were randomly chosen for training and the remaining 25% were used for testing. The Random Forests
18
ACCEPTED MANUSCRIPT algorithm randomly selects a small subset of the training data. From r randomly selected subsets of data, r decision trees are obtained and p number of variables are randomly sampled as candidates at each split in the tree. Thus, all trees are trained with the same parameters but
PT
on different training sets. Random Forests algorithm uses all r decision trees to rank the classification, voting is obtained from all trees and then a final prediction class is obtained. In
CR I
random trees there is no need for any accuracy estimation procedures, such as crossvalidation or bootstrap, or a separate test set to get an estimate of the training error. The error is estimated internally during the training. To obtain the error, when the training set for the
US
current tree is drawn by sampling with replacement, some vectors are left out (out-of-bag data, oob). The size of oob data is about N/3. Moreover, it is noteworthy that the Random
AC CE
PT E
D
MA N
Forests does not over fit which implies that the user can run as many trees as desired.
Figure 6. Simple schematic for Random Forests algorithm.
The trained model is then used to identify or diagnose the fault for the scores obtained for each sample. The model is trained and tested on different datasets. PROCESS MONITORING PROCEDURE A real-time monitoring scheme is implemented using the two-step algorithm as shown in Fig. 7. The complete monitoring procedures are summarized as follows. Offline modeling: (1)
Sufficient data is acquired when a process is operated under a normal condition. Each column (variables) of the data matrix is normalized, i.e., scaled to zero mean and unit variance. 19
ACCEPTED MANUSCRIPT (2)
PCA is applied to the data matrix, and the loading matrix P Î Rmxm and the eigenvalue matrix E diag{1 , 2 ,..., m } are obtained. The threshold of cumulative percent variation, is specified.
(4)
The PCCL and RSCL are computed.
(5)
The Random Forests machine learning algorithm is trained for fault diagnosis
PT
(3)
CR I
on a dataset that contains equal samples from normal operation as well as each
AC CE
PT E
D
MA N
US
fault.
Figure 7. Overview of monitoring procedures.
Real-time monitoring:
When real-time observations become available, the monitoring task is initiated. (1)
The data matrix x representing the current operating conditions is updated at each time step and is scaled based on the mean and the variance obtained from the offline model. Subsequently, they are projected onto P to obtain the score matrix Tnew .
(2)
The first l PCs that capture the dominant variability of the current score matrix Tnew .
(3)
If the real-time sample is within the control limits, the process is judged as normal and the time-window is moved to the next step. Conversely, if the sample being evaluated violates any of the control limits, the new added data point may or may not be considered as faulty as per run-rules. 20
ACCEPTED MANUSCRIPT (4)
If the sample is flagged as faulty, then the fault diagnosis step is initiated to identify the fault.
RESULTS AND DISCUSSION
PT
Figure 8 demonstrates the application of traditional SPM tools such as Hotelling’s T2 and Q statistics. The process condition is considered faulty (samples 161-960) if the statistics of a
CR I
new observation exceed the control limit. Although both T2 and Q statistic are used for process monitoring, the Q statistic measures variability away from the normal process
US
correlation, which often is indicative of a faulty situation. The large normal process variance is typically captured by the PCs while the relatively small variance caused by noise is captured by the residuals. Moreover, if a given sample violates the T2 limit but does not
MA N
exceed the Q control limit, it can be either a fault or just that there has been a change in the operating condition and the system has moved or is moving to a new steady-state. Figure 8
AC CE
PT E
D
indicates that the traditional statistics do capture this faulty behavior quite well
Figure 8. Statistical Process Monitoring (SPM) results of fault #1 operating data.
Traditionally the Hotelling’s T2 and Q statistic values obtained are a summation over the principal component and residual spaces, respectively. Thus, the T2 and Q charts show the final value of the statistic with time but fail to represent the component contributions to the values. However, with the proposed fault detection visualization, each retained PC can be displayed in 2-D parallel coordinate plot as shown in Fig. 9. When a new sample vector x is available, it is projected onto the PC sub-space and corresponding PC scores are plotted for each PC. The control limits shown in Fig. 9 are the PCCL. It can be observed in Fig. 9 that samples 159-160 are in control whereas samples 161-162 violate the control limit.
21
MA N
US
CR I
PT
ACCEPTED MANUSCRIPT
Figure 9. Process monitoring using proposed approach, results for detecting fault #14 operating data.
Figure 9 would also provide information about which PC is violated as the time progresses. The operator can then focus on the variables that contribute the most to the specific violated
AC CE
PT E
D
PC(s) and stabilize the process.
Figure 10. PCCL fault detection visualized using proposed approach, results for detecting fault #1 operating data.
22
MA N
US
CR I
PT
ACCEPTED MANUSCRIPT
Figure 11. RSCL fault detection visualized using proposed approach, results for detecting fault #1 operating data.
D
Figure 10 shows the visualization of fault detection using the PCCL in parallel coordinates
PT E
where each PC is plotted as a parallel coordinate along with the corresponding sample number. The blue dots for PCs for sample numbers less than 160 represent the false alarms associated with the proposed algorithm. The red dots represent when a fault was detected on
AC CE
each PC. Similarly, Fig. 11 illustrates the use of parallel coordinate plot for fault detection visualization in the residual space using RSCL. As noted before, for a given sample, the figures show which PC (or PCs) was violated and for diagnosis the variable loadings on the violated PCs can then be investigated to eliminate the fault. To quantify the fault detection performance, in this work, the missed detection rate is expressed as the ratio of faulty data points that are not detected as faulty to the total faulty data points specific to a fault. As seen in Table 3, the missed detection rate is significantly lower for all faults when our method is compared to the traditional monitoring statistics. Moreover, for difficult-to-detect faults 3, 9 and 15, the missed detection rate is significantly reduced as compared to the conventional Hotelling’s T2 or Q statistics based detection. Table 3: Missed detection rate. Blank cells indicate that either no fault was detected or was not reported. Missed detection rate (Lower is better) PCCL
RSCL
Either, PCCL or RSCL
Proposed Method**
Fault
1
Russell et al.** Wang et al.*
Type I
Type II
Type I
Type II
9
7
18
0
Type I
26
Type II
0
23
PCCL
RSCL
Either
0.009
0.000
0.000
PCA T2
0.018
0.008
PCA Q
0.003
ACCEPTED MANUSCRIPT 8
10
19
11
26
5
0.013
0.014
0.006
0.018
0.020
0.014
3
22
664
51
632
70
534
0.830
0.790
0.668
0.542
0.998
0.991
4
10
334
20
1
27
0
0.418
0.001
0.000
0.04
0.956
0.038
5
10
499
20
1
27
1
0.624
0.001
0.001
0.002
0.775
0.746
6
4
4
19
1
23
1
0.005
0.001
0.001
0.005
0.011
0.000
7
7
1
22
1
28
1
0.001
0.001
0.001
0.007
0.085
0.000
8
8
12
36
30
43
5
0.015
0.038
0.006
0.102
0.034
0.024
9
49
673
41
647
77
550
0.841
10
7
356
9
76
15
61
0.445
11
8
331
17
82
24
64
0.414
12
38
11
18
5
52
3
13
11
45
17
29
27
28
14
11
13
11
1
20
1
15
14
638
18
611
31
16
61
438
37
74
17
7
102
29
18
6
74
19
13
20 21
CR I
PT
2
0.688
0.829
0.994
0.981
0.095
0.076
0.138
0.666
0.659
0.103
0.080
0.171
0.794
0.356
US
0.809
0.006
0.004
0.069
0.029
0.025
0.056
0.036
0.035
0.283
0.060
0.045
0.016
0.001
0.001
0.04
0.158
0.000
506
0.798
0.764
0.633
0.188
0.988
0.973
80
41
0.548
0.093
0.051
0.138
0.834
0.755
14
35
13
0.128
0.018
0.016
0.266
0.259
0.108
21
66
27
56
0.093
0.083
0.070
0.266
0.113
0.101
563
24
219
36
140
0.704
0.274
0.175
0.996
0.873
8
368
21
154
29
116
0.460
0.193
0.145
0.701
0.550
23
411
67
78
194
0.514
0.258
0.243
0.736
0.570
D
206
0.411
AC CE
**3 minute sampling interval
PT E
*1 minute sampling interval
MA N
0.014
Once a disturbance occurs in the process, the longer the fault is active, the longer it takes to bring the process back in control and larger the economic losses may be. Thus a fault detection algorithm should be able to detect a fault as soon as it occurs. Table 4 shows the sample number at which the fault was first detected by the PCCL, RSCL or if either of the two were used to detect a fault. The performance of the proposed method is uniformly very good. Table 4: Detection time/delay. Blank cells indicate that either no fault was detected or was not reported.
Detection time (Min) Fault
1 2 3
4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Proposed method** 0 15 6
3
6 6 6 0 0 18 0 0 0 6
Russell et al.**
Wang et al.*
0
6
0 12 6
0 0
3 8 17 2
2 2 2 46 201 52 39 14 124 8 32 56 112 113
180
PCA T2
21 51
48 30 3 69
288 912 66 147 12
936 87 279
261
PCA Q
9 36
3 3 3 60
147 33 24 111 3 2220 591 75 252
261
303 585 9 135 18
267
DPCA T2 18 48
9
453 6 633 3 69
24
597 84 279
ACCEPTED MANUSCRIPT 6 3 3 63
150 21 24 120 3
CVA Ts2
6 39
CVA Tr2
9 45
CVA Q
6 75
PCA Q
21 27
15 15 54 894 99
48
PCA T2
21 33
15 15 60 903 150
48
KPCA
15 30
9
KICA
9 36
Imp. KICA 6 33
1386 3 3 3 60 3
588 72 252 246 252
75 876 6 126 6 2031 42 81 249
246
3 3 3 60
69 33 6 117 3
27 60 237 33 198
0 0 0 63
132 81 0 129 3
33 69 252
PT
3
216 39
624 192 252
45
3 3 3 75
60 69 9 123 3 27 27 57 222
177
6
3 3 3 69
51 57 6 114 3 27 21 51 198
165
6
3 0 0 60
42 45 3 99 3 21
CR I
153 189 237
US
Zhang**
Tamura and Tsujita**
DPCA Q 15 39
MA N
*1 minute sampling interval
9 51 195
**3 minute sampling interval
We also explored the detection rate expressed as the ratio of faulty data points that are correctly detected as faulty to the total faulty data points specific to a fault. Table 5 depicts the detection rate for all faults simulated in the TEP. Again, the performance of our method is
D
superior to the traditional approach.
PT E
Table 5: Detection rate.
Detection rate
PCA
2
PCA T
AC CE
Fault
Proposed Method
PCA Q
PCCL
RSCL
Either
1
99.37
100.00
99.13
100.00
100.00
2
98.50
98.37
98.75
98.63
99.38
3
1.00
1.38
17.00
21.00
33.25
4
5.89
99.75
58.25
99.88
100.00
5
23.43
21.43
37.63
99.88
99.88
6
99.62
100.00
99.50
99.88
99.88
7
55.01
100.00
99.88
99.88
99.88
8
96.87
94.86
98.50
96.25
99.38
9
1.13
1.25
15.88
19.13
31.25
10
34.96
25.19
55.50
90.50
92.38
11
24.44
70.55
58.63
89.75
92.00
12
97.62
94.11
98.63
99.38
99.63
13
94.11
95.36
94.38
96.38
96.50
14
84.21
100.00
98.38
99.88
99.88
15
2.38
2.26
20.25
23.63
36.75
16
20.93
24.19
45.25
90.75
94.88
25
ACCEPTED MANUSCRIPT 75.31
92.86
87.25
98.25
98.38
18
89.10
90.10
90.75
91.75
93.00
19
0.25
22.81
29.63
72.63
82.50
20
33.46
46.99
54.00
80.75
85.50
PT
17
CR I
Specifying the control limits is one of the critical decisions that must be made in designing a control chart. By moving the control limits farther from the center line, we decrease the risk of a Type I error—that is, the risk of a point falling beyond the control limits, indicating an
US
out-of-control condition when no assignable cause is present. However, widening the control limits will also increase the risk of a Type II error—that is, the risk of a point falling between
MA N
the control limits when the process is really out of control. If we move the control limits closer to the center line, the opposite effect is obtained: The risk of Type I error is increased, while the risk of Type II error is decreased. The proposed PCCL/RSCL approach is applied to the TEP simulation data and its fault detection performance is compared with conventional
D
Hotelling’s T2 and Q statistic monitoring strategy. Table 3 also captures the Type I/Type II
Fault Diagnosis
PT E
error performance of the proposed method.
Once a fault is detected, the next task is to identify the type of fault and/or diagnose the root
AC CE
cause of the fault. When each faulty dataset is projected in parallel coordinates, it is observed that each fault has a unique signature in the multidimensional space as exemplified in Figs. 12-14 for faults #1, #3 and #6. The unique fault signatures in parallel coordinates hint at the possibility of using the PC scores to classify different types of faults. This then forms the basis of the Random Forests machine learning algorithm which is trained on data obtained from all faults and then used to classify and diagnose faults in real-time.
26
MA N
US
CR I
PT
ACCEPTED MANUSCRIPT
AC CE
PT E
D
Figure 12. Fault #1 signature in parallel coordinates.
Figure 13. Fault #3 signature in parallel coordinates.
27
MA N
US
CR I
PT
ACCEPTED MANUSCRIPT
Figure 14. Fault #6 signature in parallel coordinates.
For developing the offline fault diagnosis model, a dataset that has the PC scores and the corresponding fault number is created. The dataset consisted of samples 161-860 that
D
includes samples from the point the fault was introduced to when the system has reached a
PT E
new steady-state after the introduced fault. From the large dataset that consisted of 700 samples from each normal and fault condition, 75% (11550 samples in models #1 and #2) of the samples were randomly selected for training. The remaining 25% (3850 samples in
AC CE
models #1 and #2) constituted the testing dataset. During training, the Random Forests algorithm knows which PC scores correspond to what fault number. In other words, the fault signature is captured during the training stage. 1500 trees were used in this case study to build the offline model. In Table 6, model #2 is the one with the largest dataset (15400 x 48) with 15400 samples (700 samples each from normal and 21 faulty datasets) and 48 variables (1 fault number,14 PC scores and 33 residuals). The computational time for generating 1500 trees for model #2 is 247.37 seconds and the time is of no concern as this step is performed offline. For real-time diagnosis, once a sample is detected faulty, the diagnosis step is carried out. The time taken to diagnose 1 sample based on model#2 is 0.64 seconds. Given in Fig. 15, to illustrate the real-time diagnosis step, is a sample classification tree created based on a training dataset that consisted of 14 PC scores from all faults. Note that it is not possible to show all the final trees that were developed using the Random Forests since it is an ensemble of trees.
28
MA N
US
CR I
PT
ACCEPTED MANUSCRIPT
Figure 15. An example of a fault diagnosis tree generated based on training dataset of 14 principal component scores.
D
As shown in Fig. 15, if the PC1 and PC8 scores are greater than or equal to 18 and -20
PT E
respectively, the sample would be diagnosed as fault 1. For real-time fault diagnosis, a sample collected at a given time instance is projected onto the PC sub-space and then the PC scores are used to classify that sample. This will be done only
AC CE
when the sample is detected as faulty. Thus using the tree in Fig. 15, if the real-time sample has the PC1 and PC8 scores greater than or equal to 18 and -20 respectively, it would be diagnosed as fault 1. In addition, such classification will be obtained for all trees in the model and the majority vote will be used as the final diagnosis. As shown in Table 6, six different data sets were created to study the diagnosis accuracy based on different training datasets. The final model selected for diagnosis was model #5 from Table 6. This model did not include the normal operating data arguing that the diagnosis will be performed only when a fault is detected and thus a fault is either diagnosed correctly or is misclassified as another fault. It is noteworthy that faults 3, 9 and 15 have higher missed detection rates (>70%) relative to other faults as the fault signatures are very similar to the signatures obtained for a normal operating dataset. Thus, we observe increased model fidelity when faults with higher missed detection rates are removed from the training dataset. In addition to PC scores, we also added the respective residuals for each PC as additional variables for training to investigate how this affects the diagnosis outcome. 29
ACCEPTED MANUSCRIPT Table 6: Overall fault diagnosis accuracy.
Model #
Variables included in training data
Fault diagnosis accuracy on the testing dataset (%)
1
14 PC scores
65.12
2
14 PC scores and residuals
67.69
3
14 PC scores
79.26
4
14 PC scores and residuals
84.87
PT CR I
US
6
All faults and normal operating data All faults and normal operating data
14 PC scores 14 PC scores and residuals
All faults except faults 3,9,15 All faults except faults 3,9,15
67.86
All faults
74.64
All faults
MA N
5
Training dataset
To evaluate the performance of classification model on a set of test data for which the true
D
values are known, a confusion matrix is often used as shown in Table 7. In Table 7, each
PT E
column of the confusion matrix represents the instances in the reference or actual class while each row represents the instances of a predicted class. To evaluate the diagnosis performance of the Random Forests algorithm, we use sensitivity (true positive rate) and specificity (true
AC CE
negative rate) metrics. Figure 16 shows how these measures are defined and calculated. Sensitivity for each fault is the proportion of a positive fault scenario that is correctly identified as such whereas specificity is the proportion of a negative fault scenario that is identified as such. Using the formulas in Fig. 16 and the confusion matrix in Table 7, one can calculate the sensitivity and specificity of the Random Forests classification model for each fault.
Figure 16. Random Forests diagnosis performance measures.
30
ACCEPTED MANUSCRIPT
Table 7: Confusion matrix for testing dataset that consisted 14 PC scores from all faults (model #5). Random Forests model evaluated was trained on a training dataset that consisted 14 PC scores from all faults.
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 Sum
159 0
0
0
0
0
0
0
0
0
0
0
0
2
0 180 0
0
0
0
0
0
0
0
0
0
0
3
0
1
32
2
11
0
0
2
29
8
7
0
3
4
0
0
6 156 7
0
0
0
5
3
38
2
0
5
0
0
12
2
74
0
0
0
7
7
2
2
1
6
0
0
0
0
0 186 0
0
0
0
0
0
7
0
0
0
0
0
0 171 0
0
0
0
8
0
1
0
0
0
1
0 164 0
0
0
9
0
0
19
1
16
0
0
0
10
0
0
12
0
10
0
0
11
1
0
2
4
2
0
12
1
0
0
0
5
13
0
0
0
0
14
0
0
0
1
15
0
1
30
16
0
0
8
17
0
0
3
18
0
0
0
19 20 21
0
0
0
CR I
0
0
0
0
159
1.0000
0
0
0
0
0
0
0
180
1.0000
0
28 19
2
4
14 10
5
177
0.9587
2
5
5
2
0
17
8
1
257
0.9711
0
17
8
3
1
7
10
7
160
0.9754
0
0
0
0
0
0
0
0
0
186
1.0000
0
0
0
0
0
0
0
0
0
0
171
1.0000
5
5
0
0
0
0
0
0
0
0
176
0.9966
41 13 11
0
3
0
17 14
2
5
11
7
9
169
0.9634
1
7
95
2
5
1
0
10 29
9
0
0
6
0
187
0.9736
0
0
2
0
80
0
0
0
4
3
8
0
6
3
1
116
0.9897
0
0
1
0
1
0 165 3
0
0
0
0
11
0
2
0
189
0.9931
1
0
0
0
0
0
0
6 144 0
0
0
0
0
0
0
0
151
0.9980
0
0
0
0
0
0
4
0
0 156 0
1
5
0
0
0
0
167
0.9969
MA N
US
0
D
Predict ed
0
31 12
0
0
12
3
3
0
0
1
12
7
4
2
0
0
12
8
1
1
90
4
5
165
0.9786
0
2
12
3
14
1
0
2
11 11
3
1
0
0
9
8
4
3
5
83
1
173
0.9743
0
0
15
2
6
0
0
0
16
3
0
1
0
4
14
3
3
5
9 120 209
0.9747
0
27
0
0
0
9
0
1
0
56 14
8
3
15 17 12
236
0.9486
0
2
0
0
0
13 19 10
1
0
0
11 54
2
1
5
8
2
136
0.9766
1
0
0
0
1
0
0
1
0
0
10
2
1 150 0
0
0
1
170
0.9942
0
0
0
0
0
0
0
0
3
0
0
0
0
0 138 0
0
0
141
0.9991
AC CE
1
Specifici ty
PT E
1
PT
Reference
Sum Sensitivity
8
161 185 163 175 178 188 171 172 174 184 174 192 162 168 175 178 199 170 175 167 164 3675 0.9 0.9 0.2 0.8 0.4 0.9 1.0 0.9 0.2 0.5 0.4 0.8 0.8 0.9 0.3 0.3 0.7 0.8 0.5 0.5 0.7 9 7 0 9 2 9 0 5 4 2 6 6 9 3 2 0 5 1 1 0 3
The Random Forests performance results are then compared to the results reported by Li and Xiao [55]. Table 8 lists the correct diagnosis results for the 20 fault types using five PCs as reported by Li and Xiao [55] and the results obtained from using Random Forests on 14 PCs. Li and Xiao [55] employ an adaptive rank-order morphological filter to develop a pattern classification algorithm for fault diagnosis in TEP. Their proposed algorithm achieves pattern matching through adopting one-dimensional adaptive rank-order morphological filter to process unrecognized signals under supervision of different standard signal patterns. The matching degree is characterized by the evaluation of error between standard signal and filter output signal. In their approach, if one signal pattern and its best matching signal template 31
ACCEPTED MANUSCRIPT turn out to both derive from the same fault type, it is regarded as correct fault diagnosis, otherwise the incorrect diagnosis. Li and Xiao [55] reported that the performance of the onedimensional adaptive rank-order morphological filter (PCA1DARMF) algorithm degrades
PT
when more sets of signal templates representing more fault types are provided. As shown in Table 8, the Random Forests provide a significantly higher true positive rate for most faults.
CR I
It should also be noted that for the faults 3, 9 and 15 that are deemed to be hard to detect, the true negative rate for diagnosis is more than 94%.
Table 8: Fault diagnosis comparison.
Random Forests Sensitivity (%) Specificity (%) 98.75 100 97.29 100 19.63 95.87 89.14 97.11 41.57 97.54 98.93 100 100 100 95.34 99.65 23.56 96.34 51.63 97.36 45.97 98.97 85.93 99.31 88.88 99.8 92.85 99.68 32 94.85 30.33 97.65 75.37 99.42 81.17 99.91 51.42 97.85 49.7 97.43
US
Fault
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
MA N
30 95 0 5 25 100 65 0 5 0 15 0 5 5 5 0 85 30 0 0
D PT E
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Correct diagnosis rate (%) [55]
AC CE
Fault
Visualization of fault propagation Once a fault occurs in the process, the disturbance propagates through the process units. Our preliminary results show that when the score and loadings plot are viewed together, additional information on how the disturbance travels through the system can be obtained. Figure 17 shows the loading and score plots for fault #1. Once a fault is introduced after sample #160, the scores start breaking the normal operating cluster. As time progresses for 32
ACCEPTED MANUSCRIPT samples onwards of #170, it can be seen that loading plot points towards the A feed flow and its control valve as the root cause of the disturbance. For later samples it can be seen that the disturbance affects the stripper, reactor and product separator pressure. The change in the A
PT
feed flow also affects the total feed flow. Such in-depth correlations of each fault can be derived from the data without using any prior process knowledge or any understanding of
CR I
process dynamics.
High loading of a variable on a PC indicates that the variable is associated with the direction of the maximum amount of variation. For example, as shown in Fig. 17, the variable A feed
US
and its control valve are correlated with PC1. Once a fault occurs and as time progresses, any of the PCs could be violated and the variables correlated with those PCs will be major
MA N
contributors or consequence of the disturbance for the given sample. As shown in the Fig. 17 at sample #175, the variable A feed and its control valve have high loadings on PC1 and are the root cause, and at later samples the fault shifts to PC2 and the pressure variables that have high loadings on PC2 are the possible root causes. Thus at later samples the operator would
AC CE
PT E
D
be interested in stabilizing the pressures to bring the process back to steady-state.
Figure 17. Tennessee Eastman Benchmark Process fault propagation for fault #1.
CONCLUSIONS PCA is widely used to transform the original variables to reduced dimension of PCs that are uncorrelated. By using the proposed parallel visualization technique, all PCs or the significant PCs can be viewed at once. The plotting of these PCs in parallel coordinates provides a
33
ACCEPTED MANUSCRIPT visually comprehensible dimension to analyze large datasets. Faults can be visually spotted on PCs and validated by control limits applied to the reduced PCs. Traditionally, the Hotelling’s T2 and Q statistics are used for process fault detection that fail
PT
to provide insights as to which PCs caused the statistics to violate the limit. However, we propose a method that applies a control limit to each PC and thus, allows the operator to
CR I
narrow down the root causes for the fault. This method shows that the fault detection rates using the proposed method is considerably improved compared to the classical approaches.
US
We also show that the PCs plotted in parallel coordinates highlight common features from faulty events and can be helpful in determining underlying patterns that can be used for fault detection and isolation. We propose a method using Random Forests technique that provides
MA N
an alternative to classify and identify faults in a process. With more data being collected from processes every day, a larger database of faults can be incorporated to train this machinelearning algorithm and is used for real-time fault diagnosis.
D
Another important aspect of the proposed technique that needs to be further investigated is to
PT E
analyze and study scores of the PCs to discern features of PCs during faulty operating conditions. Such characteristics will further guide fault isolation as sampled data is acquired by DCS. Such knowledge can then be utilized for fault prediction [59] and taking preemptive
AC CE
action for fault prevention. REFERENCES
[1] S.J. Qin, Statistical process monitoring: basics and beyond, Journal of Chemometrics, 17 (2003) 480-502.
[2] M. Riaz, F. Muhammad, An application of control charts in manufacturing industry, Journal of Statistical and Econometric Methods, 1 (2012) 77-92. [3] NIST/SEMATECH, e-Handbook of Statistical Methods, 2015. [4] C.-d. Tong, X.-f. Yan, Y.-x. Ma, Statistical process monitoring based on improved principal component analysis and its application to chemical processes, J. Zhejiang Univ. Sci. A, 14 (2013) 520-534. [5] L.H. Chiang, E.L. Russell, R.D. Braatz, Fault Detection and Diagnosis in Industrial Systems, 2001. [6] V. Venkatasubramanian, R. Rengaswamy, S.N. Kavuri, K. Yin, A review of process fault detection and diagnosis: Part III: Process history based methods, Computers & Chemical Engineering, 27 (2003) 327-346. 34
ACCEPTED MANUSCRIPT [7] M. Tamura, S. Tsujita, A study on the number of principal components and sensitivity of fault detection using PCA, Computers & Chemical Engineering, 31 (2007) 1035-1046. [8] T. Kourti, J.F. MacGregor, Process analysis, monitoring and diagnosis, using multivariate
PT
projection methods, Chemometrics and Intelligent Laboratory Systems, 28 (1995) 3-21. [9] B.M. Wise, N.B. Gallagher, The process chemometrics approach to process monitoring
CR I
and fault detection, Journal of Process Control, 6 (1996) 329-348.
[10] E.L. Russell, L.H. Chiang, R.D. Braatz, Data-driven Methods for Fault Detection and Diagnosis in Chemical Processes, Springer-Verlag London2000.
US
[11] A. Cinar, A. Palazoglu, F. Kayihan, Multivariate Statistical Monitoring Techniques, Chemical Process Performance Evaluation, CRC Press2007, pp. 37-71.
MA N
[12] A. Cinar, S.J. Parulekar, C. Undey, G. Birol, Batch Fermentation, CRC Press2003. [13] S.L. Shah, I.F.o.A. Control, S.o. Dynamics, C.o.P. Systems, Dynamics and Control of Process Systems 2004 (DYCOPS-7): A Proceedings Volume from the 7th IFAC Symposium,
D
Cambridge, Massachusetts, USA, 5 - 7 July 2004, Elsevier2005.
PT E
[14] L.H. Chiang, R.D. Braatz, Process monitoring using causal map and multivariate statistics: fault detection and identification, Chemometrics and Intelligent Laboratory Systems, 65 (2003) 159-178.
AC CE
[15] J.-M. Lee, C. Yoo, I.-B. Lee, Statistical monitoring of dynamic processes based on dynamic independent component analysis, Chemical Engineering Science, 59 (2004) 29953006.
[16] M. Kano, K. Nagao, S. Hasebe, I. Hashimoto, H. Ohno, R. Strauss, B.R. Bakshi, Comparison of multivariate statistical process monitoring methods with applications to the Eastman challenge problem, Computers & Chemical Engineering, 26 (2002) 161-174. [17] Y. Zhang, Fault Detection and Diagnosis of Nonlinear Processes Using Improved Kernel Independent Component Analysis (KICA) and Support Vector Machine (SVM), Industrial & Engineering Chemistry Research, 47 (2008) 6961-6971. [18] B. Scholkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. Muller, G. Ratsch, A.J. Smola, Input space versus feature space in kernel-based methods, Neural Networks, IEEE Transactions on, 10 (1999) 1000-1017. [19] J. Yang, X. Gao, D. Zhang, J.-y. Yang, Kernel ICA: An alternative formulation and its application to face recognition, Pattern Recognition, 38 (2005) 1784-1787.
35
ACCEPTED MANUSCRIPT [20] E.L. Russell, L.H. Chiang, R.D. Braatz, Fault detection in industrial processes using canonical variate analysis and dynamic principal component analysis, Chemometrics and Intelligent Laboratory Systems, 51 (2000) 81-93.
PT
[21] M.A. Bin Shams, H.M. Budman, T.A. Duever, Fault detection, identification and diagnosis using CUSUM based PCA, Chemical Engineering Science, 66 (2011) 4488-4498.
CR I
[22] P. Miller, R.E. Swanson, C.F. Heckler, Contribution plots: the missing link in multivariate quality control, International Journal of Applied Mathematics and Computer Science, 8 (1998) 775–792.
US
[23] S. Yoon, J.F. MacGregor, Fault diagnosis with multivariate statistical models part I: using steady state fault signatures, Journal of Process Control, 11 (2001) 387-400.
MA N
[24] R. Dunia, J.S. Qin, Joint diagnosis of process and sensor faults using principal component analysis, Control Engineering Practice, 6 (1998) 457-469. [25] W. Ku, R.H. Storer, C. Georgakis, Disturbance detection and isolation by dynamic principal component analysis, Chemometrics and Intelligent Laboratory Systems, 30 (1995)
D
179-196.
PT E
[26] A. Raich, A. Çinar, Diagnosis of process disturbances by statistical distance and angle measures, Computers & Chemical Engineering, 21 (1997) 661-673. [27] Information visualization in data mining and knowledge discovery, in: F. Usama, G.G.
AC CE
Georges, W. Andreas (Eds.), Morgan Kaufmann Publishers Inc., 2002, pp. 440. [28] R.W. Brooks, Viewing process information multi-dimensional for improved process understanding, operation and control, Aspen World ConferenceBoston, 1997. [29] Y.-H. Fua, M.O. Ward, E.A. Rundensteiner, Hierarchical parallel coordinates for exploration of large datasets, Proceedings of the conference on Visualization '99: celebrating ten years, IEEE Computer Society Press, San Francisco, California, USA, 1999, pp. 43-50. [30] H. Albazzaz, X.Z. Wang, F. Marhoon, Multidimensional visualisation for process historical data analysis: a comparative study with multivariate statistical process control, Journal of Process Control, 15 (2005) 285-294. [31] H. Albazzaz, X.Z. Wang, Historical data analysis based on plots of independent and parallel coordinates and statistical control limits, Journal of Process Control, 16 (2006) 103114. [32] R. Dunia, T.F. Edgar, M. Nixon, Process monitoring using principal components in parallel coordinates, AIChE Journal, 59 (2013) 445-456. [33] R.C. Wang, T.F. Edgar, M. Baldea, M. Nixon, W. Wojsznis, R. Dunia, Process fault detection using time-explicit Kiviat diagrams, AIChE Journal, 61 (2015) 4277-4293. 36
ACCEPTED MANUSCRIPT [34] J.J. Downs, E.F. Vogel, A plant-wide industrial process control problem, Computers & Chemical Engineering, 17 (1993) 245-255. [35] J.E. Jackson, A User's Guide to Principal Components, John Wiley & Sons, Inc.2004,
PT
pp. 4-25. [36] P. Nomikos, J.F. MacGregor, Multivariate SPC Charts for Monitoring Batch Processes,
CR I
Technometrics, 37 (1995) 41-59.
[37] E.N.M. van Sprang, H.-J. Ramaker, J.A. Westerhuis, S.P. Gurden, A.K. Smilde, Critical evaluation of approaches for on-line batch process monitoring, Chemical Engineering
US
Science, 57 (2002) 3979-3991.
[38] I.T. Jolliffe, Principal Component Analysis, Springer New York2002.
MA N
[39] R. Cangelosi, A. Goriely, Component retention in principal component analysis with application to cDNA microarray data, Biology Direct, 2 (2007). [40] P.R. Lyman, C. Georgakis, Plant-wide control of the Tennessee Eastman problem, Computers and Chemical Engineering, 19 (1995) 321-331.
D
[41] J. Yu, S.J. Qin, Multimode process monitoring with Bayesian inference-based finite
PT E
Gaussian mixture models, AIChE Journal, 54 (2008) 1811-1829. [42] X. Chen, X. Yan, Using improved self-organizing map for fault diagnosis in chemical industry process, Chemical Engineering Research and Design, 90 (2012) 2262-2277.
AC CE
[43] S. Stubbs, J. Zhang, J. Morris, Fault detection in dynamic processes using a simplified monitoring-specific CVA state space modelling approach, Computers & Chemical Engineering, 41 (2012) 77-87. [44] J.P. Stevens, Applied Multivariate Statistics for the Social Sciences, Taylor & Francis, New York, 2009.
[45] A. Inselberg, M. Reif, T. Chomut, Convexity algorithms in parallel coordinates, J. ACM, 34 (1987) 765-801. [46] A. Inselberg, The plane with parallel coordinates, The Visual Computer, 1 (1985) 69-91. [47] E.J. Wegman, Hyperdimensional Data Analysis Using Parallel Coordinates, Journal of the American Statistical Association, 85 (1990) 664-675. [48] A. Inselberg, Visualization and Data Mining for High Dimensional Datasets, in: O. Maimon, L. Rokach (Eds.) Data Mining and Knowledge Discovery Handbook, Springer US2005, pp. 297-319. [49] J.J. Miller, E.J. Wegman, Construction of line densities for parallel coordinate plots, Computing and graphics in statistics, Springer-Verlag New York, Inc.1991, pp. 107-123. [50] M.L. Samuels, J. Witmer, A. Schaffner, Statistics for the Life Sciences Pearson2011. 37
ACCEPTED MANUSCRIPT [51] F. Yang, D. Xiao, Progress in Root Cause and Fault Propagation Analysis of LargeScale Industrial Processes, Journal of Control Science and Engineering, 2012 (2012) 10. [52] R.G. Cowell, P. Dawid, S.L. Lauritzen, D.J. Spiegelhalter, Probabilistic Networks and
PT
Expert Systems, Springer-Verlag New York1999. [53] M. Maurya, R. Rengaswamy, V. Venkatasubramanian, Fault diagnosis by qualitative
CR I
trend analysis of the principal components, Chemical Engineering Research and Design, 83 (2005) 1122-1132.
[54] G. Dong, Z. Beike, M. Xin, W. Chongguang, Application of Signed Directed Graph
US
Based Fault Diagnosis of Atmospheric Distillation Unit, Intelligent Systems and Applications (ISA), 2010 2nd International Workshop on, 2010, pp. 1-5.
MA N
[55] H. Li, D.-y. Xiao, Fault diagnosis of Tennessee Eastman process using signal geometry matching technique, EURASIP J. Adv. Signal Process., 2011 (2011) 1-19. [56] R.E. Schapire, Y. Freund, P. Barlett, W.S. Lee, Boosting the margin: A new explanation for the effectiveness of voting methods, Proceedings of the Fourteenth International
D
Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 1997, pp. 322-330.
PT E
[57] L. Breiman, Bagging Predictors, Machine Learning, 24 (1996) 123-140. [58] L. Breiman, Random Forests, Machine Learning, 45 (2001) 5-32. [59] C. Skittides, W.-G. Früh, Wind forecasting using Principal Component Analysis,
AC CE
Renewable Energy, 69 (2014) 365-374.
38