Robust monitoring of industrial processes using process data with outliers and missing values

Robust monitoring of industrial processes using process data with outliers and missing values

Journal Pre-proof Robust monitoring of industrial processes using process data with outliers and missing values Lijia Luo, Shiyi Bao, Xin Peng PII: S...

2MB Sizes 0 Downloads 58 Views

Journal Pre-proof Robust monitoring of industrial processes using process data with outliers and missing values Lijia Luo, Shiyi Bao, Xin Peng PII:

S0169-7439(19)30073-5

DOI:

https://doi.org/10.1016/j.chemolab.2019.103827

Reference:

CHEMOM 103827

To appear in:

Chemometrics and Intelligent Laboratory Systems

Received Date: 4 February 2019 Revised Date:

5 August 2019

Accepted Date: 14 August 2019

Please cite this article as: L. Luo, S. Bao, X. Peng, Robust monitoring of industrial processes using process data with outliers and missing values, Chemometrics and Intelligent Laboratory Systems (2019), doi: https://doi.org/10.1016/j.chemolab.2019.103827. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Robust monitoring of industrial processes using process data with outliers and missing values Lijia Luoa∗, Shiyi Baoa, Xin Pengb a

Institute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou 310014, China b Key Laboratory of Advanced Control and Optimization for Chemical Processes of Ministry of Education, East China University of Science and Technology, Shanghai 200237, China Abstract: The quality of process data has great effect on the performance of data-driven process monitoring methods. Two main problems that reduce the quality of data are the contamination (i.e., outliers) and missing values. To address these problems, a two-step method called COF-GRE is developed for the robust estimation of multivariate location and scatter matrix in the presence of missing data and both casewise and cellwise outliers. The first step applies cellwise outlier filters (COF) to identify and remove the contaminated cells in a multivariate data matrix. In the second step the generalized Rock S-estimators (GRE) are used to handle casewise outliers and missing values simultaneously. Based on the COF-GRE method, a robust process monitoring method is then developed to achieve high-performance monitoring even if the reference data of industrial processes contain outliers and missing values. A fault detection index, called the robust T2 statistic, is defined using the multivariate location and scatter matrix obtained by COF-GRE. A fault diagnosis method is also proposed to determine faulty variables and at the same time to compute their fault magnitudes. The effectiveness and advantages of the proposed methods are illustrated with a simulation example and an industrial case study. Keywords: Robust estimation; Cellwise outliers; Missing data; Process monitoring; Fault detection



Corresponding Author. Tel.: +86-571-85290401. E-mail address: [email protected] 1

and diagnosis. 1. Introduction Online monitoring of industrial processes is an important means to enhance the safety of process operations and to ensure the high product quality. Process monitoring consists of two main tasks: fault detection that aims to detect abnormal events (referred to as process faults) occurred during process operations, and fault diagnosis that aims to identify faulty variables and to diagnose fault sources [1]. As industrial processes become more heavily instrumented, large quantities of process data are available for use in process monitoring. This greatly promotes the development of data-driven process monitoring methods [2-4]. Data-driven monitoring methods are especially suited for complex, large-scale and distributed industrial processes, because they rarely need the detailed engineering knowledge about processes and are very easy to implement. For this reason, lots of data-driven monitoring methods have been proposed over the past few decades [2-12]. The general procedure of data-driven process monitoring is as follows: (1) obtain a set of reference/training data representing the normal process behaviors; (2) apply multivariate statistical analysis methods to the process data for building monitoring models or extracting important data information; (3) construct monitoring indices and determine their control limits for detecting process faults; (4) diagnose the root cause of detected faults. To cope with the high dimensionality of process data and the strong dependency among process variables, most of data-driven monitoring methods are based on latent variable models that are built by using dimensionality reduction approaches [5-9], such as principal component analysis (PCA) [7], canonical correlation analysis (CCA) and independent component analysis (ICA) [9]. On the other hand, probabilistic models and Bayesian methods are also widely 2

used for fault detection and diagnosis [5,10-12]. For example, the Bayesian network has shown a number of advantages in fault diagnosis, due to its powerful statistical inference ability [11,12]. The reference data represent the normal region of an industrial process. They are used not only for process modeling but also for defining monitoring indices and determining their control limits for fault detection. There is no doubt that the quality of reference data has great effect on the process monitoring performance. Based on the fact that the reference data are collected in normal operating conditions of a process, most of data-driven monitoring methods simply assume that all the reference data are normal. In practical situations, however, the data collected from industrial processes frequently contain outliers and missing values. Outliers can be gross measurement errors, recording errors, faulty data, and the data obtained during atypical operating periods (e.g., startup and shutdown) or in different operating modes [13]. Missing values may be caused by the partial loss of recorded data, the malfunction of sensors, or failures in measurement. Both outliers and missing values reduce the quality of reference data and the performance of process monitoring. Multivariate analysis methods, such as PCA, CCA, factor analysis and regression analysis, are known to be very sensitive to outliers. From statistical point of view, the presence of outliers disturbs the accurate estimation of the mean and variance/covariance of the data. From the viewpoint of process monitoring, outliers in the reference data may distort the monitoring model, reduce the efficiency of monitoring indices and inflate the control limits of monitoring indices, which result in poor process monitoring performance [13]. Robust statistics has been developed for the outlier detection and for the robust estimation of multivariate location and scatter matrix in the presence of outliers [14-18], such as the S-estimators [16], M-estimators [17], and minimum covariance determinant (MCD) 3

estimators [18]. Robust multivariate analysis methods, such as robust PCA [19], were also proposed to yield reliable results even if there are outliers. Some robust analysis methods have been applied to process monitoring [13,20-22]. Two simple ways to deal with missing values are to delete the observations containing missing values or to replace the missing values with the corresponding mean/median of variables. However, the deletion of data leads to the loss of useful information, which is undesired when the number of missing values is large or the data set is small. The data replacement may change statistical properties (e.g., mean, variance and covariance) of the data and destroy the correlations between variables. When there is no outlier, a better way to cope with missing values is using the expectation maximization (EM) algorithm to maximize the likelihood of the observed data [19,23]. Robust versions of the EM algorithm have been developed to handle missing data and outliers simultaneously [24]. Except for the robust EM algorithm, some other methods were also proposed for dealing with data sets with both missing values and outliers [23,25], such as the generalized S-estimators [23]. Suppose the multivariate data are stored in the form of a matrix with rows as cases (i.e., samples) and columns as variables. In robust statistics for multivariate data analysis, the outlier typical refers to a casewise outlier, i.e., a row (or a case) in a data matrix [26]. Most of existing robust analysis methods were developed in the context of casewise outliers. However, the casewise outlier paradigm is insufficient for multivariate data sets. It often happens that only a few data cells (entries) in a row are anomalous but the other cells are normal, particularly in the cases where variables are measured separately and/or obtained from different sources, such as the data from industrial processes. To address this problem, Alqallaf et al. [27] proposed the cellwise outlier paradigm based on an 4

independent contamination model, where the basic units of outliers are the cells in a data matrix. Assume that each data cell has a probability ε of being independently contaminated. For a data matrix with p variables, the expected fraction of contaminated rows/cases is ̅ = 1 − (1 − ) , which could quickly exceed 50% as ε or p increases. This is fatal to the traditional robust methods designed for casewise outliers, because they cannot handle well the data with over 50% of contaminated cases [14,26]. A few methods have been proposed to address the problem of cellwise outliers [26,28,29]. Agostinelli et al. [28] applied a univariate filter to flag the anomalous cells in a data matrix. However, this univariate filter has limited performance because the correlations between variables are not taken into account. To overcome this deficiency, Rousseeuw and Van den Bossche [26] developed a new method called DetectDeviatingCells (DDC) to filter and impute cellwise outliers. In the DDC method, the value of each cell is predicted using variable correlations. A cell is flagged as an outlier if the observed value deviates far from the predicted value. Unfortunately, the DDC method uses a fixed cutoff value to identify cellwise outliers, and hence some cells will still be flagged even if there is on outlier in the data. Recently, Leung et al. [29] developed a new filter called UBF-DDC by intersecting DDC with a consistent univariate and bivariate filter (UBF). The UBF-DDC first performs both UBF and DDC methods, and then it computes the intersection of results of UBF and DDC. Therefore, UBF-DDC has lower computational efficiency than DDC and UBF, and it may not yield better results than DDC or UBF in some cases. Dealing with cellwise outliers is an ongoing problem. The problem becomes more difficult if the data contain both cellwise and casewise outliers and missing values simultaneously, which may occur in the data obtained from actual industrial processes. However, there are few works that address this problem together with the 5

problem of process monitoring. In this paper, a two-step method called COF-GRE is proposed to estimate the location and scatter matrix of multivariate data in the presence of cellwise and casewise outliers and missing values simultaneously. The first step uses cellwise outlier filters (COF) to identify and remove the contaminated cells in a data matrix. The second step applies the generalized Rocke S-estimators (GRE), which are able to deal with missing data and casewise outliers, to the incomplete data matrix coming from the first step to estimate the multivariate location and scatter matrix. Based on COF-GRE, a robust process monitoring method is then developed. The multivariate mean and scatter matrix obtained by COF-GRE are used to define a robust T2 statistic for fault detection. A fault diagnosis method is also proposed to identify faulty variables and compute their fault magnitudes. The rest of the article is organized as follows. Section 2 introduces the two-step COF-GRE method and its components: cellwise outlier filters and the generalized Rocke S-estimators. The robust process monitoring method is presented in Section 3. In Section 4, the effectiveness and advantages of the proposed methods are illustrated by two case studies. Conclusions are given in Section 5. 2. Two-step method for robust estimation of multivariate location and scatter matrix Consider a data matrix ∈ ℛ × consisting of n samples of p variables. Some cells in X may be missing values or contaminated by outliers. A two-step method, called COF-GRE, is developed for the robust estimation of multivariate location and scatter matrix of X in the presence of missing data and outliers. The first step uses cellwise outlier filters (COF) to flag the contaminated cells in X and set the flagged cells to missing values. The second step applies the generalized Rocke S-estimators [29] to the incomplete data matrix to estimate the multivariate location and scatter matrix. 6

2.1 Cellwise outlier filters 2.1.1 A univariate filter Agostinelli et al. [28] applied the univariate GY filter of Gervini and Yohai [30] on each variable separately to identify cellwise outliers in a data matrix. Let  =  ,  , … ,    be a sample of

the variable j. The observations  are standardized by  =

   

(1)

where ! and " are robust estimators of location and scale of the variable j. In this paper, ! and " are chosen as the median and the Sn scale estimator [31] respectively, given by ! = median ( )

(2)

" = 1.1926 ∙ median (median. ( − . ))

(3)

The empirical distribution function of the absolute value / / is

 2 01 , (3) = ∑ 7 5/ / ≤ 3

(4)

2 To detect outliers, the 01 , (3) is compared with the real distribution of / / when  ~09, .

However, the real distribution 09, of  is unknown in practice. The standard normal distribution, 0 , is often chosen as a reference distribution for  . The non-normal data could be transformed to

be approximately normally distributed using a normalizing transformation, such as the Box–Cox and Yeo–Johnson transformations [32]. According to the GY filter [28,30], the fraction of outliers in the sample  is computed by 2 ℎ , = sup>?@ 02 (3) − 01 , (3)

= maxBC D02 ||()  −

 2

F

2

where G∙H2 denotes the positive part, 02 denotes the distribution of / /,

(5) I = 02  (J) is the 

J quantile of 02 , K9 = maxK: ||() < I , and ||() is the order statistics of / /. The NOℎ , P 7

(QRS denotes the largest integer no larger than a) observations with the largest absolute values / /

are flagged as outliers. The cutoff value for / / is [28,30]

2 ̅ = min3: 01 , (3) ≥ 1 − ℎ ,  3 ,

(6)

̅ = ||(U ) with K = O − NOℎ , P. Equivalently, the observations  with / / ≥ 3 , ̅ that is, 3 ,

̅ is adaptively computed from are flagged as outliers. It is important to note that the cutoff value 3 , the data. Moreover, the univariate GY filter is a consistent filter [28,29], which asymptotically will

not wrongly flag any outliers if the selected reference distribution 0 has heavier tails than the real distribution 09, .

2.1.2 A bivariate filter The univariate GY filter does not take into account the correlations between variables. The deviating data with values inside the normal range of every variable but disobey the pattern of relations between variables will pass the univariate GY filter, especially for the highly correlated data. To overcome this deficiency, a bivariate filter that takes variable correlations into account is developed. Let . = ,. , ,. , … ,  ,.  be a sample of two variables j and k, with ,. =  , . 

V

(neither  nor . is a missing value). The robust correlation coefficient between variables j and k is estimated by [33] WX. =

[ Y,Z  [ Y,Z 2

[ [,Z [ [,Z

(7)

where ",. and ",. are robust scale estimators of  + .  and  − . ,  =  ⁄" and

. = . ⁄". are the standardized form of  and . , and " and ". are robust scale estimators

of   and G. H. The Sn scale estimator (see Eq. (3)) is used in this paper. Two variables are 8

considered to be significantly correlated if WX. ≥ W̅ , with W̅ a predefined threshold.

For two correlated variables j and k, the location ^. and scatter matrix _. can be estimated

from . using the multivariate MM-estimator with the smoothed hard rejection (SHR) weight

function [15]. The squared robust distance of ,. is then computed by

  `,. = ,. − ^.  _. ,. − ^.  V

 , given by Let a1 ,. (3) be the empirical distribution function of `,.   a1 ,. (3) = ∑ 7 5`,. ≤ 3

(8)

(9)

Similar to the univariate GY filter, bivariate outliers are detected by comparing a1 ,. (3) with a reference distribution function a. (3). In this paper, a. is chosen as the c  distribution with two degrees of freedom, i.e., a. = c . The fraction of bivariate outliers in the sample . is 2 ℎ ,. = sup>?@Z a. (3) − a1 ,. (3)  = maxBC Da. `(),. −

 2

F

(10)

   where I. = a. (d) is the d quantile of a. , K9 = maxK: `(),. < I. , and `(),. is the order   statistics of `,. . The cutoff value for `,. is

̅ 3 ,. = min3: a1 ,. (3) ≥ 1 − ℎ ,. 

 ̅ The observations ,. with `,. ≥ 3 ,. are flagged as bivariate outliers.

(11)

2.1.3 Cellwise outlier identification The above univariate and bivariate filters are used together to identify cellwise outliers in a data matrix ∈ ℛ × . The univariate filter is first applied on each variable in X separately to flag the contaminated cells. The flagged cells are then set as missing values, denoted by NA’s (from “Not e . Let U be an auxiliary matrix of zeros and Available”), which results in an incomplete data matrix

e , that is, f = 0 if xij is NA. Next, the bivariate ones with zeros indicating the missing values in 9

e to flag bivariate filter is applied on all pairs of correlated variables (h, i) (1 ≤ h < i ≤ j) in

outliers. When a data point ,. =  , . 

V

is flagged by the bivariate filter, this does not

necessarily prove that two cells  and . are both cellwise outliers, but at least one of them is.

For the flagged bivariate points, the following criterion is used to further determine which cells are to be flagged as cellwise outliers. Criterion for determining cellwise outliers: Let Ω, = (h, i): f = 1, f. = 1, WX. ≥ W̅ 

(12)

Ω, = (h, i): ,. is a flagged by the bivariate filter

(13)

e . Let be the set of pairwise cells that are evaluated by the bivariate filter in rows i = 1, …, n of

be the set of pairwise cells flagged by the bivariate filter in rows i = 1, …, n. For each cell (i, j), we respectively count the numbers of evaluated and flagged pairs in which the cell (i, j) is involved, i.e., u, = #i: (h, i) ∈ Ω, 

u, = #i: (h, i) ∈ Ω, 

(14) (15)

The cell (i, j) is flagged as a cellwise outlier if u, > xu, , where x (typically 0.5 < x < 1) is a parameter to control the number of flagged cells. The procedure for identifying cellwise outliers in a data matrix X is summarized as follows: Step 1: standardization. For each variable j in X, compute robust location and scale estimators !

and " using Eq. (2) and Eq. (3). Standardize X to Z via Eq. (1).

Step 2: applying the univariate filter on each variable. For each variable j, compute the fraction of

̅ using Eq. (5) and Eq. (6). Flag the data cell (i, j) as an outlier outliers ℎ , and the cutoff value 3 , ̅ . Set the flagged cells to missing values, NA’s. if / / ≥ 3 , 10

Step 3: computing bivariate correlations. For any two variables j and k, compute their correlation coefficient WX. using Eq. (7). Given a threshold W̅ , label two variables as correlated if WX. ≥ W̅ . Step 4: applying the bivariate filter on correlated variables. For each pair of correlated variables j  for data points ,. =  , .  and k, use Eq. (8) to compute the squared robust distances `,.

V

without missing values (i.e., neither  nor . is NA). Compute the fraction of outliers ℎ ,. and  ̅ ̅ . the cutoff value 3 ,. using Eq. (10) and Eq. (11). Flag the bivariate point ,. if `,. ≥ 3 ,.

Step 5: flagging cellwise outliers. Based on the flagged bivariate points, count the numbers u,

and u, for each cell (i, j) using Eq. (14) and Eq. (15). Flag the cells (i, j) with u, > xu, as cellwise outliers. Set the flagged cells to missing values, NA’s. 2.2 Generalized Rocke S-estimators

z Applying cellwise outlier filters to the data matrix ∈ ℛ × yields an incomplete data matrix

with missing values that correspond to the contaminated cells. The second step then applies the z for computing the multivariate location and generalized Rocke S-estimators (GRE) [23,29] to

scatter matrix. The GRE is designed to handle the multivariate data with missing values and it is robust against the casewise outliers. Let { ∈ ℛ × be an auxiliary matrix of zeros and ones, with zeros indicating the missing entries

z . The dimension of the available part of each row xi in z is j = ∑7 f , where uij is the in

(i,j)th element of U. Given a vector | ∈ ℛ  of zeros and ones, a vector } ∈ ℛ  and a matrix ~ ∈ ℛ × , let }(|) and ~(|) denote the sub-vector of y and the sub-matrix of A corresponding to

nonzero entries in u. For a data point  ∈ ℛ  , a center  ∈ ℛ  and a scatter matrix € ∈ ℛ × , the normalized squared Mahalanobis distance of x is defined as 11

`(, , €) = ( − )V (€∗ ) ( − )

(16)

e9 be an initial estimator where € ∗ = €/|€|/ (so |€ ∗ | = 1) and |€| is the determinant of €. Let €

e9 , , z {, is defined as the solution in s of [23] of the scatter matrix. The generalized M-scale, ", €, € | 

…† ‡ ,|‡  ,€|‡  ˆ ∑ 7 ƒ W „ Y/Š Œ |  ‰Š ‹€C ‡ ‹

=  ∑ 7 ƒ

(17)

where  ∈ (0,1) controls the breakdown point, W(∙) is a loss function, and ƒ are constants for ensuring consistency under the multivariate normal distribution, which are chosen such that [23] Ž W † ‰ ˆ‘ = , ~u‰Š (’, “) ‖ ‖[



Š

(18)

A modified Rocke-ρ function [29] is used to achieve a controllable tradeoff between efficiency and robustness in higher dimensions, given by

0 for 0 ≤ f ≤ 1 − –

W(f) = ”— with

› œ3 − —

˜ ™š

˜  š

› ž +  for 1 − – < f < 1 + –

1 for f ≥ 1 + – – = min —

[ ( ) ŸŠ





, 1›

(19)

(20)

where c (1 − ¡) is the (1 − ¡) quantile of the χ2 distribution with p degrees of freedom, and θ is small and depends on p to control the efficiency. Finally, the generalized Rocke S-estimators (GRE) are given by [29]

e = arg min,€ ", €, € e9 , , z {  ¢, € z { = 1 subject to ", €, €, ,

(21)

3. Robust monitoring of industrial processes 3.1 Fault detection Denote the reference data of an industrial process as = ( ,  , … ,  )V ∈ ℛ × , where 12

 ∈ ℛ  is a sample of p variables. The raw data set may contain both missing values and

outliers. The two-step COF-GRE method is therefore applied to estimate the multivariate location  ¢

e from X. Given a sample  ∈ ℛ  , and an auxiliary vector | ∈ ℛ  of zeros and and scatter matrix € ones with zeros indicating the missing entries in x, the robust T2 statistic is defined as V e (|)  (|) −  ¥ = (|) −  ¢ (|)  € ¢ (|) 

(22)

This T2 statistic is used for fault detection by comparing the computed values with the control limit given by

¥¦ = c§ (1 − ¨)

(23)

where c§ (1 − ¨) is the (1 − ¨) (typically α is chosen to be 0.01 or 0.05) quantile of the χ2

distribution with degrees of freedom of j = ∑7 f . A sample is detected as faulty if its T2 value exceeds the control limit, i.e., ¥ > ¥¦ . 3.2 Fault diagnosis When a faulty sample is detected using the T2 statistic, the faulty variables in this sample are

identified for the purpose of fault diagnosis. It is important to note that, when applying the two-step e COF-GRE method to the reference data X, we obtain not only the location  ¢ and scatter matrix €

̅ (see Eq. (6)) for each but also other useful data information, such as the univariate cutoff value 3 ,

̅ variable and the bivariate cutoff value 3 ,. (see Eq. (11)) for each pair of correlated variables.

These data information can be used to identify the faulty variables in a faulty sample. For a faulty sample  =  ,  , … ,  , the measurement xj of variable j is standardized by  =

¢  © ª«

(24)

e respectively. The ¢ and the jth diagonal element of € where ¬̂  and Σ are the jth element of  13

̅ , and the fault magnitude of this variable is computed by variable j is identified as faulty if  > 3 , fm =

̅ [ ,9› ¯°±—²[ >U, ̅[ >U,

(25)

Note that examining the variables based solely on their own values may miss the faulty variables that disobey the pattern of relations between variables. Therefore, for the remaining variables except the identified faulty variables, we further identify faulty variables by examining the relations between variables. For each pair of correlated variables (j, k) (with WX. ≥ W̅ , and both of them are not identified as faulty by the univariate examination), the squared Mahalanobis distance is computed by V   e. . − .  `. = . −  ¢ .  €

(26)

V e. are the sub-vector of  where . =  , .  is a sub-sample of x, and  ¢ . and € ¢ and the

 e corresponding to variables j and k. The . is detected as faulty if `. ̅ . sub-matrix of € > 3 ,.

The criterion in Section 2.1.3 is then used to determine the faulty variables. In brief, we first count the numbers u, and u, for each variable j using Eq. (14) and Eq. (15) respectively, and then

¦¦¦¦ of the variable flag the variable j as faulty if u, > xu, . Finally, the average fault magnitude fm j is computed by ¦¦¦¦ = ∑´Y, ³. fm. fm .7 ³. =

fm. =

µ¯Z

¶Y,

∑Z·Y µ¯Z

[ ̅ ¯°±—…Z >U,Z ,9›

>̅U,Z



(27)

The fault magnitude of a fault-free variable is zero. The faulty variable with the largest fault magnitude is most likely the root variable responsible for the fault. Moreover, the number of faulty variables can be used to evaluate the influence scope of a fault.

14

3.3 Process monitoring procedure The procedure of the robust process monitoring method is illustrated in Fig. 1, and main steps are summarized as follows. Stage I: offline data analysis Step 1: Collect a set of reference data from an industrial process and store the data in a matrix X. Step 2: Flag the cellwise outliers in X using the univariate and bivariate filters.

z with missing values. Step 3: Delete the flagged cells in X to get an incomplete matrix

z to estimate the Step 4: Apply the generalized Rocke S-estimators to the incomplete matrix multivariate location and scatter matrix. Stage II: online process monitoring For each new sample, xnew

Step 1: Compute the robust T2 value, ¥U¸¹ , for xnew via Eq. (22).

Step 2: Compare ¥U¸¹ with the control limit ¥¦U¸¹ computed by Eq. (23). If ¥U¸¹ ≤ ¥¦U¸¹ , return back to Step 1; otherwise, detect a fault and go to Step 3. Step 3: Identify faulty variables in xnew and compute their fault magnitudes. Step 4: Determine the root variable responsible for the fault.

15

Fig. 1. Robust process monitoring procedure. 4. Case studies 4.1 A simulation example The performance of the two-step method COF-GRE is illustrated using a simulation example. A clean data set ∈ ℛ × is generated from a multivariate normal distribution u (9 , €9 ) with the

dimension p = 20 and the sample size n = 500. The mean 9 is equal to 0. The covariance matrix

€9 has entries of Σ9,. = 0.9|.| , with both high and low correlations between variables. The data

matrix X is then contaminated by cellwise and casewise outliers respectively. Cellwise outliers are added into X by randomly replacing 10% of cells by the data generated from a univariate normal distribution, u(º, 0.1 ), with the contamination value λ of 1, 1.5, 2, 2.5, 3, …, 10. Casewise outliers are added into X by randomly replacing 10% of rows by the data generated from a multivariate normal distribution, u(ƒ», 0.1 “), where ƒ = ¼ºc  (0.99) with λ = 0.5, 1, 1.5, 2, …, 10 and v 

is the last eigenvector (corresponding to the smallest eigenvalue) of the matrix €9 with the length 16

rescaled to meet (» − 9 )V €9 (» − 9 ) = 1. Moreover, noisy data are generated by adding white Gaussian noise to the clean data set X, with the signal noise ratio (SNR) in the range between 30dB and 0dB. Then, 10% of the noisy data are replaced by above cellwise and casewise outliers respectively, with the contamination value λ of 4. The two-step COF-GRE method is applied to estimate the scatter matrix from the contaminated data. The parameter γ of the univariate filter is set as 0.95. Two parameters of the bivariate filter are set as d = 0.95 and W̅ = 0.6. In the criterion for determining cellwise outliers, the parameter x is set as 0.5. The performance of COF-GRE is compared with other three methods: (1) the UBF-DDC-GRE method of Leung et al. [29] that first uses the UBF-DDC method to remove cellwise outliers and then uses the GRE to estimate the scatter matrix, (2) the iterated reweighted minimum covariance determinant (IRMCD) method of Cerioli [34], and (3) the GY-GRE method that first uses the univariate GY filter [28] to flag and remove cellwise outliers and then uses the GRE to estimate the scatter matrix. For these three methods, the default values of parameters are used. The likelihood ratio test (LRT) distance is used to evaluate how far the estimated scatter matrix

e is from the true covariance matrix €9 , given by €

e€9  − log/€ e€9 / − j LRT = trace€

(28)

A smaller LRT distance represents a better estimation. Fig. 2(a) shows the LRT distances of four methods for different contamination values λ under the cellwise contamination. When º < 3, the COF-GRE method has better performance than other three methods. Both COF-GRE and UBF-DDC-GRE perform better than GY-GRE, because the univariate GY filter does not take into account the correlations between variables. When º ≥ 3 , the 17

performance of COF-GRE, UBF-DDC-GRE and GY-GRE is quite similar. The IRMCD method has good performance when º is small, but it tends to break down (i.e., yielding very high LRT

distances) as the contamination value º increases. The IRMCD method can only handle the

casewise outliers. It excludes too many data cells when the contamination value º is large (if a row

of X is identified as a casewise outlier, all cells in this row will be excluded). This leads to the breakdown of the IRMCD method. Unlike the IRMCD method, COF-GRE, UBF-DDC-GRE and GY-GRE are designed to handle the cellwise outliers. They only exclude the contaminated cells in X. Therefore, they yield reliable estimations even in the cases of a large contamination value º.

Fig. 2. LRT distances of four methods for different contamination values. (a) cellwise contamination, (b) casewise contamination. Fig. 2(b) shows that COF-GRE, UBF-DDC-GRE and GY-GRE have similar performance under the casewise contamination. Their LRT distances are slightly larger for contamination values of 0.5 and 1.0, but they have very small LRT distances when contamination values are larger than 1.5. The IRMCD method yields the largest LRT distances for all contamination values, and its LRT distance increases with increasing the contamination value. This indicates that COF-GRE, UBF-DDC-GRE and GY-GRE also outperform IRMCD in the condition of casewise contamination. The main reason 18

is that the generalized Rocke S-estimators have higher efficiency than the IRMCD estimators. The performance of four methods is further tested with the noisy data containing outliers. Fig. 3 and Fig. 4 show the LRT distances of four methods for different SNR values under the cellwise and casewise contaminations respectively. In two figures, LRT-estimated/true denotes the LRT distance between the estimated scatter matrix and the true covariance matrix of clean data (without noise and outliers), and LRT-estimated/noisy denotes the LRT distance between the estimated scatter matrix and the covariance matrix of noisy data (without outliers). Clearly, the COF-GRE, UBF-DDC-GRE and GY-GRE methods have comparable performance, but they are far better than the IRMCD method. As shown in Fig. 3(a) and Fig. 4(a), as the SNR increases (i.e., the decrease of noise level), the LRT-estimated/true distance decreases quickly. Especially for the SNR values larger than 20dB, the values of LRT-estimated/true are very close to the noise-free case. Fig. 3(b) and Fig. 4(b) show that, for COF-GRE, UBF-DDC-GRE and GY-GRE, the LRT-estimated/noisy distance is very small and it is less affected by the noise level, especially in the cases of low noise (SNR ≥ 10dB). However, for the IRMCD method, the LRT-estimated/noisy distance increases as the SNR increases. The results in Fig. 3(b) and Fig. 4(b) indicate that the noise has little influence on the cellwise outlier filters used in COF-GRE, UBF-DDC-GRE and GY-GRE, but the outlier detection performance of IRMCD is strongly affected by the noise. Since Fig 3(b) and Fig. 4(b) already show that the outliers have been well filtered by the cellwise outlier filters, it is easy to know that the larger LRT distances in Fig 3(a) and Fig. 4(a) for smaller SNR values (SNR < 20 dB) are mainly due to the high noise rather than outliers.

19

Fig. 3. LRT distances for the noisy data with different SNR values under the cellwise contamination.

Fig. 4. LRT distances for the noisy data with different SNR values under the casewise contamination. 4.2 An industrial case study 4.2.1 Process description The performance of the robust process monitoring method is illustrated by the Tennessee Eastman (TE) process simulation [35], a well-known benchmark and realistic representation of a real industrial process. Fig. 5 shows a flow sheet of the TE process. This process consists of a reactor/separator/recycle arrangement. It involves eight chemical components [35]: four gas reactants (A, C, D, and E), an inert (B), two liquid products (G and H), and a liquid byproduct (F). Downs and 20

Vogel [35] have built a simulation model for the TE process. The benchmark TE simulation data, downloaded from http://web.mit.edu/braatzgroup/links.html, are used in this case study. These data contain a reference data set and 21 faulty data sets generated in the normal and 21 faulty operating conditions respectively. Each data set has 960 continuous samples of 52 process variables. The data of 33 process variables in Table 1 are used. In each faulty data set, a process disturbance in Table 2 was introduced to the TE process at the 160th sample to create faulty operations. The reference data set is contaminated by randomly replacing 10% measurements of each variable with outliers.

21

Fig. 5. Flow sheet of the Tennessee Eastman process [1]. 22

No.

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17

Variable A feed rate (stream 1) D feed rate (stream 2) E feed rate (stream 3) A and C feed rate (stream 4) Recycle flow rate (stream 8) Reactor feed rate (stream 6) Reactor pressure Reactor level Reactor temperature Purge rate (stream 9) Product separator temperature Product separator level Product separator pressure Product separator underflow Stripper level Stripper pressure Stripper underflow (stream 11)

Table 1. Monitored variables in the TE process. Units No. Variable kscmh Stripper temperature v18 -1 kg h Stripper steam flow v19 -1 kg h Compress work v20 kscmh Reactor cooling water outlet temperature v21 kscmh Condenser cooling water outlet temperature v22 kscmh D feed flow valve (stream 2) v23 kPa E feed flow valve (stream 3) v24 % A feed flow valve (stream 1) v25 A and C feed flow valve (stream 4) ℃ v26 kscmh Compressor recycle value v27 Purge valve (stream 9) ℃ v28 % Separator pot liquid flow valve (stream 10) v29 kPa Stripper liquid product flow valve (stream 11) v30 3 -1 m h Stripper steam valve v31 % Reactor cooling water flow valve v32 kPa Condenser cooling water flow valve v33 3 -1 m h

23

Units ℃ kg h-1 kW ℃ ℃ % % % % % % % % % % %

Table 2. Disturbances in the TE process [35]. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Variable A/C feed ratio, B composition constant (stream 4), B composition, A/C feed ratio constant (stream 4) D feed temperature (stream 2) Reactor cooling water inlet temperature Condenser cooling water inlet temperature A feed loss (stream 1) C header pressure loss-reduced availability (stream 4) A, B, C feed composition (stream 4) D feed temperature (stream 2) C feed temperature (stream 4) Reactor cooling water inlet temperature Condenser cooling water inlet temperature Reaction kinetics Reactor cooling water valve Condenser cooling water valve Unknown Unknown Unknown Unknown Unknown The valve for stream 4 was fixed at the steady state position

Type Step Step Step Step Step Step Step Random Random Random Random Random Slow drift Sticking Sticking Unknown Unknown Unknown Unknown Unknown Constant

4.2.2 Fault detection results The two-step COF-GRE method is applied on the contaminated reference data to first remove

e of reference data. In the cellwise outliers and then to estimate the location  ¢ and scatter matrix €

univariate filter, the parameter γ is set as 0.99. Two parameters of the bivariate filter are set as d = 0.99 and W̅ = 0.6. In the criterion for determining cellwise outliers, the parameter x is set as e, the robust T2 statistic is defined by Eq. (22) and used for 0.5. Based on the estimators  ¢ and €

detecting faulty samples in faulty data sets. The control limit of the T2 statistic is determined by Eq.

(23) with α chosen to be 0.01. The COF-GRE-based robust monitoring method (shorten to COF-GRE-T2) is compared with other two methods: a non-robust method called NR-T2, and a robust method based on a casewise outlier filter, called IRMCD-T2. The NR-T2 method uses the classical T2 24

statistic that is computed using the empirical mean and covariance matrix of the contaminated reference data. In the IRMCD-T2 method, a casewise outlier filter called IRMCD [34] is first applied to eliminate outliers from the contaminated data, and then the classical T2 statistic is used for fault detection. Fault detection performance is quantified by the fault detect rate (FDR) and false alarm rate (FAR). FDR is defined as the percentage of faulty samples that are successfully detected by the T2 statistic. FAR is defined as the percentage of normal samples that are wrongly detected as faulty. Table 3 shows the detection results of three methods for 21 faults. The COF-GRE-T2 clearly detected 18 faults, except that the FDRs for faults 3, 9 and 15 are very low due to the weak effect of these faults on the process variables. The IRMCD-T2 detected 16 faults (except for faults 3, 4, 9, 15 and 19), and the NR-T2 only detected 12 faults (except for faults 3, 4, 9-11, 15, 16, 19 and 21). For the easily detectable faults, such as faults 1, 2, 6-8, 13 and 18, FDRs of IRMCD-T2 and NR-T2 are comparable to or slightly lower than that of COF-GRE-T2. For the difficult-to-detect faults, such as faults 4, 10, 11, 16, 19 and 20, the COF-GRE-T2 has much higher FDRs than IRMCD-T2 and NR-T2. The NR-T2 and IRMCD-T2 have poor fault detection performance due to the cellwise contamination in the reference data. The IRMCD-T2 is slightly better than the NR-T2, because a part of contaminated data cells were eliminated from the reference data using the IRMCD filter. However, unlike the COF-GRE method, the IRMCD filter was designed to cope with the casewise outliers and it cannot handle well the cellwise outliers. For this reason, the COF-GRE-T2 performed better than the IRMCD-T2.

25

2

Fault 1 2 3 4 5 6 7 8 9 10 11

COF-GRE-T FDR FAR 100 2 99 0 4 3 100 1 100 1 100 0 100 1 98 2 4 3 87 1 81 2

Table 3. Detection results for 21 faults. IRMCD-T2 NR-T2 COF-GRE-T2 IRMCD-T2 NR-T2 Fault FDR FAR FDR FAR FDR FAR FDR FAR FDR FAR 12 99 0 99 0 100 1 75 0 49 0 13 98 0 97 0 95 1 88 0 85 0 14 0 0 0 0 100 2 79 0 29 0 15 0 0 0 0 5 2 0 0 0 0 16 17 0 13 0 88 3 3 0 0 0 17 100 0 100 0 97 3 42 0 13 0 18 97 0 75 0 90 2 87 0 86 0 19 89 0 77 0 92 1 0 0 0 0 20 0 0 0 0 90 1 21 0 15 0 21 3 0 0 0 55 3 29 0 1 0 19 0 0 0

Fig. 6 and Fig. 7 show the monitoring charts of three methods for faults 5 and 11, which illustrate how the COF-GRE-T2 outperforms NR-T2 and IRMCD-T2. In Fig. 6(a), after the fault 5 occurred at the 160th sample, the COF-GRE-T2 shows a sharp increase in its value immediately, and all the subsequent samples have large T2 values outside the control limit. In Fig. 6(b) and Fig. 6(c), T2 values of most samples are very small, and only a few samples (between the 181st sample and the 220th sample) have large T2 values that exceed the control limit. Fig. 7(a) shows that the fault 11 was detected by the COF-GRE-T2 at the 166th sample, and more than 80% of faulty samples were detected. However, no faulty sample was detected by the NR-T2 (see Fig. 7(c)), and the IRMCD-T2 detected only about 20% of faulty samples (see Fig. 7(b)). Both NR-T2 and IRMCD-T2 suffer from the masking effect of outliers, that is, outliers in the reference data affect the mean and covariance of the data such that some faulty samples may have small T2 values and hence cannot be detected. Compared to the NR-T2, the IRMCD-T2 is able to reduce the masking effect by eliminating a part of outliers from the contaminated reference data. Therefore, the IRMCD-T2 detected more faulty samples than the NR-T2. The COF-GRE-T2 eliminates the masking effect, because it is robust against 26

the cellwise outliers.

Fig. 6. Monitoring chars for fault 5. (a) COF-GRE-T2, (b) IRMCD-T2, (c) NR-T2. 27

Fig. 7. Monitoring chars for fault 11. (a) COF-GRE-T2, (b) IRMCD-T2, (c) NR-T2.

28

4.2.3 Fault diagnosis results Variables responsible for a detected fault are identified using the method in Section 3.2. Based on the identified faulty variables, the root cause of a fault is further diagnosed. Faults 4 and 6 are used as two examples to illustrate the implementation and effectiveness of the fault diagnosis method. First, consider fault 4 that was caused by the step change of the reactor cooling water inlet temperature. Fig. 8(a) shows fault magnitudes of all variables for the 161st sample, at which the fault 4 was first detected. Two variables, v9 (reactor temperature) and v32 (reactor cooling water flow valve), were identified as faulty, because both of them have large fault magnitudes. Since these two faulty variables are closely related to the reactor cooling water system, we may infer that fault 4 was caused by disturbances in the reactor cooling water system. This diagnosis is consistent with the actual cause of fault 4. Fig. 8(b) shows fault magnitudes of 33 monitored variables for all faulty samples. Clearly, the variable v32 has the largest fault magnitude at almost all faulty samples. The other variables were less affected by fault 4. Therefore, fault 4 is a local fault with the influence focused on the reactor cooling water system. This confirms that an industrial process itself, due to the intrinsic fault tolerance ability, is able to limit the influence of some faults in a local region or compensate for faults to make them harmless [36]. For fault 4 in the TE process, the fault tolerance ability comes from the control loop for the reactor temperature that consists of v9, v21 and v32 (see Fig. 5).

29

Fig. 8. Fault magnitudes of 33 monitored variables for fault 4 at (a) the 161st sample, and (b) all faulty samples. For each sample in Fig. 8(b), fault magnitudes of variables are scaled to between 0 and 1 by dividing the largest value. Next, consider fault 6 that was caused by the A feed loss. Fig. 9 shows fault magnitudes of variables for all faulty samples of fault 6. When fault 6 was first detected at the 161st sample, two variables, v1 (A feed rate) and v25 (A feed flow valve), have large fault magnitudes (see Fig. 9(a)), and hence they were identified as faulty variables responsible for fault 6. Note that v1 and v25 constitute the control loop for the A feed rate (see Fig. 5). Based on this meaningful variable connection, we may infer that fault 6 was caused by changes of the A feed rate. Again, this diagnosis aligns well with the actual cause of fault 6. Different from fault 4, fault 6 is a large-scale fault that has significant influence on many process variables (see Fig. 9(b)). This explains why fault 6 was easily detected by both IRMCD-T2 and NR-T2 but fault 4 was not detected (see Table 3). Moreover, Fig. 9(b) shows that fault 6 has the largest influence on variables v1 and v25 at first, and then three pressure variables v7, v13 and v16 were affected most by fault 6, and finally variables v32 and v27 have larger fault magnitudes. This indicates that, due to the complicated interactions between variables, a fault may propagate/shift from one variable to other correlated variables or even spreads across the 30

entire process. The fault propagation/shift in industrial processes increases the difficulty of fault diagnosis, particularly for faults detected with longer time delays after they occurred. Therefore, the quick and accurate detection of faults before the fault propagation/shift is very important to get reliable fault diagnosis results.

Fig. 9. Fault magnitudes of 33 monitored variables for fault 6 at (a) the 161st sample, and (b) all faulty samples. For each sample in Fig. 9(b), fault magnitudes of variables are scaled to between 0 and 1 by dividing the largest value. Table 4 shows the identified faulty variables for 18 clearly detected faults, with the exception of faults 3, 9 and 15. It is important to note that the real faulty variables responsible for most of faults (except for faults 6, 14, 15 and 21) are neither measured nor monitored (see Table 1 and Table 2). Therefore, for these faults, it is impossible to identify the real faulty variables directly using any fault diagnosis methods. This brings great difficulty to fault diagnosis. Nevertheless, the proposed method yields reliable diagnosis results for most of faults. For faults 4-7, 10-12 and 14, the identified faulty variables (in Table 4) are either exactly the real faulty variables or closely related to the real faulty variables (in Table 2). Therefore, as we illustrated above using faults 4 and 6 as two examples, it is possible to use the identified faulty variables to infer the real causes of these faults. For fault 1, two 31

variables, v16 (stripper pressure) and v27 (compressor recycle value), are identified as the faulty variables. This implies that fault 1 is related to the process fluctuations in the stripper, which is most likely caused by changes of the stream 4 that feeds three gases (A, B and C) to the stripper. This diagnosis is close to the real cause of fault 1 in Table 2. For faults 2 and 8, variables v10 (purge rate) and v28 (purge valve) are identified as two faulty variables. Note that v10 and v28 are controlled by the mole fraction of B component in the purge gas (see Fig. 5). Because the B component is an inert fed by the stream 4, it is natural to infer that faults 2 and 8 may be caused by changes of B component in the feed stream 4. This diagnosis is also close to the real causes of faults 2 and 8 in Table 2. Because the root causes of faults 16–20 are unknown (Table 2), it is hard to determine whether the identified faulty variables for these faults are reliable or not. For faults 13 and 21, the identified faulty variables are far from the real faulty variables. This may be due to two reasons: (1) the real faulty variable of fault 13 is not monitored, and (2) faults 13 and 21 were detected with long time delays after they occurred, so that their effects have permeated other variables before they were detected. Table 4. Identified faulty variables responsible for faults. Fault FDT Faulty variable Fault FDT Faulty variable 1 163 12 162 v22 v16, v27 2 173 13 198 v10, v28 v7, v16, v24 3 / / 14 161 v9, v32 4 161 15 / / v9, v32 5 161 v22 16 171 v31 6 161 17 180 v21 v1, v25 7 161 v4 18 241 v22 8 176 v10, v11, v22, v28 19 162 v20 9 / / 20 223 v22 10 185 v18 21 441 v31 11 166 v9, v32 FDT (fault detection time) is the sample at which the fault was first detected. The variable with the largest fault magnitude is highlighted in bold.

32

5. Conclusions A two-step method, called COF-GRE, is proposed for the robust estimation of location and scatter matrix of the multivariate data with outliers and missing values. In the first step, a univariate filter is used together with a bivariate filter to flag and remove the cellwise outliers in a multivariate data matrix. The second step applies the generalized Rocke S-estimators, which can handle missing values and are robust against the casewise outliers at the same time, to an incomplete data matrix for computing the multivariate location and scatter matrix. As shown by a simulation example, the COF-GRE method has high robustness against both cellwise and casewise outliers, and therefore it yields reliable results for the multivariate data contaminated by outliers. Based on the COF-GRE method, a robust process monitoring method is developed, which addresses a common problem in actual industrial processes: the presence of outliers and missing values in the process data. A fault detection index, called the robust T2 statistic, is defined using the multivariate location and scatter matrix obtained by COF-GRE. A fault diagnosis method is also proposed to identify faulty variables and compute their fault magnitudes. We show via the TE process simulation that the robust T2 statistic has superior fault detection performance although the reference data contain a large amount of outliers, whereas the classical T2 statistic has poor performance due to the masking effect of outliers. It is also shown that the proposed fault diagnosis method yields reliable diagnosis results. An incomplete data matrix with missing values is obtained after using the proposed outlier filters to remove cellwise outliers. However, most of existing fault detection and diagnosis methods cannot handle this incomplete data matrix. To address this issue, a topic for further research is to develop 33

efficient methods (such as the expectation maximization algorithm) to fill the missing data in the incomplete data matrix by their predicted values. Acknowledgements This study was supported by the National Natural Science Foundation of China (no. 61304116). References [1] L. Luo, S. Bao, J. Mao, D. Tang, Fault detection and diagnosis based on sparse PCA and two-level contribution plots, Ind. Eng. Chem. Res. 56 (2017) 225−240. [2] S.J. Qin, Survey on data-driven industrial process monitoring and diagnosis, Annu. Rev. Control 36 (2012) 220–234. [3] J.F. MacGregor, A. Cinar, Monitoring, fault diagnosis, fault tolerant control and optimization: Data driven methods, Comput. Chem. Eng. 47 (2012) 111−120. [4] Z. Ge, Review on data-driven modeling and monitoring for plant-wide industrial processes, Chemom. Intell. Lab. Syst. 171 (2017) 16–25. [5] Z. Ge. Process data analytics via probabilistic latent variable models: A tutorial review, Ind. Eng. Chem. Res. 57 (2018) 12646–12661. [6] L. Luo, S. Bao, J. Mao, Adaptive selection of latent variables for process monitoring, Ind. Eng. Chem. Res. 58 (2019) 219075–9086. [7] P. Nomikos, J.F. MacGregor, Monitoring batch processes using multiway principal component analysis, AIChE J. 40 (1994) 1361−1375. [8] L. Luo, Process monitoring with global-local preserving projections, Ind. Eng. Chem. Res. 53 (2014) 7696–7705. [9] M. Kano, S. Tanaka, S. Hasebe, I. Hashimoto, H. Ohno, Monitoring independent components for fault detection, AIChE J. 49 (2003) 969−976. [10] B. Cai, L. Huang, M. Xie, Bayesian networks in fault diagnosis, IEEE Trans. Ind. Inform. 13 (2017) 2227−2240. [11] H. Yu, F. Khan, V. Garaniya, Modified independent component analysis and bayesian network-based two-stage fault diagnosis of process operations, Ind. Eng. Chem. Res. 54 (2015) 102724−2742. [12] J. Yu, M.M. Rashid, A novel dynamic bayesian network-based networked process monitoring approach for fault detection, propagation identification, and root cause diagnosis, AIChE J. 59 (2013) 2348−2365. [13] S. Bao, L. Luo, J. Mao, D. Tang, Z. Ding, Robust monitoring of industrial processes in the presence of outliers in training data, Ind. Eng. Chem. Res. 57 (2018) 8230−8239. [14] M. Daszykowski, K. Kaczmarek, Y.V. Heyden, B. Walczak, Robust statistics in data analysis-A review basic concepts, Chemom. Intell. Lab. Syst. 85 (2007) 203−219. [15] R.A. Maronna, V.J. Yohai, Robust and efficient estimation of multivariate scatter and location, Comput. Stat. Data An. 109 (2017) 64–75 34

[16] P.L. Davies, Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices, Ann. Statist. 15 (1987) 1269–1292. [17] R.A. Maronna, Robust M-estimators of multivariate location and scatter, Ann. Statist. 4 (1976) 51−67. [18] P.J. Rousseeuw, A.M. Leroy, Robust regression and outlier detection, John Wiley & Sons Inc, New York, 1987. [19] S. Serneels, T. Verdonck, Principal component analysis for data containing outliers and missing elements, Comput. Stat. Data An. 52 (2008) 1712−1727. [20] J. Chen, J.A. Bandoni, J.A. Romagnoli, Robust PCA and normal region in multivariate statistical process monitoring, AIChE J. 42(12) (1996) 3563−3566. [21] Y. Tharrault, G. Mourot, J. Ragot, Fault detection and isolation with robust principal component analysis, In 16th Mediterranean Conference on Control and Automation, Congress Centre, Ajaccio, France, 2008. [22] L. Luo, S. Bao, C. Tong, Sparse robust principal component analysis with applications to fault detection and diagnosis, Ind. Eng. Chem. Res. 58 (2019) 1300–1309. [23] M. Danilov, V.J. Yohai, R.H. Zamar, Robust estimation of multivariate location and scatter in the presence of missing data, J. Am. Stat. Assoc. 107(499) (2012) 1178−1186. [24] R.J.A. Little, D.B. Rubin, Statistical analysis with missing Data (2nd ed.), Wiley, New York, 2002. [25] M. Templ, A. Kowarik, P. Filzmoser, Iterative stepwise regression imputation using standard and robust methods, Comput. Stat. Data An. 55 (2011) 2793–2806. [26] P.J. Rousseeuw, W. Van Den Bossche, Detecting deviating data cells, Technometrics 60(2) (2018) 135–145. [27] F. Alqallaf, S. Van Aelst, V.J. Yohai, R.H. Zamar, Propagation of outliers in multivariate data, Ann. Statist. 37 (2009) 311–331. [28] C. Agostinelli, A. Leung, V.J. Yohai, R.H. Zamar, Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination, TEST 24 (2015) 441–461. [29] A. Leunga, V.J. Yohai, R.H. Zamar. Multivariate location and scatter matrix estimation under cellwise and casewise contamination, Comput. Stat. Data An. 111 (2017) 59–76. [30] D. Gervini, V.J. Yohai, A class of robust and fully efficient regression estimators. Ann. Stat. 30(2) (2002) 583–616. [31] P.J. Rousseeuw, C. Croux, Alternatives to median absolute deviation, J. Am. Stat. Assoc. 88 (1993) 1273–1283. [32] I.K. Yeo, R.A. Johnson, A new family of power transformations to improve normality or symmetry, Biometrika 87 (2000) 954–959. [33] R. Gnanadesikan, J.R. Kettenring, Robust estimates, residuals, and outlier detection with multiresponse data, Biometrics 28 (1972) 81–124. [34] A. Cerioli, Multivariate outlier detection with high-breakdown estimators, J. Am. Stat. Assoc. 105(489) (2010) 147−156. [35] J.J. Downs, E.F. Vogel, A plant-wide industrial process control problem, Comput. Chem. Eng. 17 (1993) 245−255. [36] L. Luo, R.J. Lovelett, B.A. Ogunnaike, Hierarchical monitoring of industrial processes for fault 35

detection, fault grade evaluation, and fault diagnosis, AIChE J. 63(7) (2017) 2781–2795.

36

Highlights 1. A method is proposed for the robust estimation of multivariate location and scatter matrix in the presence of outliers and missing data. 2. Cellwise outlier filters are developed to identify the contaminated cells in a data matrix. 3. The generalized Rock S-estimators are used to handle casewise outliers and missing data. 4. A method is proposed for process monitoring when the data contain outliers and missing values. 5. A fault diagnosis method is proposed to determine faulty variables and their fault magnitudes.