10th IFAC Symposium on Fault Detection, 10th IFAC Symposium Detection, Supervision and Safetyon forFault Technical Processes 10th IFACPoland, Symposium on Fault Detection, Supervision and Safety for Technical Processesonline at www.sciencedirect.com Warsaw, August 29-31, 2018 Available 10th IFAC Symposium on Fault Detection, Supervision and Safety for Technical Processes 10th IFACPoland, Symposium on Fault Detection, Warsaw, August 29-31, 2018 Supervision and Safety for Technical Warsaw, Poland, August 29-31, 2018 Supervision and Safety for Technical Processes Processes Warsaw, Poland, Poland, August August 29-31, 29-31, 2018 2018 Warsaw,
ScienceDirect
IFAC PapersOnLine 51-24 (2018) 1311–1316 An Information-Theoretic Framework for Fault Detection Evaluation and An Information-Theoretic for Detection Evaluation An Information-Theoretic Framework for Fault Fault Detection Evaluation and and Design of OptimalFramework Dimensionality Reduction Methods An Information-Theoretic Framework for Fault Detection Evaluation and Design of Optimal Dimensionality Reduction Methods An Information-Theoretic Framework for Fault Detection Evaluation and Design of Optimal Dimensionality Reduction Methods ,§ ,§ , Weike Sun* , Richard D. Braatz* Benben Jiang* Design of of Optimal Optimal Dimensionality Dimensionality Reduction Methods Methods Design Reduction ,§ ,§
Benben Jiang*,§, Weike Sun*,§, Richard D. Braatz* Benben Jiang*,§, Weike Sun*,§, Richard D. Braatz* ,§, Weike Sun*,§, Richard D. Braatz* Benben Jiang* Jiang* ,of Weike Sun* , Cambridge, Richard D. MA Braatz* Benben *Massachusetts Institute Technology, 02139, USA *Massachusetts Institute of Technology, Cambridge, MA 02139, USA (e-mail: {bbjiang, vickysun, braatz}@mit.edu) *Massachusetts Institute of Technology, Cambridge, MA 02139, USA §(e-mail: {bbjiang, vickysun,equally braatz}@mit.edu) These authors to thisMA work *Massachusetts Institute of Technology, Cambridge, 02139, *Massachusetts Institute ofcontributed Technology, Cambridge, MA 02139, USA USA {bbjiang, vickysun, braatz}@mit.edu) §(e-mail: These authors contributed equally to this work §(e-mail: {bbjiang, vickysun, braatz}@mit.edu) Corresponding author: R.D. Braatz. Tel.: +1-617-253-3112; fax: +1-617-258-5042. (e-mail: {bbjiang, vickysun, braatz}@mit.edu) These authors contributed equally to this work §§ Corresponding author: R.D. Braatz. Tel.: +1-617-253-3112; fax: +1-617-258-5042. These authors contributed equally These authors contributed equally to to this this work work Corresponding author: R.D. Braatz. Tel.: +1-617-253-3112; fax: +1-617-258-5042. Corresponding Corresponding author: author: R.D. R.D. Braatz. Braatz. Tel.: Tel.: +1-617-253-3112; +1-617-253-3112; fax: fax: +1-617-258-5042. +1-617-258-5042.
Abstract: Data-based fault detection is a growing area with various dimensionality reduction techniques Abstract: Data-based fault detection is a growing area with various dimensionality reduction techniques being most commonlyfault used in the manufacturing industries. The evaluation among these methods is Abstract: Data-based detection is a growing area with various dimensionality reduction techniques being most commonly used in the manufacturing industries. The evaluation among these methods is generally based on false alarm rate and fault detection rate comparisons given a specific dataset. This Abstract: Data-based fault detection is a growing area with various dimensionality reduction techniques Abstract: Data-based fault detection is a growing area with various dimensionality reduction techniques being most commonly used in rate the manufacturing industries. The evaluation among thesedataset. methods is generally based on false alarm and fault detection rate comparisons given a specific This article aims to propose aused universal fordetection the evaluation ofThe different fault detection approaches. To being commonly in the manufacturing industries. evaluation these methods is generally based on false alarm and fault rate comparisons givenamong a specific dataset. This being most most commonly used in rate thecriterion manufacturing industries. The evaluation among these methods is article aims to propose a universal criterion for the evaluation of different fault detection approaches. To this end, an information-theoretic framework isdetection presented thatcomparisons imbeds the fault problem into an generally based on false false alarm rate rate and fault fault detection rate comparisons givendetection specific dataset. This article aims to propose a universal criterion for the evaluation of different fault detection approaches. To generally based on alarm and rate given aa specific dataset. This this end, an information-theoretic framework is the presented that imbeds fault detection an information point of view. The basis for fault detection evaluation isthe established in terms into of the article aims to propose universal criterion for evaluation of different different fault detection problem approaches. To this end, an information-theoretic framework is the presented that imbeds thethen fault detection problem into an article aims to propose aa universal criterion for evaluation of fault detection approaches. To information point of view. The basis for fault detection evaluation is then established in terms of the information contained in the extracted feature space. The developed theory shows that mutual this end, end, an an information-theoretic information-theoretic framework is presented presented that imbedsisthe thethen faultestablished detection problem problem into an basis for fault detection that evaluation in terms into of the this framework is imbeds fault detection an information point of view. The information is contained inanother the extracted feature space. The shows that mutual information not merely performance index which may developed be useful intheory some problem, but rather point of The basis for detection evaluation is established in of space. The developed theory shows that mutual point of view. view. The extracted basis for fault fault detection evaluation is then then established in terms terms of the thea information contained in the feature information is not merely another performance indexmethods which may be useful in some problem, but rather a universal indicator aboutin how fault detection candeveloped –theory the larger the that information information the extracted feature shows mutual contained inanother the well extracted feature space. The developed theory shows that mutuala information iscontained not merely performance indexspace. whichThe may beperform useful in some problem, but rather universal indicator about how well fault detection methods can perform – the larger the information preserved in is features byfault a dimensionality reduction technique, thethe fault information not merely performance index which may useful some problem, but rather information isthe notextracted merely another performance indexmethods which may be useful in in some problem, butdetection rather aa universal indicator about another how well detection canbe perform –the thebetter larger information preserved in the extracted features byfault a to dimensionality reduction technique, the better thethe fault detection performance. The framework is used derive an optimal iso-information transformation matrix for universal indicator about how well detection methods can perform – the larger information universal indicator about how wellbyfault detection methods cantechnique, perform –the thebetter largerthethe information preserved in the extracted features a to dimensionality reduction fault detection performance. The framework is used derive an optimal iso-information transformation matrix for dimensionality reduction methods for fault detection, which is demonstrated in the application of preserved in the extracted features by a dimensionality reduction technique, the better the fault detection performance. The framework is used to derive an optimal iso-information transformation matrix for preserved in the extracted features by a dimensionality reduction technique, the better the fault detection dimensionality reduction methods for fault detection, which is demonstrated in the application of principal component analysis and variate to iso-information an is oscillatory process bias.for performance. The framework is canonical used tofault derive ananalysis optimal iso-information transformation matrix for dimensionality reduction methods for to detection, which demonstrated inwith the random application of performance. The framework is used derive an optimal transformation matrix principal component analysis and canonical variate analysis to an oscillatory process with random bias. dimensionality reduction methods for fault fault detection, which is demonstrated inwith the random application of principal component analysis and canonical variate analysis to an oscillatory process bias. dimensionality reduction methods for detection, which is demonstrated in the application of © 2018, IFACFault (International Federation of monitoring; Automatic Control) Hosting by ElsevierDimensionality Ltd. All rights reserved. Keywords: detection; Process Data-driven method; reduction principal component analysis and canonical variate analysis to an oscillatory process with random bias. principal component analysis and canonical variate analysis to an oscillatory process with random bias. Keywords: Fault detection; Process monitoring; Data-driven method; Dimensionality reduction technique; theory. Process Keywords: Information Fault detection; monitoring; Data-driven method; Dimensionality reduction technique; Information theory. Process monitoring; Data-driven method; Dimensionality reduction Keywords: Fault detection; technique; theory. Process monitoring; Data-driven method; Dimensionality reduction Keywords: Information Fault detection; technique; technique; Information Information theory. theory. small number of relevant factors by these methods. The small number from of relevant by these methods. transformation feature factors to decision is usually basedThe on small number of relevant factors by these methods. The 1. INTRODUCTION transformation from feature to decision is usually based on the evaluation of the designed functions for the extracted small number number from relevant factors by these these methods. The transformation feature factors to decision is usually based on small of relevant by methods. The 1. INTRODUCTION the evaluation of the designed functions for the extracted Fault detection is1. important in commercial systems for features. INTRODUCTION For example, a popular approach is to compare the transformation from feature to decision decision is for usually based on transformation from feature to is usually based on the evaluation of the designed functions the extracted Fault detection important in commercial for squared 1. INTRODUCTION features. For example, a popular approach is tothe compare the providing alerts onis abnormal conditions systems (aka faults) 1.small INTRODUCTION norm ofof thethe residual for afunctions given online measurement the evaluation designed for extracted Fault detection is important in commercial systems for the evaluation of the designed functions for the extracted features. For example, a popular approach is to compare the providing alerts onis small abnormal conditions (aka faults) squared norm of the residual for aapproach given online measurement before propagation into catastrophic incidents. DataFault detection important in systems for with a design threshold by normal data. For aaestablished popular is to compare the providing alerts onis small abnormal conditions (aka faults) Fault their detection important in commercial commercial systems for features. features. For example, example, popular is operating to measurement compare the norm of the residual for aapproach given online before their propagation into catastrophic incidents. Datawith a design threshold established by normal operating data. based fault detection methods have been widely applied to squared providing alerts on small small into abnormal conditions (aka faults) squared norm of the residual for a given online measurement before their propagation catastrophic incidents. DataAdditive providing alerts on abnormal conditions (aka faults) squared norm of the residual for a given online measurement with a design threshold established by normal operating data. based fault methods have been widely applied to Noise the manufacturing and other industries given that Dataonly before their detection propagation into catastrophic catastrophic incidents. DataAdditive by before their propagation into incidents. based fault detection methods have been widely applied to with with aa design design threshold threshold established established by normal normal operating operating data. data. Additive Additive Noise the manufacturing and other industries given that only historical process data are needed and their implementation based fault detection methods have been widely applied to x , x ,... xˆ1 , xˆ2 ,... Noise 1 2 Noise based fault detectionand methods have been widely applied to Feature Additive the manufacturing other industries given that only Additive Source Measurement Decision historical process data are needed and their implementation x1 , x2 ,... xˆ1 , xˆ2 ,... Extraction cost is low relative to are other methods. Standard data-based Noise the and other industries given that Feature Noise the manufacturing manufacturing and other industries given that only only historical process data needed and their implementation x11 , x22 ,... Measurement xˆ11 , xˆ22 ,... Source Decision Feature Feature Extraction cost isdetection low relative to are other methods. Standard data-based Source Measurement Decision fault includes feature extraction and decision as Fig. 1. Fault detection problem historical process data needed and their implementation Source Measurement Decision ,... xˆˆ1 ,, xxˆˆ2 ,... ,... Extraction historical process data are needed and their implementation Extraction cost is low relative to other methods. Standard data-based xx11 ,, xx22 ,... x Feature 1 2 Feature Source Measurement Decision fault detection includes feature extraction and2003), decision as Fig. 1. Fault detection Extraction problem shown in Fig. 1 (Venkatasubramanian et al., which Source Measurement Decision ˆ ˆ x , x ,... x , x ,... cost is low relative to other methods. Standard data-based cost isdetection low relative to other methods. Standard data-based fault includes feature extraction and2003), decision as Fig. 1. Fault Channel detection Extraction problem Source Encoder Decoder User shown in Fig. 1 (Venkatasubramanian et al., which ˆ ˆ x , x ,... x , x ,... can be viewed as a series of transformations on the process fault detection includes feature extraction and decision as Fig. 1. Fault detection problem fault detection includes feature extraction and2003), decision as shown in Fig. 1as (Venkatasubramanian et al., which Fig. 1. Fault Channel detection problem x , x ,...information xˆ Gustafson, , xˆ ,... Source Encoder Decoder User Fig. 2. Shannon’s theory problem (Galdos and 1977; can be viewed a series of transformations on the process data. methods applied practice are Source Encoder Channel Decoder User shown in Fig. 1 et 2003), which Source Encoder Channel Decoder User xx ,, xx ,... xxˆˆ ,, xxˆˆ ,... can beThe viewed a most seriescommonly of transformations on the process shown in Fig. 1as (Venkatasubramanian (Venkatasubramanian et al., al.,in 2003), which ,...information ,... Cover and Thomas, 2012) Fig. 2. Shannon’s theory problem (Galdos and Gustafson, 1977; Source Encoder Channel Decoder User data. The methods most commonly applied in practice are linear, in which case the measurement vector that is the input Encoder Decoder and Gustafson, User can beThe viewed as aa most seriescommonly of transformations transformations on practice the process process Fig.Source 2. Shannon’s information problem2012) (Galdos 1977; data.be methods applied in are can viewed as series of on the Covertheory andChannel Thomas, linear, in which casemost the measurement vector that is the input Fig. 2. information problem (Galdos Gustafson, 1977; Covertheory and Thomas, 2012) to theThe process monitoring system can be represented asare theory problem, asand shown in Fig. 2, data. The methods commonly applied in practice practice area Fig.Shannon’s 2. Shannon’s Shannon’sinformation information theory problem (Galdos and Gustafson, 1977; data. methods most commonly applied in linear, in which case the measurement vector that is the input Cover Thomas, 2012) to the mapping process monitoring system can be represented as a uses Shannon’s information theory problem, as shown in Fig. 2, Cover and and Thomas, 2012) probability theory as a basis for the extraction of ) plus additive linear of the system states ( x , x ,... linear, in which case the measurement vector that is the input 1 be2 represented linear, which monitoring case the measurement that is the input Shannon’s information theory problem, as shown in Fig. 2, to the in process system canvector as a uses probability theory as a basis for the extraction of ) plus additive linear mapping of the system states ( x , x ,... information from signals corrupted with noise (Shannon, to the process monitoring system can be represented as a 1 2 Shannon’s information theory problem, as shown in Fig. 2, sensor noise. monitoring The to the mapping process system can represented as a uses Shannon’s information theory problem, shown in Fig. of 2, probability theory a basis for as the extraction ) plusreduction additive linear of the data-based system statesdimensionality ( x11 ,be x22 ,... information from signalsas corrupted with noise (Shannon, 1959; Cover and Thomas, 2012). The literatures on fault sensor noise. The data-based dimensionality reduction uses probability theory as a basis for the extraction of ) plus additive linear mapping of the system states ( x , x ,... ˆ ˆ ) from the methods then extract useful features ( , x ,... 1 2 uses probability theory as a basis for the extraction of ) plus additive linear mapping of the system states ( x , x ,... information from signals corrupted with noise (Shannon, 2 1 21 sensor noise. The data-based dimensionality reduction 1959; Cover andinformation Thomas, 2012). literatures onlargely fault detection and theoryThe have grown(Shannon, from signals with noise ) from the information methods then space extract useful features ( xˆ1 , xˆ2 ,...and sensor noise. The data-based dimensionality from signals corrupted corrupted with noise (Shannon, 1959; Cover andinformation Thomas, 2012). The literatures onlargely fault measurement on prior knowledge different sensor noise. The based data-based dimensionality reduction from the information methods then extract useful features ( xˆ11 , xˆ22 ,... ) reduction detection and theory have grown independently, although the parallels cangrown be seen by 1959; Cover and Thomas, 2012). The literatures on fault 1959; Cover and Thomas, 2012). The literatures on fault measurement space based on prior knowledge and different ˆ ˆ detection and information theory have largely ) from the methods then extract useful features ( x , x ,... extraction principals to facilitate fault detection. Feature ) from the independently, methods then space extract useful features ( xˆ11 , xˆ22 ,...and although the theory parallels can be seen by measurement based on prior knowledge different comparison of Figs. 1 and 2. A few existing feature detection and information have grown largely and although information havecangrown largely extraction principals to facilitate fault component detection. Feature detection independently, the theory parallels be seen by extraction approaches including principal analysis measurement space based based on prior prior knowledge and different different comparison of Figs. 1 and 2. A few existing feature measurement space on knowledge and extraction principals to facilitate fault detection. Feature extraction methods utilize knowledge from independently, the parallels be seen by comparison of although Figs. 1 some and A fewcan existing feature independently, although the 2. parallels can beinformation seen by extraction approaches including principal component analysis (PCA) (Hotelling, 1933; Severson et al., 2016), partial least extraction principals to facilitate fault detection. Feature extraction methods utilize knowledge information extraction approaches principals including to facilitate fault component detection. analysis Feature theory principal to build feature selection and many of comparison of their Figs. 1 some and 2. A principals few from existing feature extraction methods utilize some knowledge from information comparison of Figs. 1 and 2. A few existing feature (PCA) (Hotelling, 1933; Severson et al., 2016), partial least squares (PLS) (Wold, 1984), and canonical variate extraction approaches including principal component analysis theory to build their feature selection principals and many of extraction approaches including principal component analysis (PCA) (Hotelling, 1933; Severson et al., 2016), partial least such methods belong to the supervised pattern recognition extraction methods utilize some knowledge fromand information theory to build theirutilize featuresome selection principals many of methods knowledge from information squares (PLS) (Wold, 1984), and et canonical variate analysis (CVA) (Larimore, 1997; Russell Jiang etleast al., extraction (PCA) (Hotelling, 1933; Severson et al., 2016), partial such methods belong to theselection supervised pattern recognition (PCA) (Hotelling, 1933; Severson et al., al., 2000; 2016), partial least squares (PLS) (Wold, 1984), and canonical variate analysis problem (Joshi et al., 2005; Verron et al., 2008). In these theory to build their feature principals and many to build belong their feature principals and many of of (CVA) 1997; Russell et al., extraction 2000; Jiang etfault al., theory such methods theselection supervised pattern recognition 2015ab)(Larimore, have been used in the for squares (PLS) (Wold, 1984), and canonical variate analysis problem (Joshiinformation et al., to 2005; Verron et al., 2008). In these squares (PLS) (Wold, 1984), andfeature canonical variate analysis (CVA) (Larimore, 1997; Russell et al., 2000; Jiang et al., cases, mutual is used as an index to select the such methods belong to the supervised pattern recognition such methods belong to the supervised pattern recognition 2015ab) have been used in the feature extraction for fault problem (Joshi et al., 2005; Verron et al., 2008). In these detection. The been major1997; trends inthethefeature data extracted (CVA) Russell et al., 2000; Jiang et al., mutual isrelated used astoet anthe index to labels. select the (CVA) (Larimore, 1997; Russell et al.,are 2000; Jiangusing etfault al.,a cases, 2015ab)(Larimore, have in extraction features that information are closely class In (Joshi et 2005; Verron al., 2008). In problem (Joshi et al., al., 2005; Verron al., 2008). In these these detection. The been majorused trends inthethefeature data are extractedfor using a problem cases, mutual information isrelated used astoet anthe index to labels. select the 2015ab) have used in extraction for fault features that are closely class In detection. The major trends in the data are extracted using a 2015ab) have been used in the feature extraction for fault cases, mutual information is used as an index to select the used asto antheindex select the featuresmutual that information are closely isrelated classto labels. In detection. The The major major trends trends in in the the data data are are extracted extracted using using aa cases, detection. features are closely related to Copyright © 2018, 2018 IFAC 1311Hosting featuresbythat that areLtd. closely related to the the class class labels. labels. In In 2405-8963 © IFAC (International Federation of Automatic Control) Elsevier All rights reserved. Copyright 2018 responsibility IFAC 1311Control. Peer review©under of International Federation of Automatic Copyright © 2018 IFAC 1311 10.1016/j.ifacol.2018.09.565 Copyright © 2018 IFAC 1311 Copyright © 2018 IFAC 1311
1
2
1
2
1 1 1
2 2 2
1 1 1
2 2 2
1 1
2 2
1 1
2 2
IFAC SAFEPROCESS 2018 1312 Warsaw, Poland, August 29-31, 2018
Benben Jiang et al. / IFAC PapersOnLine 51-24 (2018) 1311–1316
unsupervised data-based dimensionality reduction, independent component analysis uses the minimization of mutual information as a criterion to extract independent components (Hyvärinen, 2004; Yu et al., 2013). Enhanced detection performance was observed for an electrical motor system by utilizing the extracted independent components (Parra et al., 1996), but no direct theoretical connection was made towards quantifying the best theoretically achievable fault detection performance. While concepts from information theory have been used by some researchers looking for improved features (Friston, 2010; Takashi et al., 2017), most researchers in fault detection have not incorporated results from information theory into research. This article argues that information theory does not only facilitate feature extraction, but is a valuable indicator for the evaluation and design of fault detection methods. Given the information loss during feature extraction, the fault detection performance can be evaluated and compared directly without referring to false alarm rates or fault detection rates, which builds the foundation for further study on designing optimal dimensionality reduction techniques for fault detection. The article is organized as follows. Section 2.1 discusses a theoretic analysis of the fault detection problem related to residual generation. This analysis establishes a criterion that lays a foundation for fault detection evaluation. Section 2.2 presents an information-theoretic formulation for fault detection evaluation and the theory that relates the mutual information to fault detection evaluation based on this framework. Section 3 provides a numerical example for fault detection that compares two widely used dimensional reduction methods, PCA and CVA, to illustrate the developed concepts. The conclusions are in Section 4.
2. METHOD 2.1 Evaluation Framework for the Fault Detection Problem This section briefly introduces the relevant concepts for residual generation and fault detection used in subsequent sections. Consider a linear dynamical system modeled by (1) x (t + 1) = Ax (t ) + Bu(t ) + w(t ) , (2) y(t ) = Cx (t ) + Du(t ) + v (t ) , where w and v are process and measurement noise; u ∈ nu denotes the input signals, x ∈ nx are state n variables, y ∈ y are output variables, and nα denotes the number of elements in the vector α . First a worst-case formulation is presented. Given the measured values for y and u , the states x can be estimated as xˆ based on a specified criterion, such as mean-squarederror (MSE), and the corresponding residual is defined as
r (t ) = x (t ) − xˆ (t ) .
(3)
From Ding (2014), the model residual can be decomposed as
r (t ) = r0 (t ) + rf (t ) = x0 (t ) − xˆ 0 (t ) + rf (t ) ,
(4)
where r0 (t ) = x0 (t ) − xˆ 0 (t ) is the nominal residual induced by measurement noise and process noise (disturbances), and rf (t ) is the effect of faults on the system residual.
A fault detection statistic based on residuals is defined by R = r = x − xˆ .
(5)
A fault detection method should be insensitive to noise and disturbances while remaining sensitive to faults. A tradeoff in the design of fault detection methods is that increasing the sensitivity to faults tends to increase the sensitivity to noise and disturbances. In other words, an increase in fault detection rate is usually associated with an increase in the false alarm rate. A practically useful specification for a fault detection method is to maximize the fault detection rate at a fixed false alarm rate. For example, if no false alarm rate is allowed then fault detection threshold can be set as the maximum value over all of the nominal residuals,
J th = sup r0 + r f = sup r0 . f = 0, w , v
(6)
w ,v
In this case, the fault is detected only if R = r0 + r f > J th .
(7)
The set of faults Ω f whose detection can be ensured satisfies inf
f ∈Ω f , w , v
r0 + r f > J th .
(8)
A lower bound on the fault residual for faults within the set Ω f is given by (Emami-Naeini et al., 1988) inf r f > 2 J th .
f ∈Ω f
(9)
If the states can be accurately estimated, J th would be nearly zero so that all faults could be detected without false alarms. However, many states in manufacturing processes are not accurately estimated because of noise and disturbances. In this sense, a model with smaller J th should be preferable to generate a larger detectable fault set and insensitivity to disturbances, noise, and model uncertainties according to (9). This analysis can be generalized to employ measures appropriate for random noise in (1) and (2), account for fixed false alarm rate not equal to zero, employ squared norms, and include effects of model uncertainties as in Emami-Naeini et al. (1988). The core evaluation principle for fault detection performance is to compare a residual R for normal operating data at a fixed false alarm rate. An ideal fault detection system would generate J th = 0 whereas a practical suboptimal fault detection system has J th as small as possible. 2.2 Fault Detection Evaluation by Information Methods 1) An information-theoretic formulation of the fault detection problem This section presents a framework that imbeds information theory into fault detection evaluation. In order to establish the relationship between fault detection and information theory, an information-theoretic formulation of the fault detection problem in discrete time is first established as in Fig. 3. In contrast to a deterministic system, the real system is subject to random noise and disturbances. As such, the future states x of the stochastic system are not completely determined by the historical information, and the measurements should be interpreted as the realization of an underlying stochastic
1312
IFAC SAFEPROCESS 2018 Warsaw, Poland, August 29-31, 2018
Benben Jiang et al. / IFAC PapersOnLine 51-24 (2018) 1311–1316
process with random inputs. The formulation of the process and its measurements in a probabilistic setting is therefore reasonable and the corresponding probability density function (pdf) can be denoted as p x ( x ) and p x| y ( x | y) whose existence is verified by Galdos and Gustafson (1977). The feature extraction transformation should be characterized by the random map specified by the conditional pdf p xˆ| y ( xˆ | y) in Fig. 3. The fault detection problem makes an assessment made based on the state estimate xˆ . y
x Source
Lemma 1 [Lower Bound]: Given a state estimator with its distortion measure defined as the averaged squared norm of residuals, i.e.,
(
ε E x − xˆ
xˆ
Source Map
),
(13)
IL ( xˆ ) I ( x; y ) − I ( x; xˆ ) ,
(14)
define the optimal estimates and errors by xˆ * = E ( x | y ) ,
(
Fault Detection
Feature
2
and the information loss of the estimator as
p xˆ| y ( xˆ | y )
p x| y ( x | y )
px ( x)
1313
ε * = E x − xˆ *
(15) 2
).
(16)
Fig. 3. Information-theoretic formulation of fault detection.
Then the equivalent relations hold: The relevant mutual information I ( x; y) and I ( x; xˆ ) are defined as (Cover and Thomas, 2012)
p I ( x; xˆ ) = ∫ ∫ log xxˆ p xxˆ dx dxˆ , p x p xˆ xˆ x
ε ≥ ε * ( exp ( 2 IL ( xˆ ) ) )
1/ nx
ε ≥ nx ( Σ x
(10)
,
(17)
) ( exp ( −2 I ( x; xˆ ) ) )
1/ nx
1/ nx
.
(18)
Proof: See the Appendix for the proof.
where
p xxˆ = ∫ p x| y p xˆ | y p y dy , p x = ∫ p xxˆ dxˆ , p xˆ = ∫ p xxˆ dx . xˆ
y
x
If the process data are assumed to follow a multivariate normal distribution, (10) can be shown to be equivalent to (Cover and Thomas, 2012)
I ( x; xˆ ) =
Σ Σˆ 1 log x x , Σ[ x , xˆ ] 2
(11)
where Σ α is the covariance for random variable α , and Σ α denotes the determinant of the covariance matrix Σ α . The mutual information I ( x; xˆ ) is a function of the choice of feature extraction p xˆ| y ( xˆ | y) . The averaged squared norm of the residual can be written as a function of p xˆ| y ( xˆ | y) as
preserved in the feature space of I ( x; sˆ ) . The lower bounds in Lemma 1 can be achieved, namely, equality holds in (17) and (18), by using an iso-information transformation (IIT) xˆ = Msˆ , where M ∈ nx ×ns is a matrix of full column rank. The matrix M that achieves the lower bound is given by −1
ˆˆ T ) . M * = E ( xsˆ T ) E ( ss
(19)
Proof: See the Appendix.
ε ( p xˆ| y ) E ( ( x − xˆ )T ( x − xˆ ) ) = ∫ ∫ ∫ (x − xˆ )T ( x − xˆ ) p x| y p xˆ | y p y dx dy dxˆ
Lemma 2 [Achievement of Lower Bound]: Consider the n normal operating measurements y ∈ y and states x ∈ nx , a dimensional reduction technique that produces loading scores (features) sˆ ∈ ns ( ns ≤ n x ), with information
(12)
x y xˆ
2) Main results for fault detection evaluation This section describes two main results based on the above information-theoretic formulation. First, the relationship is developed between the mutual information I ( x; xˆ ) and fault detection performance of a given feature extraction method. The resulting relation indicates that mutual information I ( x; xˆ ) is not just a different performance index that may be useful, but is a true indicator for distortion evaluation. A higher mutual information between true and approximated states results in a lower squared norm of residual, which in turn improves the fault detection performance according to Section 2.1. Second, a lower bound of the rate distortion function is obtained by applying Shannon’s lower bound. The resulting lower bound can provide an indication of how well the fault detection method can perform based on the information that the method extracts from the measurements.
In Lemma 2, the matrix M denotes an iso-information transformation from feature space to estimated state space that minimizes the corresponding squared norm of residual without changing the information contained in the feature space (i.e., I ( x; xˆ ) = I ( x; sˆ) ). Theorem 1 [Fault Detection Performance Evaluation]: Given normal operating training data, consider two-dimensional reduction techniques α and β , and suppose that the information preserved in the retained feature space produced by the techniques α and β are Iα ( x; sˆα ) and I β ( x; sˆβ ) , respectively. If Iα ( x; sˆα ) > I β ( x; sˆβ ) , (20) then
(
E x − xˆα*
2
) < E ( x − xˆ ) , * 2
β
(21)
where xˆα* = Mα* sˆα and xˆ β* = M β* sˆβ , and Mα* and M β* are determined by (19) in Lemma 2. Proof: The proof follows directly from Lemmas 1 and 2.
1313
IFAC SAFEPROCESS 2018 1314 Warsaw, Poland, August 29-31, 2018
Benben Jiang et al. / IFAC PapersOnLine 51-24 (2018) 1311–1316
Eq. (21) can be used to show that J thα < J thβ , which indicates that dimensional reduction technique α is preferable to technique β for fault detection according to (9). These results illustrate the useful role of information theory in the evaluation of fault detection methods. When comparing different dimensional reduction methods, a true indicator of the performance of a fault detection method is the mutual information between the features that the method extracts and the real states. A higher I ( x; sˆ) yields a smaller
(
E x − xˆ *
2
)
9 Normal
8
Fault
7
6
R
2
5
4
3
2
1
0 0
and thus results in better fault detection
200
400
600
800
1000
Sample
(a) PCA with optimum transformation
performance. The optimal mapping from feature space to state space can be obtained by (19), which utilizes all the useful information contained in sˆ .
3.5 Normal Fault
3
2.5
R
2
2
3. CASE STUDY This section illustrates the usefulness of the key concepts using a simple oscillatory process with a random bias (Galdos and Gustafson, 1977),
1.5
1
0.5
0
0 x1 1 0 0 w1 x1 0 0 2 x x + 0 1 0 0 = − w 0 w2 , 2 2 x3 0 1 x3 0 0 w3 1 1 with
random
initial
condition
0
200
400
600
800
1000
Sample
(b) CVA with optimum transformation
(22)
14 Normal Fault
12
T
[ x1 (0), x2 (0), x3 (0)]
10
~ N (0,diag{x01 , x02 , x03 }) and process noise wi ~ N (0, q) . Discrete noisy measurements are collected according to
R
2
8
6
y1 1 0 y 0 1 2 y3 = 0 0 y4 1 0 y5 0 −1
0 v1 0 x1 v2 x , + v3 1 2 x3 1 v4 v5 5
4
2
(23) 0 0
200
400
600
800
1000
Sample
(c) PCA with a random transformation 120
where vi ~ N (0, q) . The numerical values of the parameters
Normal Fault
are x01 = 22 , x02 = (0.02) 2 , x03 = (0.25) 2 , w = 21 rad/h , and 2
80
60
R
q = (0.05) 2 . The data sampling time is 3 minutes. Consider a fault that is a slow drift added to x2 with magnitude 0.0008. The datasets for both normal and fault case contain 1000 samples and the fault is introduced to the system at time 0. To illustrate the results in Section 2, two widely used dimensionality reduction techniques, PCA and CVA, are applied for fault detection. The corresponding projection matrices PPCA and PCVA are obtained by the PCA and CVA methods, and the transformation matrices M PCA and M CVA can be obtained by (19). For PCA, the number of principal components is set to a = 1 . For CVA, the lag variables for past and future are set to l = h = 2 and the number of states is taken as k = 3 . Details about the PCA and CVA application can be found in Chiang et al. (2001). To provide a sound comparative assessment of the detection performances, false alarm rates are maintained at the same level ( α = 1% ) for comparing the fault detection rate (FDR), which is the ratio between the detected fault samples and total fault samples.
100
40
20
0 0
200
400
600
800
1000
Sample
(d) CVA with a random transformation Fig. 4. Squared residuals for fault detection methods for a single simulation
Fault detection results for one simulation are in Fig. 4. As seen in Figs. 4ab using the iso-information transformation from Theorem 1, PCA has nearly zero detection rate whereas CVA has nearly 85% detection rate, which is consistent with CVA’s ability to take into account process dynamics (Chiang et al., 2001). CVA results in a tighter threshold than PCA, which results in better fault detection. As a comparison with the optimum iso-information transformation by M * in (19),
1314
IFAC SAFEPROCESS 2018 Warsaw, Poland, August 29-31, 2018
Benben Jiang et al. / IFAC PapersOnLine 51-24 (2018) 1311–1316
the results obtained using a random transformation matrix M = [mi , j ] ∈ nx ×ns , where mi , j ~ N (0,1) is in Figs. 4cd. The optimum transformation matrix yields much lower squared
( ) = E( x
E r0
residuals
2
0
− xˆ 0*
2
)
and in higher fault
detection performance for both PCA and CVA, which realizes the maximum benefit from the information produced in the feature extraction step. The averaged performance for 1000 simulations is shown in Fig. 5 and Table 1. A higher mutual information was
( ) = E( x 2
associated with a lower E r0
0
− xˆ 0*
2
)
and a
lower detectable threshold J th , thereby leading to a higher fault detection rate, which is consistent with the discussions in Section 2. Table 1. Fault detection performances
PCA
Fault detection rate 7.70%
Mutual information 5.20
CVA
83.20%
13.94
Methods
( )
E r0
2
2.00 0.07
Fault detection rate 1.2
PCA CVA
1
0.8
FDR
0.6
0.4
0.2
0 0
200
400
600
800
1000
Time
(a) fault detection rate Mutual infomation 16
PCA CVA
14
12
MI
10
8
6
4
2 0
200
400
600
800
1000
Time
(b) mutual information Fig. 5. The fault detection rate and mutual information for PCA and CVA averaged over 1000 simulations
4. CONCLUSIONS This article presents a fault detection evaluation that imbeds the fault detection problem into an information theoretic framework. The significance of mutual information in fault detection has been demonstrated: mutual information
1315
is not just another performance index that may be useful in some fault detection problems, but rather a true indicator of how well a fault detection method can perform. The criteria for fault detection evaluation are discussed in Section 2.1. Optimal dimensionality reduction techniques should extract features that generate as small residuals as possible for the normal operating condition. Based on this statement, an information-theoretic formulation for fault detection problem is introduced in Section 2.2 and the relevancy between mutual information and the squared norm of residual is established. A higher mutual information between the estimated and true states is shown to generate a smaller residual, thereby resulting in better fault detection performance. Rather than focusing on specific dimensionality reduction techniques, the mutual information is a true indication of how good a feature extraction method can perform given its preserved information. The theory was illustrated by a numerical example in which the feature space with high mutual information with respect to the real states yielded a higher fault detection rate. The theory establishes a solid foundation for developing an optimal dimensional reduction method based on maximizing mutual information for fault detection. REFERENCES Chiang, L.H., Russell, E.L., and Braatz, R.D. (2001). Fault Detection and Diagnosis in Industrial Systems. London, UK: Springer Verlag; 2001. Cover, T.M., and Thomas, J.A. (2012). Elements of Information Theory (2nd Edition). John Wiley & Sons. Ding, S.X. (2014). Data-Driven Design of Fault Diagnosis and Fault-Tolerant Control Systems. London, UK: Springer Verlag. Emami-Naeini, A., Muhammad, A.M., and Stephen, R.M. (1988). Effect of model uncertainty on failure detection: the threshold selector. IEEE Transactions on Automatic Control, 33(12), 1106–1115. Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11, 127–138. Galdos, J., and Gustafson, D. (1977). Information and distortion in reduced-order filter design. IEEE Transactions on Information Theory, 23(2), 183–194. Jiang, B., Zhu, X., Huang, D., Paulson, J.A., and Braatz, R.D. (2015a). A combined canonical variate analysis and Fisher discriminant analysis (CVA–FDA) approach for fault diagnosis. Computers & Chemical Engineering, 77:1–9. Jiang, B., Huang, D., Zhu, X., Yang, F., and Braatz, R.D. (2015b). Canonical variate analysis-based contributions for fault identification. Journal of Process Control, 26:17–25. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(7), 417–441. Hyvärinen, A., Karhunen, J., and Oja, E. (2004). Independent Component Analysis. New York: John Wiley & Sons. Joshi, A., Deignan, P., Meckl, P., King, G., and Jennings, K. (2005). Information theoretic fault detection. Proceedings of the American Control Conference, Portland, OR, USA, 1642–1647.
1315
IFAC SAFEPROCESS 2018 1316 Warsaw, Poland, August 29-31, 2018
Benben Jiang et al. / IFAC PapersOnLine 51-24 (2018) 1311–1316
Larimore, W.E. (1997). Canonical variate analysis in control and signal processing, in: T. Katayama, S. Sugimoto (Eds.), Statistical Methods in Control and Signal Processing, 83–120. New York: Marcel Dekker Inc. Parra, L., Deco, G., and Miesbach, S. (1996). Statistical independence and novelty detection with information preserving nonlinear maps. Neural Computation, 8(2), 260–269. Russell, E.L., Chiang, L.H., and Braatz, R.D. (2000). Fault detection in industrial processes using canonical variate analysis and dynamic principal component analysis. Chemometrics & Intelligent Laboratory Systems, 51, 81– 93. Severson, K., Chaiwatanodom, P., and Braatz, R.D. (2016). Perspectives on process monitoring of industrial systems. Annual Reviews in Control, 42, 190–200. Shannon, C.E. (1959). Coding theorems for a discrete source with a fidelity criterion. IRE National Convention Record, 4, 325–350. Takashi, T., Kim, K.K.K., Parrilo, P., and Mitter, S. (2017). Semidefinite programming approach to Gaussian sequential rate-distortion trade-offs. IEEE Transactions on Automatic Control, 62(4), 1896–1910. Venkatasubramanian, V., Rengaswamy, R., Kavuric, S.N., and Yin, K. (2003) A review of process fault detection and diagnosis: Part III: Process history based methods. Computers & Chemical Engineering, 27(3), 327–346. Verron, S., Tiplica, T., and Kobi, A. (2008). Fault detection and identification with a new feature selection based on mutual information. Journal of Process Control, 18(5), 479–490. Wold, S., Ruhe, A., Wold, H., and Dunn, W.J. (1984). The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing, 5(3), 735–743. Yu, J., Chen, J., and Rashid, M.M. (2013). Multiway independent component analysis mixture model and mutual information based fault detection and diagnosis approach of multiphase batch processes. AIChE Journal, 59(8), 2761–2779. APPENDIX Proof of Lemma 1: Based on the Shannon lower bound (Galdos, and Gustafson, 1977; Cover and Thomas, 2012), the mutual information and mean-squared-error (MSE) distortion measure ε ( p xˆ| y ) in (12) are related by nx
1 ε inf I ( x; xˆ ) ≥ H ( x ) − log 2p e , p xˆ| y ∈CQε 2 n x
(A-1)
where H ( x ) denotes the Shannon entropy (Galdos, and Gustafson, 1977; Cover and Thomas, 2012) and CQD is defined as
CQD = { p xˆ | y : ε ( p xˆ | y ) ≤ D} ,
inf I ( x; xˆ ) ≤ I ( x; xˆ ) , (A-1) can be written as
p xˆ| y ∈CQε
nx
ε 1 exp ( 2 H ( x ) − 2 I ( x; xˆ ) ) . (A-3) ≥ n ( 2p e ) x nx Based on the Gaussian assumption, H ( x ) is expressible as H ( x) =
1 log ( (2p e) nx Σ x ) . 2
(A-4)
Substituting (A-4) into (A-3) results in the inequality (18). To prove (17), (14) can be used to rewrite (A-3) as nx
ε 1 exp ( 2 H ( x ) − 2 I ( x; y ) ) exp ( 2 IL( xˆ ) ) . (A-5) ≥ n n ( 2p e ) x x
From (15), it follows that I ( x; y ) = I ( x; xˆ * ) =
Σ x Σ xˆ * Σ[ x , xˆ * ]
since xˆ * are optimal state estimates. Together with (A-4), it can be shown that
1
( 2p e )
nx
exp ( 2 H ( x ) − 2 I ( x; y ) ) = Σx
(A-6)
Σ[ x , xˆ *]
−1 xˆ *
= Σ x − Σ xxˆ * Σ Σ xˆ * x .
Σ x Σ xˆ *
As pointed out in Galdos, and Gustafson (1977), xˆ * has the properties: E ( xxˆ *T ) = E ( xˆ * xˆ *T ) , or E ( ( x − xˆ * ) xˆ *T ) = 0 .
Equation (A-6) can be further rearranged to 1 exp ( 2 H ( x ) − 2 I ( x; y ) ) n ( 2p e ) x
(A-7)
(A-8)
= Σ x − Σ xˆ * = E ( ( x − xˆ )( x − xˆ ) *
* T
).
Since xˆ * are optimal state estimates, we have that 1/ nx
1 nx exp ( 2 H ( x ) − 2 I ( x; y ) ) nx e p (2 )
(
= nx E ( ( x − xˆ * )( x − xˆ * )T )
(
= tr E ( ( x − xˆ * )( x − xˆ * )T )
)
1/ nx
(A-9)
)
= E ( ( x − xˆ ) ( x − xˆ ) ) = ε * . * T
*
Substituting (A-9) into (A-5) derives the inequality (17). Proof of Lemma 2: The MSE distortion measure is defined as
(
ε E x − xˆ
2
) = E ( x − Msˆ ) . 2
(A-10)
Differentiating (A-10) with respect to the transformation matrix M results in
∂ε 2E ( M sˆ − x ) sˆ T = 0 . ∂M
(A-2)
which contains all the feature extractors p xˆ| y whose MSE distortion measure satisfy ε ( p xˆ| y ) ≤ D .
Since
(A-11)
Then the optimal transformation matrix that achieves the lower bound is −1
ˆˆ T ) . M * = E ( xsˆ T ) E ( ss 1316