Copvright © IFAC Dynamics and Control of Process Systems, Corfu, Greece, 1998
DETERMINING THE NUMBER OF PRINCIPAL COMPONENTS FOR BEST RECONSTRUCTION S. Joe Qin 1 and Ricardo Dunia
Department of Chemical Engineenng University of Texas, Austin, TX 78712, USA
Abstract: A well-defined unreconstructed variance (URY) is proposed to determine the number of principal components in a PCA model for best reconstruction. Unlike most other methods in the literature, this proposed l ;RV method has a guaranteed minimum over the number of PC's corresponding to the best reconstruction. Therefore, it avoids the arbitrariness of other methods with monotonic indices. The CRY can also be used to remove variables that are little correlated with others and cannot be reliably reconstructed from the correlation-based PCA model. The effectiveness of this method is demonstrated with a simulated process. Copyright © 1998 IFAC Keywords: principal component analysis. missing values, sensor reconstruction , principal component subspace. residual subspace
1.
I\,TRODCCTIO~
corresponds to a well defined minimum prediction error . the same approach does not work for PCA. The reason is that PCA actually models an identity mapping from the variables to themselves. for both the calibration (or training) data set and the test set. If a cross-validat.ion approach is used . the predicted error sum of squares (PRESS) would decrease monotonically. and would become zero as the number of PC's approaches the number of variables. For this reason, the determination of the number of PC's has been somewhat subjective. A simple approach is to choose the number of PCs for the variance to achieve a predetermin ed percentage. say 95%. Wold (\void . 1978) uses the PRESS to choose the number of PCs at which including an additional PC will not s1gnificantly reduces the PRESS. Since the PRESS in Pc..\. is monotonically decreasing , this method is subjective. Joreskog et al. (Joreskog et al .. 1976) discussed a criterion in which each PC must contribute at least on 171th of the total variable. where m is the number of variables. Cattell (Cattell. 1966) presented a scree test based on a plot of the eigenvalues of the correlation matrix.
The use of principal component analysis (PCA) in chemical process monitoring and analysis has progressed significantly. Some of the representing applications include: (i) monitoring of batch and continuous processes (!'\omikos and MacGregor. 1995) . (l\lacGregor and Koutodi. 199.')): (ii) extraction of active biochemical reactions (Harmon ff IJI .. 1995); (iii) product quality control in the principal component subspace (Piovoso et al .. 1992): (iv) missing value replacement (nlartens and Naes. 1989): (v) sensor fault detection and reconstruction (\rise and Ricker. 1991) . (Dunia el al .. 1996): (\"i) process fault identification and reconstruction(Dunia and Qin . 1997): and (vii) disturbance detection (Ku et al .. 1995) Although the PCA method has been widely used. selecting the number of principal components (PC) is rather subjective. l"nlike partial least squares. for which a cross validation approach is a\'ailable and the optimal number of factors
1
Author to whom all correspondence should be addressed.
357
The number of PC's is chosen at which the graph drops sharply followed by a straight line with much smaller slope. Harmon, et al. (Harmon El al. , 1995) proposed an alternative method based on the auto-correlation test for the scores for a specific type of applications. The situation is that the variables are serially autocorrelated. Therefore, the significant scores are auto-correlated. If for a particular component the autocorrelation of the scores is almost zero , then this and the remaining components resemble to random noise and are not used. ~Ialinowski summarized numerous other approaches to determining the number of PC's (Malinowski , 1991).
which is projected on the PC'S and RS and is shown t.o have a minimum with respect to the number of PC's. Section 4 presents the use of the proposed method to to determine the optimal number of PC's for a simulated process. The last section presents conclusions.
2. SE).'SOR RECOi\'STRUCTION BASED 0;\1 PCA 2.1 PGA Projection
arm
Let x E denote a sample vector of In sensors. Assuming there are n samples for each sensor , a data matrix X E 5j(nxm is composed with each row representing a sample . By either the NIPALS (Wold et al. , 1987) or a singular value decomposition (SVO) algorithm. the matrix X can be decomposed into a score matrix T and a loading matrix P whose columns are the right singular vectors of X . In the case of SVD , the singular values are ordered with decreasing magnitudes. A particular sample vector x can be projected to the PCS (Sp), which is spanned by P , and RS (Sr), respectively. The projection on the PCS is:
Most existing approaches to determining the number of PC 's use an index that is monotonlcally decreasing. The number of PC's is chosen when there is no significant decrease in the index after adding a PC. These approaches based on monotonic indices are subjective because (i) there may be a rather constant decrement in the index; and (ii) there can be more than one location which satisfies the criterion. In this paper , we propose a new criterion to determine the number of PC's based on the best reconstruction of the variables. An important feature of this new approach is that the proposed index has a minimum (i.e ., non-monotonic) corresponding to the best reconstruction. \\-'hen the PCA model is used to reconstruct missing values or faulty sensors , the reconstruction error is a function of the number of PC's. We propose the use of an unreconstructed rarwllce (CRV) which is the variance of the reconstruction error to determine the number of PC's . The CRY can be decomposed into a portion in the principal component subspace (PCS) and a portion in the residual subspace (RS). The portion in the RS is shown to decrease monotonically with the number of PC's and that in the PCS in general increases with the number of PC 's. As a result , the CRY always has a minimum which points to the optimal number of PC's. In the case of missing value reconstruct ion. the unreconstructed variances for different missing patterns are combined with weights based on the frequency of occurrence . In the case of faulty sensor identification and reconstruction. the unreconstructed variances are weighted based on the importance of each fault. In the special case where each sensor is reconstructed once , the procedure can be thought of as a cross validation by Iwnng-one-sensor-out. in contrast to the traditionalleaving-one-sample-out cross validation.
x = ppTx == Cx
( 1)
The projection on the RS is:
x = (I -
ppT)x
= (I -
C)x
(2)
Since Sp and Sr are orthogonal.
xTx = 0
(3)
2.2 SEnsor Reconstruction
In this subsection we discuss the issue of sensor reconstruction given a PCA model and a known direction. Let x' denote the normal sample vector. The actual sensor measurement including an upset is x
= x' + f~,
(4)
which is corrupted by an upset :Fj along a di1 and f the rection ~i E 5j(m. where II~i 11 magnitude . The task of sensor reconstruction is to find an estimate for x' along the direction of upset ~i, i.e .. to best correct the effect of the upset. In other words . we try to find fi such that
=
The organization of the remaining sections of t he paper is as follows . Section 2 presents a general description for missing and faulty sensors reconstruction using principal component models. Sect ion 3 defines t.he unreconstructed variance
(5)
is most consistent with the PC A model , or x, has minimum model error. i.e ..
358
fi
x, !1 2
= arg min ll x, J. = arg min llxi 112 J.
(I)
= arg min Ilx J.
(8)
J;~i I 12
x· - Xi
(6)
f.! x = ---~r x = ---e~i
J; -
Xi
= (I -
~i~i,
~r~i
)x
( 12)
f
i.r x' = -,-,
(13)
e~i
(9)
~r~i
= (I -
f)~,
Therefore . the reconstruction error . (14)
Substituting the above fi into Eq.(5) we obtain the best sensor reconstruction as follows. - 'T
-
Substituting Eq.(4) into Eq.(9) we get:
The solution can be easily found USlI1g least squares. fi
= (f,
',
iOiOT)x
and
.
If.!
x'l Ilx -xill=~
(10)
(
(15)
~i
From Eqs.(13) and (14) we observe that where ii = idllid l. The above derivation is a rather general representation of many specific tasks. By defining more specifically the upset direction ~i , the following situations can be considered as special cases for the reconstruction result in Eq.(10):
(1) £{x' - x;}
(1) Process faults (Dunia and Qin , 1997) . By choosing ~i as the direction of a process fault, e.g., leakage in a unit. Eq.(5) provides the best estimate of the magnitude of the leakage . Eq.(lO) provides what the sensor measurements should be if there were no leak . (2) Sensor validation (Dunia. et ai, 1996). For the case of sensor faults. ~i is the ith column of the identity matrix. that is , ~i
= [0
of 0 0 0 of
0 .. . 1
.
=
=
0 because £{x"} O. In other words , Xi is an unbiased estimate of x". (2) The variance of (x' - Xi) occurs only in the reconstruction direction ~i , which is the same as the variance of f - J;. (3) The reconstruction error (x" -x;) is independent of the fault magnitude f. (4) The reconstruction error (x" - x;) depends on the number of PC 's retained in the PCA model.
The last two items indicate that one can determine the number of PC 's by achieving the minimum reconstruction error. Further , If the reconstruction error is minimized for a particular magnitude f, it is minimized for all magnitudes. Therefore , we can simply determine the number of PC's for the case of f = 0 to achieve the best reconst ruction.
(11)
For example . ~l = [1 represents the first sensor is faulty among five sensors. 6 = [0 0 1 0 represents the third sensor is faulty. In this case . Eg. ( 10) provides the best reconstruction of the faulty sensor based on the PCA model. (3) Data reconciliatioll. The same representation can be used to reconcile gross errors detected in the ? h sensor . (4) Missing data replacement. If sensor i for a particular observation is missing , one can fill in a zero for the missing entry and use Eq.(10) to estimate the best replacement. ~1artens and !'\aes (1989) described this method of missing value replacement. If more than one sensor have missing values. one can augment the direction vectors ~i into a matrix and similar least squares solutions can be derived.
of
3.
DETER~lIl\E
THE l\U,\lBER OF PC'S FOR BEST RECONSTRl;CTION
3.1 Unreconstructed \ 'anance
The unreconstructed variance (l'RV) in the direction ~i can be calculated as follows assuming f O.
=
Ui
== rar{
e(x - x;)} = t'al'{fd =
q~i
(~T~,j2
(16)
where R == £(xxT) is the correlation matrix which can be estimated from the data . Intuitively. Ui is the variance of the reconstruction error in the estimation of x" by using Xi. The orthogonal properties of P allows one to represent R using.
R = CRC + (I - C)R(I - C) = R+ R(17) 2.3 Reconstruction Error
where R == £(xxT) and it == £(i(XT) are the modeled and unmodeledportions of R. respectively.
Substituting Eq.(4) into Eq.( .'}). we obtain ,
359
Substitution of this expression to.
Ui
=
III
and when I gets close to 171, the denominator II~tlI2 usually tends to zero faster than ii j • making
Eq.( 16) leads
lim Uj
(18)
The equat.ion above relates variations of the data.
Ui
l-+rn
although Ui is not necessarily mono tonically increasing with I. Intuitively, when I = 171. Sp == iJ?7n. Therefore. any Xj E ~m has a minimal (zero) distance to Sp. which makes it a valid reconstruction for x. As a result, its variance Ui == Uj is infinity. The sum ii j + Uj = Uj provides the existence of a minimum with respect to I. Therefore , we can formulate an optimization problem to minimize Ui with respect to the number of principal components,
with the unmodeled
3.2 Number of Pnnelpal Components for Best ReconstructIOn
The condition of making Ui as small as possible can be used to determine the number of principal components and the set of sensors for process monitoring. To illustrate how Ui can be minimized with respect to I and m. we make use of the following identity,
n1ln Uj I
.
= Ui +Ui
T
mill q u I
. (TT.) = m1l1 q u +q U I
(23)
where u represents the vector of the unreconstructed variances for all Fj E {Fj }, and q is a weighting vector with positive entries. Such a vector allows one to adjust the model depending on how critical eaeh fault is to process operation. Not.e t.hat the linear combination of monotonically decreasing functions qT U is also monotonieally decreasing. and qT U -+ 00 when I -+ m.
(20)
where.
is the variance of (fi - f) in the direction of projected on Sr.
(22)
Because Eq.(22) provides the optimal I only for F j , it is necessary to redefine the objective function by considering all possible faults, {Fj },
and we obtain from the expectation of Eq.(19) multiplied by (fi - f)2, Ui
=X
~i
3.3
Sensor SelectIOn for Best ReconstructIOn
We have described Uj as a parameter that quantifies the amount of variance not captured in Xj when estimating X'. However. Ui can still be very large after the minimization for a particular fault or sensors. A more restrictive criterion than the reconstructability condition is required in order to determine if x, is a reliable estimation of x'. For that reason we define a reference to determine if a particularu, provides an acceptable reconstruction for the identificat.ion of F j • This reference is obtained by using the average sample vect.or x instead of the reconstructed vector Xi in Eq. (16),
is the variance of (fi - f) projected on Sp , and
The total unreconst ructed variance is represented by the sum of the unreconstructed variances on the PCS and the RS. The addition of the profiles obtained from the relations U, (I) and Ui (I) pro\'ides the possibility for a minimum with respect to I. Substituting of Eq.( 16) in the definition of Ui we obtain:
Because x is normalized to zero mean and unit 0, the reference case corresponds variance, x to using a model with zero principal components. In particular , udO) = udO) and udO) = O. By adding PC it is possible that Uj(l) 2 udO) if the increase in Uj is larger than the decrease in Uj. If the unreconstructed variance Ui is greater than ud 0), the mean x provides less unreconstructed variance for the estimation than Xi. Therefore,
=
Dunia and Qin (1997) shows that Ui is mono ton ically decreasing with respect to l. Intuitively, the great.er I. the s_maller are the non-zero eigenvalues remaining in R, making Ui (l + 1) :S Ui (I). In the case of Uj.
360
the PCA model derived from the data does not pro\'ide a better estimation of x' than x when :Fj occurred . In this case it is better to use x as the reconstructed values , which can be interpreted a" using zero principal components. It is desirable to consider as many correlated variables as possible for process monitoring. But if a sensor chosen for monitoring is not correlated with others. the objective function qT u tends to increase, particularly when sensor faults for uncorrelated sensors are being considered. Indeed, a fault on a sensor with lack of correlation provides a large Uj , making the fault unreconstructable by means of multi variable reconstruction. The inclusion of such a sensor as part of the monitoring measurements could affect the fault. diagnosis and t.he parsimony of t.he model. Therefore. for the multi variable monitoring procedure we suggest removing not only the sensor faults with large Uj from the set of possible faults , but also the sensors associated to the unreconstructable sensor faults. The selection of variables with reliable sensor fault reconstruction affects the number of variables considered for monitoring , making the procedure iterative.
Fig. l. Flow diagram of the simulated process. the streams for different purposes. :\ total of 12 flow measurements are taken along the process. To illustrate the reconstructability condition and the use of u in the calculation of / and m , we consider all the sensor faults and leaks at each reflux condenser as possible abnormal conditions. The first twelve faults correspond to sensor faults , which are characterized by the following directions:
The previous discussion suggests the following procedure for the calculation of m and t using the unreconstructed variance:
and the last two faults correspond to the leaks at the reflux condensers with the following direction vectors:
(1) Consider all monitoring sensors as the initial
guess for m. (2) Estimate the correlat.ion matrix R from the data. (3) l\Iinimize the objective function qT u with respect to t. (4) Discard the faults for which the individual unreconstructed variances are greater than ~TR<" In the case of sensor faults not only is:Fj discarded from {:F;} but also the sensor is removed from process monitoring. (5) li pdate the number of sensors used for monitoring. If m remains the same, the optimization procedure ends. Otherwise , return to step 2.
~r3
= [0 ~r4 = [0
E:X:\~IPLE
OF
SI~ll'L:\TED
0 0 0 0 0 1 0 0 0 0 0]
:\ total of 1000 samples were randomly generated for the calculation of the correlation matrix at normal process conditions. The minimization of the unreconstructed variance will determine the optimal number of principal components. faults, and sensors to consider for fault identification by reconst ruct ion. \Ve consider all the faults have the same importance in reconstruction , i.e .. all elements of q are one. The profile gi\'en by qT U and qT U are shown in Figure 2. ;\'otice that qT U is monotonically decreasing with / and qT U increases when / gets large , conserving the properties of Uj anduj . respectively. Figure 2 illustrates that / 3 provides the minimum for qT u = qT U + qT U. According to step 4 of the procedure presented earlier we discard faults for \\'hich the individual unreconstructed variance is greater than ~TR 1 at / 3. Therefore, :F12 cannot be reliably reconstructed from the PC A model and is taken out from {:F;}.
The discarded faults due to high unreconstructed variance could be detected and identified with the parameter ~T x. Note that such an approach is not based on data correlation but a probabilistic distribution of ~T x.
4. :\;\'
Ji
0 1 1 0 0 0 0 0 0 0 0 1'2
=
PROCESS
Consider the process flow diagram of Figure l. Two separation columns are used to obtain three different products. The bottom product of the first column is taken to the second column to obtain a second stream of light products. The control valves allow to distribute the amounts of
=
361
AC]\:;\'O\\TEDG \1 E:\"1'
_q
T-
.... . . . q
T /,
U
U
T
--qu
--- ---
/
This work is partly supported by Air Products. ALCOA , Dupont. and Fisher-Rosemount .
6.
------
Cattell. R.B . (1966). Muliivanatc Bchav. Res. 1. 245. Dunia. R. and S. J. Qin (1997). A unified geometric approach to process and sensor fault identification. Comput. Chem. Dunia, R .. J. Qin. T. F. Edgar and T. J . McAvoy (1996). Identification of faulty sensors using principal component analysis. AIChE J.
!
i
iil'1''------2~----3---------'5
l Fig. 2. L"RV vs . the number of PC's for
In
= 12.
42 . 2797- 2812.
Harmon, J.L., Ph. Duboc and D Bonvin (1995). Factor analytical modeing of biochemical data. Comput . Chem. 19, 1287-1300. Joreskog , K. , J. Klovan and R. Reyment (1976). GeologIcal Factor Analysis. Elsevier, Amsterdam . Ku, \-V. , R.H. Storer and C . Georgakis (1995). Disturbance detection and isolation by dynamic principal component analysis. Chem . Intell. Lab Sys . 30 , 179. MacGregor , J. F. and ~1. Koutodi (199.5). Statistical process control of multivariate processes. Control Engineering Practice 3 , 403-414. Malinowski, E.R. (1991). Factor Analysis zn Chemistry. Wiley-Interscience, New York. Martens. H. and T. Naes (1989). lHultivariate Calibration. John Wiley and Sons, New York. :"iomikos , P. and J.F. ~lacGregor (1995). i\lultivariate spc charts for monitoring batch processes. Tcchnometl'lcs 37( 1).41-59. Piovoso . 1\1.. K. Kosanovich and R . Pearson (1992). Monitoring process performance in real-time . In: Proceedzngs of ACe. Chicago. pp. 2359-2363. Wise , B. M. and f\. L. Ricker (1991). Recent advances in multivariate statistical process control: Improving robustness and sensitivity. In: Proceedings of the IF4C. ADCHEM Symposium . pp. 12.5-130. ',"old. S. (1978). Cross validatory estimation of the number of components in factor and principal component analysis. Technomflrics
14, . -------------~---,
I
_q T-U
I
i2r 8; i I
6f I
i
•' I
....
....
... ... ...
1
,
...
! I
4~ I
2f
J
I I
.. .
,i
,
,.
REFERE~CES
I
()1:- ----~=-<.----=-3----4-:------:5
l Fig. 3. URV vs . the number of principal components for m = 11. Fu rthermore. because :1"12 corresponds to a sensor fault, this sensor is dropped out and the number of sensors reduces to In 11 . Figure 3 illustrates the case of In = 11. The minimum is also reached at I = 3. The above results show that :1"12 can not be reconstructed with the PCA approaches; the best reconstructed value for sensor 12 is its average.
=
5. CO:\CLl'SIONS
20. 397-406 .
Wold , S .. K. Esbensen and P. Geladi (1987). Principal component analysis. Chemometl'lcs alld IntclhgeTlt Laboratory Systems 2,37 - 52.
A well-defined unreconstructed variance IS proposed to determine the number of principal components in the PCA model for best reconstruction. L"nlike most other methods in the literature. this proposed l'RV method has a guaranteed minimum over the number of PC's corresponding to the best reconstruction. Therefore. it avoids the arbitrariness of other methods with monotonic indices. The effectiveness of this method is successfully demonstrated with a simulated process.
362