Chinese J . Chem. Eng., 14(1) 65-72
(2006)
Accelerated Recursive Feature Elimination Based on Support Vector Machine for Key Variable Identification* M A 0 Yong( % 3 )a, PI Daoying( ?&
LIU Yuming( $1
)b
and SUN Youxian(q\{fi
a National Laboratory of Industrial Control Technology, Institute of Modem Control Engineering, Zhejiang University, Hangzhou 310027, China National Laboratory of Industrial Control Technology, Institute of System Control Engineering, Zhejiang University, Hangzhou 310027, China
Abstract Key variable identification for classifications is related to many trouble-shooting problems in process industries. Recursive feature elimination based on support vector machine (SVM-RFE) has been proposed recently in application for feature selection in cancer diagnosis. In this paper, SVM-RFE is used to the key variable selection in fault diagnosis, and an accelerated SVM-RFE procedure based on heuristic criterion is proposed. The data from Tennessee Eastman process (TEP) simulator is used to evaluate the effectiveness of the key variable selection using accelerated SVM-RFE (A-SVM-RFE). A-SVM-RFE integrates computational rate and algorithm effectiveness into a consistent framework. It not only can correctly identify the key variables, but also has very good computational rate. In comparison with contribution charts combined with principal component aralysis (PCA) and other two SVM-RFE algorithms, A-SVM-RFE performs better. It is more fitting for indusbrial application. Keywords variable selection, support vector machine, recursive feature elimination, fault diagnosis
1 INTRODUCTION Many changes frequently occur in process industries, including intentional (e.g. product change, grade change, and product rate change) and unintentional (e.g. process faults, process disturbances, and operator error). Total understanding and analysis of collected history data are very pivotal for running chemical plants"-31. In these industrial data mining problems, selecting the variables, which are driving the difference between fault and normal system status, is a hot topic in this field, especially for the unintentional changes. Proper identification of these key variables allows rapid recovery from process faults, so it is crucial for production of consistent quality product. Identifying key process variables between two distinct operating conditions has been of interest in Ref. [4].In the remainder of this paper, we address the problem of determining the key variables, which are driving the difference in dataset from two distinct operating conditions. In some perspective, this is a key variable selection problem for classifications (support vector machine). In Refs. [5, 61, contribution chart combined with principal component analysis (PCA) are used for
identifying key process variables for a single faulty observation; however, these results may be misleading, as illustrated in Ref. [6]. Genetic algorithm combined with Fisher discriminant analysis as a wrapper method is also used to do variable selection in Ref. [6]. But the heavy computation burden of this method makes it difficult to apply it to industrial fields. In this paper, recursive feature elimination based on support vector machine is used to do key variable selection, and an accelerated strategy is proposed to improve the algorithm for application with a trade-off between the computational rate and the performance of the algorithm. A comparison among contribution chart combined with PCA, SVM-WE by eliminating half number of remaining features, SVM-FWE by eliminating features one by one, and accelerated SVM-RFE is made to prove the effectiveness of our method. Cross-validation is used as an optimization tool to evaluate the effectiveness of specific number of ranked variables on sample datasets. Recursive feature elimination based on support vector machine is a data-driven approach, which means that this technique only requires representative data to generate results.
-. Received 2005-01-11, accepted 2005-11-15.
* Supported by China 973 Program (No.2002CB312200), the National Natural Science Foundation of China (No.60574019 and No.60474045), the Key Technologies R&D Program of Zhejiang Province (No.2005C21087) and the Academician Foundation of Zhejiang Province (No.2005A1001-13). ** To whom correspondence should be addressed.
Chinese J. Ch. E. (Vol. 14, No.1)
66
However, process knowledge is still required to interpret and validate the results.
2 METHODS 2.1 Recursive feature elimination based on support vector machine Recursive feature elimination based on support vector machine is first proposed by Guyon et a1.[71, and has been applied in several gene-expression studi e ~ [ ~ for ' * ] selecting important features. Here, it is used to determine the key variable in industrial process. Recursive feature elimination is a circulation procedure for eliminating features by a criterion. It consists of three steps: (1) train the classifier; ( 2 ) compute the ranking criterion; (3) remove the features with smallest ranking score. Support vector machine is used here as classifier, and the ranking criterion is relative to the realization of SVM too; hence, we call this method SVM-RFE for simplicity. In support vector machine"], a classification function is realized as
c N
f ( 4=
q Y i K ( X iX ) +b 9
i=l
where the coefficients u = (ai) and b are obtained by training over a set of examples Y i ) } i = I , K , N ' xiE R " , yi E {-1,1} In the linear case, the SVM expansion defines the hyN
per-planef(x)= (w,x> +b, with w = (wi) = C a i y i x t. i=l
The idea is to define the importance of a feature of the S V M in terms of its contribution to a cost function J ( a ). At each step of the RFE procedure, the S V M is trained on the given dataset, J is computed, and the feature with smallest contribution to J is discarded. In the case of linear SVM, the variation due to the elimination of the i th feature is
SJ(i) = w12 while in the non-linear case l T l T SJ(i)=-a Za--a Z ( - i ) a 2 2
where Zii=yiyj K (xi, xj), Z (-k) is Z with k th feature removed. In this paper, only the case of linear SVM is discussed; for non-linear cases, there still exist a lot of computation problems that prevent it from being applied to industrial fields as shown in Ref. [ 101. The SVM must be retrained after each elimination operation, because the importance of a feature February, 2006
with medium-low importance may be promoted by removing a correlated feature. By SVM-RFE, a ranked list of features is thus obtained, from which the most key variables may be selected according to the predictive accuracy of cross-validation.
2.2 Accelerated recursive feature elimination based on support vector machine Many industrial processes have hundreds of variables, if features are eliminated one by one, the same number of SVMs as that of variables will have to be trained in the whole procedure of SVM-RFE. On the other hand, in order to identify the key variables exactly, usually, several hundreds of samples are needed. So the computation burden is too heavy, and it always takes several hours to get a ranked list of variables according to their importance as mentioned in Ref. [7]. If features are eliminated by half number of the remaining, although the computational rate is relatively faster, the most informative features may be eliminated prematurely. So how to trade off the performance and computation rate is an obstacle to prevent it for industrial application. Note that in every loop of SVM-RFEalgorithm, absolute values of many weights are generally similar and concentrated close to zeros or relatively very small. Thus removing a trunk of features corresponding to minimum contribution wlz may reduce the number of steps of RFE procedure significantly. Based on this idea, a strategy for the elimination process based on the value of the contribution w: is proposed. Assuming there area remaining features in current training dataset, and there is a weight corresponding to each feature with a S V M trained in each step of RFE. Denote contribution of each feature by K , ya
x,
in an ascending order, where = wi . The Acontributions are spited into z intervals by p -quartile, where
z=&,
p = n . u ; n=l,K,(z-l)andu=l/z
the value of p -quartile is defined as 3/mPi+i
MP
={(Yap + Y a p + i ) / 2
, Ap is not an integer 9
otherwise
where [Ap] denotes the integral part of R p . By these p-quartiles, all contributions are divided into z intervals. As to every internal, the length of the internal is calculated by difference between the maximal contribution and the minimal contribution in the internal.
Accelerated Recursive Feature Elimination Based on Support Vector Machine for Key Variable Identification
Denote the length of the intemals as I, m = 1, K , z . The distribution of wzz is estimated from two aspects. We use the coefficient of variability
67
tures by half number of remaining is called H-SVM-RFE, and SVM-RFE eliminating features one by one is called 0-SVM-RFE.
3 RESULTS as the first measure to estimate the dispersivity of the
l Z
1, . CV = 0 , means the z m=l length of all intemals is the same and when CV is maximal, the length of almost all internals are concentrated nearby zero. CV is used to estimate the relative distribution density of r, . If CV is less than a threshweight trunk, where
T =-
old CV, , the length of all intemals are considered similar enough; nevertheless, in some case, the contributions in every internal may be in a similar magnitude, and it is unsuitable to eliminate many features in a time. So it is necessary to introduce a further discriminant rule. 'To estimate the distribution deflection of contribution , the ratio vector R = [5 , L, rZ] is introduced as the
x
second measure, where k, = yk/ y , k = 1,L, z , yj =
c
I
lk.Zl
i=l
/ [ k .z] ;7 = I
c a
i=l
t
/ A . If rk is bigger than a I
given threshold q, , it is considered no distinct distri-
x
bution deflection of contribution by using k th p -quartile as dividing point, and features belongs to both sides of k th p -quartile are considered with similar contributions; otherwise, the features ranked ahead of k th p -quartile are considered much less informative than the features ranked back of k th p -quartile. Based on these two measures, we conclude the elimination rules into several conditions by the distribution of contributions.In our experiments, CV, is set to 1. (1) If CV < CV, ,we define qt = 0.4.
(a) If min{r,, L, rT}< 7, , the features will be eliminated one by one. (b) If rk < qt < rj,, , the leftmost k trucks will be discarded. (2) If CV > CV,, we define qt = 0.15. (a) If min{r,, L, rT} < q, , the features will be eliminated one by one. (b) If rk < qt < rj+l, the leftmost k trucks will be discarded. In their shortened forms, accelerated recursive feature elimination based on support vector machine is called A-SVM-RFE, SVM-RFE by eliminating fea-
The TEP (Tennessee Eastman process 1 is derived from a chemical industrial process. It is a simulation of a real chemical plant, whose flow sheet is described in Fig.1. The whole process involved five manipulating cells: reactor, condenser, compressor, stripper, vaporAiquid separator; four gaseous reactants A, B, C, D, and the inert B are five inputs for the process; G, H are liquid products, F is by-product. All reactions in the reactor are irreversible. All reaction heat is treated by the cooler on reactor. Reaction products and superfluous reactants are transformed from reactor to condenser from pipeline 7 in gaseous status. The condensed gaseousAiquid mixtures are poured into vaporAiquid separator, by which gaseous reactants are recycled back to reactor by compressor in pipeline 8, and liquid is put into stripper from pipeline 10. Residual reactants are purged out from the stripper by pipeline 5 connected with the top of stripper, and mixed with recycled gas in pipeline 8. Products G and H are poured out from the bottom of the stripper by pipeline 11. Detailed description for TEP is available in Refs. [ll-131. The plant-wide control structure recommended in Lyman and Georgakis'"' was implemented to generate the closed loop simulated process data for each fault in Ref. [6], which could be downloaded from http://brahms.scs.uiuc.edu[61.In the datasets, 480 observations containing 52 process variables for normal system status are simulated and 800 observations are produced for each kind of fault operating condition. In these 52 process variables, 41 are measure variables and 11 are manipulating variables. Sampling period is 3s. In this paper, the data corresponding to Fault 2, Fault 5 , and normal system condition are draped out to construct two cases for verifying the effectiveness of our algorithm. Each case has 1280 observations consisting of 480 normal observations and 800 specific fault observations. The same cases are also analyzed in Ref. [6]. As in Ref. [6], the focus of this paper is also to study those faults with stable operating conditions before and after the fault occurs. For Faults 5 and 2, the process went through a transient behavior before it is settled to a new stable operating condition. Chinese J. Ch. E. 14(1) 65 (2006)
68
1 ..
..
.. .. .. .. ..
Figure 1 The process flow sheet of the TEP
Therefore, data points immediately after the fault occurs were not included in the abnormal dataset for all case studies. Note that abnormal datasets must contain a sufficient number of data points to represent the new operating condition. One hundred observations, representing observation-to-variable ratio of roughly 2 : 1, were found to be sufficient to define the abnormal datasets. As to normal datasets, the same number of observations as that of abnormal datasets is adopted. The same key variables were obtained when more observations were used in the abnormal datasets or in the normal datasets. To each case study, the 5-fold cross-validation is used as an optimization tool to evaluate the effectiveness of specific number of ranked variables on sample datasets. In our experiments, the penalty parameters of linear kernel SVM are set to 100. Our algorithms are implemented with MATLAB code.
3.1 Case study on Fault 5 Fault 5 is a step change in the condenser cooling water inlet temperature at t = 24h, which is immeasurable as described in Ref. [6]. The effect of the fault includes inducing 34 variables transients, some of which are described in Fig.2 (variable 52: condenser cooling water flow valve; variable 11: temperature in the vapor/liquid separator; variable 22: the separator February, 2006
cooling water outlet temperature; variable 18: the stripper temperature). The control loops are able to compensate for the change, and the temperature in the separator returns to its set point. This case has been analyzed by contribution charts combined with PCA for key variable selection in Ref. [6]. By I"! statistic combined with PCA, variable 20 and variable 52 are miss-indicated as the key variables. However, when these two variables are plotted against each other, no separation between the normal samples and the abnormal samples are seen. By 0-SVM-FWE, H-SVM-RFE, and A-SVM-RFE, the variable 17 and variable 52 are indicated as the key variables, which is consistent with the results achieved by genetic algorithm combined with Fisher discriminant analysis in Ref. [6]. By these two variables, samples in the whole dataset for the fault procedure 5 are expressed as the left plot in Fig.3, and the samples used in key variable selection are expressed as the right plot. In fact, variable 52 is used in the PI loop to control variable 17 in the original TEP simulator'61,so it is reasonable for these two variables with correlation. In Fig.4, the variable selection performance of 0-SVM-WE, H-SVM-WE, A-SVM-WE, and T statistic contribution charts combined with PCA are compared by success rate of 5-fold cross-validation. Linear SVM is used as classifier in cross-validation.
Accelerated Recursive Feature Elimination Based on Support Vector Machine for Key Variable Identification
24
69
-
22 -
!z 20-
? I
0
I
500
1000
3
16
-
_ . 14 I
I
20
8
=
-
1500
observation (8) '
18
I
I
I
22
24
82t ul
26
variable 52 (4
81
24
-
79
I
78
0
500
1000
1500
observation (b)
16
v
20
22
24
26
variable 52 (b)
76.0 I 0
500
1000
1500
observation (c)
Figure 3 (a) The bivariate plot for variable 17 and 52 in the whole fault procedure and (b) the bivariate plot for variable 17 and 52 in the normal and abnormal training datasets 0 normal sample; m abnormal sample
/u
68 -
62
/
..
0
500
1000
1500
observation (4
IGgure 2 The time series plots for (a) the condenser cooling water valve (variable 52), (b) the separator temperature (variable 11),(c) the separator cooling water outlet temperature (variable 22), and (d) the stripper temperature (variable 18)in Fault 5
30.31 v1
I
5
1
10
I
I
15 20
I
25
I
I
I
30 35 40
I
45
1 1 50
number of ranked variables
Figure 4 Comparison of four key variable selection methods by the success rate of 5-fold cross validation _____ A - S V M - m ; - - -HH-SVM-W; --- 0 - S V M - W ; - contribution chart Chinese J. Ch. E. 14(1) 65 (2006)
70
Chinese J. Ch. E. (Vol. 14, No.1)
And the computational performance comparison of O-SVM-RFE, H-SVM-WE, and A-SVM-WE are listed in Table 1, where ranking time, loops of ranking variables, and effectiveness of ranked variables are used as indices. Table 1 Computational performance comparisonof OSVM-RFE,HSVM-RFE and ASVM-RFE on Fault 5 dataset Algorithm
Ranking time
Ranking loops
Effectiveness of ranking variables
O-SVM-RFE
239.5
51
high
H-SVM-RFE
32.3
6
high
A-SVM-RFE
23.3
4
high
Using the top 2 variables selected by these three methods, the success rate of cross-validation is 96.5%; and using the top 2 variables selected by contribution charts combined with PCA, the success rate of cross-validation is only 53.5%. There is little difference among the results achieved by A-SVM-RFE, O-SVM-RFE, and H-SVM-RFE. In this experiment, the contributions of useless variables are very little, the contributions of informative variable s are relatively very large, and the number of useless variables is very large; so A-SVM-WE eliminated 42 variables in the first loop, and 6 variables in the second loop. Totally it used 4 loops to rank all features. Its computational rate is the fastest one in the three algorithms, and its effectiveqess doesn’t recede. The results obtained by these three methods are obviously better than that obtained by the T 2 statistic contribution chart method combined with PCA for key variable selection, although this method is very time-saving. Similar result is also achieved by genetic algorithm combined with Fisher discriminant analysis in Ref. [6]. In Ref. [6], 22 minutes are spent to obtain the key variables by genetic algorithm combined with Fisher discriminant analysis. By A SVM-RFE, 23.3 seconds are spent by using a 1533MHz AMD1800+ computer with 256MB RAM.
3.2 Case study on Fault 2 Fault 2 is a step change of gaseous inert B in a stream input channel at t = 24h as described in Ref. [6], which causes changes in the stripper pressure (variable 16), the B compositions in the reactor feed, the purge stream, and the product stream (variables 24 and 30). And the control loop reacts and increases the purge rate (variables 10 and 4) to calm down these February, 2006
changes. As a result, the inert B composition in these places returns to normal, while some other by-product F composition decreases (variable 28). All of them are described in F i g 5 This case also has been analyzed by contribution chart combined with PCA for key variable selection and genetic algorithm combined with Fisher discriminant analysis in Ref. [6]. By both methods presented in that paper, variables 10, 28, 34, and 47 are identified as key variables. However, contribution charts combined with PCA do not identify all key variables, which are identified by genetic algorithm combined with Fisher discriminant analysis. By genetic algorithm combined with Fisher discriminant analysis, a total of 16 variables are selected as key variables; they are variables 3,7, 10, 11, 13, 16, 18, 19,22,28, 33, 34, 35, 43, 47, 50. By the three most important variables 34, 10, and 28, samples in the whole dataset for the fault procedure 2 are expressed as the left plot in Fig.6, and the samples used in key variable selection are expressed as the right plot. By A-SVM-RFE, 15 of these 16 key variables are ranked as its top 16 variables, and variables 10, 28, 34, and 47 are ranked as the top 4 informative variables. The results are similar to that achieved by 0-SVM-RFE. In this experiment, the contributions of 30 informative variables are very close. A-SVM-RFE eliminated 21 variables in the first loop, and the left 31 variables are ranked in 14 loops. Although its computational rate is not the fastest one in the three algorithms, its effectiveness is the highest. H-SVM-RFE is the fastest ranking method of the three on this dataset; by H-SVM-RFE, variables 10, 28, and 34 are ranked in the top 4 variables, but variable 47 is lost. The relative importance of variables is not considered in H-SVM-RFE; many informative variables are eliminated prematurely, so the ranking performance of H-SVM-RFE is limited. In Fig.7, the variable selection performance of O-SVM-RFE, H-SVM-WE, A-SVM-RFE, and ? statistic contribution chart combined with PCA are compared by success rate of 5-fold cross-validation. Linear SVM is used as classifier in cross-validation. Using the top 4 variables (34, 10, 28, and 47) selected by A-SVM-WE, the success rate of cross-validation is 98%, which is a litter better than using the top 4 variables ranked by H-SVM-RFE, where the success rate of cross-validation is 96%. The difference is very small between the results achieved by A-SVMRFE and the O-SVM-RFE. In Ref.[6], 22 minutes are
Accelerated Recursive Feature Elimination Based on Support Vector Machine for Key Variable Identification
71
100
1 -
z
3200 3150 -
r'
d
80
02
'53100 >
20
0
500
1000
1500
0
I
I
500
1000
observation (4 1.01
0.2 I
0
m
10.01
1
I
500
1000
I 1500
8.5I 0
I
. .
I
1000
500
observation
(4
(4
15 10
I
1.0 0
1
observation
L.U I
0.5 1
1500
observation (b)
14
500
1000
I
131
0
1500
observation (el
I
I
500
1000
I
1500
observation
(9
Figure 5 The time series plots for (a) the stripper pressure (variable 16), (b) the purge valve (variable 47), (c) the purge rate (variable lo), (d) the inert B composition in the purge stream (variable 24), (e) the by-product F composition in the purge stream (variable 28), and (f) the inert B composition in the reactor feed (variable 30) in Fault 2
cb)
(4
Figure 6 (a) The tri-variate plot for variables 34,10, and 28 in the whole fault procedure 2 and (b) the tri-variate plot for 34, 10, and 28 in the normal and abnormal training datasets 0 normal sample; abnormal sample
*
Chinese J. Ch. E. 14(1) 65 (2006)
Chinese J. Ch. E. (Vol. 14, No.1)
72
number of remaining features, it performs better in the two TEP cases tested. So it is more fitting for industrial application.
REFERENCES 1 0.801 gO.781
2
I
I
5
10
I
I
I
I
I
25 30 35 40 number of ranked variables 15 20
I
45
1 1
50
Figure 7 Comparison of four key variable selection methods by the success rate of 5-fold cross validation _____ A-SVM-RFE; ---HH-SVM-RFE; _ _ _ 0-SVM-FWE; - contribution chart
consumed for achieving the key variables by genetic algorithm combined with Fisher discriminant analysis. By A-SVM-RFE, 73.8 seconds are needed. The computational performance comparison of 0-SVM-RFE, H-SVM-RFE, and A-SVM-WE are listed in Table 2. ’Ihble 2 Computational performancecomparison of OSVM-RFE, HSVM-RFE, andASVM-RFEon Fault 5 dataset
2
3
4
5
6
~~~~~
Algorithm
Ranking time, s
Ranking loops
Effectiveness of ranking variables
0-SVM-RFE
241.0
51
high
H-SVM-FWE
32.5
6
medium
A-SVM-RFE
73.8
15
high
7
~~~~~
CONCLUSIONS In this paper, we have studied the problem of key variable identification by support vector machine based recursive feature elimination (SVM-WE) and its accelerated algorithm (A-SVM-WE). The data from Tennessee Eastman process (TEP) simulator was used to evaluate the effectiveness of these methods. A-SVM-RFE integrates computational rate and algorithm effectiveness into a framework. It not only can correctly identify the key variables, but also has very good computational rate. Compared with contribution chart combined with PCA, SVM-RFE eliminating features one by one and SVM-RFE eliminating half 4
February, 2006
8
9 10
11
12
13
Liang, J., Qian, J.X., “Multivariate statistical process monitoring and control: Recent developments and applications to chemical industry”, Chinese J. Chem. Eng., 11(2), 191-203 (2003). Venkatasubramanian, V., Rengaswamy, R., Kavuri, S.N., Yin, K.W., “A review of process fault detection and diagnosis Part III: Process history based methods”, Comp. Chem. Eng., 27,327-346 (2003). Chu, Y.H., Qin, S. J., Han, C., “Fault detection and operation identification based on pattern classification with variable selection”, Ind. Eng. Chem. Res., 43, 17011710 (2004). Chen, F.Z., Wang, X.Z., “Discovery of operational spaces from process data for production of multiple grades of products”, Ind. Eng. Chem. Res., 39, 23782383 (2000). Miller, P., Swanson, R.E., “Contribution plots: A missing link in multivariate quality control”, Appl. Math. Comp. Sci., 8,775-792 (1998). Chiang, L.H., Pell, R.J., “Genetic algorithms combined sswith discriminant analysis for key variable identification”, J. Process Control, 14, 143-155 (2004). Guyon, I., Weston, J., Barnhill, S., Vapnik, V., “Gene selection for cancer classification using support vector machines”, Mach. Learn., 46,389-422 (2002). Mao, Y., Zhou, X., Pi, D., Sun, Y., Wong, S.T.C., “Multi-Class cancer classification by using fuzzy support vector machine and binary decision tree with gene selection”, J. Biomed. Biotechnol., 2, 160-171 (2005). Vapnik, V.N., The Nature of Statistical Learning Theory, 2nd ed., Springer, New York (2000). Mao, Y., Zhou, X., Pi, D., Sun, Y., Wong, S.T.C., “Parameters selection in gene selection using Gaussian Kernel support vector machines by genetic algorithm”, J. Zhejiang Univ. S , 6B, 961-973 (2005). Lyman, P.R., Georgakis, C., “Plant-wide control of the Tennessee Eastman problem”, Comp. Chem. Engr., 19, 321-331 (1995). Downs, J.J., Vogel, E.F., “A plant-wide industrial-process control problem”, Comp. Chem. Engr., 17, 245-255 (1993). Chiang, L. H., Russell, E.L., Braatz, R.D., Data-driven Methods for Fault Detection and Diagnosis in Chemical Process, Springer, London (2000).