Accepted Manuscript SVM-tree and SVM-forest algorithms for imbalanced fault classification in industrial processes Gecheng Chen, Zhiqiang Ge
PII: DOI: Article number: Reference:
S2468-6018(18)30136-6 https://doi.org/10.1016/j.ifacsc.2019.100052 100052 IFACSC 100052
To appear in:
IFAC Journal of Systems and Control
Received date : 16 September 2018 Revised date : 16 February 2019 Accepted date : 22 March 2019 Please cite this article as: G. Chen and Z. Ge, SVM-tree and SVM-forest algorithms for imbalanced fault classification in industrial processes. IFAC Journal of Systems and Control (2019), https://doi.org/10.1016/j.ifacsc.2019.100052 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
SVM-tree and SVM-forest algorithms for imbalanced fault classification in industrial processes Gecheng Chen, Zhiqiang Ge State Key Laboratory of Industrial Control Technology, Institute of Industrial Process Control, College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, P. R. China
Abstract Fault classification plays a central role in process monitoring and fault diagnosis in complex industrial processes. Plenty of fault classification methods have been proposed under the assumption that the sizes of different fault classes are similar. However, in practical industrial processes, it is a common case that large amounts of normal data (majority) and only few fault data (minority) are collected. In other words, most existing fault classification problems were carried out under the imbalanced data scenario, which will lead to a restricted performance of traditional classification algorithms. In this paper, a K-means based SVM-tree algorithm is proposed to deal with the nonlinear multiple-classification problem under the situation of imbalance data. Meanwhile, a SVM-forest scheme is further developed for sensitive data selection and performance enhancement when the imbalance degree is larger among different classes. Effectiveness of the proposed method is verified through the Tennessee Eastman (TE) benchmark process.
Keywords:
Imbalanced data; K-means; Support vector machine; Sensitive data selection; Fault
Classification.
Corresponding author: +86-571-87951442 , E-mail address:
[email protected] (Ge Z.)
1. Introduction With the development of modern complex industrial process, an exceptionally large amount of data have been produced and stored for process analytics (Feital T, Kruger U, Dutra J, Pinto J, Lima E. 2013; Ge Z, Song Z, Gao, F. 2013; Evchina, Y., Puttonen, J., Dvoryanchikova, A., & Lastra, J. L. M. 2015; Ge et. al., 2017; Ge, 2018; Wang, Tianzhen, et al, 2016). As a result, data-driven approaches are developing rapidly and become more and more popular for process monitoring (Ge et. al., 2013; Yin, S., Ding, S. X., Xie, X., & Luo, H. 2014; Ge, 2017, Zhu, et. al., 2018). Particularly in recent years, plenty of data-driven methods have been developed and applied for fault classification in industrial processes, such as fisher discriminant analysis (FDA), support vector machines (SVM) (Namdari, M., & Jazayeri-Rad, H. 2014), artificial neural network (ANN) ( Jane, A. P., & Pund, M. A. 2012; Liu, Y., & Ge, Z. 2018; Jing, C., & Hou, J. 2015; Ge and Liu, 2019). Support vector machine (SVM) proposed by Vapnik and his colleagues (Cortes, C., & Vapnik, V. 1995) has been successfully applied to solve many problems. Compared to traditional classification methods, SVM has a stronger generalization ability with high classification accuracy in practical applications, especially for datasets with nonlinearities. There are also some other algorithms that have been developed for nonlinear classification, such as Gaussian Mixture Model (GMM) (Rasmussen C E, Williams C K I. 2005), LWPR-aided predictive control strategy (Gao, et al. 2018), and so on. Imbalanced data problem refers to the classification scenarios with one or several classes much bigger in size than others. It is very common in practical industrial processes because there is usually large amount of normal data samples (majority) with few fault data samples (minority). This imbalance would break the pre-assumption of classification methods and leads to poor classification performance. When imbalance and nonlinearity both exist in a dataset, the classification precision may get even worse. In recent years, the imbalanced classification problem has drawn much attention. Generally, there are
two types of imbalanced data problems, namely absolute scarcity of data and relative scarcity of data. Absolute scarcity of data means that the excessive lacking minority samples are unable to describe the boundary clearly. As a result, the classifier based on the minority samples will have an unsatisfying performance in classification accuracy. Actually, absolute scarcity of data is the greatest challenge of the imbalanced classification problem in traditional methods. Weiss shows that the classification error rate is much higher than other cases when absolute scarcity of data occurs (Weiss, G. M. 1995; Weiss, G. M., & Hirsh, H. 2000). Wang developed a method called SMOTE-biased-SVM (Wang, H. Y. 2008). In his work, samples of the minority are added by using (Synthetic Minority Oversampling Technique) SMOTE strategy in the SVM hyperplane training procedure. This method tried to reduce the imbalance degree and the absolute scarcity of data by over-sampling the minority samples. But the SMOTE strategy cannot add new information for the training process, so this method have little use for absolute scarcity. What is more, SMOTE might cause serious overfitting problem. The relative scarcity of data means that the number of the minority is relatively smaller than the majority. Thus the information of the minority will be overwhelmed by the majority during the training procedure. For example under the assumption that the sample size of each class is not much different, SVM trains the hyperplane based on the distance of all samples to the hyperplane. The result of the training process will inevitably be more affected by the majority in imbalance classification, which will probably cause poor classification performance for the minority. In extreme cases, the minority may be considered as noises to be ignored in training process. Lin proposed the under-sampling-assembleSVM algorithm to eliminate the degree of imbalance among different classes (Lin, Z. Y., Hao, Z. F., Yang, X. W., & Liu, X. L. 2009). In this paper, sub-classifiers are trained by SVM using the minority classes and randomly selected subset from the majority classes which have the same size as the minority classes. After that, these sub-classifiers are assembled to obtain the final classification result. This method could deal with
the relative scarcity, but it will inevitably lose some useful information of the majority and it cannot handle the possible absolute scarcity either. To this end, a K-means based SVM-tree is proposed in this paper for nonlinear imbalanced data classification. Firstly, the majority is clustered into M sub-classes according to the degree of the imbalance by a T-threshold K-means method. These sub-classes are combined with N minority to form a (M+N) classes training dataset, based on which a SVM-tree classifier can be trained. Compared to traditional methods, the K-means based SVM-tree neither adds samples nor reduces samples. As a result, the original information and distribution of the training set can be guaranteed to the greatest extent which could promise the good performance of the trained classifier.
In addition, this method can help the minority to precisely describe
the boundary, and thus has a better performance than other multiple-classification methods. In addition, the SVM-tree keeps all of the classes (sub-classes) unbiased throughout the training process. This is an important consideration for imbalanced datasets. When the imbalanced degree among different classes increases severely, the SVM-tree needs to cluster the majority into a lot of sub-classes to promise that there is no imbalance between new sub-classes and minority classes, which would cause both large computation and low precision. To handle this problem, a SVM-forest sensitive data selection method is further proposed to select samples with more useful information for the classification of imbalanced data to decease the imbalance. In this way, a large amount of data which has little effects on the description of boundaries would be ignored during the training process, which makes the SVM-forest based SVM-tree effective when facing extreme imbalance cases. The rest of this paper is organized as follows. Section 2 briefly introduces the SVM method. Section 3 provides detailed information of the proposed K-means based SVM-tree method for nonlinear imbalanced fault classification, followed by the further developed SVM-forest method for sensitive data selection under
lager degree of imbalance in the next section. The performance of the proposed methods is evaluated through the Tennessee Eastman (TE) benchmark process in Section 5. Finally, conclusions are made in the last section.
2. Introduction to SVM algorithm Considering a binary classification problem, the labeled training set is assumed as D = {( x1 , y1 ), ( x 2 , y2 ),..., ( x m , ym )} , where yi {−1, +1} . The SVM algorithm tries to find the best hyperplane
ωT z + b = 0 which can separate samples from different classes. The distance of each sample to the
hyperplane is:(Cortes, C., & Vapnik, V. 1995) r=
| ωT x + b | || ω ||
(1)
which can separate the samples correctly, for ( x i , yi ) D ω T x i + b +1, T ω x i + b −1,
yi = +1; yi = −1.
(2)
Those training samples which are closest to the hyperplane can make the equal sign in eq. (2), and are named as support vectors. The distance between two support vectors belonging to different classes is called margin: =
2 || ω ||
(3)
The hyperplane which maximizes is the exact hyperplane which we want. That is, find ω and b to maximize , the optimization problem can be described as follows.
max ω ,b
2 ω
s.t. y j (ωT x j + b) 1,
(4) j = 1, 2,..., m
More detailed information of the basic SVM algorithm can be found in (Smola, A., Bartlett, P., Schölkopf, B., & Schuurmans, D. 2000; Schuldt, C., Laptev, I., & Caputo, B. 2004)
3. K-means SVM-tree algorithm 3.1 The introduction to K-means SVM-tree algorithm Given a training set Wl = [ X1 ; X 2 ;...; X C +1 ] , which contains C fault modes and one normal mode. Xi = [x1 ; x 2 ;...; x ni ], i = 1, 2,..., C + 1 in training set denotes the samples of the i th class, where Xi R ni m , x j = [a1 , a2 ,..., am ], j = 1, 2,..., n j , n j is the number of samples in the i th class, and m is the number of
variables. The labels of the normal mode and C fault modes are from 1 to C + 1 in sequence, that is, Yi = [i, i,..., i ], i = 1, 2,..., C + 1 . The normal class X 1 is assumed to be the majority, and the other classes are
assumed to be the minority. Furthermore, the sample size of each class in the minority is assumed to be similar, and the degree of imbalance n is defined as the ratio of the amount of data between the majority and minority classes, that is, n =
n1 n2
n1 n3
...
n1 nC +1
. Figure 1 (a) takes a dichotomy problem as an example. The
blue samples represent the majority and black samples represent the minority. The main idea of the K-means SVM algorithm can be illustrate as follows. Table1: T-threshold K-means algorithm Algorithm:T-threshold K-means algorithm m Input:majority training set X1 = [ x1 ; x 2 ;...; x n ] ,where x j R , j = 1, 2, 1
, ni , k , T
Process: Select k samples randomly from X1 as initial mean vector {μ1 , μ 2 ,..., μ k } ; Repeat Make Ui = (i = 1, 2,..., k ) For j = 1, 2,..., n1 do Calculate
the
distance
between
each
sample
μ i , i = 1, 2,..., k : d ji =|| x j − μ i || ;
Sort the distance set in ascending order D j = [ d ji ], i = 1, 2,..., k ; For g = 1, 2,..., k do Assume D j [ g ] = d jt ,where t belongs to [1, 2,..., k ] ; If | U t | == T : continue; If
| Ut | T : U t = U t {x j } , break;
End for End for For i = 1, 2,..., k do
xj
in
X1
and
each
mean
vector
' Calculate new mean vector: μ i =
1 x; | U i | xUi
If μ i ' μ i Let μ i = μ i' End for Until the current mean vectors do not change Update the label of the samples, do y j = i if x j U i , where y j is the label of x j Output: U major = {U1 , U 2 ,..., U k } , Ymajor = {Y1 ; Y2 ;...; Yk }
In the first step, the majority is clustered into k subclasses by K-means method, denoted as X1 = [ U1 ; U 2 ;...; U k ] . Because the samples’ number of one class may be much larger than the other classes
in the clustering result, a threshold T needs to be set to ensure that the amount of data in each category in the clustering results is not too much different. The specific details for T-threshold K-means algorithm are introduced in Table 1. As can be seen, the majority has been clustered into 11 sub-classes in Figure 1(b). In the second step, the center of each class (sub-class) are obtained denoted as O = [o1 , o2 ,..., oC + k ] . The stars in Figure 1 (c) represent centers of different classes. In the third step, O is clustered into two classes O1 , O 2 by K-means. In Figure 1(c), O 1 represents the center set in the circle above, and O 2 represents
the center set below, D1 and D 2 represent the training sets corresponding to the centers belonging to O 1 and O 2 , respectively. In the fourth step, a SVM hyperplane is trained by using the dataset D1 , D 2 . The red plane in Figure 1(d) represents the best plane to separate D1 , D 2 . This hyperplane is called the root node of the SVM-tree, and D1 , D 2 are two branches of the node. Table 2 shows the steps for node training. In the fifth step, repeating node training algorithm on D1 and D 2 until every new node contains samples of one class only. These new nodes with only one class are called leaf nodes. The label of the class belonging to leaf node is called the label of the leaf node. Table 2: Node training algorithm for SVM-tree Algorithm: Node training algorithm for SVM-tree ' ' Input:node set Wl = [ U1 ; U 2 ;...; U k ; X 2 ;...; X C + k ] , Y = [Y1 ; Y2 ;...; YC + k ] ,
Process:
Calculate the center of all classes in Wl ' and make up the center set O = [o1 , o2 ,..., oC +k ] ; Cluster the O into two classes O1 , O 2 . D1 is assumed as the training set corresponding to the centers belonging to O1 , and the same to D 2 ; Train a hyperplane S between D1 , D 2 by using SVM. Output: hyperplane S, branches of the node D1 , D 2
When testing a sample x o = [ a1 , a2 ,..., am ] , let the sample move to the corresponding node according to the classification results of node’s hyperplanes step by step starting from the root node until the sample come to a leaf node. The label of the test sample y r can be determined as follows: yo [1, k ] 1, yr = yo − k , yo [k + 1, C + k ] 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
(5)
0.2
0.3
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.5
0.6
0.7
0.8
0.9
1
(b)
(a)
0
0.4
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(c) (d) Figure 1: (a) a binary imbalanced classification problem; (b) the majority clustered into some sub-classes; (c) cluster the center set into two sub-classes; (d) train SVM hyperplane according to the result in (c).
3.2 Performance analyses As illustrated in Figure 2(a), stars represent center points of classes (or sub-classes), diamonds are the testing set of the minority, the red thick dotted line is the boundary that can be described by the minority and the black thick dotted line is the real boundary. Obviously, the problems of relative scarcity and absolute scarcity both exist in this example. When training an SVM classifier based on this training set, the hyperplane will bias toward the minority because of the relative scarcity. What’s worse, the minority cannot describe its boundary clearly. When testing those 6 samples, three samples closer to the minority center are more accurately classified and the other three samples near the majority center are more likely to be wrongly classified For the relative scarcity problem, the K-means SVM-tree classification method divides the majority into several sub-classes to eliminate the impact of data imbalance. Compared to the existing under-sampling or oversampling methods, this method neither adds data samples nor reduces samples, which ensures the integrity of the original data to the greatest extent and effectively prevents the adverse effects such as overfitting or losing of the information. For the absolute scarcity problem, the SMOTE method can only add data samples within the boundaries described by a given minority class. That is, samples can only be added in the red thick dashed box in Figure 2(a), which cannot help the minority describe the real boundary. When we use K-means SVM-tree, we can use the majority class to draw the boundary of minority classes in reverse. As the number of subclasses increases, different results are shown in Figure 2(b)-(c), in which the red thin dotted lines are the centerlines of the subclasses closest to the minority. As we can see, with the increasing number of sub-classes for the majority, the red thin dotted lines are getting closer and closer to the real boundaries of the two classes. Ideally, when we cluster the majority class into an infinite number of sub-classes just like the results shown
in Figure 2(d), the red dashed line is the line that connects the samples of the majority closest to the boundary, which can be considered as the actual boundary.
(a)
(b)
(c)
(d)
Figure 2: (a) a binary imbalanced classification problem with relative and absolute scarcity of data; (b) a binary classification problem is transferred to a multi-classification problem; (c) the line between the centers of the subclasses of majority get closer to the real boundary after clustering the majority into more subclasses; (d) the line between the centers of the subclasses of majority will overlap the real boundary approximately if the majority is clustered into infinite subclasses.
4.
SVM-forest for sensitive data selection and imbalanced fault classification When the degree of data imbalance grows larger, the majority needs to be divided into many sub-classes
by using the SVM-tree algorithm, which will introduce a high computational burden and may deteriorate the
classification performance. To this end, a SVM-forest based sensitive data selection method is further developed in this paper. Considering the same training set, similarly, Figure 3 (a) takes the dichotomy problem as an example. The blue samples represent the majority and black samples represent the minority. The main steps of this algorithm are as illustrated as follows. Step 1: Select samples with the same number from each minority class randomly as the contemporary test set Z l = [ Z 2 ; Z 3 ;...; Z C +1 ] ; Step 2: Cluster the majority into k sub-classes by using T-threshold K-means algorithm, that is X1 = [ U1 , U 2 ,..., U k ] . Due to the large degree of the imbalance data, these sub-classes may still have
imbalance effect toward the minority. As can be seen in Figure 3(b), the majority has been divided into 11 sub-classes, and those sub-classes are still imbalanced to the minority. Step 3: Train a SVM-tree G based on Wl = [U1 ; U2 ;...; Uk ; X2 ;...; XC + k ] . Then test the SVM-tree G with the contemporary test set Z l . If the classification accuracy meets the requirement , the training process will be stopped. If the accuracy is not satisfied, the following steps will be carried out. Step 4: Train a SVM-forest for every minority class and the sub-classes of the majority. For a minority class
X i (i 1) , combine it with every sub-class of the majority to get k train sub-sets
Q ki = [ U r ; X i ], r = 1, 2,..., k . Then train a SVM-tree based on every sub-set and get the tree-set
Ti = [t1i , t2 i ,..., tki ] which is named as a SVM-forest in this paper. Test all trees in this forest by using the
contemporary test set Z i , and calculate the precision Pi = [ p1i , p2 i ,..., pki ] for the minority class X i , which represent the separability between every sub-class and X i . Basically, the smaller the precision, the worse separability between the sub-class and X i , and the more significant effect of this sub-class on the performance of the classification. Then we select the smallest n precisions that represent n sub-classes with relatively high correlations for the classification performance, retain this subclass, and re-formulate the new
majority class X 1i , where
n is called selection rate. k
Step 5: repeat step 4 on every minority class to get k SVM-forests. Based these SVM-forests, a new majority class set [ X12 , X13 ,..., X1C +1 ] can be obtained. Take the sum of these set to get the new majority training set X1' for the next iteration. In Figure 3(c), five sub-classes with relatively low separability between the minority are chosen to reintegrate the new training set for the majority like the blue samples in Figure 3(d). Step 6: repeat step 1-step 5 until the result meets the requirement in step 3. In Figure 3(d), cluster the majority into some sub-classes like the situation in Figure 3(e), and then repeat the step1-5. Detailed procedures of the whole algorithm are shown in Table 3. Table 3 Sensitive data selection method based on SVM-forest Algorithm: Sensitive data selection method based on SVM-forest Input:training set Xl = [ X1 ; X2 ;...; XC+1 ] ,precision requirement ,selection rate
n k
Process: Repeat Select samples with same number from each minority class randomly as the contemporary test set Zl = [Z 2 ; Z3 ;...; ZC+1 ] ; Use T-threshold K-means algorithm to cluster X1 and update the training set to Wl ' = [ U1 ; U 2 ;...; U k ; X 2 ;...; X C + k ] ; Train SVM-tree G on Wl ' , then test G with Z l to get the precision 0 ; If 0 : Break Else For i = 2, 3,..., C + 1 ,do Let Ti = For r = 1, 2,..., k ,do Train SVM-tree t ri based on Q ri = [U r ; Xi ] Ti = Ti {tri }
Test SVM-forest Ti = [t1i , t2i ,..., tki ] by Z i to get the precision Pi = [ p1i , p2i ,..., pki ] , Select the lowest n precision and reintegrate their corresponding sub-classes to get a new majority set X1i . Get sum of [ X12 , X13 ,..., X1C +1 ] to obtain the new majority training set X1' 。 Output:SVM-tree G
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
0
0.1
0.2
0.3
0.4
(a) 1
1 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
(b)
0.9
0
0.5
0.6
0.7
0.8
0.9
0
1
0
0.1
0.2
0.3
0.4
(c)
0.5
(d) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(e) Figure 3: SVM-forest method, (a) a binary classification problem with large degree of imbalance; (b) the majority has been clustered into 10 sub-classes; (c) select 5 sub-classes from the sub-classes in (b) based on their separability between the minority; (d) re-formulate the 5 sub-classes to a new majority; (e) repeat the steps in (a)-(d).
5. Case study In this section, the Tennessee Eastman (TE) process is used to evaluate the developed SVM-tree and
SVM-forest algorithms for imbalance fault classification. This process was first introduced by Downs and Vogel (Downs, J. J., & Vogel, E. F. 1993, Ge, 2018; Zhu et. al., 2018), and has been widely used for testing and evaluating various process monitoring algorithms and control strategies in the past years. As shown in Figure 4, TE process consists of five operation units: a two-phase reactor, a partial condenser, a recycle compressor, a product stripper and a separator. There are four gaseous reactants A, C, D, E and two liquid products G and H and one by-product F. 21 faults data are available for simulation in this process, detailed descriptions of which are listed in Table 4.
Figure 4: Flow chart of Tennessee Eastman Process Table 4: Faults in TE process Fault
Descriptions
type
Fault
1
A/C feed ratio, B composition constant
Step
12
(stream 4)
Descriptions Condenser
cooling
water
type inlet
Random
temperature
variation
2
B composition, A/C ratio constant (stream 4)
Step
13
Reaction kinetics
Slow drift
3
D feed temperature (stream 2)
Step
14
Reactor cooling water valve
Sticking
4
Reactor cooling water inlet temperature
step
15
Condenser cooling water valve
Sticking
5
Condenser cooling water inlet temperature
Step
16
Unknown
Unknown
6
A feed loss (stream 1)
Step
17
Unknown
Unknown
7
C header pressure loss-reduced availability
Step
18
Unknown
Unknown
(stream 4) 8
A, B, C feed composition (stream 4)
Random
19
Unknown
Unknown
9
D feed temperature (stream 2)
Random
20
Unknown
Unknown
10
C feed temperature (stream 4)
Random
21
Valve position constant (stream 4)
Constant
11
Reactor cooling water inlet temperature
Random
To verify the effectiveness of K-means SVM-tree method and SVM-forest sensitive data selection method for imbalanced fault classification, the normal operating mode, fault 7 and fault 8 are selected in this case study. Compared to other faults in this process, Fault 7 and fault 8 have relatively low separability to the normal condition, and thus are difficult for classification. The samples of the imbalanced training set were selected as the normal working conditions with 20,000 samples of the label samples (called the majority), and both fault 7 and fault 8 with 50 labeled samples (called the minority), respectively. The test set consists of 310 samples, in which 0~100 are normal samples, 101~180 are fault 7, 181~310 are fault 8. The classification results are shown in Figure 5 in which the x axis represents the number of samples and the y axis represents the labels given by classifiers. The classification precision for every class is shown in the figures. Figure 5(a) shows the result by using “one-to-one” SVM method. Because of the high degree of imbalanced data, the basic SVM cannot obtain a satisfying hyperplane to classify the minority classes. Figure 5(b) shows the result by using the under-sampling-assemble-SVM to train the classifier. In the training procedure, 500 samples of the normal condition have been randomly selected to be combined with the minority samples to train a SVM classifier each time. After 20 times training, 20 sub-classifiers are assembled to give the result by the voting strategy. This method can handle the influence of imbalance data effect to a certain degree. However, the precision on the minority is still not satisfactory besides the classification accuracy of the majority is sacrificed. Figure 5(c) shows the result for SMOTE-biased-SVM. In this experiment, 500 fault samples have been added to the training set for each minority class by the SMOTE method. Then the “one-to-one” SVM model is trained, which works better than the under-sampling method, but the precision for the minority is still quite low. Figure 5(d) shows the result for the K-means SVM-tree method. In this method, the majority is clustered into 100 sub-classes by T-threshold K-means,
and then SVM-tree is build up for fault classification on the test dataset. By using this method, the degree of imbalance can be eliminated without adding or discarding samples. What is more, the sub-classes of the majority can help the minority describe the boundary which will make a great contribution to the precision for the minority. Compared to the under-sampling method, the SVM-tree algorithm avoids the sacrifice of the precision for the majority. On the other hand, SVM-tree can simultaneously classify the minority more accurately compared to the oversampling method. Figure 5(e) shows the result for SVM-forest after 3 iterations. In this experiment, the precision requirement is set as [100%, 78%, 75%], which is the precision by using SVM for the regular data. The selection rate is assumed to be
80 . It shows that this method can 100
improve the precision for all classes by 5%~10% on the basis of SVM-tree. Under-sampling-assemble-SVM Method
One-to-one SVM Method
89.0%
82.0%
33.8%
0%
Fault 8
Fault 8
Fault 7
Fault 7
Normal
Normal
0
50
100
150 200 Samples
250
300
350
0
150 200 Samples
250
(b)
SMOTE-biased-SVM Method
SVM-tree Method
65.0%
55.4%
95.0%
Fault 8
Fault 8
Fault 7
Fault 7
Normal
Normal
50
100
53.8%
(a)
100%
0
50
66.2%
100
150 200 Samples
(c)
250
300
350
0
50
70.0%
100
150 200 Samples
(d)
300
350
63.1%
250
300
350
SVM-forest Method
100%
75.0%
75.4%
Fault 8
Fault 7
Normal
0
50
100
150 200 Samples
250
300
350
(e) Figure 5: test result for five methods, (a) the “one-to-one” SVM cannot classify samples of fault 7; (b) undersampling-assembled-SVM triggers many classification errors for the majority; (c) SMOTE-biased-SVM works badly for samples of fault 8; (d) SVM-tree performs better than the three methods before for all classes; (e) SVM-forest method can improve the result of SVM-tree.
Table 5: Confused matrix for five methods Method
One-to-one SVM
Under-sampling-assemble-SVM
SMOTE-biased-SVM
SVM-tree
SVM-forest
Reality
Prediction Normal
Fault 7
Fault 8
Recall
Precision
Normal
89
11
0
89.0%
34.9%
Fault 7
80
0
0
0.0%
0.0%
Fault 8
86
0
44
33.8%
100.0%
Normal
82
0
18
82.0%
49.7%
Fault 7
27
53
0
66.2%
93.0%
Fault 8
56
4
70
53.8%
79.5%
Normal
100
0
0
100.0%
53.8%
Fault 7
28
52
0
65.0%
100.0%
Fault 8
58
0
72
55.4%
100.0%
Normal
95
0
5
95.0%
58.3%
Fault 7
24
56
0
70.0%
93.3%
Fault 8
44
4
82
63.1%
94.3%
Normal
100
0
0
100.0%
66.7%
Fault 7
20
60
0
75.0%
95.2%
Fault 8
30
3
97
75.4%
100%
Besides, many criterions have been proposed to evaluate a classifier such as F1-means and G-means (Joshi, M. V. 2002) and the method developed in DB-KPI (Jiang, Yuchen, and S. Yin. 2018). In this paper, we chose the first one to give more intuitive explanation. The F1-means is defined as F1 =
2 P R and P+R
the G-means is defined as G = PR , where P refers to the precision and the R refers to the recall. Figure 6 shows both F1 and G-means for five methods. The magenta line represents the result for SVM-forest, and the black line represents the result for SVM-tree. It is obvious that the two methods proposed in this paper
work better than other methods. F1-means for Five Methods
G-means for Five Methods
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
One-to-one SVM Under-sampling-assemble-SVM SMOTE-biased-SVM SVM-tree SVM-forest
0.1 0
Normal
Fault 7
Fault 8
0.2
One-to-one SVM Under-sampling-assemble-SVM SMOTE-biased-SVM SVM-tree SVM-forest
0.1 0
Normal
Fault 7
Fault 8
(a) (b) Figure 6: F1-means and G-means for five methods
6. Conclusion In this paper, a K-means based SVM-tree classification method was proposed to deal with the imbalanced fault classification problem in industrial processes. The K-means was used to reduce the degree of imbalance by clustering the majority into sub-classes without change the original training set. The subclasses of the majority can help the minority to describe the boundaries, which is very difficult for most existing imbalanced methods. In addition, a SVM-forest sensitive data selection method has further been proposed to deal with those cases with larger degree of imbalance by selecting a part of the data that is more relevant to the classification performance in the majority. Detailed comparative studies between proposed methods and conventional methods have been carried out through the Tennessee Eastman (TE) benchmark process. The results demonstrated that both SVM-tree and SVM-forest methods were effective, and achieved better performance. In fact, the SVM-tree and SVM-forest algorithms were proposed for data with nonlinearity based on the “clustering based classification framework”. Therefore, when combining the K-means and SVM methods, those algorithms are particularly suitable for data which has high nonlinearities in clusters. However, those methods might trigger large computation burdens when a large dataset is applied due to the limitation of
both K-means and SVM. Therefore, how to handle the big data problem under the clustering framework should be considered as one of promising issues. A possible way is to extend the proposed method to the distributed parallel form (Zhu et. al, 2017; Yao and Ge, 2018, 2019). In addition, suitable criterions to determine the number of sub-classes for SVM-tree and selection rate for SVM-forest are still open questions, which also need more investigations in the future work.
Acknowledgement This work was supported in part by the National Natural Science Foundation of China (61722310, 61673337), the Natural Science Foundation of Zhejiang Province (LR18F030001), and the Fundamental Research Funds for the Central Universities 2018XZZX002-09.
References Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning (Vol.20, pp.273-297). Downs, J. J., & Vogel, E. F. (1993). A plant-wide industrial process control problem. Comput.chem.eng, 17(3), 245-255. Evchina, Y., Puttonen, J., Dvoryanchikova, A., & Lastra, J. L. M. (2015). Context-aware knowledge-based middleware for selective information delivery in data-intensive monitoring systems. Engineering Applications of Artificial Intelligence, 43(C), 111-126. Feital T, Kruger U, Dutra J, Pinto J, Lima E. 2013. Modeling and performance monitoring of multivariate multimodal processes, AIChE Journal, 59, 1557-1569. Gao T, Yin S, Gao H, Yang X, Qiu J, Kaynak O. A Locally Weighted Project Regression Approach-Aided Nonlinear Constrained Tracking Control." IEEE Transactions on Neural Networks & Learning Systems 2018, 29, 5870-5879. Ge Z, Song Z, Gao, F. 2013. Review of recent research on data-based process monitoring, Industrial & Engineering Chemistry Research, 52, 3543-3562. Ge Z, Song Z, Ding S, Huang B. Data mining and analytics in the process industry: the role of machine
learning. IEEE Access, 2017, 5, 20590-20616. Ge Z. Review on data-driven modeling and monitoring for plant-wide industrial processes. Chemometrics & Intelligent Laboratory Systems, 2017, 171, 16-25. Ge Z. Process data analytics via probabilistic latent variable models: A tutorial review. Industrial & Engineering Chemistry Research, 2018, 57, 12646-12661. Ge Z. Distributed predictive modeling framework for prediction and diagnosis of key performance index in plant-wide processes. Journal of Process Control, 2018, 65, 107-117. Ge Z, Liu Y. Analytic Hierarchy Process Based Fuzzy Decision Fusion System for Model Prioritization and Process Monitoring Application. IEEE Transactions on Industrial Informatics, 2019, 15, 357-365. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering, 21(9), 1263-1284. Jane, A. P., & Pund, M. A. (2012). Recognition of similar shaped handwritten marathi characters using artificial neural network. Science, 260(5107), 511-5. Jiang, Y, Yin S. Recent Advances in Key-Performance-Indicator Oriented Prognosis and Diagnosis With a MATLAB
Toolbox:
DB-KIT.
IEEE
Transactions
on
Industrial
Informatics,
2018,
DOI: 10.1109/TII.2018.2875067 Jing, C., & Hou, J. (2015). Svm and pca based fault classification approaches for complicated industrial process. Neurocomputing, 167(C), 636-642. Joshi, M. V. (2002). On Evaluating Performance of Classifiers for Rare Classes. IEEE International Conference on Data Mining, 2002. ICDM 2003. Proceedings (pp.641-644). IEEE. Rasmussen, Carl Edward, and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2005. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing Human Actions: A Local SVM Approach. International Conference on Pattern Recognition (Vol.3, pp.32-36 Vol.3). IEEE. Lin, Z. Y., Hao, Z. F., Yang, X. W., & Liu, X. L. (2009). Several svm ensemble methods integrated with under-sampling for imbalanced data learning. , 5678, 536-544. Liu, Y., & Ge, Z. (2018). Weighted random forests for fault classification in industrial processes with hierarchical clustering model selection. Journal of Process Control, 64, 62-70. Namdari, M., & Jazayeri-Rad, H. (2014). Incipient fault diagnosis using support vector machines based on monitoring continuous decision functions. Engineering Applications of Artificial Intelligence, 28(1), 2235.
Smola, A., Bartlett, P., Schölkopf, B., & Schuurmans, D. (2000). Gaussian Processes and SVM: Mean Field and Leave-One-Out. Advances in Large-Margin Classifiers. MIT Press. Wang, H. Y. (2008). Combination approach of SMOTE and biased-SVM for imbalanced datasets. IEEE International Joint Conference on Neural Networks (pp.228-231). IEEE. Wang T, Wu H, Ni M, Dong J, Benbouzid M, Hu X. An adaptive confidence limit for periodic non-steady conditions fault detection. Mechanical Systems & Signal Processing 72-73(2016):328-345. Weiss, G. M. (1995). Learning with rare cases and small disjuncts. Twelfth International Conference on International Conference on Machine Learning (pp.558-565). Morgan Kaufmann Publishers Inc. Weiss, G. M., & Hirsh, H. (2000). A Quantitative Study of Small Disjuncts. Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence (Vol.24, pp.665-670). AAAI Press. Xie X, Sun W, Cheung K. An Advanced PLS Approach for Key Performance Indicator-Related Prediction and Diagnosis in Case of Outliers. IEEE Transactions on Industrial Electronics, 2016, 63, 2587-2594. Yao L, Ge Z. Big data quality prediction in the process industry: a distributed parallel modeling framework. Journal of Process Control, 2018, 68, 1-13. Yao L, Ge Z. Scalable Semi-supervised GMM for Big Data Quality Prediction in Multimode Processes. IEEE Transactions on Industrial Electronics, 2019, 66, 3681-3692. Yin, S., Ding, S. X., Xie, X., & Luo, H. (2014). A review on basic data-driven approaches for industrial process monitoring. IEEE Transactions on Industrial Electronics, 61(11), 6418-6428. Yin S, Xie X, Sun W. a nonlinear process monitoring approach with locally weighted learning of available data. IEEE Transactions on Industrial Electronics, 2016, 64, 1507-1516. Zhu, J, Ge Z, Song Z. Distributed Parallel PCA for Modeling and Monitoring of Large-scale Plant-wide Processes with Big Data. IEEE Transactions on Industrial Informatics, 2017, 13, 1877-1885. Zhu J, Ge Z, Song Z, Gao F. Review and big data perspectives on robust data mining approaches for industrial process modeling with outliers and missing data. Annual Reviews in Control, 2018, 46, 107-133. Zhu J, Ge Z, Song Z, Zhou L, Chen G. Large-Scale Plant-wide Process Modeling and Hierarchical Monitoring: A Distributed Bayesian Network Approach. Journal of Process Control, 2018, 65,91-106.