Outlier detection based on a dynamic ensemble model: Applied to process monitoring

Outlier detection based on a dynamic ensemble model: Applied to process monitoring

Information Fusion 51 (2019) 244–258 Contents lists available at ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate/inffus ...

2MB Sizes 0 Downloads 24 Views

Information Fusion 51 (2019) 244–258

Contents lists available at ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

Full Length Article

Outlier detection based on a dynamic ensemble model: Applied to process monitoring Biao Wang, Zhizhong Mao∗ Department of Control Theory and Control Engineering, Northeastern University, 110819, Shenyang, China

a r t i c l e

i n f o

Keywords: Outlier detection Process monitoring Ensemble learning Dynamic classifier selection One-class classification

a b s t r a c t This paper focuses on outlier detection and its application to process monitoring. The main contribution is that we propose a dynamic ensemble detection model, of which one-class classifiers are used as base learners. Developing a dynamic ensemble model for one-class classification is challenging due to the absence of labeled training samples. To this end, we propose a procedure that can generate pseudo outliers, prior to which we transform outputs of all base classifiers to the form of probability. Then we use a probabilistic model to evaluate competence of all base classifiers. Friedman test along with Nemenyi test are used together to construct a switching mechanism. This is used for determining whether one classifier should be nominated to make the decision or a fusion method should be applied instead. Extensive experiments are carried out on 20 data sets and an industrial application to verify the effectiveness of the proposed method.

1. Introduction Modern industry has embraced the dawn of a data-based epoch due to the difficulty in deriving physical models for complicated processes [1]. Due to the development of computer technology, industrial process data are rapidly collected and stored. These historical databases are highly useful and invaluable in developing process monitoring models. Process monitoring methods on the basis of multivariate statistics, such as PCA and PLS, are the most popular ones. They have been applied successfully in many industrial applications [2]. Recently, methods developed from the notion of one-class classification (OCC) have also been developed for process monitoring [3–6]. Because one-class classifiers can be trained under the situation that counterexamples are unavailable, they have drawn increasing attention in the field of process monitoring. 1.1. Motivations On one hand, it is observed that one important step in applying databased techniques is to obtain the portion of data representing normal operating conditions [7]. Historical databases usually contain samples at normal conditions, as well as faulty conditions, various operating modes, startup periods, shutdown periods. In this paper, we define process data at normal operation conditions as normal data (or target data), and data originating from all the other conditions, as well as abnormal objects due to sensor malfunctions and data transmission errors as



outliers. We have observed that outliers will complicate the extraction of normal data in databases. As a result, the corresponding models built for process monitoring will deteriorate in some sense. Singling out these outliers will be significant in this situation. On the other hand, methods on the basis of OCC also suffer much from outliers in training sets. Although some techniques have been applied to enhance the robustness to outliers, their negative influence still exists. Here we demonstrate this problem with a toy example of PCA process monitoring. A data set containing 400 six-dimensional (6 process variables) samples is used as the training set, which is deemed the representation of normal process. T2 statistic and SPE statistic are used to monitor the subsequent measurements. Intuitively, T2 statistic represents a measure of the variation within the PCA model triggered by new data, and SPE statistic expresses a measure of the variation of new data that has not been captured by the PCA model. Then we use the welltrained PCA model to monitor a test set that contains a fault. Result is shown in Fig. 1(a), from which the fault at 𝑡 = 501 can be easily identified. For comparison, varying fractions of outliers are added into the original training set and monitoring results by these ill-trained PCA models are shown in Fig. 1(b)–(d). It can be clearly observed that outliers in the training sets do negatively affect the PCA models. This influence becomes worse as the proportion of outlier increases. 1.2. Contributions In this paper, we propose a novel dynamic ensemble outlier detection method and apply it to process monitoring. The main challenge of

Corresponding author. E-mail address: [email protected] (Z. Mao).

https://doi.org/10.1016/j.inffus.2019.02.006 Received 4 May 2018; Received in revised form 15 January 2019; Accepted 19 February 2019 Available online 20 February 2019 1566-2535/© 2019 Elsevier B.V. All rights reserved.

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Fig. 1. An example showing effect of outliers in training sets with varying fractions: (a) 0%; (b) 5%; (c) 10%; (d) 50%.

detecting outliers for industrial data stems from its natural characteristic, such as absence of data labels. Learning an outlier detector in such a situation is referred to as the missing label problem, solution of which can only be derived from the data consisting of a mixture of normal samples and outliers whose labels are missing [8]. Here we employ the strength of one-class classifier on addressing this problem. In contrast to traditional multi-class classifiers, the construction of dynamic ensemble outlier detector with one-class classifier has several restrictions due to the absence of labelled training data. To this end, we use a heuristic mapping to transform outputs of one-class classifiers into continuous form. Then we propose a procedure that can generate pseudo outliers

from original training data. Note that most one-class classifiers usually have a user-specified parameter, i.e. the fraction of rejected target samples. With this parameter and all well-trained classifiers, we nominate the corresponding fraction of target samples with the highest probability as pseudo outliers. This procedure is very reasonable and appropriate for industrial processes since collected measurements usually contain outliers. Then we divide the processed data into several clusters. At the test phase, we should firstly confirm the region of competence, with which we can compute the competence of each base learner. This region of competence is determined by finding the belonging cluster of the test point. And the classifier competence is computed by a proba245

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

bilistic model. Then we propose a switching mechanism on the basis of two statistic tests, i.e. Friedman test and Nemenyi test. This is used for determining whether one classifier should be nominated to make the decision or a fusion method should be applied instead. If more than one base classifiers have been singled out by this switching mechanism, we use a fusion method called decision template to fuse them. The remainder of this paper is organized as follows. Section 2 introduces related work on outlier detection for industrial data and ensemble of on-class classifiers. In Section 3, we briefly present some basic concepts concerning one-class classification, followed by the proposed outlier detection approach described in Section 4. The experimental simulations on benchmark data sets are carried out in Section 5, and an industrial case is executed in Section 6. Finally, some conclusions are drawn in Section 7.

the recovered clean data from their method, the performance of system identification can be improved. According to their forms of implementation, we simply category these outlier detection methods into two groups. One is named as “internal detection (ID)”, which indicates detecting outliers inside the given raw data [5–11]. The other is referred to as “external detection (ED)”, which implies that an outlier detection model trained from the raw data can be used for external unseen data [16,17]. Note that both types of methods are important for industrial processes. ID methods are usually used for processing historical data in an offline manner. Performance of process monitoring or control models constructed on those recovered “clean” data sets can be much better compared with that based on the raw data. ED methods can be trained either online or offline, but they are usually used for online outlier detection. For these methods, how to process the complicated raw data determines their performance of online detection. For our proposed outlier detection method, it can be regarded as an ED method since we construct an ensemble detection model from the raw data. Furthermore, we can also deem it as an ID method since outliers in the raw data can also be single out with the procedure of generating pseudo outliers (Section 4.3). Therefore, our method can be applied both online and offline, contrasting to most outlier detection methods.

2. Related work In this section, literatures regarding outlier detection methods for industrial processes and ensemble models dedicated to one-class classification are reviewed. 2.1. Outlier detection for industrial data At early stage, outliers were usually picked up through visual inspection of data charts based on engineers experience. Such a method is subjective and becomes inappropriate for large and complex systems in modern process industry. Then a popular outlier detection approach called “3𝜎 edit rule” was developed based on the idea that of a data sequence is approximately normally distributed [9]. Unfortunately, this procedure usually failed in practice because the presence of outliers tends to inflate the variance estimate, causing too few outliers to be detected. Accordingly a robust method called Hampel identifier, which replaces the outlier-sensitive mean and standard deviation estimates with the outlier-resistant median and median absolute deviation from the median respectively, was developed in [10,11]. For modern process industry, nevertheless, enormous process variables can be measured and related variables may bring great challenge for univariate methods. Multivariate outlier detection methods hence should be more appropriate in such a situation. In [12], techniques based on the Mahalanobis distance (MD) were proposed and applied in different fields of chemometrics. To mitigate the unreliability of MD related outlier detection techniques, two resampling-based methods called resampling by halfmean (RHM) and smallest half volume (SHV) were developed in [13]. Then a method called closest distance to center (CDC) was proposed in [7] as an alternative for outlier detection in order to alleviate the problem of heavy computational complexity of RHM and SHV. For the improvement of identification by subspace system identification algorithm for the errors-in-variables state space models, the minimum covariance determinant (MCD) estimator was employed to detect outliers in [14]. A hybrid approach that uses three multivariate outlier detection algorithms, i.e. MD, RHM, and SHV was proposed in [15], aiming to improve the identification of blast furnace ironmaking process. In addition, several residual-based methods based on prediction were also been developed. Considering the characteristics of time series in process control systems, an outlier detection algorithm that adopts an improved RBF neural network (NN) and hidden Markov model (HMM) was proposed in [16]. The NN was used to construct the model of the controlled object and the HMM was used to analyze residuals between predictions and true measurements. Another residual-based method was proposed in [17] to improve the performance of fault detection. A neural network was used to estimate the actual system states, whose deviations from the true measurements were used to identify outliers through the Hampel identifier. In [8] the outlier detection problem was reduced to matrix decomposition problem with low rank and sparse matrices, which was further conveyed into a semidefinite programming problem. By using

2.2. Ensemble of one-class classifiers For methods tailored for one-class classifier ensemble, we category them into three groups according to their combination techniques. We refer to the first group as “Static Ensemble (SE)”, which means that base classifiers are aggregated by fixed or well-trained functions. Performance of averaging and multiplying rule are discussed in [18]. To cope with all possible situations with missing data, one-class classifiers trained on one-dimensional problems, n-dimensional problems or features are combined in dissimilarity representations in [19]. Bagging and Random Subspace (RS) are combined together to generate diverse training subsets for an improved SVDD in [20] to detect outliers in process control systems. SVDD-based classifiers are combined for image database retrieval to improve the performance of single one-class classifier in [21]. Clustering algorithms like k-means and fuzzy c-means are also proposed to generate diverse training subsets in [22]. The second group is referred to as “Pruned Ensemble (PE)”, where only part of elite classifiers are selected from the pool to construct a subensemble model. In contrast to multi-class classification, to which many diversity measures can be applied [23], designing diversity measures for one-class classifiers may be nontrivial since none counterexample is available. With a combined criterion that consists of consistency and pairwise diversity, Krawczyk [24] employs firefly algorithm to complete the task of searching for the best possible subset of classifiers. Similarly, with an exponential consistency measure and a non-pairwise diversity measure as criterion, Parhizkar and Abadi [25] uses a novel binary artificial bee colony algorithm to prune the initial ensemble so that a nearoptimal sub-ensemble can be found. According to their empirical results, PE methods usually outperform SE methods. The third group is referred to as “Dynamic Ensemble (DE)” owning to their dynamic selection mechnisms. DE techniques rely on the assumption that different classifiers are competent (“experts”) in different local regions of the feature space. So the key issue in DE is how to estimate the competence of a base classifier for the classification of a new test sample [26]. Due to the difficulty residing in measuring competence, studies concerning DE tailored for one-class classifiers are limited. In [27], three competence measures for one-class classifiers are developed and one of base classifiers is delegated dynamically to the decision area where it is the most competent. Since the authors assume that all samples at disposal are from the target concept, the estimated competence may be biased in our interested application here, in which training set contains outliers. 246

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

p(𝜔T |x) of a data point x can be derived from p(x|𝜔T ): ( ) ( ) ( ) ( ) 𝑝 𝑥|𝜔𝑇 𝑝 𝜔𝑇 ( ) 𝑝 𝑥|𝜔𝑇 𝑝 𝜔𝑇 𝑝 𝜔𝑇 |𝑥 = = ( ) ( ) ( ) ( ) 𝑝 (𝑥 ) 𝑝 𝑥|𝜔𝑇 𝑝 𝜔𝑇 + 𝑝 𝑥|𝜔𝑂 𝑝 𝜔𝑂

3. One-class classification One-class classification distinguishes much from traditional classification problem. It assumes that the counterexamples are unavailable during the training phase. A good one-class classifier will have both a small fraction false negative as a small fraction false positive theoretically. Many techniques have been developed to one-class classifiers. More details can be found in [28]. Here we use a toy example to show differences between different one-class classifiers. Fig. 2 shows six oneclass classifiers trained on a ‘banana-shape’ data set. In this figure, ‘∗ ’ indicates a target data point, and ‘+’ represents an outlier. Error on the target set is set 0.1 for all methods. Results show that each method has a particular decision boundary. Note that this is only a toy example that using a two-dimensional data set. For higher dimensional data sets, the difference may be more distinct. On the other hand, this example also implies that it is hard to select an appropriate classifier when none prior knowledge concerning the data is available. Even for the same classifier, we still hardly have a proper way of determining proper parameters since we have none labeled data that can be used for optimizing by algorithm like cross validation. As a result, utilizing more models in an ensemble may significantly improve the robustness of the constructed system and prevent us from choosing a weaker model [29]. This is also the essential motivation to ensemble pruning and dynamic classifier selection.

(1)

Actually, this perfect solution can hardly be obtained in OCC problem since outlier distribution p(x|𝜔O ) is unknown. Then some assumptions are indispensable. As claimed in [30], p(x|𝜔T ) can be used instead of p(𝜔T |x) provided p(x|𝜔O ) is independent of x. Accordingly, our goal is to transform binary outputs into continuous outputs. Therefore some heuristic mapping has to be applied. Let us take SVDD (a boundary-based one-class classifier) as an example. The resemblance 𝜌(x, 𝜔T ) between sample x and target class 𝜔T should be transformed with a function f( · ) fulfilling the following three criteria: (1) 0 < f(𝜌(x, 𝜔T )) < 1; (2) f( · ) is monotonic; 𝑓 (𝜌(𝑥, 𝜔𝑇 )) = 0.5, 𝑓 𝑜𝑟 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 (3) { 𝑓 (𝜌(𝑥, 𝜔𝑇 )) < 0.5, 𝑓 𝑜𝑟 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑢𝑡𝑠𝑖𝑑𝑒 𝑡ℎ𝑒 𝑏𝑜𝑢𝑑𝑎𝑟𝑦 𝑓 (𝜌(𝑥, 𝜔𝑇 )) > 0.5, 𝑓 𝑜𝑟 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛𝑠𝑖𝑑𝑒 𝑡ℎ𝑒 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 Then an exponential function with the following form should be appropriate: ( ) ( ( ) ( )) 𝐶′ 𝑝̂ 𝑥|𝜔𝑇 = 𝑒𝑥𝑝 −𝐶 ⋅ 𝜌 𝑥, 𝜔𝑇 = 𝑒𝑥𝑝 − ‖𝑥 − 𝑜‖2 (2) 𝑅 where o and R represent the center and radius of the hypersphere, respectively. Here parameter 𝐶 ′ = ln 2 that can be calculated via criterion (3).

4. Methodology In this section, we present methodology of the proposed method. Firstly, an outline of our method is provided, followed by the detailed techniques.

4.3. Pseudo outliers Here we also take SVDD as the example. Without loss of generalization we set parameter 𝜐 = 1∕𝑛𝐶 = 0.1 (error on target set) in order to reject 10% target data points approximately when training the model. Assume that all classifiers in the pool Ψ are well trained with data set S and outputs of all classifiers have been transformed by the procedure in Section 4.2. We summarize these outputs in a matrix, which is usually referred to as decision profile (DP) [31].

4.1. Outline Assume we are given a training set S, including target set ST and outlier set SO with missing labels (labels of samples in ST and SO is unknown). Firstly, a pool of one-class classifiers Ψ = {𝜓1 , … , 𝜓𝐿 } (either homogenous or heterogeneous) can be trained with S. As outputs of most classifiers are binary, we use a heuristic mapping to transform outputs of all classifiers with respect to all training points to continuous form. This output form is analogy with a posteriori probability. Then we use these continuous outputs along with the parameter of classifiers in Ψ to generate a fraction of pseudo outliers SPO . Here the data structure of S becomes more legible, i.e. 𝑆 = {𝑆𝑃 𝑂 , 𝑆∕𝑆𝑃 𝑂 }. For a new test point, we firstly determine its region of competence using a clustering algorithm (the clustering process is implemented at the training phase). Then we calculate competence of all classifiers with respect to this test point. Then a switching mechanism is used to decide whether one classifier should be nominated to make the decision or a fusion method should be applied instead. We describe these processes of our method in Fig. 3.

⎡ 𝑑1,𝑇 (𝑥) ⎢ ⋮ ⎢ 𝐷𝑃 (𝑥) = ⎢ 𝑑𝑖,𝑇 (𝑥) ⎢ ⋮ ⎢ ⎣𝑑𝐿,𝑇 (𝑥)

𝑑1,𝑂 (𝑥) ⎤ ⋮ ⎥ ⎥ 𝑑𝑖,𝑇 (𝑥) ⎥ ⋮ ⎥ ⎥ 𝑑𝐿,𝑂 (𝑥)⎦

(3)

where di, T (x) and di, O (x) represent supports of classifier i for the target class and the outlier class, respectively. Then a column vector indicates supports of all classifiers for one class. And a row vector represents supports of a classifier for target class and outlier class, respectively. For each data point in the training set 𝑥𝑗 ∈ 𝑆, 𝑗 = 1, … , 𝑁, we can derive the corresponding decision profile DP(xj ). With this DP we define the following quantity that we refer to it as difference between classes (DC): 𝐶 (𝑥 ) =

4.2. Output transformation

𝐿 ∑ ( 𝑖=1

Of all one-class classifiers, only those on the basis of density estimation have outputs of probabilistic form. For other types of one-class classifiers, outputs are either distance form or reconstruction error form. These types of output form are difficult to understand, especially in ensemble models. Moreover, the combination of different types of oneclass classifiers may also leads to unnecessary errors. Accordingly, the first issue is to transform outputs of all classifiers to a unified form. In this paper, we choose the probabilistic form like that of density-based classifiers. In addition, this form is easy to understand because it indicates the probability that a point being an outlier. Assume 𝜔T and 𝜔O indicate normal point and outlier point, respectively. According to the Bayesian theorem, the posteriori probability

) 𝑑𝑖,𝑂 (𝑥) − 𝑑𝑖,𝑇 (𝑥)

(4)

After DC values of all data points have been derived, we arrange them in ascending order [𝐷𝐶1 , 𝐷𝐶2 , … , 𝐷𝐶𝑁 ], 𝐷𝐶1 ≤ 𝐷𝐶2 ≤ ⋯ ≤ 𝐷𝐶𝑁 and select the top 10%, i.e. top 0.1N data points to construct the pseudo outlier set SPO . Note that DC value of a data point can be deemed the average support difference between target and outlier class of all classifiers. Thereby, it is reasonable to use it as a criterion to assign pseudo outliers. Moreover, the number of pseudo outliers is consistent with the prior assumption about the target set. Suppose that it is given a dataset consisting of a fraction of outliers and we just know this value fortunately. Then pseudo outliers selected by our procedure should highly resemble those true outliers. 247

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Fig. 2. Decision boundaries of different OC classifiers: (a) SVDD; (b) Parzen density data description; (c) K-means data description; (d) K-nearest neighbor data description (k = 5) data description; (e) Self-organizing map data description; (f) Minimum spanning tree data description.

248

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Fig. 3. Different phases of our method, including classifier generation, post processing, classifier selection and fusion.

4.4. Dynamic classifier selection

When classifying a test point, we should decide its belonging cluster first. Then all data points in that cluster constitute the region of competence with respect to this test point.

Creating a monolithic classifier to cover all the variability inherent to most pattern recognition problems is somewhat unfeasible [32]. Then designing a dynamic selection scheme for ensemble models is becoming increasingly popular. Like most dynamic classifier selection techniques for multi-class classification problem, we also need to define the region of competence, in which competence of all classifiers can be evaluated by a specified criterion, either local accuracy [33] or probabilistic models [34] even the complexity [26].

4.4.2. Calculation of competence Once the region of competence of the test point has been determined, competence of all classifiers can be calculated and used for selection. This paper proposes to use a probabilistic model to calculate the competence, motivation of which is explained below. Note that although the structure of raw data becomes more pellucid due to pseudo outliers, there still exists another problem, namely data imbalance. The fundamental issue regarding the imbalanced learning problem is the ability of imbalanced data to significantly compromise the performance of most standard learning algorithms. Therefore, when presented with complex imbalanced data sets, these algorithms fail to properly represent the distributive characteristics of the data and resultantly provide unfavorable accuracies across the classes of the data [36]. This implies that we may derive a biased result if accuracy-based or rank-based [32] measures are used to calculate the competence. In contrast to these measures, a probabilistic model is more suitable. Furthermore, procedures for output transformation (Section 4.2) and pseudo outliers (Section 4.3) can be sufficiently exploit by a probabilistic model. Naturally, a probabilistic competence measure of classifier 𝜓 i for the sample x can be represented by the probability of correct classification:

4.4.1. Region of competence Region of competence with respect to one test point can be deemed a validation set used for computing classifier competence of classifying this point. Competence of all classifiers will be used by a selection mechanism for classifying this test point. Generally, two types of methods can be used to determine such a region, one is on the basis of K-nearest neighborhood (KNN), and the other is via a clustering algorithm [35]. For each test point, it is necessary to calculate its distances to all data points in order to find its K nearest neighbors in KNN rule. This is expensive for online applications. While for the clustering-based method only the determination of its belonging cluster is necessary so long as all data points have been divided into several clusters at the training phase. In this paper, we prefer the cluster-based method to define the region of competence due to its convenience at the test phase. Once the original data set has been processed by the pseudo outlier procedure, k-means clustering is used here to divide the training points into several clusters. As for the problem of deciding the number of clusters, we only determine it according to the size of the corresponding training set, i.e. larger data set has more clusters. We admit that this setting is too simple or may be unreasonable since the cluster number is influenced by many factors. This paper does not focus much on this problem due to its complexity. From our point of view, in addition, the influence of cluster number may be little. The reason is that we only aim to discover those data points closed to the test point, the number may affect little. This also means that we cannot determine the optimal number of data points in the region of competence.

𝑀 ( ) ∑ { } 𝑃𝑐 𝜓𝑖 |𝑥 = Pr 𝑥 belongs to the 𝑗 − th class ∧ 𝜓𝑖 (𝑥) = 𝑗

(5)

𝑗=1

where Pr(A) indicates the probability that an event A is true. Here 𝑀 = 2 for OCC implies target and outlier classes. Nonetheless, this probability equals to 0 or 1 at least either one or both of the two events are random, which hinders the direct application of such a probabilistic model to calculate the classifier competence. To this end, this paper uses an indirect method inspired by [37]. At first, we construct a hypothetical classifier called randomized reference classifier (RRC), which can replace the corresponding classifier when calculating its competence. For OCC issue, RRC uses two random variables 𝜆T and 𝜆O to represent support values for target and outlier class, respectively. In order to construct an equivalent classifier 𝜓 i with 249

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

RRC with respect to sample x, we restrict these two random variables to satisfy the following criteria:

ones. The motivation for this implementation is to alleviate the issue of overfitting, which usually gives a deceptively lower training error. Authors of [39] have used an adjusted F-test proposed in [40] to statistically compare different classifiers. However, they have admitted that is an easy but not very accurate way to check whether classifiers are significantly different. As concluded in [41] that the non-parametric tests are safer than parametric tests since they do not assume normal distributions or homogeneity of variance. Meanwhile, they can be applied to extensive measures, such as classification accuracies and error ratios, even model sizes and computation times. In this paper, accordingly, we employ a non-parametric test, i.e. Friedman test, for multiple classifiers, followed by a post-hoc test, i.e. Nemenyi test, to check if the estimated competences of classifiers with respect to samples in the competence set are significantly different. Note that here we deem each data point in the competence set as a data set when calculating the statistic. The Friedman test compares the average ranks of classifiers. Under the null-hypothesis that all the classifiers are equivalent, the Friedman statistic conforms to a 𝜒𝐹2 distribution. Then Iman and Davenport [42] showed that Friedman’s 𝜒𝐹2 is undesirably conservative and derive a better statistic. With regard to Nemenyi test, the performance of two classifiers is significantly different if their average ranks differ over a critical difference. Then on the basis of these two statistical tests we define the switching mechanism. Switching Mechanism: If we fail to reject the null-hypothesis Friedman test, all classifiers will be fused. Otherwise, Nemenyi test will be executed to check if the best classifier significantly differs from the second best. If yes, only the best one will be preserved (DCS); otherwise, both. Then this procedure will be executed for the second and third best classifiers. Also, significant difference indicates preserving the better, otherwise both. Then go on... (Classifier Fusion). It is necessary to underline that motivation of this mechanism originates mainly from that redundant and inaccurate classifiers in MCSs can adversely affect the performance of a system based on the combination functions [43].

(1) 0 ≤ 𝜆T(O) (x′) ≤ 1, ∀x′ ∈ CSx ; (2) 𝔼(𝜆𝑇 (𝑥′ )) = 𝑑𝑖,𝑇 (𝑥′ ); 𝔼(𝜆𝑂 (𝑥′ )) = 𝑑𝑖,𝑂 (𝑥′ ); (3) 𝜆𝑇 (𝑥′ ) + 𝜆𝑂 (𝑥′ ) = 1. where 𝔼 is the expected value operator. CSx is the competence set of x and x′ indicates the samples in this set. Note that (1) and (3) ensure the normalization properties of class support. Criterion (2) means that, with respect to sample x′, the expectations of these two variables equal to the support values of the corresponding classifier. This restriction ensures that RRC is equivalent to its corresponding classifier when calculating its competence. Then according to these three restrictions, we can learn the distributions about these two variables. Actually, distributions fulfilling the above three conditions may be not unique. While beta distribution seems more reasonable, and this has also been used. ( ) Γ 𝑎𝑗 + 𝑏𝑗 ( ) ( )𝑏 −1 𝜆𝑗 (𝑥) ∼ 𝐵𝑒𝑡𝑎 𝜆𝑗 |𝑎𝑗 , 𝑏𝑗 = ( ) ( ) 𝜆𝑗 𝑎𝑗 −1 1 − 𝜆𝑗 𝑗 (6) Γ 𝑎𝑗 Γ 𝑏𝑗 where 𝑗 = 𝑇 (𝑂) represents the random variable of target (outlier) class. Then from condition (2) we have 𝑎𝑗 ( ( )) ( ) 𝔼 𝜆𝑗 𝑥′ = = 𝑑𝑖,𝑗 𝑥′ (7) 𝑎𝑗 + 𝑏𝑗 𝑎𝑗 + 𝑏𝑗 = 𝑀 = 2

(8)

As thus, for each sample in the competence set, each RCC relates two beta distributions, whose hyperparameters can be derived from (7) and (8). Then we use these two random variables to calculate classifier competence. Assuming without loss of generality that x′ belongs to the target class (x′ ∈ CSx ), the competence C(𝜓|x) for classifier 𝜓 is equal to the probability that x′ is correctly classified by RCC: ( ) ( ) 𝐶 𝜓|𝑥′ = 𝑃 𝜆𝑇 > 𝜆𝑂 1

=

∫0

( ) 𝐵𝑒𝑡𝑎 𝜆𝑇 |𝑎𝑇 , 𝑏𝑇 ×

( ∫𝑜

𝜆𝑇

) ( ) 𝐵𝑒𝑡𝑎 𝜆𝑂 |𝑎𝑂 , 𝑏𝑂 𝑑 𝜆𝑂 𝑑 𝜆𝑇

(9) 4.5. Classifier fusion

Once the competence of classifier 𝜓 with respect to all samples in the competence set have been derived, we should extend competence encapsulated in the competence set to the entire feature space [38]. This is because the classifier competence at one test point should be the cumulative influence of all sources in the entire feature space. Competence with respect to the competence set is 𝑐 (𝜓|𝑥) =

𝑁 ( ) ( ) ∑ ′ ′ 𝐶 𝜓|𝑥𝑖 𝐾 𝑥, 𝑥𝑖 , 𝑖=1



𝑥𝑖 ∈ 𝐶 𝑆𝑥

According to the proposed switching mechanism, a classifier fusion procedure is necessary unless the best classifier outperforms the others significantly. In this paper, we use a fusion method called decision template (DT) to aggregate all or part of classifiers dynamically. DT is a robust classifier fusion scheme that combines classifier outputs by comparing them to a characteristic template for each class. It uses all classifier outputs to calculate the final support for each class, which is in sharp contrast to most other fusion methods which use only the support for that particular class to make their decision [31]. Assume without loss of generality that classifier set {𝐶1 , 𝐶2 , … , 𝐶𝐾 } (2 ≤ K ≤ L) with respect to test point x is singled out by the selection procedure described in Section 4.4. Firstly, we summarize the continuous outputs (Section 4.2) of these classifiers into Nx (size of competence set of sample x) decision profiles (Section 4.3), which are Nx K × 2 matrixes {𝐷𝑃 (𝑥)1 , 𝐷𝑃 (𝑥)2 , … , 𝐷𝑃 (𝑥)𝑁𝑥 }. Then DT with respect to the each class is calculated by

(10)

where 𝐾(𝑥, 𝑥′𝑖 ) is a potential function that decreases with the increasing distance between x and 𝑥′𝑖 . It is reasonable to assign different weights to samples in the competence set since they affect the calculation of competence in different degrees. Here we use the distance to the test point to represent these weights. Usually, a Gaussian potential function with Euclidean distance 𝐾(𝑥, 𝑥′𝑖 ) = exp(−dist (𝑥, 𝑥′𝑖 )2 ) is used. Then competence with respect to the entire feature space can be ) ( ( ) ) ∑𝑁 ( ′ ′ 2 𝐶 − dist 𝑥 𝜓|𝑥 exp 𝑥, 𝑁 𝑖=1 ( ) 𝑖 𝑖 ∑ ′ 𝑐 (𝜓|𝑥) = 𝜇𝑖 𝐶 𝜓|𝑥𝑖 = (11) ( ) ( ∑𝑁 ′ )2 𝑖=1 𝑖=1 exp −dist 𝑥, 𝑥𝑖

𝐷 𝑇𝑗 =

1 𝑁𝑥𝑗

∑ 𝜔𝑗 ∩𝐶 𝑆𝑥

( ) 𝐷𝑃 𝑥𝑖

(12)

where 𝑗 = 𝑇 (𝑂) represents target (outlier) class and set 𝜔j includes the corresponding samples. Nxj indicates the number of samples belonging to j class. CSx is the competence set of sample x. Once DP(x) has been derived, we should calculate its resemblance to DTj in order to determine the support values. As both these two quantities are matrix (not vectors), we can use Frobenius norm to describe their similarity, which is similar to the Euclidean distance between

Therefore, the objective of the potential function model seems to assign a weighting to each sample in the competence set when averaging them. 4.4.3. Switching mechanism based on statistical test Once the competence of all classifiers in the pool have been derived, we use a statistic to test if there exist a fraction of significantly better 250

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Table 1 Summary of the 20 datasets used in the experiments: label, name, number of examples, number of attributes, and imbalance ratio. Label 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Name wisconsin pima iris0 glass0 yeast1 vehicle3 vehicle0 ecoli2 segment0 yeast3 ecoli3 page-blocks0 vowel0 shuttle-c0-c4 yeast-1-7 ecoli4 abalone9-18 yeast4 yeast5 yeast6

#Examples 683 768 150 214 1484 846 846 336 2308 1484 336 5472 988 1829 459 336 731 1484 1484 1484

#Attributes 9 8 4 9 8 18 18 7 19 8 7 10 13 9 7 7 8 8 8 8

Table 2 Confusion matrix. Actual label

IR Predicted Label

1.86 1.87 2 2.06 2.46 2.99 3.25 5.46 6.02 8.1 8.6 8.79 9.98 13.87 14.3 15.8 16.4 28.1 32.73 41.4

Target class Outlier class

Target class

Outlier class

True Negative (TN) False Positive (FP)

False Negative (FN) True Positive (TP)

median, and the standard deviation is calculated with only half observations nearest to this median. It can be proved that such estimations of mean and standard deviation can be very close to that calculated with normal data, under the condition that the fraction of outliers is smaller than 1/2. 5.2. Metrics Traditionally, for a basic two-class classification problem, the most frequently used metrics are accuracy and error rate. In this paper, we denote the samples from outlier class as positive and those from the target class as negative. Following this convention along with the confusion matrix given in Table 2, accuracy and error rate are defined as 𝐴𝑐 𝑐 𝑢𝑟𝑎𝑐 𝑦 =

𝑇𝑁 + 𝑇𝑃 ; (𝑇 𝑁 + 𝐹 𝑃 ) + (𝑇 𝑃 + 𝐹 𝑁 )

𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 = 1 − 𝑎𝑐 𝑐 𝑢𝑟𝑎𝑐 𝑦 (16)

two vectors. ‖2

‖ 𝑆𝑥,𝑗 = ‖𝐷𝑃 (𝑥) − 𝐷𝑇𝑗 ‖ ‖ ‖𝐹

But for certain situations where the ratio between sizes of two classes is very large, the accuracy metric can be deceiving [36]. Then we use G-mean as the metric √ 𝑇𝑁 𝑇𝑃 𝐺 − 𝑚𝑒𝑎𝑛 = × (17) 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃

(13)

Then, supports of x for two classes are calculated by 𝐷 𝑗 (𝑥 ) = 1 −

1 𝑆 2𝐿 𝑥,𝑗

(14)

Following G-mean metric, statistical comparison is also necessary. Here we use the two non-parametric tests introduced in Section 4.4.3 (Friedman and Nemenyi test). Significance level is 5% for all data sets.

Finally, the result of the classification task for sample x can be derived with maximum rule. 5. Experiments and analysis

5.3. Baselines

In this section, we carry out extensive experiments on several benchmark data sets. The objective is to verify if the proposed dynamic ensemble model can significantly improve the performance of single-model methods. In addition, we also compare our method with other ensemble models, including SE, PE and DE methods.

To put the experimental results into context, we compare our method with the following competitors: (1) The single best (SB) OC classifier in the ensemble. (2) Majority voting (MV) of all OC classifiers in the ensemble. Here outputs of base learners are binary form. (3) Average (AVR) outputs of all OC classifiers in the ensemble. Here outputs of base learners are continuous form. (4) Product (PRO) of all OC classifiers in the ensemble. Here outputs of base learners are continuous form. (5) OCDCS-EM: a dynamic classifier selection of OC classifier method proposed in [27]. Note that the validation set in original paper is generated randomly, which is different from ours. To compare fairly we use the same procedure (cluster-based) here. (6) Oracle (OR). The oracle works as follows: assign the correct class label to the test point if and only if at least one individual OC classifier produces the correct class label.

5.1. Data sets and preprocessing We employ 20 data sets taken from KEEL1 repository [44]. Because some data sets in KEEL are similar, we only select 20 representative ones. These data sets are chosen according to the imbalance ratio (IR), i.e. data sets at different ranges. Simple descriptions are shown in Table 1. These data sets are binary and arranged according to degrees of imbalance. Due to the lack of one-class benchmarks, we compulsively denote the class with majority examples as the target class (negative class) and the other class as outlier class (positive class). Data scaling is an important data preprocessing step, especially for high-dimensional data sets, features of which usually have different ranges. Assume {ai } is a sequence of d-dimensional data set A. For each attribute of ai , it is scaled by 𝑎𝑗𝑖 ←

𝑎𝑗𝑖

− 𝑠𝑗

𝜇𝑗

,

𝑗 = 1, ⋯ , 𝑑

5.4. Experimental setup The experiments are conducted using two ensemble types: homogeneous and heterogeneous. In homogeneous ensemble models, SVDD is employed as the base learner. The fraction of target rejection is set to 10% for all SVDD models. Bagging and Random Subspace are used to generate 20 diverse individual base classifiers. While in heterogeneous ensemble models, we use 10 different one-class classifiers listed in Table 3. Procedures of these classifiers are available in MATLAB toolbox “dd_tools”.2

(15)

where 𝜇 j and sj represent the mean and standard deviation attribute j, respectively. For data sets contaminated by outliers, however, this scaling approach is risky since outliers may be concealed. Here we employ a modification developed in [7]. Specifically, the mean is replaced by the 1

http://sci2s.ugr.es/keel/datasets.php.

2

251

http://homepage.tudelft.nl/n9d04/functions/Contents.html.

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Table 3 Ten one-class classifiers used in the experiments. Label

OC classifier

Name in dd_tools

1 2 3 4 5 6 7 8 9 10

Mixture of Gaussians data description Parzen density estimator Auto-encoder neural network K-means data description K-center data description Principal component analysis data description Self-organizing map data description Minimum spanning tree data description K-nearest neighbor data description Support vector data description

mog_dd parzen_dd autoenc_dd kmeans_dd kcenter_dd pca_dd som_dd mst_dd knndd svdd

point and will probably obtain results near oracle. Provided for each test point the true best individual can be singled out, classifier selection model in this situation is just oracle. In practice, selecting the best individual for each test point is almost impossible due to the limitation of training and validation set as well as the essential difficulty lying in the classification task. Among the three used classifier fusion methods, ARV performs best. So it is enough to compare our method with ARV. For 18 of all 20 data sets, our method outperforms ARV with respect to the G-mean metric. The Nemenyi test further indicates their significant difference. For data set 9 and 12, AVR obtains the best result, but the difference from our method should not be significant. The reason may result from the above overconfident assumption, i.e. the best individuals have not been single out for certain test points. For 16 out of 20 data sets, our method outperforms OCDCS-EM. But the Nemenyi test does not support significant difference. From our point of view, the reasons may be threefold. The first may be that our region of competence including pseudo outliers is helpful. The second may be that our competence calculation procedure performs better. And the last reason may due to the contribution of switching mechanism.

The parameter “error on target class” is set 0.1 for all classifiers, and other parameters are default of corresponding procedures. 5.5. Results

5.6.2. Heterogeneous ensemble The most outstanding result in this experiment is that performance of SB improves great, even better than AVR in terms of average rank (3.85: 3.95). The reason may be that base classifiers in the heterogeneous ensemble are well trained since all training points are used in contrast to the first experiment, and different structures are included. Performances of three static ensemble models are inferior. This implies that classifier fusion methods can improve the performance of weak base learners, but achieving improvement over strong learners is difficult. This statement has also been verified in several literatures. But for classifier selection methods, improvements can still be obtained. For 17 out of 20 data sets, our method performs better than SB, and 15 for OCDCS-EM. Result of Nemenyi test implies the significant difference. Note that this difference is smaller than that in the first experiment (if this comparison is meaningful). Apart from these, other similar results can also been obtained.

Results for homogeneous and heterogeneous ensemble models in terms of G-mean are show in Tables 4 and 5, respectively. For each data set, boldface is used to underline the best result, and the corresponding rank is labeled at the subscript. In addition, average ranks over all 20 data sets are calculated and shown at the bottom of tables. For homogeneous ensemble models, the Friedman test rejects the null-hypothesis (𝑝 = 0.0083 < 0.05), indicating that significant difference exists between these classifiers. The critical difference of Nemenyi test is 1.69. For heterogeneous ensemble, the Friedman test also rejects the null-hypothesis (𝑝 = 0.0062 < 0.05). And the critical difference of Nemenyi test is also 1.69. 5.6. Analysis 5.6.1. Homogeneous ensemble In this experiment, the pool of base learners is generated by Bagging and Random Subspace techniques. Parameters of all SVDD models are identical. We can easily find firstly from Table 4 that oracle obtains the best result. However, as it is an abstraction method rather than a real model, we exclude it from the comparison. Then we investigate the comparison between single-best model and our method. It is reasonable to expect that classifier selection methods will perform better when the individual classifiers are of different accuracies. This has been found by the comparison with not only our method, but OCDCS-EM. Our method outperforms SB with respect to G-mean metric for all data sets. Moreover, Nemenyi test also shows significant difference. Our dynamic selection mechanism ensures that the most competent classifier to classify each test point, while single model uses fixed classifier for all test points. It is reasonable hence that our method outperforms SB significantly. The success of classifier selection methods mainly resides in the accurate estimate of classifier competence. So it can be concluded that the competence estimation procedure used in our method provides accurate estimation and contributes much on the improvement of single classifier. Certainly, the accurate competence estimation relates much to its preliminaries, i.e. procedures of output transformation, pseudo outliers, and selection of region of competence. Significance of the switching mechanism will be explained in Section 5.6.3. Then we compare results of static ensemble models with our dynamic model. Before looking at the result, we make the following guess according to their mechanisms. The classifier selection methods will probably outperform classifier fusion methods since classifier fusion models will tend to smooth out the differences of individual classifier and may be slightly inferior to the most outstanding base classifier in the pool, but classifier selection models can choose the “best” individual for each test

5.6.3. About the switching mechanism Here we investigate the significance of the switching mechanism proposed in Section 4.4.3. As we have stated, the goal of this switching mechanism is to alleviate the overfitting problem of dynamic classifier selection methods. Hence, it relates much to the training-testing relationship. If it is very close, this switching mechanism seems to be redundant, even deteriorates the performance of dynamic classifier selection methods. But for situations in which their relationship is distant, it may realize its function. So it may be more significant in practical applications. We show the comparisons between with and without the switching mechanism for two types of ensemble in Fig. 4. Actually, it is hard to say there is significant improvement by the usage of the switching mechanism. From our point of view, the reason is that the training set resembles the testing set much so that the errors on both sets are close. It is noteworthy that in most cases more than one individuals are singled out via the switching mechanism. In this situation, the model can also be deemed a dynamic ensemble selection (DES) model. Although significant improvement can hardly be obtained for these data sets, comparable results can be guaranteed. 5.6.4. Scalability Ensemble learning has been widely used in the field of data mining [45]. Its ability of addressing big data has also been verified [46]. While most ensemble models belong to the static ensemble learning. Different from static ensemble learning, our method is a dynamic ensemble model. At the training phase, the computational complexity is similar since they all request to train a pool of base learners. This indicates that these two types of ensemble learning share similar training time for the 252

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Table 4 Results from the homogeneous ensembles in terms of G-mean (%). Subscripts indicate corresponding ranks (except Oracle). Best results are in bold. Dataset

SB

MV

ARV

PRO

OCDCS-EM

Our

OR

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 AVR. rank

87.924 54.174 98.633 52.966 49.176 62.903 87.294 83.814 80.155 73.983 71.332 73.015 80.596 89.104 65.813 67.725 60.853 59.884 74.006 59.976 4.30

85.665 53.235 96.126 55.704 52.155 57.796 86.115 81.945 80.794 69.195 68.776 74.301 81.374 88.915 63.966 69.733 59.636 58.746 74.035 67.913 4.75

88.973 57.673 96.375 56.193 53.194 58.055 89.553 84.383 84.191 71.204 71.053 74.301 81.993 88.436 64.985 69.733 59.905 60.673 78.313 67.913 3.45

84.026 51.476 97.554 54.805 54.493 60.724 85.906 80.886 76.986 64.256 69.035 70.996 80.915 89.733 65.264 65.446 60.104 59.105 77.094 61.905 4.85

91.982 61.391 1001 57.592 57.802 63.442 90.641 84.093 82.763 74.202 70.974 73.124 82.482 91.912 67.961 70.972 62.892 64.132 80.692 68.902 2.10

92.121 60.422 1001 58.241 58.111 64.081 90.002 84.611 83.102 75.541 71.461 73.913 83.981 92.711 67.522 71.661 63.281 65.281 81.101 70.061 1.30

99.87 89.54 100 92.73 93.84 96.73 99.96 98.87 99.91 95.94 94.40 96.64 99.54 100 98.19 97.73 89.91 94.58 99.13 98.01 NA

Table 5 Results from the heterogeneous ensembles in terms of G-mean (%). Subscripts indicate corresponding ranks (except Oracle). Best results are in bold. Dataset

SB

MV

ARV

PRO

OCDCS-EM

Our

OR

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 AVR. rank

91.464 59.831 1001 55.265 50.116 61.593 87.245 83.581 88.852 73.984 66.593 80.731 83.514 89.134 58.106 72.593 66.103 62.284 77.574 62.145 3.45

89.735 55.206 99.126 54.746 52.094 58.756 87.166 81.246 80.776 72.145 65.754 76.346 83.195 88.096 59.745 70.116 64.794 60.226 74.905 63.884 5.35

91.972 57.665 99.505 55.193 52.133 58.756 90.253 81.934 84.195 75.263 64.055 78.584 84.933 88.435 64.593 72.034 64.015 61.595 78.623 64.933 3.95

86.976 58.913 99.604 54.894 50.495 60.734 88.984 81.245 86.554 72.056 61.936 77.235 82.056 90.463 63.974 71.365 61.986 63.773 73.336 61.19‘6 4.75

91.853 58.194 1001 56.521 54.872 63.742 91.131 82.073 88.133 77.522 68.742 79.193 85.442 91.002 66.671 72.972 66.742 64.912 80.112 69.802 2.10

93.721 59.482 1001 56.182 55.651 64.221 91.131 83.112 90.131 78.491 69.211 80.112 86.031 91.691 66.012 73.661 67.281 66.301 81.231 71.301 1.25

100 89.54 100 91.73 93.84 97.73 100 100 99.88 97.59 97.63 100 99.62 100 97.83 99.62 96.91 97.74 100 99.68 NA

same data set. But at the test phase, because we need to determine the region of competence and compute competence of each base learning with respect to each test point, dynamic ensemble learning has more computational complexity. According to the introductions of the determination of the region of competence and the calculation of the competence in Section 4, we can see that the heavy computational complexity can be alleviated if we can control the number of samples in the region of competence. As we have mentioned that two methods are usually used to determine this region, i.e. KNN and clustering method. If we use KNN method, we should calculate the distances from the test point to all data points in the training set in order to find its k nearest neighbors. This is terrible for large training sets. While provided we use a clustering method, the clustering is executed at the training phase, we only need to determine which cluster the test point belongs to at the training phase. The problem of heavy computational complexity can be alleviated then. For large training sets,

however, the number of data points in each cluster is also large. In this paper, we try to solve this problem by increasing the number. 6. Applied to Tennessee Eastman process The Tennessee Eastman (TE) process is a widely used simulation process for evaluating different approaches referring to process monitoring [47]. The technological process is illustrated in Fig. 5. The gaseous reactants A, C, D, and E and the inert B are fed to the reactor where the liquid products G and H are formed. Then the product stream is cooled by a condenser and fed to a vapor-liquid separator, where the vapor is recycled to the reactor feed through a compressor. A portion of the recycled stream is purged to keep the inert and byproducts from accumulating in the process. The condensed components from the separator are pumped to the stripper. Stream 4 is used to strip the remaining reactants in Stream 10, and is combined with the recycle stream. 253

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Fig. 4. Influence of the switching mechanism for (a) homogeneous ensemble model; (b) heterogeneous ensemble models.

Fig. 5. Technological process of TE process.

6.1. Data sets

complex working scenario, 50 faulty examples from one type of fault are randomly selected and added into the first training set. Note that labels of all points in this training set are unknown to simulate a missing label problem. Data scaling technique introduced in Section 5.1 is also used here.

20 different faults are usually simulated by this process and we list them in Table 6. Accordingly, 21 data sets can be generated to represent one normal condition and 20 faulty conditions. These data sets can be used to train process monitor models. In this paper, only the normal data set is assumed to be available at the training phase. This data set contains 500 data points sampled at normal work condition. Note that some faulty data points are usually added to the training set but they are assumed unknown. At the test phase, we have 20 data sets, each of which contains 960 samples. They are usually used to check the process monitoring models. For each test data set, the first 160 data points are normal and the subsequent points represent a fault. To imitate the

6.2. Experimental setup We carry out two series of experiments to verify the effectiveness of our method with respect to the offline and online outlier detection. These two issues are related to process monitoring. For methods in the first series of experiment, the goal is to extract data that can accurately represent target concept. To evaluate this performance of different 254

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

methods, we use the extracted data to construct corresponding PCA structure as the process monitoring model for test sets. And we compare our method with three multivariate outlier detection methods, Mahalanobis distance (MD) [12], RHM and SHV [13]. In the literature, MD, RHM, and SHV are three methods that have also been proposed for extracting normal data. In a nutshell, we choose these competitors according to the conditions of detection and application. While for methods in the second series of experiment, the goal is to directly construct a process monitoring model from the training set with unlabeled outliers. Then this model is used on test sets to assess its performance. Here we compare our method with five other competitors, D-SVDD [48], L-SVDD [49], and OCClustE [22], Feature Bagging (FB) [50] and Rotated Bagging (RB) [51]. D-SVDD and L-SVDD are two methods developed from the original SVDD, the objective of these two method is to improve the robustness of SVDD. Via modifying the optimization problem in SVDD, the negative influence of outliers in the training set can be alleviated efficiently. OCClustE, FB, and RB are three state-of-the-art ensemble outlier detection methods. For the sake of

Table 6 Process faults for the Tennessee Eastman process. Fault

Description

Type

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16–20

A/C feed ratio, B composition constant B composition, A/C ratio constant D feed temperature Reactor cooling water inlet temperature Condenser cooling water inlet temperature A feed loss C header pressure loss-reduced availability A, B, C feed composition D feed temperature C feed temperature Reactor cooling water inlet temperature Condenser cooling water inlet temperature Reaction kinetics Reactor cooling water valve Condenser cooling water valve Unknown

Step Step Step Step Step Step Step Random Random Random Random Random Slow drift Sticking Sticking NA

Fig. 6. Monitoring results on fault 1 data set of PCA models, whose training sets are processed by different methods: (a) MD; (b) RHM; (c) SHV; (d) Our method. 255

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Fig. 6. Continued

avoiding difference triggered by base learners, the specification of base learner of these two method is identical with that of our method. Note that a principled way of most one-class classifiers for setting the threshold of outlier is to supply the fraction of the target set which should be accepted [30]. In this paper, we set error on target class 0.1 for all used SVDD, which indicates that nearly 90% target data will be accepted. For unsupervised learning paradigm like outlier detection, we have none labeled training data for parameter optimization using algorithm like cross validation. This setting is totally specified by user, 0.9 is also recommended by the authors of [30]. This setting is regardless of training data, which means that we generally use this setting for any training set, unless we have some prior knowledge about the data. We should also claim that our focus is on the ensemble learning, rather than SVDD in this paper. SVDD only plays the role of base learner, and we can also use other base learners. Moreover, 0.9 is used for all SVDD models for the sake of fairness.

6.3. Results and analysis 6.3.1. Offline detection Results from the first series of experiment are shown in Table 7. In order to show how one method outperforms another, i.e. higher true positive rate (TPR) or lower false positive rate (FPR)3 , we list these two indexes separately. Moreover, it is significant to see which method performs better on TPR or FPR from the practical point of view. As defined in Section 5, outlier class is regarded as positive class, and target class as negative class. Average values over all 20 data sets are shown at the bottom. These PCA models are constructed from training sets processed by offline outlier detection methods. Note here PCA is chosen as the 3 TPR represents the rate of correctly detected outliers 𝑇 𝑃 ∕𝑇 𝑃 + 𝐹 𝑁, and FRP represents the rate that target points are detected as outliers 𝐹 𝑃 ∕𝐹 𝑃 + 𝑇 𝑁, also referred to as false alarm rate.

256

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

Table 7 Results from the first series of experiment: PCA (SPE statistic). Best results are in bold. Fault

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 AVR

True positive rate

and we do not care much about it here. What we focus on is the outlier processing ability, which is realized by PCA models. And it is clear that our method has achieved better result as shown by the average results. Limited by the space, we only show the monitoring results on fault 1 test set in Fig. 6. Two statistics (SPE and T2 ) are used here. These monitoring results cannot be presented by Table 8. Actually, all methods’ results in terms of SPE are failed due to the high FPR values (alarms are triggered at start). This do not happen for the T2 statistic, but the alarm delays and some outliers may be hidden.

False positive rate

MD

RHM

SHV

Our

MD

RHM

SHV

Our

91.25 86.25 77.30 78.17 75.49 94.36 88.74 90.67 76.65 89.46 88.55 95.91 91.07 85.63 70.59 90.55 84.75 94.38 76.65 89.46 85.80

97.88 89.37 79.42 82.54 77.10 96.58 90.25 93.85 77.88 95.58 89.57 96.18 91.86 84.82 72.38 92.47 85.64 95.80 77.88 91.58 87.93

99.88 88.19 78.58 81.63 76.92 96.85 92.36 92.17 80.53 94.17 91.33 98.83 92.59 87.79 75.64 93.87 86.33 95.17 80.53 94.17 88.88

100 91.38 83.90 81.79 86.42 98.59 91.70 99.50 81.69 95.14 90.98 98.62 94.06 86.50 81.90 95.59 89.51 98.88 81.69 93.14 91.05

25.63 20.63 19.38 31.88 17.50 13.13 20.63 35.63 15.63 12.50 21.88 8.750 11.25 22.50 3.750 6.785 16.88 31.25 16.25 15.63 18.37

35.00 24.38 21.88 35.00 21.25 14.38 21.88 44.38 18.75 13.75 21.25 10.00 11.25 24.38 11.25 8.125 15.00 39.38 17.50 16.25 21.25

39.38 23.75 20.63 37.50 18.75 13.75 22.50 39.38 19.38 14.38 20.63 11.88 13.13 24.38 11.25 7.500 15.00 36.88 19.38 18.75 21.40

37.50 23.13 20.00 31.25 18.13 13.75 18.13 31.88 17.50 13.13 18.75 8.125 10.63 23.75 9.375 5.000 14.38 31.88 17.50 11.88 18.78

6.3.2. Online application Results of the second series of experiment are shown in Table 8. Here our method and five competitors are directly used for online outlier detection on TE process. We have to claim that we only show results with respect to the TPR and FPR metrics. For the case where classifiers selection methods are applied to process monitoring, the fault detection threshold is changed for different test points. So we do not show the monitoring curve. We can find the difference between results in Tables 7 and 8. Result in terms of FPR has been greatly improved for methods in Table 8. Result with respect to TPR has also been improved, which can be shown by the average values. This implies that our method and five competitors in Table 8 outperform PCA models even if training sets in Table 7 have been processed. Then we compare our method with five competitors. D-SVDD and LSVDD are two methods developed from the original SVDD, the objective of these two method is to improve the robustness of SVDD. Via modifying the optimization problem in SVDD, the negative influence of outliers in the training set can be alleviated efficiently. OCClustE, FB, and RB are three state-of-the-art ensemble outlier detection methods. Their base learners are identical with our method in this paper. For the results with respect to TPR, our method outperforms D-SVDD for 19 data sets, and outperforms L-SVDD for 19 data sets, too. This comparison result indicates that our dynamic ensemble model is more robust to outliers in training set than D-SVDD and L-SVDD. The reason may be that outliers in the training set are processed reasonably by our method. OCClustE is a cluster-based ensemble method, it usually performs well for data that distributes in several clusters. For 15 data sets, our method outperforms OCClustE with respect to TPR. For data set 1, 4, 12, 14, 20, the superiority of OCClustE over our method is not very obvious. For the other two ensemble methods, our method outperforms FB for 19 data sets, and

monitoring method due to its wide usage in the domain of process monitoring and the SPE statistic is chosen randomly. The results of different method in Table 8 are used to compare the performance of processing outliers. The simple principle is that method with stronger outlier processing ability will result in better monitoring result by PCA. Let us take a good look at the results. For the TPR metric, our method outperforms the other three competitors for 13 out of 20 data sets. This means that our method has better performance on detecting outliers. The reason is that we have used a distinct detection mechanism from multivariate methods, which suffer more from outliers in the training set. While for the FPR metric, it is clear that MD has the best result, followed by our method. Actually, results of all methods are hard to satisfy us since the number of misclassified target points is too large. This will lead to many false alarms. Although most outliers can be identified, the expense is so big that this process monitoring method is risky. But we have to emphasize again that this may result from the drawback of PCA Table 8 Results from the second series of experiment. Fault

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 AVR

True positive rate

False positive rate

D-SVDD

L-SVDD

OCClustE

FB

RB

Our

D-SVDD

L-SVDD

OCClustE

FB

RB

Our

98.64 91.37 89.75 88.53 93.07 95.11 90.66 95.67 86.97 92.33 95.73 95.91 94.07 91.62 85.59 93.49 91.69 96.68 93.97 94.47 92.77

98.17 90.64 84.37 89.73 94.16 94.97 92.73 93.19 88.67 91.15 95.73 96.18 92.93 90.09 86.77 92.17 92.15 96.10 93.05 92.28 92.26

100 93.88 85.42 95.64 97.69 96.85 89.54 94.46 90.74 93.90 96.98 98.83 94.76 94.55 88.64 92.66 94.36 94.44 95.71 97.79 94.34

100 92.79 87.94 94.37 98.74 95.43 89.27 96.44 91.62 95.42 95.42 97.90 95.48 92.71 87.74 93.20 92.89 94.01 94.24 93.90 93.98

100 94.62 88.90 94.30 99.01 98.82 89.03 97.91 93.10 96.58 97.14 98.57 95.23 93.42 90.55 94.89 93.28 93.57 96.10 96.26 95.06

100 96.90 89.06 95.01 99.58 98.66 92.01 100 94.91 97.64 95.73 98.62 97.70 93.18 94.37 97.00 96.59 100 97.88 96.02 95.54

6.875 11.87 8.750 13.13 13.13 5.625 10.00 15.63 14.38 11.88 10.63 9.375 11.88 18.13 7.500 5.625 18.75 11.25 3.750 11.25 10.97

7.500 11.25 8.750 11.25 13.75 6.250 9.375 17.50 15.00 9.375 11.25 10.00 13.13 21.25 6.250 7.500 16.25 13.13 7.500 14.38 11.53

5.000 6.880 2.500 8.750 9.375 2.500 10.63 11.88 11.88 13.75 5.625 10.63 10.00 16.88 8.125 10.00 15.63 8.125 9.375 17.50 9.752

9.375 6.880 5.625 10.63 11.88 10.00 7.500 18.13 10.63 6.250 10.00 9.375 11.25 16.88 13.13 7.500 15.00 9.375 9.375 11.88 10.53

10.63 9.375 5.625 5.000 13.75 5.625 5.625 16.88 10.00 13.13 8.125 14.38 16.25 16.88 6.250 9.375 17.50 11.25 7.500 13.75 10.85

3.125 5.000 0 6.880 10.00 0.625 4.375 13.13 8.750 11.88 3.750 0 7.500 8.125 3.125 7.500 12.50 8.750 7.500 2.500 6.251

257

B. Wang and Z. Mao

Information Fusion 51 (2019) 244–258

RB for 17 data sets. Although it is observed that ensemble methods can improve the robustness of algorithms, their performance can be further improved by a dynamic mechanism. Then from the perspective of FPR, our method still has significant superiority over its competitors. This result along with result by TPR indicates that our method can identify most outliers, and made least false alarms at the same time. This can also be found from the average values.

[14] J. Almutawa, Identification of errors-in-variables model with observation outlier based on MCD, J. Process Control 19 (5) (2007) 879–887. [15] J.S. Zeng, C.H. Gao, Improvement of identification of blast furnace ironmaking process by outlier detection and missing value imputation, J. Process Control 19 (9) (2009) 1519–1528. [16] F. Liu, Z. Mao, W. Su, Outlier detection for process control data based on a non-linear Auto-Regression Hidden Markov Model method, Trans. Inst. Meas. Control 34 (5) (2012) 527–538. [17] H. Ferdowsi, S. Jagannathan, M. Zawodniok, An online outlier identification and removal scheme for improving fault detection performance, IEEE Trans. Neural Netw. Learn. Syst. 25 (5) (2014) 908–919. [18] D.M.J. Tax, M.V. Breukelen, R.P.W. Duin, J. Kittler, Combining multiple classifiers by averaging or by multiplying? Pattern Recognit. 33 (9) (2000) 1475–1485. [19] P. Juszczak, R.P.W. Duin, Combining one-class classifiers to classify missing data, in: International Workshop on Multiple Classifier Systems, Cagliari, Italy, 2004, pp. 92–101. [20] B. Wang, Z. Mao, One-class classifiers ensemble based anomaly detection scheme for process control systems, Trans. Inst. Meas. Control 40 (12) (2017) 3466–3476. [21] C. Lai, D.M.J. Tax, R.P.W. Duin, E. Pękalska, P. Paclík, in: International Workshop on Multiple Classifier Systems, 2002, pp. 212–221. [22] B. Krawczyk, M. Woźniak, B. Cyganek, Clustering-based ensembles for one-class classification, Inf. Sci. 264 (6) (2014) 182–195. [23] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Mach. Learn. 51 (2) (2003) 181–207. [24] B. Krawczyk, One-class classifier ensemble pruning and weighting with firefly algorithm, Neurocomputing 150 (150) (2015) 490–500. [25] E. Parhizkar, M. Abadi, BeeOWA: a novel approach based on ABC algorithm and induced OWA operators for constructing one-class classifier ensembles, Neurocomputing 166 (2015) 367–381. [26] D.V.R. Oliveira, G.D.C. Cavalcanti, R. Sabourin, Online pruning of base classifiers for dynamic ensemble selection, Pattern Recognit. 72 (2017) 44–58. [27] B. Krawczyk, Dynamic classifier selection for one-class classification, Knowl.-Based Syst. 107 (2016) 43–53. [28] D.M.J. Tax, One-class Classification, Delft University of Technology, 2001. [29] M. Woźniak, M. Graña, E. Corchado, A survey of multiple classifier systems as hybrid systems, Inf. Fusion 16 (1) (2014) 3–17. [30] D.M.J. Tax, R.P.W. Duin, Combining one-class classifiers, in: International Workshop on Multiple Classifier Systems, 2001, pp. 299–308. [31] L.I. Kuncheva, J.C. Bezdek, R.P.W. Duin, Decision templates for multiple classifier fusion: an experimental comparison, Pattern Recognit. 34 (2) (2001) 299–314. [32] A.S.B. Jr, R. Sabourin, L.E.S. Oliveira, Dynamic selection of classifiers—A comprehensive review, Pattern Recognit. 47 (11) (2014) 3665–3680. [33] K. Woods, W.P.K. Jr, K. Bowyer, Combination of multiple classifiers using local accuracy estimates, IEEE Trans. Pattern Anal. Mach. Intell. 19 (4) (1997) 405–410. [34] G. Giacinto, F. Roli, Methods for dynamic classifier selection, in: Proceedings of 10th International Conference on Image Analysis and Processing, 1999, pp. 659–664. [35] L.I. Kuncheva, Clustering-and-selection model for classifier combination, in: International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 1, 2000, pp. 185–188. [36] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1263–1284. [37] T. Woloszynski, M. Kurzynski, A probabilistic model of classifier competence for dynamic ensemble selection, Pattern Recognit. 44 (10–11) (2011) 2656–2668. [38] W.S. Meisel, Potential functions in mathematical pattern recognition, IEEE Trans. Comput. C-18 (10) (1969) 911–918. [39] L.I. Kuncheva, Switching between selection and fusion in combining classifiers: an experiment, IEEE Trans. Syst. Man Cybern.-Part B 32 (2) (2002) 146–156. [40] S.W. Looney, A statistical technique for comparing the accuracies of several classifiers, Pattern Recognit. Lett. 8 (1) (1988) 5–9. [41] J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (1) (2006) 1–30. [42] R.L. Iman, J.M. Davenport, Approximations of the critical region of the Friedman statistic, Commun. Stat. 9 (6) (1979) 571–595. [43] H. Ykhlef, D. Bouchaffra, An efficient ensemble pruning approach based on simple coalitional games, Inf. Fusion 34 (2017) 28–42. [44] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Mult.-Valued Logic Soft Comput. 17 (2011) 255–287. [45] B. Krawczyk, L.L. Minku, J. Gama, J. Stefanowski, M. Wozniak, Ensemble learning for data stream analysis: a survey, Inf. Fusion 37 (C) (2017) 132–156. [46] A. Galicia, R. Talavera-Llames, A. Troncoso, I. Koprinska, F. Martinez-Alvarez, Multi-step forecasting for big data time series based on ensemble learning, Knowl.-Based Syst. 163 (2019) 830–841. [47] N.L. Ricker, Decentralized control of the Tennessee Eastman Challenge process, J. Process Control 6 (4) (1996) 205–221. [48] K. Lee, D.W. Kim, K.H. Lee, D. Lee, Density-induced support vector data description, IEEE Trans. Neural Netw. 18 (1) (2007) 284–289. [49] B. Liu, SVDD-based outlier detection on uncertain data, Knowl. Inf. Syst. 34 (3) (2013) 597–618. [50] A. Lazarevic, V. Kumar, Feature bagging for outlier detection, in: Presented at the Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining., 2005. [51] C.C. Aggarwal, S. Sathe, Theoretical foundations and algorithms for outlier ensembles, ACM SIGKDD Explorations Newsl. 17 (1) (2015) 24–47.

7. Conclusions A dynamic ensemble outlier detection approach is proposed in this study. One-class classifiers are used as base learners due to the absence of labeled training data. With the help of the proposed output transformation and pseudo outlier procedures, we use a probabilistic model to evaluate competence of all base classifiers on regions determined by a clustering technique. A switching mechanism based on two statistical tests is proposed to decide whether one-class classifier should be nominated to make the decision or decision template should be applied instead. 20 data sets are used to compare performance of different methods. For both homogeneous and heterogeneous ensemble models our method can obtain the best results. And the improvements over single best model and static ensemble models are significant. In addition, another dynamic classifier selection method is also compared with our method. Although significant advantage has not been obtained by our method, improvement over it is still clear. Finally, we apply our method to process monitoring (TE process). Experiments are executed in two ways, one is offline outlier processing, and the other is online outlier detection (process monitoring). In both series of experiments, our method performs better than corresponding competitors. However, our method may still be improved in several ways, such as improving the calculation of classifier competence, enhancing the determination of region of competence, even the dynamic selection mechanism. These are all our further research directions. Acknowledgments This work is supported by National Natural Science Foundation of China (grant no. 51634002) and National Key R & D Program of China(grant no. 2017YFB0304104). Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.inffus.2019.02.006. References [1] S. Yin, X. Li, H. Gao, O. Kaynak, Data-based techniques focused on modern industry: an overview, IEEE Trans. Indust. Electron. 62 (1) (2015) 657–667. [2] M. Kano, Y. Nakagawa, Data-based process monitoring, process control, and quality improvement: recent developments and applications in steel industry, Comput. Chem. Eng. 32 (1–2) (2008) 12–24. [3] Q. Jiang, X. Yan, Probabilistic weighted NPE-SVDD for chemical process monitoring, Control Eng. Pract. 28 (1) (2014) 74–89. [4] J. Huang, X. Yan, Related and independent variable fault detection based on KPCA and SVDD, J. Process Control 39 (2016) 88–99. [5] Z.Q. Ge, Z.H. Song, Bagging support vector data description model for batch process monitoring, J. Process Control 23 (8) (2013) 1090–1096. [6] G. Li, et al., An improved fault detection method for incipient centrifugal chiller faults using the PCA-R-SVDD algorithm, Energy Build. 116 (2015) 104–113. [7] L.H. Chiang, R.J. Pell, M.B. Seasholtz, Exploring process data with the use of robust outlier detection algorithms, J. Process Control 13 (5) (2003) 437–449. [8] C. Yu, Q.G. Wang, D. Zhang, L. Wang, J. Huang, System Identification in Presence of Outliers, IEEE Trans. Cybern. 46 (5) (2016) 1202–1216. [9] F. Pukelsheim, The three sigma rule, Am. Stat. 48 (2) (1994) 88–91. [10] R.K. Pearson, Exploring process data, J. Process Control 11 (2) (2001) 179–194. [11] R.K. Pearson, Outliers in process modeling and identification, IEEE Trans. Control Syst. Technol. 10 (1) (2002) 55–63. [12] R.D. Maesschalck, D. Jouan-Rimbaud, D.L. Massart, The Mahalanobis distance, Chemom. Intell. Lab. Syst. 50 (1) (2000) 1–18. [13] W.J. Egan, S.L. Morgan, A. Chem, Outlier detection in multivariate analytical chemical data, Anal. Chem. 70 (11) (1998) 2372.

258