Krist V. Gernaey, Jakob K. Huusom and Rafiqul Gani (Eds.), 12th International Symposium on Process Systems Engineering and 25th European Symposium on Computer Aided Process Engineering. c 2015 Elsevier B.V. All rights reserved. 31 May - 4 June 2015, Copenhagen, Denmark.
Improving Data Reliability for Process Monitoring with Fuzzy Outlier Detection Harakhun Tanatavikorna and Yoshiyuki Yamashitaa a Department of Chemical Engineering; Tokyo University of Agriculture and Technology; Tokyo, Japan
[email protected]
Abstract To implement on-line process monitoring techniques that utilize principal component analysis (PCA) or partial least squares (PLS) models, it is important to use reliable data that represents normal process operation when constructing the models. In this paper, a novel flexible fuzzy treatment method is developed for the detection of outliers in process data. This method utilizes a combination of fuzzy C-means clustering algorithm to separate the data into clusters and a fuzzy inference engine to assign a degree of outlier to data points. The current iteration of the fuzzy inference engine performs an evaluation based on distance as a measure of standard deviation and fuzzy membership values to determine the degree of outlier. This method can be considered a hybrid method that incorporates statistical parameters into a fuzzy strategy. Decisions on how to handle the data can then be made based on the degree of outlier. This degree of outlier can be conveniently translated into a relative weight assigned to an outlier entering downstream data processes. The fuzzy treatment method was applied to benchmark penicillin production process data containing artificial outliers data points. The proposed method was able to detect the outliers in the process data. Though data points in the transient region were prone to high degree of outlier. Additionally it is possible to modify the fuzzy inference engine to utilize different criteria, such as data density or spread to evaluate outlierness. The result is presented along with a discussion on the advantages of this method as a flexible treatment of process data. The methodology will be applied to future investigation on PCA based process monitoring. Keywords: Process Monitoring, Process Data, Outlier Detection, Fuzzy logic, Batch, Fuzzy Cmeans
1. Introduction The advances in the field of sensors, computers, and informatics have exponentially increased the amount of available process data. Large plants record process variables, product quality, production, and maintenance information on a frequent basis. Reliable process data is a vital component in the implementation of on-line process monitoring techniques that control these large facilities. Data reliability, in this context, refers to the accuracy and completeness of data, given the intended purpose for use. Reliability does not mean that the data is error-free; it means that any errors present and found in the data are within a tolerable range. This tolerance is dependent upon evaluation that the associated risks have been assessed and determined that the errors are not significant enough to doubt the findings or conclusions based on the data. Given a set of process data, an outlier is an element that deviates significantly from normal or
1596
Harakhun Tanatavikorn et al.
sometimes meaningful range. They represent either errors or interesting observations; such as mechanical faults, changes in system behavior, instrument error, or simply through human error. The presence of outliers may lead to inaccurate process models and wrong analytical conclusions. Therefore outlier detection is used to detect and, when appropriate, remove anomalous entries from process data to improve their reliability for process monitoring. Conventional outlier detection methods rely on statistical tests to perform a screening of a dataset. Classical test are the Grubbs (1950) and the Dixon (1950) ones; other similar tests have also been formulated by Quesenberry and David (1961), and Ferguson (1961). In essence, all these tests provide a statistic to be compared with a critical value in order to evaluate whether a data point is an outlier. Their difference is mainly construction of the statistic and the selection of critical value. Since ultimately outlier detection and analysis is a subjective exercise, classical methods are often rigid and sometime face limitations posed by their statistical nature. In this paper, a novel flexible fuzzy treatment method is developed for the detection of outliers in process data. This method utilizes a combination of fuzzy C-means clustering algorithm to separate the data into clusters and a fuzzy inference engine to assign a degree of outlier to data points. The main advantage of the method is the flexibility of the inference engine where different evaluation criteria can be implemented. This provides a adjustable subjective framework for evaluating large size data sets. The organization of the paper is as follows: Section 2 describes the components and methodology of the fuzzy treatment method. Section 3 presents selected results of applying the fuzzy treatment method on simulated process data. Finally, Section 4 concludes the paper.
2. Methodology
Figure 1: Fuzzy treatment method 2.1. Fuzzy c-means Clustering The main purpose of fuzzy c-means clustering is the partitioning of data into a collection clusters, where each data point is assigned a membership value for each cluster. Fuzzy c-means clustering involves two processes: the calculation of cluster centers and the assignment of points to these centers using a form of Euclidian distance. These processes are repeated until the cluster centers stabilize. The algorithm is similar to crisp clustering, such as k-means clustering, in several aspects but incorporates fuzzy set concepts of partial memberships by allowing data points to belong to more than one cluster. This can be observed in the form of overlapping clusters. Additionally many crisp clustering techniques tend to have difficulties in handling extreme outliers but fuzzy clustering algorithms tend to give them small membership degree in surrounding clusters (Looney, 1999). The algorithm also needs a fuzzification parameter m which determines the degree of fuzziness in the clusters. When the value of m is 1 the algorithm works like crisp clustering, while larger values of m increases the fuzzification of the clustering algorithm thus allowing a higher degree cluster overlap. Based on the clustering results, k different data groups are prepare from the initial data. The assignment of xi to a data group depends it’s membership value corresponding to each cluster, uik . xi gets assigned to a cluster based on the following conditions:
ImprovingDataReliabilityforProcessMonitoringwithFuzzyOutlierDetection
1597
• The highest value of uik denotes which group k the xi is placed in; • If the difference between highest uik and the next highest ui(k−1) is less than the threshold value, then xi gets placed in both groups; • Additionally if the highest valuest of uik is below 0.2 then the top 3 ui are placed 3 groups. It is important to note that, the selection of the threshold value δ depends on the selecting of the number of clusters n for the fuzzy clustering. In this work, a plot of the within groups sum of squares by number of clusters extracted is used to determine the appropriate number of clusters based on the bend in the plot (Everitt and Hothorn, 2006). This is often referred to as the ”elbow” method. 2.2. Fuzzy Inference System Fuzzy inference is the process of formulating the mapping from a given input to an output using fuzzy logic. The mapping then provides a basis from which decisions can be made, or patterns discerned. The FIS selected for implementation in this paper was proposed by D’Errico and Murru (2012) with modifications to the FIS rule base and input variables. The modified FIS includes two inputs (distance and grouping) and one output (degree of outlier). The inference engine is the basic Mamdani Model was constructed using the R Statistical Software ’frbs’ package. The fuzzy distance is obtained after a fuzzification of the crisp distance input d = d(xi , μ), according to: • • • •
if d(xi , μ) ≥ 4σmax , then distance is very long; if 3σmax ≤ d(xi , μ) ≤ 5σmax , then distance is long; if 2σmax ≤ d(xi , μ) ≤ 4σmax , then distance is medium; if d(xi , μ) ≤ 3σmax , then distance is short.
where σmax : is the highest observed standard deviation among the chosen variables of the in the corresponding group. The fuzzy grouping is obtained after a fuzzification of the highest membership values g = g(xi , uik) from the results of the fuzzy clustering, according to: • if g(xi , uik ) ≥ 0.4σ , then grouping is grouped; • if 0.25 ≤ g(xi , uik ) ≤ 0.45, then grouping is moderate; • if 0.2 ≤ g(xi , uik ), then grouping is spread; where uik : is the membership value of the data point i to the data group k. As to outlierness: • if ρ ≥ 0.5σ , then outlierness is high; • if 0.25 ≤ ρ ≤ 0.75, then outlierness is intermediate; • if 0.5 ≤ ρ, then outlierness is low; The inference system is based on the following 11 rules: 1. 2. 3. 4. 5. 6. 7. 8.
if (distance is short) and (grouping is grouped), then outlierness is (low) if (distance is short) and (grouping is moderate), then outlierness is (intermediate) if (distance is short) and (grouping is spread), then outlierness is (intermediate) if (distance is medium) and (grouping is grouped), then outlierness is (low) if (distance is medium) and (grouping is moderate), then outlierness is (intermediate) if (distance is medium) and (grouping is spread), then outlierness is (high) if (distance is long) and (grouping is grouped), then outlierness is (intermediate) if (distance is long) and (grouping is moderate), then outlierness is (intermediate)
1598
Harakhun Tanatavikorn et al.
9. if (distance is long) and (grouping is spread), then outlierness is (high) 10. if (distance is very long), then outlierness is (high) 11. if (grouping is spread), then outlierness is (high) This degree of outlier can be conveniently translated into a relative weight that allows a fuzzy outlier to still contribute to a certain extent to any subsequent processes. 2.3. Simulation Example: Fed-batch Penicillin Cultivation Process Birol et al. (2002) developed a simulation software based on a detailed unstructured model for penicillin production in a fed-batch fermentor. The model extends the mechanistic model of Bajpai and Reuß (1980) by adding input variables such as pH, temperature, aeration rate, agitation power, and feed flow rate of substrate along with introducing the CO2 evolution term. The model is used to generate process data for testing the fuzzy treatment method. A batch was simulated based on an integration step size of 0.02 h and a sampling interval of 0.5 h. In the simulation, an initial batch culture is followed by a fed-batch operation depending on the depletion of the carbon source. The process switches to fed-batch mode of operation when the level of glucose concentration reaches 0.3 g/l. Fermenter Temperature and pH have a strong influence on product quality. They are controlled by PID controllers whose settings in the simulation software were left at default values for normal operation. The main variables extracted for fuzzy treatment were pH and fermenter temperature. 5 artificial outlier points based on standard deviations were inserted into the simulated data: 1. 2. 3. 4. 5.
Outlier[0.615σT , −2σ pH ] in Transient region of pH @ 20 Outlier [3σT , 3σ pH ] (in Temperature and pH @ 200 Outlier [3σT , −0.3925σ pH ]in Temperature @ 400 Outlier [−0.217σT , −3σ pH ] in pH @ 600 Outlier [6σT , 6σ pH ] in Temperature and pH @ 800
Figure 2: Fuzzy treatment method
3. Result and discussion As previously mentioned, the elbow method was used to determine the number of cluster for the fuzzy c-means clustering. The number of clusters selected is 10. Additionally a corresponding
ImprovingDataReliabilityforProcessMonitoringwithFuzzyOutlierDetection
1599
threshold value of δ = 0.2 was chosen for the clustering. The group assignment and outlierness value for each artificial outlier point is presented in Table1. The Fuzzy Treatment Method is able to evaluate and assign an appropriately high outlierness to the artificial outlier points. It is also observed, as seen in Table 2, that the method assigns high outlierness to extreme values, such as those found in the transition region (approx. 0≤ index≤150). This is expected due to the fact that the 2 basic input criteria, distance and fuzzy c-means membership values, used to evaluate the outlierness of a data point are sensitive to extreme and/or isolated values. In particular it is interesting to note that a low fuzzy c-means membership value or an evenly spread out membership value distribution between the groups signifies that a data point is relatively isolated from data clusters. The transient region, where large and rapid fluctuations occur, remain a challenge for the fuzzy treatment method. It is speculated that changing or increasing the numbers of input parameters will improve performance in the transient region. Integration of pattern based outlier detection methods, such as PLS or neural networks, may improve the performance in the transient region. Though this is subjected to further study.
Table 1: Group assignment and outlierness of artificial outliers Data Point Index 20 200 400 600 800
Group Assignment and Outlierness 3 5 7 0.583 0.583 0.583 8 9 10 0.75 0.75 0.75 5 7 0.75 0.75 1 2 3 0.75 0.75 0.75 8 9 10 0.75 0.75 0.75
Table 2: Group assignment and outlierness of selected data points
Variable with Artifical Outlier
Data Point Index
pH
70
Temp. and pH
100
Temp.
130
pH
300
Temp. and pH
700
Group Assignment and Outlierness 3 5 7 0.75 0.75 0.75 7 8 0.583 0.75 1 2 0.516 0.456 1 0.25 7 0.25
Remarks Transient Region Transient Region
A summary and comparison of the fuzzy treatment method can be seen in Table 3. It is compared to a multivariate outlier detection proposed by Filzmoser et al. (2005) and the Local Outlier Factor (LOF) proposed by Breunig et al. (2000). All three methods yield similar results and have difficulties in the transient region of the data; resulting in false positives or being unable to distinguish the artificial outlier from extreme data points. Though it is important to note that it is possible to integrate the outlier detections proposed by Filzmoser et al. (2005) and Breunig et al. (2000) into the fuzzy treatment methodology.
Table 3: Summary and comparison of fuzzy outlier treatment method Data Point Index
Fuzzy Treatment Method (Outlierness)
20 200 400 600 800
0.5833 0.75 0.75 0.75 0.75
Robust mahalanobis distance based on the adjusted mcd estimator Detected Detected Detected Detected Detected
Local Outlier Factor 1.02261 (rank 595/1005) 1.890222 (rank 43/1005) 1.822154 (rank 49/1005) 1.425353 (rank 107/1005) 2.619494 (rank 21/1005)
1600
Harakhun Tanatavikorn et al.
4. Conclusion The presence of suspected outlier values in process data has been a long-standing problem. The difficulty of the problem is mainly due to the subjective nature of outlier detection. A single criteria to distinguish different types of outliers is generally insufficient and does not address the subjective nature of outlier detection. To overcome this issue a novel fuzzy logic approach has been proposed and a system has been implemented. The system performance has been manually tuned: optimization of input parameters along with investigation into automatic-tuning (self-learning) is envisaged for further developments. The fuzzy treatment method utilizes fuzzy c-means clustering to separate multi-variate data into clusters. Data points can belong to more than a single cluster due to the nature of fuzzy clustering. The concept of fuzzy outlier is then adapted from D’Errico and Murru (2012) with modifications to FIS input parameters and rule-base. Utilizing individual cluster standard deviation and fuzzy c-means clustering membership values as inputs, the outlierness of a data point is computed as a result of a 2-input/1-output fuzzy inference system. The overall fuzzy treatment methodology is a generalize approach and can be modified to suit the application. It provides a flexible and highly subjective framework for outlier detection. From the results of the research conducted so far, the following conclusions can be pointed out: 1. The fuzzy treatment method provides a subjective framework for integrating various outlier detection parameters. 2. The fuzzy c-means clustering membership values have correlation to extreme and isolated values. These values may potentially be outliers.
References Bajpai, R. K., Reuß, M., 1980. A mechanistic model for penicillin production. Journal of Chemical Technology and Biotechnology 30 (1), 332–344. ¨ Birol, G., Undey, C., C ¸ inar, A., 2002. A modular simulation package for fed-batch fermentation: penicillin production. Computers & Chemical Engineering 26 (11), 1553 – 1565. Breunig, M. M., Kriegel, H., Ng, R. T., Sander, J., 2000. LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA. pp. 93–104. D’Errico, G. E., Murru, N., 2012. Fuzzy treatment of candidate outliers in measurements. Adv. Fuzzy Systems 2012. Dixon, W. J., 12 1950. Analysis of extreme values. Ann. Math. Statist. 21 (4), 488–506. Everitt, B., Hothorn, T., 2006. A Handbook of Statistical Analyses Using R. Chapman and Hall. Ferguson, T. S., 1961. On the rejection of outliers. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. University of California Press, Berkeley, Calif., pp. 253–287. Filzmoser, P., Garrett, R. G., Reimann, C., Jun. 2005. Multivariate outlier detection in exploration geochemistry. Comput. Geosci. 31 (5), 579–587. Grubbs, F. E., 1950. Sample criteria for testing outlying observations. Ann. Math. Statistics 21, 27–58. Looney, C., 1999. A fuzzy clustering and fuzzy merging algorithm. Tech. rep., Technical Report CS-UNR-101-1999. Quesenberry, C. P., David, H. A., Dec. 1961. Some tests for outliers. j-BIOMETRIKA 48 (3/4), 379–390.