A Kernel Connectivity-based Outlier Factor Algorithm for Rare Data Detection in a Baking Process⁎

A Kernel Connectivity-based Outlier Factor Algorithm for Rare Data Detection in a Baking Process⁎

Proceedings, 10th IFAC International Symposium on Proceedings, 10th IFAC International Symposium on Advanced Control Chemical Processes Proceedings, 1...

686KB Sizes 0 Downloads 29 Views

Proceedings, 10th IFAC International Symposium on Proceedings, 10th IFAC International Symposium on Advanced Control Chemical Processes Proceedings, 10th of IFAC International Symposium on at www.sciencedirect.com Available online Advanced Chemical Processes Proceedings, 10th of IFAC International Symposium on Shenyang,Control Liaoning, China, July 25-27, 2018 Advanced of Chemical Processes Shenyang,Control Liaoning, China, July 25-27, 2018 Advanced of China, Chemical Processes Shenyang,Control Liaoning, July 25-27, 2018 Shenyang, Liaoning, China, July 25-27, 2018

ScienceDirect

IFAC PapersOnLine 51-18 (2018) 297–302

A Kernel Connectivity-based Outlier A Kernel Connectivity-based Outlier A Kernel Connectivity-based Outlier Factor Algorithm for Rare Data Detection A Kernel Connectivity-based Outlier Factor Algorithm for Rare Data Detection Factor Algorithm for Rare Data Detection   in a Baking Process Factor Algorithm for Rare Data Detection in a Baking Process in a Baking Process  in a Baking Process ∗∗∗ Yanxia Wang ∗∗ Kang Li ∗∗ ∗∗ Shaojun Gan ∗∗∗

Yanxia Wang ∗∗ Kang Li ∗∗ Shaojun Gan ∗∗∗ Yanxia Wang ∗ Kang Li ∗∗ Shaojun Gan ∗∗∗ ∗∗ Yanxia Wang Kang Li Shaojun Gan ∗∗∗ ∗ ∗ School of Electronics, Electrical Engineering and Computer Science, ∗ School of Electronics, Electrical Engineering and Computer Science, ∗ School of Electronics, ElectricalBelfast, Engineering and Computer Science, Queens University Belfast, BT9 5AH, U.K. (e-mail: ∗ Queens University Belfast, Belfast, BT9 5AH, U.K. (e-mail: School of Electronics, ElectricalBelfast, Engineering and Computer Science, Queens [email protected]). Belfast, BT9 5AH, U.K. (e-mail: [email protected]). ∗∗ Queens University Belfast, Belfast, BT9 5AH, U.K. (e-mail: [email protected]). School of Electronics, Electrical Engineering and Computer ∗∗ ∗∗ School of Electronics, Electrical Engineering and Computer Science, Science, ∗∗ [email protected]). School of Electronics, Electrical Engineering and Computer Science, Queens University Belfast, Belfast, BT9 5AH, U.K. (e-mail: Queens University Belfast, Belfast, BT9 5AH, U.K. (e-mail: ∗∗ School of Electronics, Electrical Engineering and Computer Science, Queens University Belfast, Belfast, BT9 5AH, U.K. (e-mail: [email protected]) [email protected]) ∗∗∗ Queens University Belfast, Belfast, BT9 5AH, U.K. (e-mail: [email protected]) ∗∗∗ ∗∗∗ School of Electronics, Electrical Engineering and Computer Electrical Engineering and Computer ∗∗∗ School of Electronics, [email protected]) School Queens of Electronics, Electrical and Computer Science, University Belfast, Belfast, 5AH, U.K. Science, Queens University Belfast,Engineering Belfast, BT9 BT9 5AH, U.K. ∗∗∗ School of Electronics, Electrical Engineering and Computer Science, Queens University Belfast, Belfast, BT9 5AH, U.K. (e-mail:[email protected]) (e-mail:[email protected]) Science, Queens University Belfast, Belfast, BT9 5AH, U.K. (e-mail:[email protected]) (e-mail:[email protected]) Abstract: Due to strict legislation on greenhouse gas emission reduction, energy intensive Abstract: Due to strict legislation on greenhouse gas reduction, energy intensive Abstract: Due tothe strict legislation gas emission emission reduction, intensive industries include bakery industryonaregreenhouse all under pressure to improve the energy efficiency industries include the bakery industry are all under pressure to improve the energy Abstract: Due tothe strict legislation greenhouse gas monitoring emission reduction, energy efficiency intensive industries include bakery industry all an under pressure to improve thedeveloped energy efficiency in the manufacturing processes. In thisonare paper, energy system through in the manufacturing In this paper, energy monitoring system developed through industries include Technology theprocesses. bakery from industry are all an under pressure to improve thethe energy efficiency in the manufacturing processes. In this paper, an energy monitoring system developed through the Point Energy the research group is first introduced for data collection thethe Point Energy Technology from the research group is first introduced for the data collection in manufacturing processes. In this paper, energy monitoring system developed through the Energy company. Technology from the research group is data first introduced for the data collection in a Point local bakery The outliers in the an collected may include valuable information in a local bakery company. The outliers in the collected data may include valuable information the Point Energy Technology from the research group is first introduced for the data collection in a local company. The outliersthey in the collected data may include information about the bakery status of machines, however, also affect the data quality andvaluable the accuracy of the about the status of machines, however, they also affect the data quality and the of in a local company. The outliers in the collected data may include valuable information about the bakery status of machines, however, they also affect the data quality and the accuracy accuracy of the the consequent data analysis. This paper discusses two algorithms for outlier detection, connectivityconsequent data analysis. This paper discusses two algorithms for outlier detection, connectivityabout the status of machines, however, theyfactor also affect theFor data quality and the of accuracy of the consequent data analysis. This paper discusses two algorithms for outlier detection, connectivitybased outlier factor (COF) and local outlier (LOF). COF, the concept based outlier factor (COF)This and local outlier factor (LOF). For COF, the concept of connectivityconnectivityconsequent data analysis. discusses two (LOF). algorithms outlier detection, based outlier factor (COF) andpaper local outlier factor For for COF, the concept connectivityoutlier facto is adopted to identify whether an object is an outlier. For of LOF, the local based outlier facto is adopted to identify whether an object is an outlier. For LOF, the local based and to local outlier factor (LOF). For COF, of connectivitybased outlier facto (COF) is identify whether an object is anofthe outlier. Forbeing LOF, local outlieroutlier factor factor based onadopted a notion of local density represents the level anconcept object anthe outlier. outlier factor based on a of local density represents the level an an outlier. based outlier facto is to identify whether anoven object anof Forbeing LOF, local outlier factor are based onadopted a notion notion of local density represents the level ofoutlier. an object object being anthe outlier. Experiments conducted on the dataset from the in aais production line to evaluate the Experiments are conducted on the dataset from the oven in production line to evaluate the outlier factor of based onkernel a notion of local density represents the level of an object being an outlier. Experiments are conducted on the dataset from the oven in a production line to evaluate the effectiveness three functions, namely the Gaussian kernel, the Laplacian kernel and effectiveness of three kernel functions, namely the Gaussian kernel, the Laplacian kernel and Experiments arethree conducted on the dataset from the oven akernel, production line the effectiveness of kernel functions, namely the Gaussian the Laplacian kernel and polynomial kernel. The experimental results show that theinGaussian-COF and to theevaluate Laplacianpolynomial kernel. The experimental results show that the Gaussian-COF and the Laplacianeffectiveness of three kernel functions, namely the Gaussian kernel, the Laplacian kernel and polynomial kernel. The on experimental results show that theisGaussian-COF andfurther the LaplacianCOF are more effective valid oven data detection, which significant for the research COF are more effective valid oven data detection, which significant research polynomial kernel. The on experimental results show that theis andfurther the LaplacianCOF are more effective on valid oven data detection, which isGaussian-COF significant for for the the further research work on energy management in the bakery company. work on energy management in the bakery company. COF more effective on valid data detection, work are on energy management in oven the bakery company.which is significant for the further research © 2018, (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. work onIFAC energy management in the bakery company. Keywords: Energy monitoring system, COF, LOF, kernel functions. Keywords: Energy monitoring system, COF, LOF, Keywords: Energy monitoring system, COF, LOF, kernel kernel functions. functions. Keywords: Energy monitoring system, COF, LOF, kernel functions. 1. INTRODUCTION ingredients in a big mixer, then kneading the dough by by a a 1. INTRODUCTION INTRODUCTION ingredients in big mixer, kneading the 1. ingredients in aathe bigdough mixer,isthen then kneading the dough dough by a motor. While prepared properly, it is then then motor. While the dough is prepared properly, it is 1. INTRODUCTION in athe bigdough mixer, kneading the dough byarea motor. isthen prepared it is then sliced to the size. Then, divided The UK government has committed to reducing its green- ingredients toWhile the expected expected size. Then, the theproperly, divided pieces pieces are The UK UK government has has committed committed to to reducing reducing its its greengreen- sliced motor. theresting dough is aprepared it isbeing then sliced the expected size. Then, theofproperly, divided pieces are sent totoaWhile prover, for period time before The house gasgovernment emissions (GHG) by 80% by 2050 compared sent to a prover, resting for a period of time before being houseUK gasgovernment emissions has (GHG) by 80% 2050 compared sliced toa the expected size. Then, theof divided pieces are The committed toofby reducing its greensent to prover, resting for a period time before being into the oven. The length of proofing varies by size (GHG) by 80% by 2050 compared house gas emissions to the 1990 level. Consuming 16% total energy per into the oven. The for length of proofing varies bybeing size to the gas 1990 level. Consuming 16% ofbytotal energy per sent sent to a prover, resting a period of time before house emissions (GHG) by 80% 2050 compared sent into the oven. The length of proofing varies by size of dough. baking, are sent the to thethe1990 level. Consuming 16% total energy energy conper and year, manufacturing industries areof putting and species species of oven. dough. For Forlength baking,ofdoughs doughs arevaries sent into into the year, the1990 manufacturing industries areof putting energy coninto proofing by they size to thethe level. Consuming 16% energy per sent and species of dough.The For baking, doughs are sent into the oven by aathe conveyor belt. When the bread is baked, year, manufacturing industries are putting consumption optimization as a priority tototal meetenergy the GHG oven by conveyor belt. When the bread is baked, they sumption optimization as a priority to meet the GHG and species of dough. For baking, doughs are sent into the year, the manufacturing industries are putting energy con- oven conveyorautomatically belt. When the bread baked, they wouldbybearemoved from pansisby depanner. sumption optimization as a priority to meet the GHG reduction target[1]. The bakery industry, which produces would be removed automatically from pans by depanner. reduction target[1]. The as bakery industry, whichthe produces oven by a conveyor belt. When the bread is baked, they sumption optimization a priority to meet GHG would be removed automatically by depanner. The bread needs to that the and reduction target[1]. The cakes bakery industry, which produces fresh bread, and other to needs time time to cool cool so sofrom that pans the moisture moisture and fresh and and frozen frozen bread, cakes and other pastries pastries to meed meed The wouldbread be removed automatically from pans by step depanner. reduction target[1]. The bakery industry, which produces bread needs time to cool so that the moisture and carbon dioxide inside will dissipate. The last is to and other pastries toenergy meed The fresh anddaily frozen bread, cakes people’s dietary demand, consumes a lot of carbon dioxide inside will dissipate. The last step is to people’s daily dietary demand, consumes a lot of energy The bread needs time to cool so that the moisture and fresh and frozen bread, cakes and other pastries to meed carbon dioxide inside will dissipate. The last step is to slice the loaves with a machine and package them for dietary demand, consumes lot of energy people’s from gas daily and electricity. It is therefore of asignificant im- slice the loaves with a machine and package them for from gas and electricity. It is therefore of significant imcarbon dioxide inside will dissipate. The last step is to people’s daily dietarythe demand, consumes asignificant lotthe of baking energy slice the loaves with a machine and packagethe them for convenience. To improve the energy efficiency, energy from gas and electricity. It is therefore of importance to improve energy efficiency in convenience. To improve the energy efficiency, the energy portance to improve the It energy efficiency in the baking slice the loaves with a machine and package them for from gas and electricity. is therefore of significant imconvenience. improve the energyprocesses efficiency,needs the energy consumption To of manufacturing to portance improve the energy efficiency in the processes. Modern bakeries are equipped with autoof the the manufacturing processes needs to be be processes.to Modern bakeries are often often equipped withbaking auto- consumption convenience. To improve the energy efficiency, the energy portance to improve the energy efficiency in the of the manufacturing processes needs to be known correctly and precisely. However, there are always processes. Modern bakeries areMost often equipped withbaking automatic production lines [2-3]. bakery products have consumption known correctly and precisely. However, there are always matic production lines [2-3]. Most bakery products have of the manufacturing processes to be processes. Modern bakeries oftenbakery equipped withwater, autoknown correctly andwhich precisely. However, thereneeds are outliers in we to and check matic lines [2-3].are Most have consumption similarproduction core manufacturing procedure withproducts flour, outliers in aa dataset dataset we need need to investigate investigate andalways check similarproduction core manufacturing manufacturing procedure withproducts flour, water, water, known andwhich precisely. However, there are matic lines [2-3].procedure Most bakery have outliers inthese a dataset which wecaused need toby investigate andalways check whethercorrectly outliers are production error, similar core with flour, and yeast. Minor ingredients such as fruits and nuts are outliers are production error, and yeast. Minor ingredientsprocedure such as fruits and nuts are whether inthese a dataset which wecaused need toby investigate check similar manufacturing with and flour, water, whether these outliers are caused by production error, measurement error or data recording error. In and addition, and Minor ingredients fruits are outliers used yeast. to core increase the diversitysuch andasabundance ofnuts bread. measurement error or data recording error. In addition, used to increase the diversity and abundance of bread. whether these outliers are caused by production error, and yeast. Minor ingredients such as fruits and nuts are measurement error or data recording error. In addition, the outliers may affect data quality and thus the quality used increase the diversity and abundance of bread. Fig. 1toillustrates a generic production process for bread outliers may affect data recording quality and thusIntheaddition, quality Fig. 11toillustrates illustrates generic production process for for bread the measurement error oranalysis data error. used increaseThere the diversity and abundance ofmixing, bread. the outliers may affect data quality and thus the quality of the follow-up data procedure. Therefore, valid production Fig. aa generic process bread manufacturing. are several main processes: of the follow-up data analysis procedure. Therefore, valid manufacturing. There are several several main processes: outliers may affect data quality and thus the stage quality Fig. 1 illustrates abaking, generic production process formixing, bread the of the follow-up data analysis procedure. Therefore, valid data detection about the outliers is an important in main processes: mixing, manufacturing. There are dividing, proofing, cooling, and slicing/packaging. data detection about the outliers is an important stage in dividing, proofing, baking, cooling, and slicing/packaging. thedetection follow-up dataofthe analysis procedure. Therefore, valid manufacturing. are several mixing, data about outliers is an important stage in energy management the bakery industry. dividing, proofing, cooling, and slicing/packaging. The first stage isThere tobaking, incorporate themain flour,processes: water, and other of energy management of the bakery industry. The first stage is to incorporate the flour, water, and other detection aboutofthe an important stage in dividing, proofing, cooling, energy management theoutliers bakeryisindustry. The first stage is tobaking, incorporate the and flour,slicing/packaging. water, and other data As in an  energy management theoutlier bakery is The is toinincorporate flour, water, under and other As described described in [4], [4],ofan an outlier isindustry. an object object behaving behaving Thefirst workstage presented this paper is the funded by EPSRC grant  The work presented in this paper is funded by EPSRC under grant As described in [4], an outlier is an object behaving differently from the expected ones. There are aa number of  EP/P004636/1. The work presented in this paper is funded by EPSRC under grant differently from the expected ones. There are number of EP/P004636/1. As described [4],expected an outlier an object behaving  differently frominthe ones.isThere are a number of The work presented in this paper is funded by EPSRC under grant EP/P004636/1. differently from the expected ones. There are a number of EP/P004636/1.

2405-8963 © © 2018 2018, IFAC IFAC (International Federation of Automatic Control) Copyright 291 Hosting by Elsevier Ltd. All rights reserved. Copyright © under 2018 IFAC 291 Control. Peer review responsibility of International Federation of Automatic Copyright © 2018 IFAC 291 10.1016/j.ifacol.2018.09.316 Copyright © 2018 IFAC 291

2018 IFAC ADCHEM 298 Yanxia Wang et al. / IFAC PapersOnLine 51-18 (2018) 297–302 Shenyang, Liaoning, China, July 25-27, 2018

Flour

Divider

Prover

Oven

Mixer

Slicer/Flow wrap

Cooler

Depanner

Fig. 1. Flow chart of bread production process approaches for detecting outliers to acquire the embedded information and minimize their influence on the following statistic process for data analysis[5-6]. A distance-based method identifies outliers when their given radius area cannot include enough neighbors [7-8]. The drawback of this method is that it only considers the distance to the neighbors and ignores the information of closer objects [9]. A density-based approach provides a density measurement to identify that if a data point is an outlier or not. This algorithm can work well on the data set with unbalanced density regions. A connectivity-based scheme is proposed in [10-11], which considers an object and other connected objects with connectivity-based outlier factor (COF). The COF algorithm could not discern the difference between objects in small clusters which include fewer objects. The approaches based on density could handle the data set with different density areas, where the local outlier factor (LOF) is used to identify the degree of an object being an outlier [12-13]. It could not perform well when outliers are in the areas with various distributed densities [14]. Kernel functions map the initial data into a high-dimensional feature space, which could reflect the difference between objects better [15]. This paper firstly introduces the point energy monitoring system developed from the research team (www.pointenerg y.org), is used in different industrial partners, including in a local bakery company to collect the voltage, current, power, power factor and frequency data from the baking process, one of the core production phases. Then we discuss data detection for the baking process with two algorithms, kernel connectivity-based outlier factor algorithm and kernel local outlier factor algorithm. Three kernel functions, namely the Gaussian function, the Laplacian function and Polynomial function, are used to assess the performance of the algorithm. The rest of this paper is organized as follows. The related work is introduced in section 2. In section 3, the experimental setup about the dataset and experimental procedure is described in detail, while section 4 presents the results and discussions. Finally, section 5 concludes this paper.

production line or even for each manufacturing process. Thus, an energy monitoring system, shown in Fig. 2, has been developed through the Point Energy Technology initiated from the research group (www.pointenergy.org). The system mainly contains two parts, energy data acquisition part and energy data analysis part. The energy data acquisition part is designed to install on site for obtaining the energy usage details along the whole production line at a component level. For energy consumed from the electricity, current transformers and intelligent power meters are deployed to measure the current, voltage of the production process. The active and the reactive power are also calculated from the acquired signals. The power factor is regarded as the indicator of the energy efficiency in the industry which is directly related to the electricity tariff, and low power factor not only leads to higher electricity tariff but also a potential large penalty. All these energy data are directly sent to a cloud server using raspberry pi boards which are single-board computers. The cloud server bridges the on-site data acquisition and remote data analysis, remote servers could gain access to the energy data stored to the cloud server by the MQTT protocol. Further, a WEB server is developed to display the realtime energy consumption and a data server is built to store the energy data locally for the preparation of more detailed data analysis. 2.2 Kernel connectivity-based outlier factor algorithm In this section, the kernel connectivity-based outlier factor (COF) algorithm will be proposed. The core idea of this algorithm is to record each object the degree of being an outlier, which is called the connectivity-based outlier factor. The kernel COF algorithm can be formulated by the following steps: (1) Map the initial data into a new feature using a kernel function, and the following steps are conducted in the new feature space. (2) For each object x, find its k nearest neighbours. The set with point x and those neighbours is named as Nk (x). (3) Define a data set based on the nearest trail (SBN ) from data point x, such that for all 1 ≤ i ≤ k − 1, xi+1 is the nearest neighbour point of set {x1 , · · · , xi } in set {xi+1 , · · · , xk }.

(4) Let e = {e1 , · · · ek }, which is a sequence of edge points relating to the SBN path, that constitutes the consecutive nearest neighbours from point x in set Nk (x). Each ei is an edge point and dist(ei ) means the distance between sets comprising an edge.

(5) Calculate the average chaining distance from x to Nk (x) − {x}, denoted by distNk (x) and defined as: dist(x) =

2. RELATED WORK

k  2(k + 1 − i) i=1

k(k + 1)

dist(ei )

(1)

2.1 Point energy technology for energy monitoring

distNk (x) can be viewed as the weighted distance in the cost description for the SBN path from point x.

By working with a local bakery company which is eager to know how much energy they used daily and more specifically, how much the energy is consumed by each

(6) Compute the connectivity-based outlier factor (COF) at the data point x by its k−th neighbour using the following equation:

292

2018 IFAC ADCHEM Shenyang, Liaoning, China, July 25-27, 2018 Yanxia Wang et al. / IFAC PapersOnLine 51-18 (2018) 297–302

299

Cloud Server Wireless router

Local Server

Sensor

Raspberry Pi Wireless module Web Server

Energy Data Acquisition

Data Server Data Analysis

Fig. 2. Energy monitoring system COF (x) =

1 k

dist(x)  dist(o)

(2)

LOF (p) =

o∈Nk (x)

The COF of object x is the ratio of the average distance from x to Nk (x) and the average distance of its neighbour records. It is easily inferred that the chance of an object being an outlier increases with the COF increases. 2.3 Kernel local outlier factor algorithm



p ∈Nk (p)

Irdk (p ) Irdk (p)

||Nk (p)||

(5)

(6) Sort the LOF (p) for all objects, and the level of the object being an outlier becomes bigger as the value of LOF becomes bigger. In this paper, three kernel functions are used in kernel COF and kernel LOF algorithms, the Gaussian kernel, polynomial kernel and Laplacian kernel.

In this section, the kernel local outlier factor (LOF) algorithm is provided. For this algorithm, the LOF of an object p is the average of the ratio of local reach-ability of p and those of p’s k-neasrest neighbours. The kernel LOF algorithm could be described as folloes:

The Gaussian function is a radial basis kernel, where α represents the width parameter.

(1) Map the data into a feature space of higher dimensions. (2) For each object, calculate all the distances between the point p and its k-th nearesrt neighbor distk (p).

The polynomial function is a nonstationary kernel, where d is the order of polynomial. K(x, y) = (x · y + 1)d , d > 0 (7)

(3) Search all the objects in k-distance neighborhood of the object p, Nk (p):

The Laplacian function is a radial basis kernel as well, with β being the width parameter.

Nk (p) = {p |dist(p, p )distk (p)}

K(x, y) = exp(−β||x − y||), β > 0

(3)

(4) Calculate the local reach-ability density of the object p:  

||Nk (p)||  (4) reachdist k (p ← p) k (p)   reachdistk (p ← p) = max{distk (p)), dist(p, p )} where ||Nk (p)|| means the number of the objects in Nk (p). Irdk (p) = 

p ∈N

(5) For every object, calculate the local outlier factor (LOF) of object p.

293

K(x, y) = exp(−||x − y||2 /α2 ), α > 0

(6)

(8)

2.4 Evaluation criteria In order to assess the performance of the two kernel algorithms, the precision, the recall and the rank power are used [16-17]. It is assumed that a data set D = D0 + Dn , where D0 denotes the set of all outliers and Dn represents the set of non-outliers. Dm is a dataset with outliers among the objects ranked in the top m positions returned by data detection algorithm, where m ≥ 1. Let |Dm | be the number

2018 IFAC ADCHEM 300 Yanxia Wang et al. / IFAC PapersOnLine 51-18 (2018) 297–302 Shenyang, Liaoning, China, July 25-27, 2018

Table 1. Class distribution of oven data Oven

Case Common Rare

Class label good bad

Percentage of objects 96.85 3.15

of data objects in |Dm |, and |D0 | be the number of data objects in set |D0 |.

Precision is to represent the percentage of outliers among the top m ranked samples returned by the algorithm and defined as: |Om | (9) P recision = m

Recall is the ratio between detected outliers and all outliers in the data set, which can be defined as: |Om | (10) Recall = |D0 | To calculate the position of the detected outlier, a rank power is introduced. It is assumed that m samples with position 1 to position m are returned by the data detection method, and there are N r outliers in these m objects. For 1 < i < Nr , let Pi denote the position of i−th outlier, then rank power being defined as: Nr (Nr + 1) RankP ower = Nr  2 Pi

(11)

i=1

It can be easily inferred that the rank power value is between 0 and 1, with 1 representing the best and 0 being the worst performance. Therefore, the efficiency of data detection algorithm can be judged by the above three variables. The performance could be better as the values of precision and recall become larger. For the same precision and recall, rank power should be used to judge the efficiency, and a larger value means a better efficiency. 3. EXPERIMENTAL SETUP 3.1 Data set Since a large portion of electricity, about 30-35%, is used for the oven to bake the bread in the bakery company. In this section, we select the initial energy usage of the baking process from 00:00 to 24:00 on 02/02/2017. The following features are monitored at a 5-minute interval across all three phases: voltage, current, power, power factor, and frequency. The experiments are performed on oven data set, which has 286 samples with 11 attributes, two classes of good and bad. According to the methods used in [18], bad data samples are randomly generated to acquire an unbalanced distribution. As shown in table 1, the oven data set has 277 objects labeled as good and 9 objects labeled as bad. 3.2 Experimental procedure To compare the effectiveness of COF and LOF algorithms, The experiment is conducted for the oven data set. Following the same value in the reference [19], k equals to 5% of the number of all objects in the data set. Let N r be the 294

number of rare objects detected. Moreover, m represents the number of top-ranked objects returned by the method. For these two algorithms, the parameters of the Gaussian, polynomial, and Laplacian kernel functions are denoted by α,d, and β respectively. For Gaussian function, the parameter is selected in the range of [0.1, 3] with a interval of 0.1; for Polynomial function, the scope of d is [1,30]; and β is defined from 0.01 to 0.05 with a interval of 0.005 for Laplacian function. The experiments are implemented by MATLAB 2017, and the computing environment is Windows 10 education, version 1703 for ×64-based system. 4. RESULTS AND DISCUSSION As shown in Figures 3, the experiments are conducted for selecting the kernel parameters for kernel COF and kernel LOF. Precision and recall are considered first, then with the maximum values for the precision and recall, the kernel parameters with the most optimal rank power are determined. For the parameter selection of the kernel COF algorithm, α is selected as 0.3, d as 29, and β as 0.16. While for the parameter selection of the kernel LOF algorithm, α is selected as 0.4, d as 25, and β as 0.04. The experiment results of kernel COF and kernel LOF are listed in table 2 and table 3 respectively. As is shown in table 2, The Gaussian-COF and the Laplacian-COF could detect more rare objects than the polynomial-COF for m being from 10 to 40, and the rank powers are also stronger. For all the kernel COF algorithms, with the increase of m from 10 to 30, the numbers of rare objects which are detected increase. For example, when the top 30 ranked objects are returned by the algorithms, the Gaussian-COF identified five records in the rare class, the Laplacian-COF detect six bad objects, while only four objects in the bad class are identified by the polynomialCOF. According to precision and recall, the GaussianCOF and the Laplacian-COF have the same efficiency when the top 35 ranked records are returned. However, the rank power of the Gaussian-COF is 0.28, 17.6% lower than that of the Laplacian-COF. While for the polynomialCOF, the precision is the largest when the number of the ranked objects returned is 25, with four bad records detected, and the recall value is 0.44, much lower than those of the other two algorithms. When m varies from 25 to 40, the number of bad objects detected is keeping unvarying. For the kernel LOF algorithms, when m is 10 and 15, the number of rare objects detected by the Laplacian-LOF is 4, larger than that of Gaussian-LOF and polynomial-LOF. While the Gaussian-LOF could detect more bad objects than the other two algorithms when the top 20 ranked objects are returned. When m increase from 25 to 40, there are same numbers of rare objects detected for all kernel LOF algorithms. For Gaussian-LOF, the rank power is larger than that of Polynomial-LOF, but smaller than that of Laplacian-LOF. Therefore, the effectiveness of LOFs with different kernels would change along with the number of m changes. However, the maximum number of bad objects detected by kernel LOFs is 5, smaller than that of kernel COFs (Gaussian-COF and LaplacianCOF). Consequently, the experimental result shows that the Gaussian and Laplacian COFs perform much better on the oven data set.

2018 IFAC ADCHEM Yanxia Wang et al. / IFAC PapersOnLine 51-18 (2018) 297–302 Shenyang, Liaoning, China, July 25-27, 2018

(a) Gaussian COF

0.7

0.4

0.4

0.3

Presicion Recall Rank power

0.8

0.35

Performance

0.5

(c) Laplacian COF

1 Precision Recall Rank power

0.45

The order parameter d

0.6

Performance

(b) Polynomial COF

0.5 Presicion Recall Rank power

301

0.3 0.25

0.6

0.4

0.2 0.15

0.2

0.2 0.1 0.1

0

0.05 0

0.5

1

1.5

2

2.5

3

0

5

The width parameter

(d) Gaussian LOF

0.8

15

20

25

0

30

0.05

0.15

0.2

0.25

(f) Laplacian LOF

1 Presicion Recall Rank power

0.5

0.1

The scale parameter

(e) Polynomial LOF

0.6 Presicion Recall Rank power

0.7

10

The order parameter d

Precision Recall Rank power

0.8

0.5 0.4 0.3

Performance

Performance

Performance

0.6 0.4

0.3

0.6

0.4

0.2 0.2

0.2

0.1 0.1

0 0

0.5

1

1.5

2

2.5

0

3

5

10

15

20

25

30

0 0

0.05

The order parameter d

The width parameter

0.1

0.15

0.2

0.25

The scale parameter

Fig. 3. Prameter selection of kernel function Table 2. Kernel COF detect rare class m 10 15 20 25 30 35 40

{Nr} 3 4 5 5 5 6 6

Gaussian-COF precision recall Rank power 0.30 0.33 0.55 0.27 0.44 0.45 0.25 0.56 0.36 0.20 0.56 0.36 0.17 0.56 0.36 0.17 0.67 0.28 0.15 0.67 0.28

{Nr} 1 3 3 4 4 4 4

Polynomial-COF precision recall Rank power 0.10 0.11 1.00 0.20 0.33 0.23 0.15 0.33 0.23 0.16 0.44 0.20 0.13 0.44 0.20 0.11 0.44 0.20 0.10 0.44 0.20

{Nr} 4 4 4 5 6 6 6

Laplacian-COF precision recall Rank power 0.40 0.44 0.77 0.27 0.44 0.77 0.20 0.44 0.77 0.20 0.56 0.42 0.20 0.67 0.34 0.17 0.67 0.34 0.15 0.67 0.34

{Nr} 4 4 4 5 5 5 5

Laplacian-LOF precision recall Rank power 0.40 0.44 0.83 0.27 0.44 0.83 0.20 0.44 0.83 0.20 0.56 0.45 0.17 0.56 0.45 0.14 0.56 0.45 0.13 0.56 0.45

Table 3. Kernel LOF detect rare class m 10 15 20 25 30 35 40

{Nr} 3 3 5 5 5 5 5

Gaussian-LOF precision recall Rank power 0.30 0.33 0.50 0.20 0.33 0.50 0.25 0.56 0.31 0.20 0.56 0.31 0.17 0.56 0.31 0.14 0.56 0.31 0.13 0.56 0.31

{Nr} 2 3 4 5 5 5 5

Polynomial-LOF precision recall Rank power 0.20 0.22 0.75 0.20 0.33 0.32 0.20 0.44 0.26 0.20 0.56 0.23 0.17 0.56 0.23 0.14 0.56 0.23 0.13 0.56 0.23

5. CONCLUSION To improve the energy efficiency in the baking industry, an energy monitoring system developed by the Point Energy Technology from the research group is first introduced to collect data from the production line. After data acquisition, the kernel connectivity-based outlier factor algorithm and kernel local outlier factor algorithm are proposed to detect rare objects. In experiments, an oven data set is used to verify the performance of kernel COFs and kernel LOFs, and the experimental results show that the Gaussian-COF and Laplacian-COF algorithms are effective for oven data detection. Once outliers are identified in the data set, we could check if a production error or data collection error had occurred by prior experience, which would guide the energy management in bread manufacturing processes. The outliers data detection is also useful in 295

improving the data quality and the follow-up analysis accuracy. As a future work, incremental approaches (such as clustering, modelling, and optimisation) will be researched to improve the energy management in the bakery company. 6. REFERENCES [1] WSP (2015). Industrial decarbonisation and energy efficiency roadmaps to 2050-cross sector summary. Report to DECC and BIS, 26. [2] Arpita M., Datta A.K. (2008). Bread baking-a review. Journal of food engineering, 86, 465-474. [3] Peter T., Eric M., and Ernst W. (2014). Energy efficiency opportunities in the U.S. commercial baking industry. Journal of Food Engineering, 130, 14-22. [4] Hawkins D. M. (1980), Identication of outliers–Monographs on

2018 IFAC ADCHEM 302 Yanxia Wang et al. / IFAC PapersOnLine 51-18 (2018) 297–302 Shenyang, Liaoning, China, July 25-27, 2018

statistics and applied probability. Chapman and Hall Press, London. [5] Hui C., Rui M., Hongliang R., Shuzhi Sam G.(2016). Data-defect inspection with kernel-neighbor-density-change outlier factor. IEEE transactions on automation science and engineering. DOI: 10.1109/TASE.2016.2603420. [6] Hodge V. J., Austin J. (2004), A survey of outlier detection methodologies, Artificial intelligence review, 22(2), 85-126. [7] Jin W., Tung A. K. H., and Han J. (2001), Mining topn local outliers in large databases, In Proc. of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data mining, San Francisco, USA, 293298. [8] Ramaswamy S., Rastogi R., and Shim K. (2000), Efcient algorithms for mining outliers from large data sets, Acm Sigmod Record Journal, 29, 427438. [9] Radovanovic M., Nanopoulos A., and Ivanovic M. (2015), Reverse nearest neighbors in unsupervised distancebased outlier detection, IEEE Transactions on Knowledge and Data Engineering, 27(5), 13691382. [10] Tang J., Chen Z., Fu A. W., and Cheung D. W. (2006), Capabilities of outlier detection schemes in large datasets, framework and methodologies, Knowledge and Information Systems, 11(1), 4584. [11] Mansur M. O., Mohd. Noor Md. S. (2005), Outlier detection technique in data mining: a research perspective. Proceedings of the Postgraduate Annual Research Seminar, 23-31. [12] Wang X., Wang X.L., Chen C., and Wilkes, D.M., Enhancing minimum spanning tree-based clustering by removing density-based outliers, Digit. Signal Process., vol. 23, pp. 15231538, Sep. 2013. [13] Kim S. et al., Application of density-based outlier detection to database activity monitoring, Inf. Syst. Frontiers, vol. 15, pp. 5565, Mar. 2013. [14] Breunig M.M., Kriegel H.P., Ng R.T., and Sander J., LOF: Identifying density-based local outliers, in Proc. Int. Conf. Manage. Data, Dallas, TX, USA, 2000, pp. 93104. [15] Thomas H., Bernhard S., Alexander J. S. (2008). Kernel methods in machine learning. The Annals of Statistics, 36(3), 11711220. [16] Baeza-Yates R. A., Ribeiro-Neto B. (1999), Modern information retrieval. Addison-Wesley, Boston. [17] Meng X., Chen Z. (2004), On user-oriented measurements of effectiveness of Web information retrieval systems, International conference on internet computing, 527533. [18] Ye M., Li X., and Orlowska M. E. (2009), Projected outlier detection in high-dimensional mixed-attributes data set, Expert Systems with Applications, 36, 71047113. [19] Tang J., Chen Z., Fu A.W., and Cheung D.W. (2007), Capabilities of outlier detection schemes in large datasets, framework and methodologies, Knowl. Inf. Syst., 11(1), 4584.

296