MiFI-Outlier: Minimal infrequent itemset-based outlier detection approach on uncertain data stream

Journal Pre-proof MiFI-Outlier: Minimal infrequent itemset-based outlier detection approach on uncertain data stream Saihua Cai, Sicong Li, Gang Yuan,...

Download PDF

1MB Sizes 0 Downloads 84 Views

Report

PDF Reader
Full Text

Journal Pre-proof MiFI-Outlier: Minimal infrequent itemset-based outlier detection approach on uncertain data stream Saihua Cai, Sicong Li, Gang Yuan, Shangbo Hao, Ruizhi Sun

PII: DOI: Reference:

S0950-7051(19)30571-4 https://doi.org/10.1016/j.knosys.2019.105268 KNOSYS 105268

To appear in:

Knowledge-Based Systems

Received date : 13 April 2019 Revised date : 20 November 2019 Accepted date : 24 November 2019 Please cite this article as: S. Cai, S. Li, G. Yuan et al., MiFI-Outlier: Minimal infrequent itemset-based outlier detection approach on uncertain data stream, Knowledge-Based Systems (2019), doi: https://doi.org/10.1016/j.knosys.2019.105268. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof

*Revised Manuscript (Clean Version) Click here to view linked References

Knowledge-Based Systems journal homepage: www.elsevier.com /locate/knosys

pro of

MiFI-Outlier: minimal infrequent itemset-based outlier detection approach on uncertain data stream Saihua Caia, Sicong Lia, Gang Yuana, Shangbo Haoa, Ruizhi Suna,b* a b

College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China Scientific research base for Integrated Technologies of Precision Agriculture (animal husbandry), the Ministry of Agriculture, Beijing 100083, China

ABSTRACT

Keyword: Outlier detection Minimal infrequent itemset mining Uncertain data stream Deviation indices Data mining

Massive outlier detection approaches have been proposed for static datasets in the past twenty years, and they have acquired good achievements. In real life, uncertain data stream is more and more common, but most existing outlier detection approaches were not suitable for uncertain data stream environment. In addition, many outlier detection approaches have not considered the appearing frequency of each element, which resulted the detected outliers are not coincide with the definition of outlier. Itemset-based outlier detection approaches provided a good solution for this problem, and they have got more attentions in these years. In this paper, a novel two-step minimal infrequent itemset-based outlier detection approach called MiFI-Outlier is proposed to effectively detect the outliers from uncertain data stream. In itemset mining phase, a matrix-based method called MiFI-UDSM is proposed to mine the minimal infrequent itemsets (MiFIs) from uncertain data stream, and then an improved approach called MiFI-UDSM* is proposed for more effectively mining these minimal infrequent itemsets using the ideas of “item cap” and “support cap”. In outlier detection phase, based on the mined MiFIs, three deviation indices including minimal infrequent itemset deviation index (MiFIDI), similarity deviation index (SDI) and transaction deviation index (TDI) are defined to measure the deviation degree of each transaction, and then the MiFI-Outlier is used to identify the outliers from uncertain data stream. Several experimental studies are conducted on public datasets and synthetic datasets, and the results show that the proposed approaches outperform in infrequent itemset mining phase and outlier detection phase.

lP

re-

ARTICLE INFO

urn a

1. Introduction

In recent years, the scale of data is increasing faster than ever in various application fields, thus, it is necessary to use more effective technology to analyze and manage these data so as to discover the implicit, previously unknown and potentially useful knowledge. As an important data processing technology, data mining [23] is widely used in traffic data analysis [37], meteorological data analysis [9], mobile data analysis [38] and so on, where frequent itemset mining [21,23] is an important technology in data mining.

Jo

As a main form of data, data stream is very common in real life. Compared with static data, data stream [30] is continues, unbounded and not necessarily uniformly distributed. Because of these features, the processing speed of data stream should be faster, and multiple scans of data stream are impractical. Hence, the processing of data stream is more challenging than static datasets. In recent years, window-based technologies (such as sliding window [6,13], damped window [42] and landmark window [41]) have provided excellent mining solutions for incoming data stream, and most of these methods were based on Apriori algorithm [1] and FP-Growth algorithm [12], such as

*

HAUPM [42], TWMINSWAP [24] and TWMINSWAP-IS [24], TDMCS [13], etc. In order to obtain more accurate mining results, the accuracy of the original data stream is particularly important. However, in real life, due to measurement errors, equipment failures, human errors and other reasons, abnormal data (also called outliers, with two attributes [15]: (1) rarely appearing, and (2) deviating much from most observations) are often present in data stream, which have seriously affected the reliability of data mining, data-based prediction and other operations. Therefore, it is necessary to effectively detect these outliers as soon as possible to improve the quality of data stream. For the traditional outlier detection methods, such as clustering-based methods [18,19,31], distance-based methods [20,29] and density-based methods [2,33], they determine whether the current transactions are outliers by calculating the distances between each subset in the transaction. Thus, when the number of subsets in the transaction is very large (that is, highdimensional data), the time cost of above three kinds of outlier detection methods increases exponentially, thereby resulting the dimension disaster. Fortunately, itemset-based outlier detection approaches [4,5,7,14,16,17,25] provide a good solution for this

Corresponding author.

E-mail address: [email protected], [email protected]

1 / 21

Journal Pre-proof

problem, and they divide the entire outlier detection process into itemset mining phase and outlier detection phase.

and and the the

The remainder of this paper is organized as follows. The related work is presented in Section 2. Some preliminaries and the existing problems are introduced in Section 3. The outlier detection framework, the original minimal infrequent itemset mining approach and an improved minimal infrequent itemset mining approach, and outlier detection approach are presented in Section 4. The empirical studies and experimental analysis are stated in Section 5. The conclusions and future work are discussed in Section 6.

2. Related work

In this section, some related work including (1) outlier detection approaches, and (2) infrequent itemset mining approaches are reviewed. 2.1. Outlier detection approaches

Outlier detection is a series of processes for discovering the implicit abnormal data from static datasets and data stream. The outlier detection approaches are roughly divided into four categories: (1) clustering-based outlier detection approaches, (2) distance-based outlier detection approaches, (3) density-based outlier detection approaches, and (4) itemset-based outlier detection approaches.

re-

For the entire process of itemset-based outlier detection for uncertain data stream, both of the accuracy and time cost of outlier detection process are very important. (1) To improve the detection accuracy, more factors that may cause transaction being abnormal need to be considered. In addition, because infrequent itemsets are more consistent with the attribute one of outliers (“rarely appearing”), thus, the use of infrequent itemsets in the outlier detection process can improve the detection accuracy. (2) To reduce the time cost on the entire process, both the time cost on itemset mining stage and the time cost on outlier detection stage need to be considered. In the itemset mining stage, the large amount of data stream makes the time cost and memory usage of the infrequent itemset mining process very expensive, and this problem is even more acute in the era of big data. Because the minimal infrequent itemsets [11] (the subsets of infrequent itemsets, see Definition 6) are the generators of infrequent itemsets and the number of minimal infrequent itemsets is relatively small, thus, much time and memory can be saved both in the mining process and detection process if the infrequent itemset mining is translated into minimal infrequent itemset mining.

(4) We use several datasets including a synthetic dataset several public datasets to test the detection accuracy time efficiency of the proposed methods, and experimental results confirm the effectiveness of proposed methods.

pro of

For the itemset-based outlier detection methods, most of them are designed for precise datasets or precise data stream. However, in many cases, users are uncertain about the presence or the absence of some items or events (for example, a physician may suspect that a patient has (i) an 80% likelihood of suffering from a ﬂu and (ii) a 60%likelihood of suffering from a cold (regardless of having or not having the ﬂu)), that is, uncertain data stream is becoming more common. Compared with the precise data stream, each itemset in uncertain data stream is associated with a probability value that represents the possibility of the existence of this itemset, while the existence of probability makes the itemsetbased outlier detection methods that designed for precise data stream is not applicable for uncertain data stream. In addition, for the existing itemset-based outlier detection methods, such as FIM-UDSOD [14], FindFPOF [16] and OODFP [25], only a few factors that result the transactions abnormal are considered in the design of deviation indices, thus, the detection accuracy is not competitive.

MiFI-Outlier for effectively discovering the outliers from uncertain data stream.

urn a

lP

For the clustering-based outlier detection approaches, the main idea is to cluster the similar patterns into a class, and these elements that deviating from most elements are judged as outliers. COID [31] was a novel clustering-based iterative outlier detection approach, it divided the whole outlier detection into initialization phase and iteration phase. In the initial phase, the center position of cluster and position of abnormal point were discovered to provide the support for iteration phase. In the iteration phase, the clustering and abnormal point sets were refined gradually through exchanging the worst outliers with the data points of worst cluster boundary. Through adjusting the internal relations of the cluster as well as the relationship between clusters and outliers, the efficiency of outlier detection on multidimensional noise datasets could be effectively improved. Based on the idea that the abnormal clusters is smaller than normal clusters, Huang et al. [18] proposed a novel outlier detection method called ROCF to detect the implicit outliers. To speed up the efficiency of outlier detection, each point in the dataset was connected with their neighbor points based on the constructed neighbor graph, and the outlier judging condition was also changed from the parameter n or α to the cluster density of neighbor graph. Although the clustering-based outlier detection approaches are unsupervised, but the accuracy is highly depend on the selected clustering algorithm.

Based on the above ideas, this paper presents a minimal infrequent itemset-based outlier detection approach, namely MiFI-Outlier, to discover the outliers from uncertain data stream. The main contributions of this paper are summarized as follows:

Jo

(1) We propose an algorithm called MiFI-UDSM to mine the minimal infrequent itemsets from uncertain data stream (to our best knowledge, it is the first algorithm to mine the minimal infrequent itemsets from uncertain data stream). It uses the matrix structure to store the information of each itemset in the uncertain data stream, and then the minimal infrequent itemsets are mined by extending the frequent short-itemsets using “pattern extension” operation. (2) We introduce the concepts of “item cap” and “support cap” to reduce the scale of potential extensible itemsets, thereby reducing the meaningless “pattern extension” operation and support value calculation operation, and then propose the MiFI-UDSM* method for quickly mining the minimal infrequent itemsets. (3) We design three deviation indices to measure the deviation degree of each transaction, and then propose a minimal infrequent itemset-based outlier detection method called

For distance-based outlier detection approaches, they need to calculate the distance of each itemset of the transactions to determine whether the transactions are outliers. In 2016, Kontaki et al. [20] proposed two continuous distance-based outlier detection algorithms of COD and ACOD with the help of sliding window technology, which support flexibly adjusting the radius parameter R and threshold parameter k. The algorithm COD was designed to support detecting the implicit outliers with several threshold parameter k accompanied with fixed radius parameter R.

2 / 21

Journal Pre-proof

computationally intensive method and is not suitable processing the high-dimensional datasets.

pro of

For itemset-based outlier detection approaches, He et al. [16] proposed an efficient frequent pattern-based outlier detection approach, namely FindFPOF, to discover the implicit outliers from large scale precise datasets. The outlier judgment basis of FindFPOF was the proportion of the support value of contained frequent itemsets to the total number of mined frequent itemsets. Although the outliers can be identified by FindFPOF method, but the single simple judgment condition makes the detection accuracy is not competitive. In addition, the time cost on outlier detection phase is also very high because the itemsets used in outlier detection phase are frequent itemsets, while the scale of frequent itemsets is very large. Aimed at the problem that much time was used in outlier detection phase in FindFPOF method, the maximal frequent pattern-based outlier detection approaches, including OODFP [25] and MFPM-AD [4], have been proposed in recent years, where the OODFP approach was used to discover the outliers from high-dimensional time-series datasets and MFPM-AD approach was designed for identifying the outliers from precise data stream. In 2015, Hemalatha et al. [17] first proposed the idea that uses the minimal infrequent patterns to detect the outliers from precise data stream, and then proposed the outlier detection approach MIFPOD based on this idea. In the MIFPOD approach, they first used MIP-DS method to mine the minimal infrequent patterns from precise data stream to support outlier detection. In outlier detection phase, they defined three deviation factors named transaction weighting factor (TWF), minimal infrequent deviation factor (MIPDF) and minimal infrequent pattern based outlier factor (MIFPOF) to accurately determine whether the transactions are abnormal. Although the detection accuracy of MIFPOD is higher than that of FindFPOF, but the time cost of whole outlier detection process is relatively high because of the minimal infrequent pattern mining is based on the original Apriori method. In addition, the MIFPOD approach is designed for precise data stream, it is not suitable for uncertain data stream. For the uncertain data stream, the FIMUDSOD approach [14] was proposed to effectively identify the outliers. Similar to MIFPOD approach, the MWIFIM-OD-UDS approach [5] designed three deviation indices to measure the deviation degree of each transaction, and then the outliers were effectively detected from uncertain weighted data stream. Because the itemset-based outlier detection approaches take the appearing frequency of the itemsets into consideration, therefore, the detected outliers are more coincide with the definition of the outliers defined by Hawkins.

re-

The algorithm ACOD was designed to support detecting the implicit outliers with multiple values of both R and k, and it also supported conduct the outlier detection in parallel way to speed up the outlier detection process. Based on the relation between “antihubs” and outliers in high-and-low-dimensional settings, Radovanović et al. [29] explored two ways of using k-occurrence information for expressing the outlierness of points, and then proposed the AntiHub method for unsupervised outlier detection. Moreover, they also proposed a derived method to improve discrimination between the scores to further improve the outlier detection accuracy. Because the distance-based outlier detection approaches need to calculate the distance between each point, it is a time consuming work and not suitable for dense datasets. Outlier detection on uncertain data stream is a new direction in recent years, it was first proposed by Wang et al. [36] in 2010. Different with processing the precise datasets, the probability values of each pattern need to be taken into the definition of distance-based outlier, and an item was regarded as a distancebased outlier if the probability values of its neighbors were not larger than predefined threshold value. To efficiently reduce the intermediate data in sliding window, they designed the pruning method called PBA to prune the meaningless immediate patterns, then, the dynamic programming algorithm called DPA was proposed to efficient process each data in liner time. Moreover, the detected data could be used to incrementally detect outliers in the sliding window.

urn a

lP

For density-based outlier detection approaches, Bai et al. [2] proposed a parallel supported distributed LOF computing method, namely DLC, for identifying the density-based outliers. Tang and He [33] first introduced the concept of RDOS (Relative Densitybased Outlier Score) to measure the local outlierness of the detected objects, and then they proposed an efficient densitybased outlier detection approach based on the local kernel density estimation (KDE), where the k nearest neighbors, reverse nearest neighbors and shared nearest neighbors were used to improve the detection accuracy. Cao et al. [8] proposed a continuous outlier detection approach called CUOD to discover the outliers from uncertain data stream. To reduce the time cost on outlier detection, they used the probability pruning approach to prune the infrequent patterns in the extending process. Then, a new method for parameter variable queries was proposed to enable the concurrent execution of different queries, and the results showed that the proposed approach can reduce the required storage and running time. Overall, the accuracy of density-based outlier detection approaches is relatively high, but it is also a

Table 1 Characteristics of outlier detection approaches Type of data

Precise / Uncertain

Type of outlier detection

COID[31] ROCF[18] COD[20] ACOD[20] DLC[2] RDOS[33] FindFPOF[16] OODFP[25] MIFPOD[17] FIM-UDSOD[14] MWIFIM-OD-UDS[5] DPA[36] CUOD[8] MiFI-Outlier (our method)

static dataset static dataset data stream data stream static dataset static dataset static dataset static dataset data stream data stream weighted data stream data stream data stream data stream

precise precise precise precise precise precise precise precise precise uncertain uncertain uncertain uncertain uncertain

clustering-based clustering-based distance-based distance-based density-based density-based itemset-based itemset-based itemset-based itemset-based itemset-based distance-based density-based itemset-based

Jo

Approaches

3 / 21

Journal Pre-proof

Table 1 shows the characteristics of the aforementioned outlier detection approaches, where the type of data, the uncertainty and the type of outlier detection are considered. Note that the itemset-based outlier detection approaches, FindFPOF, OODFP, MFIPOD, FIM-UDSOD and MWIFIM-OD-UDS, are used as the compared approaches with our proposed MiFI-Outlier approach.

the infrequent weighted itemsets and minimal infrequent weighted itemsets, including the items with most interest within each transaction. Similar to FP-growth methods, the FP-growthbased infrequent mining methods also have the following drawbacks, including the large memory usage and two scans of the whole datasets.

2.2. Infrequent itemset mining approaches

In this section, we first define some concepts related to this article, and then introduce some practical problems encountered in the process of researching the minimal infrequent itemsetbased outlier detection approach on uncertain data stream. 3.1. Preliminaries

Assume that I={i1, i2,…, in} is a set of items and Is={i1, i2, …, ik}, Is I and k[1,n], then Is is a k-itemset and k is the length of Is. Uncertain data stream UDS is composed of infinite transactions, that is UDS=[T1, T2, …, Tm) (m→  ), where Ti is composed of several items selected from I and their existential probabilities p(ik,Ti), that is Ti={i1:p(i1,Ti), i2:p(i2,Ti), …, ik:p(ik,Ti)} (k≤n, 0
re-

For Apriori-based infrequent itemset mining approaches, Haglin and Manning proposed the MINIT approach [11] to recursively discovery the minimal infrequent itemsets from transactions in the datasets. In the MINIT approach, the itemsets were sorted according to their decrease support value to let the most frequently appearing itemsets can be mined first, and then, the minimal infrequent itemsets were mined from the search space and these search spaces were discarded directly after the mining operation to improve the mining efficiency, where the search space referred to the transactions that containing minimal infrequent itemsets. Szathmary et al. [32] proposed the Arima algorithm to effectively mine the minimal infrequent itemsets by traversing the frequent search space using level-searching idea, where the level-searching idea indicated the mining process was conducted from 1-itemsets to longer-itemsets in level and the longer infrequent itemsets were generated by the “pattern extension” operation of the mined shorter minimal infrequent itemsets. Then, the MRG-Exp method [32] was proposed to further improve the mining efficiency, and the main idea of MRG-Exp was to discover the frequent generators to generate the minimal infrequent itemsets recursively. Troiano and Scibelli [34] proposed a breadth-first level-wise lattice-traversal algorithm, namely Rarity, for mining the infrequent itemsets, it first identified the longest rare itemsets in the database, and then power set lattice was moved downwards to filter the frequent itemsets because any subset of frequent itemsets was also frequent. The Rarity approach was very suitable for sparse databases, but the mining efficiency is very low when processing the dense datasets.

pro of

For the itemset-based outlier detection approaches, the itemset mining process is the basis of outlier detection. Over the past twenty years, frequent itemset mining has been researched in most researches, but the infrequent itemset mining is more meaningful for outlier detection due to it aims at discovering the rarely appearing itemsets, which is more coincide with attribute of “rarely appears”. The infrequent itemset mining approaches can be roughly divided into Apriori-based approaches and FPgrowth-based approaches.

3. Preliminaries and problem descriptions

Definition 1. subset, superset: For an itemset Ia={i1, i2,…, ia} and an itemset Ib={i1, i2,…, ib} (a
urn a

lP

Definition 2. probability: The probability of itemset Im in transaction Tj that formed by some items {i} is denoted as p(Im,Tj), and it is defined as

Jo

For FP-growth-based infrequent itemset mining approaches, Tsang et al. [35] designed the RP-Tree method to mine the infrequent itemsets from static precise datasets, where the support value of each 1-itemset was calculated in the first scanning of whole dataset, and then the FP-tree-like structure was constructed in the second scanning of the dataset for recursively mine the infrequent itemsets. After the construction of FP-treelike structure, the conditional trees were constructed to mine each subset of the infrequent itemsets. In 2014, Cagliero and Garza [3] proposed two FP-growth-based algorithms, called IWI-Miner and MIWI-Miner, to discover the relevant infrequent itemsets with different weights from transactional weighted datasets. The IWIsupport-min measure was first defined relying on a minimal cost function to find the infrequent weighted itemsets and minimal infrequent weighted itemsets, including the items with least interest within each transaction. Then, the IWI-support-max measure was defined relying on a maximum cost function to find

p  Im, T j  



p (i, Tj )

(1)

{i }{ Im }

Definition 3. support: The appear frequency of itemset Im in UDS is denoted as sup(Im), and it is defined as | SW |

sup (Im ) 



p (i , T j )

(2)

j 1 { i }{ Im }

Definition 4. infrequent itemset (iFI): For an itemset Im, if its support value is less than min_sup, that is sup(Im)
4 / 21

Journal Pre-proof

TID T1 T3 T5

Table 2 An example of UDS Transactions TID Transactions {a:0.8,d:0.2,e:0.4,f:0.2} T2 {a:0.6,b:0.1,c:0.9,e:0.7,f:0.2} {a:0.3,b:0.3,c:0.4,e:0.2} T4 {b:0.2,d:0.2,e:0.7,f:0.1} {a:0.4,b:0.2,c:0.5,d:0.1,f:0.3} … ……

In this example, p(a,T1)=0.8, p(ad,T1)=0.8*0.2=0.16. In the sliding window, sup(a)=0.8+0.6+0.3+0+0.4=2.1>0.6, then, itemset {a} is a frequent itemset; sup(ab)=0.8*0+0.6*0.1+0.3* 0.3+0*0.2+0.4*0.2=0.23<0.6, thus, itemset {ab} is an infrequent itemset. 3.2. Problem descriptions In this subsection, some existing problems on itemset mining and outlier detection on uncertain data stream are pointed out respectively.

frequent:

{}

infrequent:

ab

ac

ad

abc abd abe abf

abcd

abce

abcf

b ae

acd ace

abde

abcde

af

acf

abdf

abcdf

c bc

d bd

ade adf

abef

e

be

bf

f cd

ce

cf

aef bcd bce bcf bde bdf bef

acde

abcef

acdf

acef

abdef abcdef

adef

bcde

acdef

de

df

ef

cde cdf

cef

def

bcdf

bcef

bdef

In general, how to design several accurate deviation indices according to the mined minimal infrequent itemsets to measure the deviation degree of the transactions in uncertain data stream, thereby improving the detection accuracy is a very important for outlier detection. In addition, how to design an efficient minimal infrequent itemset-based outlier detection method to accurately discover the implicit from uncertain data stream is also a tough challenge to solve.

4. Minimal infrequent itemset-based outlier detection approach (MiFI-Outlier)

cdef

bcdef

Fig.1. The specific information of potential itemsets

urn a

For the traditional Apriori-based itemset mining methods, the uncertain data stream needs to be scanned for several times to conduct “pattern extension” operation and calculate the support value of the extended itemsets, thereby mining the frequent itemsets. It is a time consuming work, and the time complexity of this method is unacceptable when processing large-scale datasets and high-dimensional datasets. For the traditional FP-Growthbased itemset mining methods, the recursive idea is used to mine all frequent itemsets from uncertain data stream, and they are more efficient than the Apriori-based methods. However, the mining process needs to construct many conditional trees, it will consume much memory usage. Although a large number of optimized algorithms have been proposed to improve the mining efficiency of Apriori-based methods and FP-Growth-based methods, but most works are directed into static precise datasets and precise data stream, they are not suitable for uncertain data stream.

Jo

For the itemset-based outlier detection approaches, the appear frequency of each item is considered as an important factor to measure the deviation degree of each transaction, thus, the accuracy of outlier detection is more accurate. However, the existing itemset-based outlier detection methods were oriented to the static precise datasets or precise data stream, they are not suitable for uncertain data stream. In addition, the designed deviation indices of FindFPOF [16], OODFP[25] and FIMUDSOD[14] are very simple, which results the detection accuracy lacking of enough competitive.

lP

a

3.2.2 Problem on outlier detection For the clustering-based outlier detection approaches, distance-based outlier detection approaches and density-based outlier detection approaches, each item in the transactions in current sliding window needs to be used to calculate the distances between other items, thus, the computational complexity is very high when processing high-dimensional data stream. In addition, the appear frequency of each item is not considered in these approaches, therefore, the detected outliers are not fit well with the definition of outlier for “appears rarely”.

re-

3.2.1 Problem on itemset mining Assume that the size of itemset {I} is m, it can be known from [39] that the number of potential extensible itemsets is up to (2m1). For the example shown in Table 2, the 1-itemsets are {{a}, {b}, {c}, {d}, {e}, {f}}, and the potential itemsets that can be mined are shown in Fig. 1, it is obviously that the number of potential itemsets is 63 (=26-1). To mine all infrequent itemsets, the support value of all possible (2m-1) itemsets needs to be calculated, it is unrealistic in the era of big data.

memory usage and only one scan of the entire data stream is also a major challenge to solve.

pro of

parameters remain the same in all subsequent sections of this paper.

In general, how to design an efficient strategy to reduce the scale of potential extensible itemsets, especially for highdimensional data stream, thereby reducing the time cost on itemset mining phase is very critical. In addition, how to use an efficient mining approach to mine the minimal infrequent itemsets from uncertain data stream in less time cost, less

For itemset-based outlier detection approaches, they are divided into two phases: (1) itemset mining phase, and (2) outlier detection phase. Similar to the itemset-based outlier detection approaches, the proposed minimal infrequent itemset-based outlier detection also can be divided into: (1) minimal infrequent itemset phase, and (2) minimal infrequent itemset-based outlier detection phase. When the new transactions flow into the sliding window, the minimal infrequent itemset mining operation is conducted to mine the minimal infrequent itemsets from the transactions in current sliding window. As mentioned in 3.2.2, the main limitation of itemset mining is its mining speed, especially processing the high-dimensional data stream. In order to increase the speed of minimal infrequent itemset mining, we can refer to two main ideas. The first idea is to reduce the scale of potential extensible itemsets through an efficient downward closure property, and the second idea is to reduce the times of data stream scans through an efficient data structure. After all minimal infrequent itemsets are mined, the outlier detection operation is conducted to effectively discover the implicit outliers from uncertain data stream. To improve the detection accuracy, more factors that have the probability to influence the detection accuracy need to be considered in the design of deviation indices. Finally, the transactions are sorted from large to small based on the calculated deviation degree, and the transactions with a large deviation degree are judged as outliers. Based on the above analysis, we first describe an efficient minimal infrequent itemset mining approach, namely MiFIUDSM, to mine the minimal infrequent itemsets from uncertain data stream. In subsection 4.2, we introduce two concepts to

5 / 21

Journal Pre-proof

4.1. Minimal infrequent itemset mining (MiFI-UDSM) As the basis of itemset-based outlier detection approach, the minimal infrequent itemsets need to be effectively mined using the “pattern extension” operation to provide protection for the outlier detection process.

After getting the (k+1)-itemsets that extended by k-itemsets, their support values are calculated using “vector multiply” operation to determine whether they are frequent itemsets, in which the “vector multiply” operation refers to multiplying the probabilities of (k+1) items in the (k+1)-itemset, and the probabilities of the (k+1) items are gained from the column vectors in the constructed matrix A. If the support value of the (k+1)-itemset is not less than the predefined min_sup, it is stored in FIL as a candidate itemset for the subsequent “pattern extension” operations. If the support value of the (k+1)-itemset is less than the predefined min_sup, the “minimal infrequent check” operation is performed to discover the minimal infrequent itemsets, in which the “minimal infrequent check” operation refers to checking whether any subset of the (k+1)-itemset is present in MFIL, then the (k+1)-itemset is discarded directly if there is a subset of current (k+1)-itemset in MiFIL, otherwise, the (k+1)-itemset is saved to MiFIL. In particular, for the extended 2itemsets, if their support values are less than min_sup, they are saved to MiFIL because they are extended by frequent 1-itemsets. Recursively performing the above operations until all minimal infrequent itemsets are mined from the transactions in current sliding window.

re-

4.1.1. The main idea of MiFI-UDSM approach Compared with Apriori-based approaches and FP-Growthbased approaches, the use of matrix structure can save much time cost and memory usage in itemset mining process [5,10] because it supports scanning the entire uncertain data stream for only one time and supports mining the minimal infrequent itemsets without generating any conditional tree. The construction process of matrix structure (denoted as matrix A) is similar to the MWIFIM-UDS approach [5], but the last row records the support value of each item. When new transaction Ta flows into the sliding window, the probability of item {ib} is written to position of Aa,b if it is in Ta, otherwise, 0 is written to the corresponding position. When the probabilities of all items in current sliding window are written into matrix A, the support value of each item is written to the corresponding position of last row of matrix A.

itemset, thus, the “pattern extension” should also not conduct on infrequent 1-itemsets to reduce the meaningless time cost. For the frequent k-itemsets that stored in the FIL, conduct the “pattern extension” operation to extend them into (k+1)-itemset. It is important to note that in the process of extending frequent 1itemsets to 2-itemsets, the “pattern extension” operation is conducted directly without judging their prefixes.

pro of

reduce the scale of potential extensible itemsets, and then propose an improved minimal infrequent itemset mining approach, namely MiFI-UDSM*. In subsection 4.3, we discuss the correctness and computing complexity of the proposed MiFIUDSM approach and MiFI-UDSM* approach. In subsection 4.4, we introduce three deviation indices to measure the deviation degree of each transaction, and then the minimal infrequent itemset-based outlier detection approach, namely MiFI-Outlier, is proposed to accurately discover the outliers from uncertain data stream.

urn a

lP

With the use of matrix structure, the scan times of uncertain data stream is reduced into once, thus, the time cost of itemset mining process is reduced to a certain extent. However, it can be seen from Fig. 1 that when the incoming data stream is highdimensional data stream, due to more different 1-items are existing in the high-dimensional data stream, thus, the scale of potential extensible itemsets is very huge, which will slow down the speed of itemset mining. For speeding up the mining process, it is necessary to delete the infrequent 1-items in the highdimensional data stream, thereby reducing the scale of potential extensible itemsets. Thus, the downward closure property should be adopted in “pattern extension” process to delete the meaningless itemsets for the mining process.

With the above operations, the number of mined minimal infrequent itemsets is much less than that of all infrequent itemsets, which facilitates the outlier detection process. The detailed process of MiFI-UDSM is shown in Algorithm 1.

Theorem 1. Downward closure property: Any superset of infrequent itemset is sure infrequent.

Jo

Proof. Assume that {Xk} is an infrequent k-itemset and {Xk+1} is extended by itemset {Xk} and {Y}. Due to the probability of itemset {Y} is not large than 1, and itemset {Xk} and itemset {Y} are not always appearing at the same time, therefore, sup(Xk+1)≤ sup(Xk)*p(Y)≤sup(Xk)
Algorithm 1: MiFI-UDSM Input: Uncertain data stream, min_sup Output: MiFIs 01.MiFIL=Φ, FIL=Φ 02.construct matrix A 03.if A|SW|+1,k
6 / 21

Journal Pre-proof

Step 1. Construct matrix A. The probability of each itemset in transactions T1, T2, T3, T4 and T5 is scanned and written into matrix A successively. When transaction T1 is scanned, the probability of the existing itemsets ({a}, {d}, {e} and {f}) in T1 is written in corresponding position, the result is shown in Fig. 2(a). Fig. 2(b) shows the constructed matrix A after transaction T2 is scanned, and Fig. 2(c) shows the constructed matrix A after all five transactions in sliding window are scanned. Then, the probability of each item is calculated and written in the corresponding position of row (|SW|+1) in matrix A, the specific result is shown in Fig. 2(d). a d e f b c T 1 0.8 0.2 0.4 0.2 0 0 T 2 0.6 0 0.7 0.2 0.1 0.9 (b)

T1 T2 T3 T4 T5

a d e f b c 0.8 0.2 0.4 0.2 0 0 0.6 0 0.7 0.2 0.1 0.9 0.3 0 0.2 0 0.3 0.4 0 0.2 0.7 0.1 0.2 0 0.4 0.1 0 0.3 0.2 0.5 (c)

a 0.8 0.6 0.3 0 0.4 sup 2.1

T1 T2 T3 T4 T5

d 0.2 0 0 0.2 0.1 0.5

e 0.4 0.7 0.2 0.7 0 2 (d)

f 0.2 0.2 0 0.1 0.3 0.8

b 0 0.1 0.3 0.2 0.2 0.8

Fig.2. The creation process of matrix A

c 0 0.9 0.4 0 0.5 1.8

Although the MiFI-UDSM approach can exactly mine the minimal infrequent itemsets from uncertain data stream, but when processing the high-dimensional data stream, the scale of potential extensible itemsets is also very huge, it will increase much time cost in the mining process, including “pattern extension” operation and support value calculation. To further reduce the scale of extensible k-itemsets, we design two concepts including “item cap” and “support cap” to eliminate the extra “pattern extension” operation and support value calculation on the meaningless k-itemsets, where these k-itemsets can be deleted directly to reduce the dimensions, thereby reducing the overall time cost on itemset mining. Specifically, the “pattern extension” operation is conducted by considering an upper bound of existential probability of each itemset in current transaction, where the upper bound is the “item cap” and it is defined in definition 6. Definition 8. item cap (pcap(Im,Tj)): The “item cap” of item {im} in transaction Tj is defined as pcap(im,Tj), it is the product of probability p(im,Tj) and the maximal existential probability value (denoted as M) of frequent items except for itself for (n-1) times. It is defined as

p cap (im, Tj )

n 1

, | Tj | 1

p ( i1 , T j )

urn a

Step 3. Extend the frequent 1-itemsets to 2-itemsets. The frequent 1-itemsets {a}, {e}, {f}, {b} and {c} that stored in FIL are taken out in turn to do the “pattern extension” operation, thereby extending them to 2-itemsets of {{ae}, {af}, {ab}, {ac}, {ef}, {eb}, {ec}, {fb}, {fc}, {bc}}.

Step 4. Search for minimal infrequent 2-itemsets. After the frequent 1-itemsets are extending to 2-itemsets, their support values are calculated to search for minimal infrequent 2-itemsets, thereby saving them into MiFIL. For the extended 2-itemsets, the detailed support values are {{ae}:0.8, {af}:0.4, {ab}:0.23, {ac}:0.86, {ef}:0.29, {eb}:0.27, {ec}:0.71, {fb}:0.1, {fc}:0.33, {bc}:0.31}, thus, itemsets {af}, {ab}, {ef}, {eb}, {fb}, {fc} and {bc} are minimal infrequent 2-itemsets and they are saved to MiFIL, while {ae}, {ac} and {ec} are frequent itemsets and they are saved to FIL. Step 5. Extend the frequent 2-itemsets to 3-itemsets. For frequent 2-itemsets {ae} and {ac} with the same prefix {a}, they can be extended into {aec}. Due to sup(aec)=0.402<0.6, it is an infrequent itemset. Because of the subsets of itemset {aec} ({ae}, {ac}, {ec}) are frequent itemsets, thus, it is a minimal infrequent itemset and should be saved to MiFIL. Because of no longer frequent itemsets are existing in this example, thus, the “pattern extension” process is ended. In summary, the mined minimal infrequent itemsets are {d}, {af}, {ab}, {ef}, {eb}, {fb}, {fc}, {bc} and {aec}.

max p (im, Tj ) (3)

, M

,| Tj | 1

m

[1,|Tj |]

Theorem 2. In the same transaction Tj, the existential probability of any k-itemset Im (k>1) that contains items {im} is not larger than pcap(im,Tj), that is: p(Im,Tj)≤pcap(im,Tj), where ImTj and {im}Im. Proof. Assume that itemset {Im}={i1,…,im} is a subset of transaction Tj, it can be known from definition 2 that

lP

Step 2. Search for minimal infrequent 1-itemsets. After the support values of all 1-itemsets are calculated, the minimal infrequent 1-itemsets are mined based on matrix A. Because sup(d)=0.5<0.6, it is an infrequent itemset and saved to MiFIL. The frequent 1-itemsets {a}, {b}, {c}, {e} and {f} are saved to FIL.

Jo

p ( im , T j ) * M

re-

a d e f T 1 0.8 0.2 0.4 0.2 (a)

4.2. An improved minimal infrequent itemset mining approach (MiFI-UDSM*)

pro of

4.1.2 An example of MiFI-UDSM approach In this subsection, an example is given to explain the MiFIUDSM method more clear. The uncertain data stream used in this example is shown in Table 2, the min_sup value and |SW| are same to that in subsection 3.1.

p ( Im, Tj )

p (i , T j )

p (i m , T j ) *

{i } { Im}

p (i , T j )

(4)

{ i } { Im im}

Because of the existential probability of each item {i} is not large than 1 (that is 0
p (i, T j ) {i } { Im im }

max p (iq, T j )

(5)

1 q m 1

It can be derivate from formula (4) and (5) that:

p ( Im, Tj )

p ( im , T j ) *

p (i , T j ) { i } { Im im }

p (im, Tj ) * max p (iq , Tj )

(6)

1 q m 1

cap

p ( im , T j ) Thus, theorem 2 is correct. Definition 9. support cap (supcap(Im)): The “support cap” of itemset Im is denoted as supcap(Im), it is defined as the sum of all pcap(Im,Tj) that appearing in current sliding window. It is defined as | SW |

sup cap (Im)

( p cap ( Im, Tj ) | Im

Tj )

(7)

j 1

Theorem 3. For any k-itemset Im (k>1), the support value of Im is not larger than supcap(X), that is sup(Im)≤supcap(Im).

7 / 21

Journal Pre-proof

| SW |

sup (Im )  



p (i , T j )

j 1 { i }{ Im } | SW |

  ( p ( im , T j ) * j 1



p (i, Tj ))

{ i }{ Im  im }

(8)

| SW |

  p ( im , T j ) cap

j 1

 sup (Im ) cap

Therefore, theorem 3 is correct. Definition 10. Safe infrequent itemset (SiFI): For a itemset Im, if its “support cap” value is less than min_sup, that is supcap(Im)
Algorithm 2: MiFI-UDSM* Input: Uncertain data stream, min_sup Output: MiFIs 01.MiFIL=Φ, FIL=Φ 02.construct matrix A 03.if A|SW|+1,k
urn a

lP

re-

Based on the “item cap” concept and “support cap” concept, an improved edition of MiFI-UDSM approach, namely MiFIUDSM*, is propsoed to more quickly mine the minimal infrequent itemsets from uncertain data stream. Different with the MiFI-UDSM approach, the pcap(Im,Tj) values and the supcap(Im) values need to be calculated before each “pattern extension” operation from frequent k-itemsets to (k+1)-itemsets to discard these SiFIs, thereby reducing the dimensions of extensible patterns in high-dimensional data stream and reducing the meaningless time cost. In addition, another matrix structure is also need to be constructed to record the pcap value and supcap value of each item, where the last row of matrix structure records the supcap(Im) values instead of the support values to easily exclude the items that do not participate in the subsequent “pattern extension” operations. If the support value of current item is less than the min_sup, all pcap value of this item in the matrix structure are recorded as 0, threrby reducing the time cost on pcap value calculation and supcap value calculation. Then, the specific implementation steps for the MiFI-UDSM* approach are as follows.

supcap(Im) value of each subset is not less than predefined min_sup value, they are saved to FIL. Otherwise, the 2-itemsets are extended with the frequent 1-itemsets stored in FIL using “pattern extension” operation to form the 3-itemsets, then, the “minimal infrequent check” operation is conducted to check whether any subset is existing in MiFIL, and the extended 3itemsets are saved to MiFIL directly if no subset of them is in MiFIL. Then, the operations shown in second phase are recursively conducting until all minimal infrequent itemsets in the transactions in current sliding window are mined. The detailed process of the MiFI-UDSM* approach is shown in Algorithm 2.

pro of

Proof. Assume that Im={i1,…,im} is a subset of Tj, it can be known from definition 3 and theorem 2 that:

Jo

In the first phase, the matrix structure is constructed similar to the MiFI-UDSM approach, and then the minimal infrequent 1itemsets and frequent 1-itemsets are mined with the calculation of support values. Before extending the frequent 1-itemsets to 2itemsets, the supcap(Im) value of each frequent 1-itemset is calculated to find the safe infrequent 1-itemsets. For the safe infrequent 1-itemsets, they are directly connected with the frequent 1-itemsets stored in the FIL to extend to the 2-itemsets, and the extended 2-itemsets are saved in the MiFIL directly without the calculation of their support values, thereby reducing the time cost on support value calculation operations. Because any superset of the safe infrequent 1-itemsets is also infrequent, thus, it is not necessary to conduct the “pattern extension” operation for them and they are moved out from FIL. In the second phase, the frequent 1-itemsets that stored in FIL are conducting the “pattern extension” operation to extend them to 2-itemsets, and the support values of the extended 2-itemsets are calculated to further mine the minimal infrequent 2-itemsets and frequent 2-itemsets. For these frequent 2-itemsets, the supcap(Im) values are calculated to find the safe infrequent 2itemsets. Specifically, in the extended frequent 2-itemsets, if the

4.2.1 An example of MiFI-UDSM* approach In this subsection, we use the example that shown in Table 2 to explain the proposed MiFI-UDSM* approach more clearly, the min_sup value and |SW| are also same to that shown in subsection 3.1. Step 1. Construct matrix A. The construction of matrix A is same to MiFI-UDSM approach. Step 2. Search for minimal infrequent 1-itemsets. The process of this step is same to MiFI-UDSM approach, and the frequent 1itemsets {a}, {b}, {c}, {e} and {f} are saved to FIL.

8 / 21

Journal Pre-proof

T1 T2 T3 T4 T5 sup cap

a 0.32 0.54 0.12 0 0.2 1.18

d 0 0 0 0 0 0

e 0.32 0.63 0.08 0.14 0 1.17

f 0.16 0.18 0 0.07 0.15 0.56

b c 0 0 0.09 0.63 0.12 0.12 0.14 0 0.1 0.2 0.45 0.95

T1 T2 T3 T4 T5 sup cap

a 0.128 0.486 0.048 0 0.1 0.762

d 0 0 0 0 0 0

e 0.256 0.567 0.032 0.028 0 0.883

f 0 0 0 0 0 0

b c 0 0 0 0.441 0 0.036 0 0 0 0.08 0 0.557

Fig.4. The pcap value and supcap value for each frequent 2-itemset

Step 8. Search for safe minimal infrequent 3-itemsets. After the calculation of supcap value for each frequent 2-itemset, the safe minimal infrequent 3-itemsets that extended by the frequent 2-itemsets are mined and stored into MiFIL. Because of supcap(c)=0.557<0.6, thus, itemsets {ac} and {ec} are safe infrequent itemsets, that is, the 3-itemsets that extended by them are minimal infrequent itemsets. Then, they are used to extend with the frequent 2-itemsets that stored in FIL to form the minimal infrequent 3-itemsets. Because the frequent 2-itemsets in FIL are {ae}, {ac} and {ec}, thus, the minimal infrequent 3itemsets extended by them are {aec}, it is saved to MiFIL directly without the calculation of their support values.

pro of

Step 3. Calculate the pcap value and supcap value for each frequent 1-itemset. After the construction of matrix A, the pcap value and supcap value of each frequent 1-itemset are calculated to seek minimal infrequent 2-itemsets. For itemset {a} in transaction T1, the maximal existential probability value except for itself is 0.4, thus, pcap(a,T1)=0.8*0.4=0.32, it is written in corresponding position of matrix structure. The pcap value of the items in other frequent 1-itemsets is also calculating in the same way. In particular, the pcap value of infrequent 1-itemsets (itemset {d}) is written in 0 to omit the calculation operation, it is owing to that any superset of the infrequent itemset is also an infrequent itemset. When all pcap values are calculated and written into matrix structure, the supcap value is calculated and written in the last row of matrix, such as: supcap(a)=0.32+0.54+0.12+0+0.2 =1.18. The detailed result of pcap value and supcap value of each frequent 1-itemset is shown in Fig. 3.

Because of no longer frequent itemsets are existing in this example, thus, the “pattern extension” process is ended. Finally, the minimal infrequent itemsets in MiFIL are {d}, {af}, {ab}, {ef}, {eb}, {fb}, {fc}, {bc} and {aec}.

Fig.3. The pcap value and supcap value for each frequent 1-itemset

re-

Step 4. Search for safe minimal infrequent 2-itemsets. After the calculation of supcap value for each frequent 1-itemset, the safe minimal infrequent 2-itemsets that extended by the frequent 1-itemsets whose supcap value is less than min_sup value are mined and stored into MiFIL. Because of supcap(f)=0.56<0.6 and supcap(b)=0.45<0.6, thus, itemsets {f} and {b} are safe infrequent itemsets, that is, the 2-itemsets that extended by them are minimal infrequent itemsets. Then, they are used to extend with the frequent 1-itemsets that stored in FIL to form the minimal infrequent 2-itemsets. Because the frequent 1-itemsets in FIL are {a}, {b}, {c}, {e} and {f}, thus, the minimal infrequent 2itemsets extended by them are {af}, {ef}, {fb}, {fc}, {ab}, {eb} and {bc}, they are saved into MiFIL directly without the calculation of their support values.

4.3. The correctness and computing complexity of MiFIUDSM and MiFI-UDSM*

lP

In this subsection, the correctness and the computing complexity of the proposed two minimal infrequent itemset mining approaches including MiFI-UDSM and MiFI-UDSM* are analyzed.

urn a

Step 5. Extend the “frequent” 1-itemsets to 2-itemsets. After step 4, we can know the “frequent” 1-itemsets are only {a}, {e} and {c}, they can be extended to 2-itemsets of {{ae}, {ac}, {ec}}. Step 6. Search for minimal infrequent 2-itemsets. After the 2itemsets are extended by “frequent” 1-itemsets, their support value is calculated to determine whether they are minimal infrequent itemsets or not. The detailed support value of the extended 2-itemsets is {{ae}:0.8, {ac}:0.86, {ec}:0.71}, thus, itemsets {ae}, {ac} and {ec} are frequent 2-itemsets, they are saved to FIL.

Jo

Step 7. Calculate the pcap value and supcap value for each frequent 2-itemset. Before extending the frequent 2-itemsets to 3itemsets, the pcap value and supcap value of the 1-items in frequent 2-itemsets are calculated to determine whether they can be further extended. For {a}, pcap(a,T1)=0.8*0.4*0.4 =0.0.128, it is written in corresponding position of matrix structure. The pcap value of other items in frequent 2-itemsets is also calculating in the same way. When all pcap values are calculated and written into matrix structure, the supcap value is calculated and written in the last row of matrix, such as: supcap(a)=0.128+0.486+0.048+0+ 0.1=0.762. The detailed result is shown in Fig. 4.

For the MiFI-UDSM approach, the factor affecting the mining accuracy is the removal of infrequent itemsets during “pattern extension” phase. It can be known from Theorem 1 that if {Xk} is an infrequent k-itemset and {Xk+1} is the superset of {Xk} that extended by itemset {Xk} and itemset {Y} (sup(Y)≤1), therefore, sup(Xk+1)≤sup(Xk)*p(Y)≤sup(Xk)2), the “minimal infrequent check” operation is conducted after support value calculating, therefore, the itemsets saved in MiFIL are guaranteed to be minimal. Overall, the mined MiFIs by MiFI-UDSM is correct and no any iFI will be missed mining. Different with the MiFI-UDSM approach, the purpose of introducing the concepts of “item cap” and “support cap” for MiFI-UDSM* is to reduce the scale of potential extensible itemsets, thereby improving the mining efficiency. It can be known from Theorem 2 that if the itemset {X} is composed by itemset {xr} and supcap(xr)
9 / 21

Journal Pre-proof

To better quantify the abnormal degree of the detected transactions, the minimal infrequent Itemset deviation index (MiFIDI), similarity deviation index (SDI) and transaction deviation index (TDI) are defined in the next paragraphs. Definition 10. MiFIDI (Minimal infrequent Itemset Deviation Index): For each minimal infrequent itemset {X}, the length of {X} is len(X) and the support value of {X} is sup(X), the number of different items in current sliding window is k. Then, MiFPDI is defined as

MiFIDI ( X )  (min_sup  sup( X ))*2k len ( X )

urn a

It can be known from the definition of outlier proposed by Hawkins [15] that the outliers have two major attributes: (1) rarely appear, and (2) deviate much from most observations. For the attribute of rarely appearing, the proposed MiFI-UDSM approach and MiFI-UDSM* approach can effectively mine the rarely appearing itemsets (i.e. minimal infrequent itemsets) from uncertain data stream. For the attribute of deviating much from most observations, we refer to the MWIFIM-OD-UDS approach [5] and MIFPOD approach [17] to design three deviation indices based on the mined minimal infrequent itemsets, thereby measuring the deviation degree of each transaction in the current sliding window. The factors that will influence the deviation degree mainly involve the following aspects. (1) The support value of minimal infrequent itemsets. The smaller support value of minimal infrequent itemset Im means Im is appearing more rarely or the appearing probability is smaller, which results the detected object more abnormal. Thus, the support value is negatively correlated with the deviation degree.

Jo

(2) The length of minimal infrequent itemsets. For a minimal infrequent 1-itemset {ia} and a minimal infrequent 2-itemset {iaib}, the number of infrequent itemsets that can be extended by {ia} is (2k-1-1), and the number of infrequent itemsets that can be extended by {iaib} is (2k-2-1). That is, the shorter minimal infrequent itemsets can generate more infrequent itemsets, therefore, the short minimal infrequent itemsets will result the detected transactions more abnormal. Thus, the length of minimal infrequent itemset is negatively correlated with deviation degree. (3) The number of contained minimal infrequent itemsets. For a transaction Ti, if more minimal infrequent itemsets are

(9)

MiFIDI is the deviation index of minimal infrequent itemsets, the bigger MiFIDI(X) value means itemset {X} is more abnormal. Definition 11. SDI (Similarity Deviation Index): For each transaction Ti in the sliding window, its length is len(Ti). Then, SDI is defined as

SDI (Ti ) 



1 * len(Y ) sup(Y ) len(Ti )

X  MiFIL ,Y (Ti X )

(10)

SDI is an important deviation index of the transactions, the bigger SDI(Ti) value means transaction Ti is more abnormal.

lP

4.4. Outlier detection approach

(4) The similar degree to the minimal infrequent itemsets in MiFIL. For a transaction Ti, if it is more similar with the itemsets stored in MiFIL, then, transaction Ti is more like an outlier. Thus, the similar degree is positively correlated with deviation degree.

re-

The difference of MiFI-UDSM* approach and MiFI-UDSM approach is that the pcap value and supcap value need to be calculated before each “pattern extension” operation, the computing complexity of this operation is O(2*k*n) in the worst case, where k is the maximal length of extensible itemsets. In this case, the computing complexity of MiFI-UDSM* approach is O((2n*|SW|)+(2n-1)+(2n-1-(n2+n)/2)+2*k*n). Because of O(2n* |SW|-(n2+n)/2)-2+2*k*n) is far less than O(2n+1), thus, the final computing complexity of the MiFI-UDSM* approach is also O(2n+1) in the worst case.

contained in Ti, which indicates many itemsets in Ti are appearing rarely, transaction Ti is more like an outlier. Thus, the number of contained minimal infrequent itemsets is positively correlated with deviation degree.

pro of

For the MiFI-UDSM approach, it can be decomposed into four steps, and the computing complexity of each step is shown as follows. Step 1: Scan the item information in the sliding window and write their probability into matrix structure, the computing complexity of this step is O(n*|SW|), n is the number of distinct items. Step 2: Calculate the support value of each 1itemset and discover the minimal infrequent 1-itemsets, the computing complexity of this step is also O(n*|SW|). Step 3: Extend the frequent 1-itemsets to long itemsets and discover the infrequent itemsets. In the worst case, all extended itemsets are frequent, thus, (2n-1) itemsets can be extended by the n frequent 1-itemsets, the computing complexity of this step is O(2n-1) in the worst case. Step 4: Discover the minimal infrequent itemsets. For the n 1-itemsets, the “minimal infrequent check” operation needs not conduct on the n 1-itemsets and (n*(n-1)/2) 2-itemsets. In the worst case, we need to check (2n-1-(n2+n)/2) itemsets, therefore, the computing complexity of this step is O(2n-1(n2+n)/2) in the worst case. In general, the computing complexity of the MiFI-UDSM approach is O((2n*|SW|)+(2n-1)+(2n-1(n2+n)/2)), because of O(2n*|SW|-(n2+n)/2)-2) is far less than O(2n+1), therefore, the final computing complexity of the MiFIUDSM approach is O(2n+1).

Definition 12. TDI (Transaction Deviation Index): For each transaction Ti in the sliding window, its length is len(Ti). Then, TDI is defined as

 TDI (Ti ) 

MiFIDI ( X )  SDI (Ti )

X Ti , X  MiFIL

(11)

len(Ti )

TDI is the final deviation index of the transactions, the bigger TDI(Ti) value means the transaction Ti is more likely an outlier. Based on the calculation of MiFIDI(X) value, SDI(Ti) value and TDI(Ti) value, a Minimal inFrequent Itemset-based Outlier detection approach, namely MiFI-Outlier, is proposed to accurately detect the implicit outliers from uncertain data stream. The process of MiFI-Outlier approach is roughly divided into: (1) mine the minimal infrequent itemsets from uncertain data stream using the proposed MiFI-UDSM method or MiFI-UDSM* method, (2) determine the contained minimal infrequent itemsets of current transactions and calculate the deviation degree of each transaction, and (3) sort the transactions using decreasing TDI(Ti) values, thereby detecting the outliers. Then, the top k transactions with largest TDI(Ti) values are judged as outliers, where value k is specified by the users. The detail process of the proposed MiFI-Outlier approach is shown in Algorithm 3. Algorithm 3: MiFI-Outlier Input: Uncertain data stream, min_sup, k Output: Outlier Sets (OS) 01.call Algorithm 2 // mine minimal infrequent itemsets 02.MiFIDI(X)=0, SDI(Ti)=0, TDI(Ti)=0 03.foreach {X}MiFIL then 04. MiFIDI(X)=(min_sup-sup(X))*2k-len(X)

10 / 21

Journal Pre-proof

For transaction T5, the similar parts are {d}:0.1, {af}:0.12, {ab}:0.08, {f}:0.3, {b}:0.2, {fb}:0.06, {fc}:0.15, {bc}:0.1 and {ac}:0.2, thus, SDI(T5)=1/0.1+2/0.12+2/0.08+1/0.3+1/0.2+2/0.06 +2/0.15+2/0.1+2/0.2=136.67.

05.end for 06.for i[1,|SW|] do 07. foreach {X}Ti and {X}MiFIL do

08.

SDI (Ti ) 

1 * len(Y ) sup(Y ) len(Ti )

Step 3. Determine the contained MiFIs for each transaction and calculate its TDI value.

09. calculate n(X)// n(X) is the number of MiFIs in Ti 10. end for 11. foreach {X}Ti do



12.

TDI (Ti ) 

For transaction T1, the contained MiFIs are {d}, {af} and {ef}, thus, TDI(T1)=(3.2+3.2+4.96)+62.5=73.86; For transaction T2, the contained MiFIs are {ab}, {af}, {bc}, {eb}, {fb}, {fc}, {ef} and {aec}, thus, TDI(T2)=(5.92+3.2+4.64+ 5.28+8+4.32+4.96+1.584)+234.13=272.034;

MiFIDI ( X )  SDI (Ti )

X Ti , X  MiFIL

len(Ti ) 13. end for 14.end for 15.sort transactions using decreasing TDI(Ti) values 16.OS←top k {Ti} 17.return outliers in OS

For transaction T3, the contained MiFIs are {ab}, {bc}, {eb} and {aec}, thus, TDI(T3)=(5.92+4.64+5.28+1.584)+211.39= 228.814; For transaction T4, the contained MiFIs are {d}, {eb}, {fb} and {ef}, thus, TDI(T4)=(3.2+5.28+8+4.96)+179.29=179.29;

For itemset {d}, MiFIDI(d)=(0.6-0.5)*26-1=3.2; For itemset {af}, MiFIDI(af)=(0.6-0.4)*26-2=3.2;

Step 4. Sort the transactions using decrease TDI values. After the calculation of TDI values, the probability of the transactions being outliers in decrease order is T2, T3, T4, T5, T1.

5. Experiments and analysis The extensive experiments are first conducted on a synthetic dataset [5] (the outliers are marked) and a public dataset1 (namely lymphography, classes 1 and 4 are marked as outliers) to evaluate the detection accuracy of the proposed MiFI-Outlier approach, and then four public datasets2 (namely mushroom, pumsb*, chess and kosarak) are used to evaluate the mining efficiency of the proposed two minimal infrequent itemset mining approaches, including time cost and memory usage. For the four public datasets, because of each data in these datasets do not provides the probability value, thus, we assign a randomly generated existential probability ranged in (0.0,1.0] to each data as suggested by [22]. The characteristic of the used datasets is shown in Table 3.

lP

For itemset {ab}, MiFIDI(ab)=(0.6-0.23)*26-2=5.92;

For transaction T5, the contained MiFIs are {d}, {ab}, {af}, {bc}, {fb} and {fc}, thus, TDI(T5)=(3.2+5.92+3.2+4.64+8+ 4.32)+136.67=165.95.

re-

4.4.1 An example of MiFI-Outlier approach This subsection takes the example listed in Table 2 as an example to illustrate the proposed MiFI-Outlier approach more clearly. The mined minimal infrequent itemsets and their support value in this example are {d}:0.5, {af}:0.4, {ab}:0.23, {ef}:0.29, {eb}:0.27, {fb}:0.1, {fc}:0.33, {bc}:0.31 and {aec}:0.402, the min_sup value is 0.6 and the number of different items (recorded as k) is 6. Step 1. Calculate the MiFIDI value for each MiFI.

pro of



X  MiFIL ,Y (Ti X )

For itemset {ef}, MiFIDI(ef)=(0.6-0.29)*26-2=4.96;

For itemset {eb}, MiFIDI(eb)=(0.6-0.27)*26-2=5.28; For itemset {fb}, MiFIDI(fb)=(0.6-0.1)*26-2=8;

For itemset {fc}, MiFIDI(fc)=(0.6-0.33)*26-2=4.32;

urn a

For itemset {bc}, MiFIDI(bc)=(0.6-0.31)*26-2=4.64;

For itemset {aec}, MiFIDI(aec)=(0.6-0.402)*26-3=1.584.

Step 2. Determine the similar parts for each transaction and calculate its SDI value. For transaction T1, the similar parts are {d}:0.2, {af}:0.16, {a}:0.8, {ef}:0.08, {e}:0.4, {f}:0.2, {f}:0.2 and {ae}:0.32, thus, SDI(T1)=1/0.2+2/0.16+1/0.8+2/0.08+1/0.4+1/0.2+1/0.2+2/0.32= 62.5;

Jo

For transaction T2, the similar parts are {af}:0.12, {ab}:0.06, {ef}:0.14, {eb}:0.07, {fb}:0.02, {fc}:0.18, {bc}:0.09 and {aec}:0.378, thus, SDI(T2)=2/0.12+2/0.06+2/0.14+2/0.07+2/0.02 +2/0.18+2/0.09+3/0.378=234.13; For transaction T3, the similar parts are {a}:0.3, {ab}:0.09, {e}:0.2, {eb}:0.06, {b}:0.3, {c}:0.4, {bc}:0.12 and {aec}:0.024, thus, SDI(T3)=1/0.3+2/0.09+1/0.2+2/0.06+1/0.3+1/0.4+2/0.12+3/ 0.024=211.39; For transaction T4, the similar parts are {d}:0.2, {f}:0.1, {b}:0.2, {ef}:0.07, {eb}:0.14, {fb}:0.02, {f}:0.1, {b}:0.2 and {e}:0.7, thus, SDI(T4)=1/0.2+1/0.1+1/0.2+2/0.07+2/0.14+2/0.02+ 1/0.1+1/0.2+1/0.7=179.29;

Datasets mushroom kosarak pumsb* chess lymphography synthetic dataset

Table 3 Characteristics of used datasets Num. of Num. of Avg. trans. items Trans. Size 8124 120 23 990002 41270 8.1 49046 2113 74 3196 75 37 148 58 18 1200

76

7.5

Data size 0.545MB 31.4MB 15.9M 901KB 6KB 55KB

To evaluate the proposed MiFI-Outlier approach, the itemsetbased outlier detection methods, FindFPOF [16], MIFPOD [17], OODFP [25] and FIM-UDSOD [14], are used as the compared approaches. Besides that, the distance-based method DPA [36] is also compared in this experiment. This experiment is conducted in different min_sup values and different sizes of sliding window (the ratio of min_sup value to the size of sliding window is set to 33%, 34%, 35%, 36%, 37% and 38% respectively) to discover the implicit outliers from uncertain data stream.

1 2

https://archive.ics.uci.edu/ml/datasets.html http://fimi.cs.helsinki.fi/data/

11 / 21

Journal Pre-proof

The experiments in this section are implementing on a machine running Windows 10 with an Intel dual core i3-2020 3.30 GHz processor and 8GB RAM. And all code used in the experiments are realized in Python language. 5.1. Detection accuracy of MiFI-Outlier approach on synthetic dataset

It can be seen from Table 4 that when the size of sliding window is 30, in the first sliding window, the error detection is appearing when the min_sup value is set to 9.9 and 10.2. Specifically, when the min_sup value is set to 9.9, ten transactions need to be selected when all seven outliers are detected, while nine transactions need to be selected when all seven outliers are detected when the min_sup value is set to 10.2. But when the min_sup value is not less than 10.5, the detection accuracy of MiFI-Outlier approach can reach to 100% in all first windows. In the second window and third window, the detection accuracy of MiFI-Outlier approach can reach to 100% regardless of the selection of the min_sup values. When the size of sliding window is 40, in the first and third windows, the detection accuracy of MiFI-Outlier approach can reach to 100% regardless of the selection of min_sup values, but in the second sliding window, the detection accuracy of MiFI-Outlier approach cannot reach to 100% when the min_sup value is set to 13.2, that is, twelve transactions need to be selected when all nine outliers are detected. When the size of sliding window is 50, the detection accuracy of MiFI-Outlier approach can reach to 100% in all front three windows regardless of the selection of min_sup values. Overall, the detection accuracy of the proposed MiFI-Outlier approach shows an increase trend with the increase of min_sup values, and it can be used to detect the implicit outliers from uncertain data stream when the min_sup values are set relatively large, but not suitable for discovering the outliers when the min_sup values are set very small. The reason is that in small min_sup values, the number of mined minimal infrequent itemsets is also very limited, which results the designed deviation indexes cannot play the maximize role in outlier detection process.

1(7) 2(7) 3(7)

No. SW (|outliers|) 1(11)

Jo

2(9)

Table 4 Detection accuracy of MiFI-Outlier approach on the synthetic dataset Size of sliding window=30 min_sup=9.9 min_sup=10.2 min_sup=10.5 min_sup=10.8 min_sup=11.1 3(3) 3(3) 3(3) 3(3) 3(3) 7(5) 7(6) 7(7) 7(7) 7(7) 10(7) 9(7) 3(3) 3(3) 3(3) 3(3) 3(3) 7(7) 7(7) 7(7) 7(7) 7(7) 3(3) 3(3) 3(3) 3(3) 3(3) 7(7) 7(7) 7(7) 7(7) 7(7) Size of sliding window=40 min_sup=13.2 min_sup=13.6 min_sup=14.0 min_sup=14.4 min_sup=14.8 5(5) 5(5) 5(5) 5(5) 5(5) 11(11) 11(11) 11(11) 11(11) 11(11) 5(4) 5(5) 5(5) 5(5) 5(5) 9(7) 9(9) 9(9) 9(9) 9(9) 12(9) 3(3) 3(3) 3(3) 3(3) 3(3) 7(7) 7(7) 7(7) 7(7) 7(7) Size of sliding window=50 min_sup=16.5 min_sup=17.0 min_sup=17.5 min_sup=18.0 min_sup=18.5 6(6) 6(6) 6(6) 6(6) 6(6) 13(13) 13(13) 13(13) 13(13) 13(13) 5(5) 5(5) 5(5) 5(5) 5(5) 10(10) 10(10) 10(10) 10(10) 10(10) 5(5) 5(5) 5(5) 5(5) 5(5) 10(10) 10(10) 10(10) 10(10) 10(10)

urn a

No. SW (|outliers|)

3(7)

No. SW (|outliers|) 1(13)

2(10) 3(10)

lP

re-

This subsection is to test the detection accuracy of the proposed MiFI-Outlier approach on the synthetic dataset, where the experiments are conducted in different min_sup values and different sizes of sliding window. Firstly, the specific experimental results of the front three windows are shown in Table 4, where the ‘No. SW’ means the number of sliding windows and the ‘|outliers|’ means the number of true outliers, the values in the parentheses represent the number of true outliers and the values outside the parentheses represent the number of transactions have been selected by MiFI-Outlier approach when all true outliers are detected. The closer number of selected transactions and the outliers indicates the higher detection accuracy, while the error detection is marked in red and bold. Secondly, the experimental results of the six compared approaches that under different min_sup values and different

sizes of sliding window are shown in Fig. 5 to Fig. 7 respectively. In these figures, the “Detection accuracy (%)” index (of the yaxis) is used to measure the detection accuracy of each approach (Detection accuracy=n1/n2, n1 indicates the number of true outliers, n2 indicates the number of selected transactions when all true outliers have been detected), while the “No. of sliding window” (of the x-axis) indicates the concrete number of sliding window.

pro of

To test the efficiency of minimal infrequent itemset mining, the time cost and memory usage of proposed MiFI-UDSM and MiFI-UDSM* approaches are tested in different min_sup values and different sizes of sliding window. To prove the effectiveness of the proposed two approaches, the Apriori-based method of MRG-Exp [32], the matrix-based method of MIP-DS [17], the FP-Growth-based method of DSUF-min [40] and RP-Tree [13] are used as the compared methods. In these compared methods, the MRG-Exp method and MIP-DS method are used to mine the minimal infrequent itemsets from precise datasets, RP-Tree method is used to mine infrequent itemsets from precise datasets and DSUF-min method is used to mine the frequent itemsets from uncertain data stream. To illustrate the efficiency of the proposed approaches more intuitively, we have modified the four compared methods to make them can accurately mine the minimal infrequent itemsets from uncertain data streams.

min_sup=11.4 3(3) 7(7) 3(3) 7(7) 3(3) 7(7) min_sup=15.2 5(5) 11(11) 5(5) 9(9) 3(3) 7(7) min_sup=19.0 6(6) 13(13) 5(5) 10(10) 5(5) 10(10)

12 / 21

100

90

90

80

80

70 60 50 40 30 20 10

70 60 50 40 30 20 10

15

20

25

30

35

0

40

10

90

90

80

80

70 60 50 40 30 20 10

20

25

30

35

60

40 30 20

0

40

80

60 50 40 30 20 10

20

25

30

No. of sliding window

(a) min_sup=13.2

10

60 50 40

40 30 20 10

80 70

10

15

20

25

5

10

15

25

10

10

15

30 20

5

10

20

25

30

35

40

(f) min_sup=11.4

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

80 70 60 50 40 30 20

0

5

10

15

20

25

30

(c) min_sup=14.0 MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

90

20

80 70 60

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

50 40 30 20 10

10

15

20

25

0

30

5

10

No. of sliding window

15

20

25

30

No. of sliding window

(f) min_sup=15.2

100

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

90

60 50 40 30 20

0

20

100

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

5

15

No. of sliding window

10

5

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

40

90

Detection accuracy (%)

20

40

100

30

30

70

35

10

20

40

80

30

No. of sliding window

50

90

25

50

0

40

100

Detection accuracy (%)

Detection accuracy (%)

35

60

0

30

20

60

(e) min_sup=14.8 Fig.6. Detection accuracy of MiFI-Outlier approach when |SW| is 40

Jo

100

0

30

10

5

15

MiFI-Outlier 90 OODFP FindFPOF 80 MIFPOD 70 FIM-UDSOD DPA

(b) min_sup=13.6

(d) min_sup=14.4

30

25

20

No. of sliding window

40

20

Detection accuracy (%)

50

50

15

30

0

Detection accuracy (%)

60

10

(c) min_sup=10.5

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

urn a

Detection accuracy (%)

70

90

70

5

No. of sliding window

100

90

60

0

40

No. of sliding window

100

70

35

10

80

20

10

5

lP

70

80

30

re-

80

Detection accuracy (%)

Detection accuracy (%)

90

15

30

(e) min_sup=11.1 Fig.5. Detection accuracy of MiFI-Outlier approach when |SW| is 30

100

90

25

50

90

0

20

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

70

100

10

40

No. of sliding window

(d) min_sup=10.8

5

50

100

No. of sliding window

0

15

10

15

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

60

(b) min_sup=10.2 100

Detection accuracy (%)

Detection accuracy (%)

(a) min_sup=9.9

10

70

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

No. of sliding window

100

5

80

10

5

Detection accuracy (%)

10

Detection accuracy (%)

5

No. of sliding window

0

90

pro of

0

100

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

Detection accuracy (%)

100

Detection accuracy (%)

Detection accuracy (%)

Journal Pre-proof

80 70 60

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

50 40 30 20 10

5

10

15

20

0

5

10

15

No. of sliding window

No. of sliding window

No. of sliding window

(a) min_sup=16.5

(b) min_sup=17.0

(c) min_sup=17.5

20

13 / 21

100

90

90

80

80

70 60 50 40 30 20

0

60 50 40 30 20 10

5

10

15

20

0

80 70

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

60 50 40 30 20 10

5

No. of sliding window

(d) min_sup=18.0

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

90

10

15

20

0

No. of sliding window

(e) min_sup=18.5 Fig.7. Detection accuracy of MiFI-Outlier approach when |SW| is 50

5

10

15

20

No. of sliding window

(f) min_sup=19.0

pro of

10

70

100

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

Detection accuracy (%)

100

Detection accuracy (%)

Detection accuracy (%)

Journal Pre-proof

that when min_sup value is set to 30, the experimental results indicate that the detection accuracy of the proposed MiFI-Outlier approach is gradually increasing as the increase of min_sup values, it is owing to that when the min_sup value is set to relatively large, the scale of minimal infrequent itemsets that used in outlier detection phase is significantly reduced, thus, the detection accuracy is also influenced. For other five compared approaches, the detection accuracy of MIFPOD approach and DPA approach is higher than that of OODFP, FindFPOF and FIM-UDSOD approach in most sliding windows, while the detection accuracy of OODFP approach, FindFPOF approach and FIM-UDSOD approach is very close regardless of the min_sup values.

When the size of sliding window is set to 40, the experimental results of the six compared approaches are shown in Fig. 6(a) to Fig. 6(f). As can be seen from Fig. 6(a), when the min_sup value is set to 13.2, the error detection situation is appearing in six sliding windows of all thirty sliding windows, and when the min_sup value is set to 13.6, the error detection situation is decreased to three windows, while the min_sup value is set to 14.0, 14.4, 14.8 and 15.2, the detection accuracy of MiFI-Outlier approach can reach to 100% in all sliding windows. Similar to

5.2. Detection accuracy of MiFI-Outlier approach on public dataset

re-

When the size of sliding window is set to 30 and the min_sup value is changing from {9.9, 10.2, 10.5, 10.8, 11.1, 11.4}, the detection accuracy of the proposed MiFI-Outlier approach against MIFPOD, FindFPOF, OODFP, FIM-UDSOD and DPA is shown in Fig. 5. When the min_sup value is set to 9.9, some error detection situations are appearing in the proposed MiFIOutlier approach, but the detection accuracy of the proposed MiFI-Outlier approach is always higher than that of other five compared approaches; In the six compared approaches, the detection accuracy of the distance-based outlier detection approach DPA is higher than that of FindFPOF approach and OODFP approach, but it is slightly lower than that of MIFPOD approach in less windows, in addition, the detection accuracy of DPA approach is not changed with the increase of min_sup values. When the min_sup value is set to 10.2, the error detection situation of MiFI-Outlier approach is appearing in the 1st, 13rd, 16rd and 23rd sliding windows, and when the min_sup value is set to 10.5, the error detection situation of MiFI-Outlier approach is only appearing in the 13rd and 23rd sliding windows, while when the min_sup value is set to 10.8, 11.1 and 11.4, the detection accuracy of the proposed MiFI-Outlier approach can reach to 100% in all sliding windows, that is, the detection accuracy of MiFI-Outlier approach shows an increase trend with the increase of min_sup values. In addition, with the increase of min_sup values, the detection accuracy of the MIFPOD approach also shows a slightly increase trend, while the detection accuracy of FindFPOF, OODFP and FIM-UDSOD approaches shows a decrease trend, the reason is that in large min_sup values, the scale of mined minimal infrequent itemsets for MiFI-Outlier approach and MIFPOD approach is much larger than that in small min_sup values, thus, more itemsets can be used in the process of outlier detection, which will improve the detection accuracy, the results indicate the proposed MiFI-Outlier approach can be used for accurately detecting the implicit outliers from uncertain data stream in relatively large min_sup values. However, in large min_sup values, the scale of mined frequent itemsets or maximal frequent itemsets for FindFPOF approach, OODFP approach and FIM-UDSOD approach is much smaller than that in small min_sup values, thus, less itemsets can be used in the process of outlier detection, which will reduce the detection accuracy of these approaches.

Jo

urn a

lP

Fig. 7(a) to Fig. 7(f) show the detection accuracy of the six compared approaches when the size of sliding window is set to 50, where the min_sup value is selected from {16.5, 17.0, 17.5, 18.0, 18.5, 19.0}. When the min_sup value is set to 16.5, the error detection situation is only appearing in two windows, while the min_sup value is not less than 17.0, the detection accuracy of the proposed MiFI-Outlier approach can reach to 100% in all sliding windows. In the six compared approaches, the stability of the detection accuracy of the MIFPOD approach is very poor, while the stability of the compared OODFP approach, FindFPOF approach and FIM-UDSOD approach is much better than that of MIFPOD approach. With the increase of min_sup values, the detection accuracy of the MIFPOD approach has a noticeable boost, while the detection accuracy of the FindFPOF approach, OODFP approach and FIM-UDSOD approach is very stable regardless of the increase of min_sup values. Overall, the proposed MiFI-Outlier approach is more sensitive to the min_sup values than to the sizes of sliding window, and its detection accuracy is much higher in big min_sup values. When the ratio of min_sup value to the size of sliding window is not less than 36%, the detection accuracy of the proposed MiFIOutlier approach can reach to 100% in almost all windows, and the ability of outlier detection of MiFI-Outlier approach is much stronger than that of the distance-based approach of DPA and the existing itemset-based approaches of FindFPOF, OODFP, MIFPOD and FIM-UDSOD.

This subsection is to test the detection accuracy of the six compared approaches on a public dataset, the experiment is also conducted in different min_sup values and different sizes of sliding window, and the experimental result is shown in Table 5, where the number outside the brackets means the number of transactions are selected when all outliers are detected, and the number in the brackets means the detection accuracy.

14 / 21

Journal Pre-proof

Table 5 Detection accuracy of MiFI-Outlier approach on public dataset lymphography Methods MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD |SW| (min_sup) 148 (75.48) 11 (54.55%) 60 (10%) 60 (10%) 96 (6.25%) 60 (10%) 148 (76.96) 11 (54.55%) 60 (10%) 60 (10%) 101 (5.94%) 60 (10%) 148 (78.44) 9 (66.67%) 139 (4.32%) 139 (4.32%) 42 (14.29%) 139 (4.32%) 148 (79.92) 8 (75%) 139 (4.32%) 139 (4.32%) 11 (54.55%) 139 (4.32%) 148 (81.4) 8 (75%) 139 (4.32%) 139 (4.32%) 12 (50%) 139 (4.32%) 148 (82.88) 7 (85.71%) 138 (4.35%) 138 (4.35%) 8 (75%) 138 (4.35%) 148 (84.36) 7 (85.71%) 138 (4.35%) 138 (4.35%) 8 (75%) 138 (4.35%)

five compared approaches (except for DPA approach) shows a decrease trend with the increase of min_sup values, and the decrease magnitude of the FindFPOF approach is very big from small min_sup value to relatively large min_sup value, while the decrease magnitude of the MiFI-Outlier approach is very limited. The reason for appearing this situation is that in big min_sup values, the scale of frequent 1-itemsets is relatively large, thus, the time cost on “pattern extension” process is also very long, but with the increase of min_sup values, the scale of frequent 1itemsets is becoming much smaller, thus, the time cost on “pattern extension” process is becoming shorter (the itemset mining process is the most time consuming phase of MiFIOutlier approach). In addition, the time cost of the proposed MiFI-Outlier approach is very stable in every sliding window, and with the increase of min_sup value, the time cost of the MIFPOD approach is becoming the second lowest.

pro of

It can be seen from Table 5 that in the six compared approaches, the detection accuracy of the proposed MiFI-Outlier approach is much higher than that of other five compared approaches, while the detection accuracy of the OODFP approach, FindFPOF approach and FIM-UDSOD approach is the same regardless of the min_sup value is set to 75.48, 76.96, 78.44, 79.92, 81.4, 82.88 and 84.36. With the increase of min_sup values, the detection accuracy of the proposed MiFIOutlier approach shows an increase trend, the reason is that the scale of minimal infrequent itemsets is relatively large when the min_sup value is set large. For the distance-based outlier detection approach DPA, the detection accuracy is not influenced by the different min_sup values, and its detection accuracy is much higher than other compared approaches except for the proposed MiFI-Outlier approach. The experimental results show that the proposed MiFI-Outlier approach can be used to accurately discover the implicit outliers from public datasets, and the min_sup value should be set much larger for improving the detection accuracy.

DPA 23 (26.09%) 23 (26.09%) 23 (26.09%) 23 (26.09%) 23 (26.09%) 23 (26.09%) 23 (26.09%)

re-

It can be seen from Fig. 9 that when the |SW| is set to 40, the time cost of the proposed MiFI-Outlier approach is also the lowest in the six compared approaches, and the time cost of the distance-based outlier detection approach called DPA is much higher than that of itemset-based outlier detection approaches (such as FindFPOF, OODFP, MIFPOD, FIM-UDSOD and MiFIOutlier). For the distance-based outlier detection approach DPA, the time cost is not influenced by the min_sup values, but it is influenced by the size of sliding window, it is owing to that in big sliding windows, the number of transactions is much more, thus, the time cost on calculating the distance of each transaction is much longer. When the min_sup value is set slightly small, the time cost of the MIFPOD approach, OODFP approach and FIMUDSOD approach is very close, but with the increase of min_sup values, the time cost of the MIFPOD approach is much lower

5.3. Time cost of MiFI-Outlier approach on synthetic dataset

lP

In this subsection, the time cost of six compared approaches on the synthetic dataset is evaluated under different min_sup values and different sizes of sliding window, and the experimental results are shown in Fig. 8 to Fig. 10.

It can be seen from Fig. 8 that when the |SW| is set to 30, the time cost of the proposed MiFI-Outlier approach is always the lowest in the six compared approaches, while the time cost of the DPA approach is the highest regardless of the min_sup values. When the size of sliding window is constant, the time cost of the

1.6 1.4

Time cost (Sec.)

1.2 1 0.8 0.6 0.4

1

0.8 0.6 0.4

0.2 0

1.2

10

15

20

25

30

35

0

40

5

10

No. of sliding window

(a) min_sup=9.9 1.8

Time cost (Sec.)

1.4 1.2 1 0.8 0.6 0.4

Time cost (Sec.)

Jo

1.6

0

20

25

30

35

0

40

10

15

20

25

30

No. of sliding window

(d) min_sup=10.8

35

40

10

15

20

25

30

35

40

No. of sliding window

(b) min_sup=10.2

(c) min_sup=10.5

1.8

1.8

1.6

1.6

1.4

1.4

MiFI-Outlier OODFP FindFPOF MIFPOD 1.2 FIM-UDSOD 1 DPA 0.8 0.6

0

5

No. of sliding window

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

1.2

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

1 0.8 0.6 0.4

0.2

5

0.6

0.2

15

0.4

0.2

1 0.8

0.4

0.2

5

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

MiFI-Outlier OODFP FindFPOF 1.4 MIFPOD 1.2 FIM-UDSOD DPA 1.6

Time cost (Sec.)

Time cost (Sec.)

1.4

1.8

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

Time cost (Sec.)

1.8

1.6

urn a

1.8

0.2

5

10

15

20

25

30

35

40

No. of sliding window

(e) min_sup=11.1 Fig.8. Time cost of MiFI-Outlier approach when |SW| is 30

0

5

10

15

20

25

30

35

40

No. of sliding window

(f) min_sup=11.4

15 / 21

2 1.8

1.6

1.6

1.4

1.4

1.2 1 0.8 0.6

1.2

0.8 0.6 0.4

0.2

0.2

20

25

0

30

10

No. of sliding window

1.8

1.8

1.6

1.6

1.4

1.4

1.2 1 0.8 0.6

0.6

0.2

25

0

30

2

10

1

25

15

20

(a) min_sup=16.5

1

5

2.5

Time cost (Sec.)

2

urn a

2.5

1.5

1

0.5

10

15

20

No. of sliding window

2

1.2

1

0.8 0.6

0

5

10

10

15

(d) min_sup=18.0

20

25

30

(f) min_sup=15.2

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

2.5

2

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

1.5

1

0

20

5

10

15

No. of sliding window

(b) min_sup=17.0

(c) min_sup=17.5

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

20

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

2.5

1

5

15

No. of sliding window

1.5

0

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

0.5

0.5

5

30

No. of sliding window

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

1.5

0

25

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

1.4

30

0.5

No. of sliding window

Time cost (Sec.)

20

Time cost (Sec.)

0.5

0

15

lP

1.5

20

0.2

5

re-

2

Time cost (Sec.)

Time cost (Sec.)

2.5

15

(c) min_sup=14.0

(e) min_sup=14.8 Fig.9. Time cost of MiFI-Outlier approach when |SW| is 40

2.5

10

10

No. of sliding window

No. of sliding window

(d) min_sup=14.4

5

5

0.4

No. of sliding window

0

0

30

1.6

0.8

0.2

20

25

1.8

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

1

0.4

15

20

2

1.2

0.4

10

15

(b) min_sup=13.6 2

Time cost (Sec.)

Time cost (Sec.)

(a) min_sup=13.2

5

0.6

No. of sliding window

2

0

1 0.8

0.2

5

Time cost (Sec.)

15

1.2

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

0.4

Time cost (Sec.)

10

1.4

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

pro of

5

1.6

1

0.4

0

2 1.8

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

Time cost (Sec.)

2 1.8

Time cost (Sec.)

Time cost (Sec.)

Journal Pre-proof

MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA

2

1.5

1

0.5

10

15

20

No. of sliding window

(e) min_sup=18.5 Fig.10. Time cost of MiFI-Outlier approach when |SW| is 50

0

5

10

15

20

No. of sliding window

(f) min_sup=19.0

lower than that of other five compared approaches, and the time cost of the OODFP approach, MIFPOD approach and FIMUDSOD approach is very close regardless of the min_sup values. With the use of “item cap” concept and “support cap” concept, the time cost on minimal infrequent itemset mining operations is reduced much compared with the traditional Apriori-based approaches and FP-Growth-based approaches, thus, the time cost of the MiFI-Outlier approach is the lowest.

Fig. 10(a) to Fig. 10(f) shows the time cost of the six compared outlier detection approaches when the min_sup value is set to 16.5, 17.0, 17.5, 18.0, 18.5 and 19.0 respectively, where the size of sliding window is set to 50. It can be seen from Fig. 10 that the time cost of the MiFI-Outlier approach is also much

In general, the time cost of the proposed MiFI-Outlier approach is more competitive than other four itemset-based outlier detection approaches, and the time efficiency of the MiFIOutlier approach is also much higher than that of distance-based outlier approach DPA. The experimental results indicate that the

Jo

than that of other two approaches. The reason is that the use of matrix structure can reduce much meaningless time cost on “pattern extension” process, thus, the reduce magnitude of the time cost is relatively larger. Similarly, when the size of sliding window is constant, the time cost of the five compared approaches (except for DPA approach) is slightly reduced with the increase of min_sup values, and the reduced magnitude of the FindFPOF approach is the largest.

16 / 21

Journal Pre-proof

MiFI-Outlier approach can be used to discover the implicit outliers from uncertain data stream for its high time efficiency.

UDSM and MiFI-UDSM*, thereby verifying the efficiency of the proposed outlier detection approach. In this experiment, four public datasets, including three dense datasets (mushroom, pumsb* and chess) and one sparse dataset (kosarak), are used as the target dataset, and the experiment is also conducted under different sizes of sliding window and different min_sup values, where the sizes of sliding window are set to 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100 respectively, and the min_sup values are set to 3, 4 and 5. The experimental results are shown in Fig. 11 to Fig. 14.

5.4. Time cost of minimal infrequent itemset mining approaches For the proposed minimal infrequent itemset-based outlier detection approach, the minimal infrequent itemset mining is the most time consume phase of the whole detection process. Thus, this subsection is mainly to test the time cost of our proposed two minimal infrequent itemset mining approaches, namely MiFI30

40 35

25

Time csot (Sec.)

30 25 20 15

20

12

15

10

5

30

40

50

60

70

80

90

0 10

100

20

30

(a) min_sup=3

1000

800 700 600 500 400 300 200

500

100 40

50

60

90

0 10

100

70

80

90

100

Size of sliding window

(a) min_sup=3

0 10

20

30

40

50

60

70

80

90

350 300

Time csot (Sec.)

120 100

urn a

200

30

40

50

60

70

(a) min_sup=3

80

90

140

20

30

80 60 40 20 20

30

40

50

60

70

Size of sliding window

(a) min_sup=3

80

90

160 140

60

70

90

100

80

90

100

80

90

100

(c) min_sup=5

60 50 40 30

10

40

50

60

70

80

90

0 10

100

20

30

Size of sliding window

40

50

60

70

Size of sliding window

(c) min_sup=5

220

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

200 180

120 100 80 60

160 140 120 100 80 60 40

20 0 10

50

20

40

100

40

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

70

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

180

Jo

100

30

(b) min_sup=4 Fig.13. Time cost of minimal infrequent itemset mining phase on dataset chess 200

120

20

80

Time csot (Sec.)

160

80

150

90

60

0 10

100

Time csot (Sec.)

180

100

200

0 10

220

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

200

90

Size of sliding window

80

Size of sliding window

220

80

50

20

20

70

100

40

50

60

250

100

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

140

100

50

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

400

160

150

40

Size of sliding window

(b) min_sup=4 Fig.12. Time cost of minimal infrequent itemset mining phase on dataset pumsb*

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

250

30

Size of sliding window

350

300

20

450

lP

30

80

re-

900

1500

20

70

500

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

1000

Time csot (Sec.)

Time csot (Sec.)

60

1100

2000

Time csot (Sec.)

50

(b) min_sup=4 (c) min_sup=5 Fig.11. Time cost of minimal infrequent itemset mining phase on dataset mushroom

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

2500

Time csot (Sec.)

6

Size of sliding window

3000

0 10

40

Time csot (Sec.)

20

Size of sliding window

0 10

8

2

5

0 10

10

4

10

0 10

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

14

Time csot (Sec.)

Time csot (Sec.)

45

16

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

Time csot (Sec.)

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

50

pro of

55

20 20

30

40

50

60

70

80

90

100

0 10

20

30

Size of sliding window

(b) min_sup=4 Fig.14. Time cost of minimal infrequent itemset mining phase on dataset kosarak

It can be seen from Fig. 11 to Fig. 13 that on dense datasets mushroom, pumsb* and chess, the time cost of our proposed MiFI-UDSM approach and MiFI-UDSM* approach is lower than

40

50

60

70

Size of sliding window

(c) min_sup=5

that of other four compared approaches under different sizes of sliding window and different min_sup values, while the time cost of MRG-Exp approach is the highest in most windows.

17 / 21

Journal Pre-proof

use of “item cap” and “support cap” have deleted many safe infrequent 1-itemsets, thus, the number of frequent 1-itemsets that can be added into the “pattern extension” operations is reduced much, which reduce much time cost on the whole itemset mining process. On the sparse dataset kosarak, the time cost of FP-Growth-based RP-Tree approach is slightly higher than that of MRG-Exp approach in different sizes of sliding window and different min_sup values. When the size of sliding window is constant, the time cost the six compared approaches shows a decrease trend with the increase of min_sup values, but the decrease magnitude is not large, it is owing to that in big sliding windows, the number of frequent 1-itemsets is more than that in small sliding windows, thus, the time cost in “pattern extension” operation is also much longer, but on sparse datasets, the scale of frequent 1-itemsets is not very large in itself, so the number of frequent 1-itemsets only changes little with the increase of min_sup values. When the min_sup value is constant, the time cost of the six compared approaches shows an increase trend with the increase sizes of sliding window, the reason is that the scale of frequent 1-itemsets is much larger in big windows. Similarly, the use of “item cap” and “support cap” on the sparse dataset also has a positive impact in minimal infrequent itemset mining process. Thus, the proposed MiFI-UDSM* also can be used in the outlier detection for detecting the implicit outliers from sparse uncertain data stream.

pro of

Compared with the MiFI-UDSM approach, the time cost of MiFI-UDSM* approach is much less because the use of “item cap” and “support cap” can reduce many meaningless time cost in “pattern extension” operations and support value calculation operations, thus, the time cost on minimal infrequent itemset mining operations is also reduced. Under the constant sizes of sliding window, the time cost of the six compared approaches shows a decrease trend with the increase of min_sup values, the reason is that the scale of potential frequent itemsets in large min_sup values is relative smaller than that in small min_sup values, thus, the time cost on “pattern extension” process is also reduced in large min_sup values. Under the constant min_sup values, the time cost of six compared approaches shows an increase trend with the increase sizes of sliding window, the reason is that in big windows, the scale of potential frequent itemsets is larger than that in small windows, which results much time will be consumed on “pattern extension” operations. Compared with the Apriori-based approach (for MRG-Exp), the use of matrix structure (for MIP-DS, MiFI-UDSM and MiFIUDSM*) can reduce some time cost on minimal infrequent itemset mining operations, it is owing to that the use of matrix allows the subsequent itemset mining operations to be conducted based on the constructed matrix structure without scanning the datasets for multiple times. The experimental results on dense datasets show that the mining efficiency of the proposed MiFIUDSM* approach is very competitive than the MRG-Exp, RPTree, MIP-DS and DSUF-min approaches, thus, it can be used in outlier detection for detecting the implicit outliers from dense uncertain data stream.

re-

5.5. Memory usage of minimal infrequent itemset mining approaches In addition to the time cost, the memory usage is another important index to reflect the efficiency of the minimal infrequent itemset mining, thus, this subsection is to test the memory usage of the proposed MiFI-UDSM approach and MiFIUDSM* approach under different sizes of sliding window and different min_sup values. The experimental results are shown in Fig. 15 to Fig. 18.

lP

It can be seen from Fig. 14 that on sparse dataset kosarak, the time cost of the proposed MiFI-UDSM approach and MiFIUDSM* approach is lower than that of other four compared approaches. In addition, the time cost of MiFI-UDSM* approach is also much lower than that of MiFI-UDSM approach, the reason is that before extending the frequent 1-itemsets to 2-itemsets, the 110

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

90

Memory usage (MB)

100

80

60

40

100

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

100

80 70

80

50 40 30

20

30

40

50

60

70

80

90

0 10

100

20

30

Size of sliding window

60 50 40 30

10

10

0 10

70

20

20

20

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

90

60

urn a

Memory usage (MB)

120

Memory usage (MB)

140

40

50

60

70

80

90

0 10

100

20

30

Size of sliding window

40

50

60

70

80

90

100

80

90

100

Size of sliding window

(a) min_sup=3 (b) min_sup=4 (c) min_sup=5 Fig.15. Memory usage of minimal infrequent itemset mining phase on dataset mushroom 350

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

250 200 150 100 50 0 10

20

30

40

50

60

70

Size of sliding window

(a) min_sup=3

80

90

Memory usage (MB)

300

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

300

Jo

Memory usage (MB)

350

250

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

250

Memory usage (MB)

400

200

150

100

200

150

100

50 50

100

0 10

20

30

40

50

60

70

Size of sliding window

80

90

100

0 10

20

30

40

50

60

70

Size of sliding window

(b) min_sup=4 (c) min_sup=5 Fig.16. Memory usage of minimal infrequent itemset mining phase on dataset pumsb*

18 / 21

Journal Pre-proof

250

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

200

200

Memory usage (MB)

Memory usage (MB)

250

200

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

150

100

150

160

100

50

50

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

180

Memory usage (MB)

300

140 120 100 80 60 40 20

30

40

50

60

70

80

90

0 10

100

20

30

Size of sliding window

(a) min_sup=3

450 400

Memory usage (MB)

Memory usage (MB)

400 350 300 250 200

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

150 100 50

30

40

50

60

70

80

90

0 10

100

20

30

40

50

60

70

80

90

100

Size of sliding window

70

80

90

400 350

350 300 250 200

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

150 100 50

100

0 10

20

30

Size of sliding window

(a) min_sup=3

60

(b) min_sup=4 (c) min_sup=5 Fig.17. Memory usage of minimal infrequent itemset mining phase on dataset chess

450

20

50

Size of sliding window

500

0 10

40

Memory usage (MB)

20

pro of

0 10

40

50

60

70

80

90

300 250 200

MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree

150 100 50

100

Size of sliding window

0 10

20

30

40

50

60

70

80

90

100

Size of sliding window

(b) min_sup=4 (c) min_sup=5 Fig.18. Memory usage of minimal infrequent itemset mining phase on dataset kosarak

It can be seen from Fig. 18 that on sparse dataset kosarak, the memory usage of the proposed MiFI-UDSM approach is also the lowest in the six compared approaches, while the memory usage of RP-Tree approach is slightly larger than other five approaches in most sliding windows, the reason is that on sparse datasets, the scale of frequent 1-itemsets is relatively small unless the min_sup value is set to be particularly small, which results the time cost on infrequent itemset mining operations also very limited. It can be seen from Fig. 18(a) that when the min_sup value is set to 3, the largest memory usage of the six compared approaches can reach to 450MB, and when min_sup value is set to 4, the largest memory usage of the six compared approaches can reach to 430MB (see in Fig. 18(b)), while the largest memory usage of the six compared approaches can reach to 400MB when min_sup value is set to 5 (see in Fig. 18(c)), that is, on sparse dataset kosarak, the reduced memory usage of the six compared approaches is very limited with the increase of min_sup values, it is owing to the scale of frequent 1-itemsets is already very small when the min_sup value is set to 3. When the size of sliding window is constant, the memory usage of the proposed MiFIUDSM approach and MiFI-UDSM* approach shows a slightly decrease trend with the increase of min_sup values, while when the min_sup value is constant, the memory usage of the proposed shows an increase trend with the increase sizes of sliding window. The experimental results also verify that the use of “item cap” and “support cap” can reduce the memory usage when mining the minimal infrequent itemsets from sparse datasets, thus, the proposed MiFI-UDSM* approach can be used to quickly and accurately detect the implicit outliers from sparse uncertain data stream.

urn a

lP

re-

It can be seen from Fig. 15 that on dataset mushroom, when min_sup value is set to 3, the memory usage of the proposed MiFI-UDSM approach and MiFI-UDSM* approach is slightly lower than that of MRG-Exp approach and MIP-DS approach but much lower than that of DSUF-min approach and RP-Tree approach. When the min_sup value is set to 4 and 5, the memory usage of the MiFI-UDSM, MRG-Exp, DSUF-min and MIP-DS approaches is very close, and the memory usage of MiFIUDSM* approach is lower than these approaches, while the memory usage of RP-Tree is much higher than these approaches. When the min_sup value is constant, with the increase sizes of sliding window, the memory usage of the six compared approaches shows an increase trend, but the increase magnitude is becoming slower. When the size of sliding window is constant, with the increase of min_sup values, the memory usage of the six compared approaches shows a decrease trend, it is owing to that the scale of frequent itemsets shows a decrease trend with the increase of min_sup values.

Jo

It can be seen from Fig. 16 and Fig. 17 that on dense datasets pumsb* and chess, the memory usage of MiFI-UDSM* approach is the lowest and the memory usage of MiFI-UDSM approach is the second lowest, while the memory usage of RP-Tree approach and DSUF-min approach is much higher than that of other four compared approaches. When the min_sup value is constant, the memory usage of the six compared approaches shows an increase trend with the increase sizes of sliding window, the reason is that in small sizes of sliding window, the number of frequent 1itemsets is also smaller than that in the large sliding windows, thus, the scale of extended itemsets is also smaller, which results the memory usage in “pattern extension” process is also very small. When the size of sliding window is constant, the memory usage of the six compared approaches shows a decrease trend with the increase of min_sup values, the reason is that the scale of frequent 1-itemsets shows a decreasing trend with the increase of min_sup values. The experimental results indicate that the use of “item cap” and “support cap” can reduce the memory usage in minimal infrequent itemset mining process, it is owing to that the safe infrequent itemsets are discarded directly to reduce the meaningless “pattern extension” operations.

5.6. Discussions In the experiments, the detection accuracy and time cost of the proposed MiFI-Outlier approach are tested under different min_sup values and different sizes of sliding window, and the time cost and memory usage of the proposed MiFI-UDSM approach and MiFI-UDSM* approach are also tested under different min_sup values and different sizes of sliding window. Thus, this subsection first aims at discussing the relationships

19 / 21

Journal Pre-proof

For the proposed MiFI-Outlier approach, when the size of sliding window is constant, the detection accuracy is positive correlation with the min_sup values, that is, the detection accuracy of the proposed MiFI-Outlier approach shows an increase trend with the increase of min_sup values, the reason is that the scale of minimal infrequent itemsets is much larger in the large min_sup values, thus, the number of used itemsets is also larger in large min_sup values. When the min_sup value is constant, the relationship between the detection accuracy of the MiFI-Outlier approach and the size of sliding window is very small. In addition, when the size of sliding window is constant, the time cost of the proposed MiFI-Outlier approach shows a decrease trend with the increase of min_sup values, and when the min_sup value is constant, the time cost of the MiFI-Outlier approach shows an increase trend with the increase sizes of sliding window.

The performance of the proposed MiFI-Outlier approach is evaluated on a synthetic dataset and a public dataset, and the results show that the detection accuracy and the time cost of the proposed MiFI-Outlier approach outperform the compared FindFPOF, MIFPOD, OODFP, FIM-UDSOD and DPA approaches. In addition, the performance of the proposed MiFIUDSM and MiFI-UDSM* approaches is evaluated on four public datasets, and the result shows that both on dense datasets and sparse datasets, the time cost and memory usage of the proposed MiFI-UDSM and MiFI-UDSM* approaches are more competitive than the compared MIP-DS, DSUF-min, MRG-Exp and RP-Tree approaches; Although on the sparse datasets, the improved efficiency of both time cost and memory usage of the proposed MiFI-UDSM* approach is slightly limited. Overall, the proposed MiFI-Outlier can provide a good solution for outlier detection for the uncertain data stream, and the detection accuracy is much higher in large min_sup values.

re-

For the proposed MiFI-UDSM approach and MiFI-UDSM* approach, when the size of sliding window is constant, the time cost and memory usage of the MiFI-UDSM approach and MiFIUDSM* approach shows a decrease trend with the increase of min_sup values, the reason is that the scale of extensible frequent itemsets is much small in large min_sup values, thus, the time cost and memory usage on “pattern extension” operation is also reduced. When the min_sup value is constant, the time cost and memory usage of the proposed MiFI-UDSM approach and MiFIUDSM* approach shows an increase trend with the increase sizes of sliding window, the reason is that the scale of extensible frequent itemsets is larger in large sliding windows, thus, the time cost and memory usage on “pattern extension” operation is also increased.

the scale of potential extensible frequent itemsets, and then an improved editor, namely MiFI-UDSM*, is proposed to raise the mining efficiency. In itemset-based outlier detection phase, three deviation indices, namely minimal infrequent itemset deviation index (MiFIDI), similarity deviation index (SI) and transaction deviation index (TDI), are defined to judge the deviation degree of each transaction, where MiFIDI is used to compute the deviation index of each mined minimal infrequent itemset, SI and TDI are used to compute the deviation index of each transaction. At last, the transactions are sorted using their decreasing TDI values and the transactions having higher TDI values are determined as outliers.

pro of

between the min_sup values, the sizes of sliding window and the detection accuracy of the proposed MiFI-Outlier approach, and then discusses the relationships between the min_sup values, the sizes of sliding window and the time cost and memory usage of the proposed MiFI-UDSM approach, MiFI-UDSM* approach and MiFI-Outlier approach.

lP

In real life, people tend to concern about whether there are outliers in the small scale of the data that meet their constraints, rather than in the entire datasets. However, how to quickly and accurately discover the outliers from the data stream that satisfies the constraints is a new challenge. Thus, in the future, we will research the infrequent-itemset-based outlier detection approach for discovering the implicit outliers from constrained uncertain data stream. In addition, the phenomenon of concept drift [26-28] appears more and more frequently in the data stream, and the appearance of concept drift will cause great distress to the outlier detection, so in the future work, the concept drift detection and outlier detection should be considered together, thereby further improving the credibility of data stream.

6. Conclusion

urn a

In general, the proposed MiFI-Outlier approach is more competitive when the min_sup value is set larger, and in the relatively large min_sup values, the MiFI-Outlier approach can discover the implicit outliers from uncertain data stream faster and more accurately.

Jo

Outlier is the main factor that will affect data-based predicting and analysis, therefore, outlier detection is urgent demanded for improving the reliability of the collected datasets. However, the outlier detection on high-dimensional uncertain data stream is a challenging work, while the efficiency of the distance-based and density-based outlier detection approaches is not competitive for their high computational complexities. Aimed at the definition of outlier, that is the data is appearing rarely and differing much from most normal data elements, this paper proposes an efficient minimal infrequent itemset-based outlier detection approach, namely MiFI-Outlier, for quickly and accurately discovering the implicit outliers from uncertain data stream, where the MiFIOutlier is made up of minimal infrequent itemset mining operation and itemset-based outlier detection operation. In minimal infrequent itemset mining phase, the matrix structure is constructed to store the probability of each itemset existing in the current sliding window first, and then the matrix-based infrequent itemset mining approach called MiFI-UDSM is proposed to mine the minimal infrequent itemsets from uncertain data stream. To reduce the meaningless “pattern extension” operations, the “item cap” concept and “support cap” concept are proposed to reduce

Acknowledgments This work was supported in part by the Chinese Universities Scientific Fund under Grant No. 2017XD001 and the Fundamental Research Funds for the Central Universities under Grant No. 2018XD004.

References [1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: 20th International Conference on Very Large Data Bases, 1994, pp. 487-499. [2] M. Bai, X. Wang, J. Xin, G.R. Wang, An efficient algorithm for distributed density-based outlier detection on big data, Neurocomputing 181 (2016) 19-28. [3] L. Cagliero, P. Garza, Infrequent weighted itemset mining using frequent pattern growth, IEEE Transactions on Knowledge and Data Engineering 26(4) (2014) 903-915. [4] S.H. Cai, R.Z. Sun, J.Y. Li, C. Deng, S.C. Li, Abnormal Detecting over Data Stream Based on Maximal Pattern Mining Technology, in: CCF Conference on Computer Supported Cooperative Work and Social Computing, 2018, pp. 371-385. [5] S.H. Cai, R.Z. Sun, S.B. Hao, S.C. Li, G. Yuan, Minimal weighted infrequent itemset mining-based outlier detection approach on uncertain data stream, Neural Computing and Applications (2018). https://doi.org/10.1007/s00521-018-3876-4

20 / 21

Journal Pre-proof

pro of

[33] B. Tang, H. He, A local density-based approach for outlier detection, Neurocomputing 241 (2017) 171-180. [34] L. Troiano, G. Scibelli, A time-efficient breadth-first level-wise latticetraversal algorithm to discover rare itemsets, Data Mining and Knowledge Discovery 28(3) (2014) 773-807. [35] S. Tsang, Y.S. Koh, G. Dobbie, RP-Tree: Rare Pattern Tree Mining, in: Proceedings of the 13th International Conference on Data Warehousing and Knowledge Discovery, 2011, pp. 277-288. [36] B. Wang, X.C. Yang, G.R. Wang, G. Yu, Outlier detection over sliding windows for probabilistic data streams, Journal of Computer Science and Technology 25(3) (2010) 389-400. [37] I.M. Wagner-Muns, I.G. Guardiola, V.A. Samaranayke, W.I. Kayani, A Functional Data Analysis Approach to Traffic Volume Forecasting, IEEE Transactions on Intelligent Transportation Systems 19(3) (2018) 878-888. [38] K. Xu, K. Zou, Y. Huang, X. Yu, X.F. Zhang, Mining community and inferring friendship in mobile social networks, Neurocomputing 174 (2016) 605-616. [39] G. Yang, The complexity of mining maximal frequent itemsets and maximal frequent patterns, in: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 344-353. [40] Y. Yang, C. Yang, Wei. Y, Frequent pattern mining algorithm for uncertain data streams based on sliding window, in: Proceeding of the 8th IEEE International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016, pp. 265-268. [41] J.X. Yu, Z. Chong, H. Lu, Z. Zhang, A. Zhou, A false negative approach to mining frequent itemsets from high speed transactional data streams, Information Sciences 176(14) (2006) 1986-2015. [42] U. Yun, D. Kim, E. Yoon, H. Fujita, Damped Window based High Average Utility Pattern Mining over data streams, Knowledge-Based Systems 144 (2018) 188-205.

Jo

urn a

lP

re-

[6] S.H. Cai, S.B. Hao, R.Z. Sun, G. Wu, Mining Recent Maximal Frequent Itemsets Over Data Streams with Sliding Window, International Arab Journal of Information Technology 16(6) (2019) 961-969. [7] S.H. Cai, R.Z. Sun, S.B. Hao, S.C. Li, G. Yuan, An Efficient Outlier Detection Approach on Weighted Data Stream Based on Minimal Rare Pattern Mining. China Communications 16(10) (2019) 83-99. [8] K.Y. Cao, G.R. Wang, D.H. Han, G.H. Ding, A.X. Wang, L.X. Shi, Continuous outlier monitoring on uncertain data streams, Journal of Computer Science and Technology 29(3) (2014) 436-448. [9] W. Fang, V.S. Sheng, X.Z. Wen, W.B. Pan, Meteorological data analysis using MapReduce, The Scientific World Journal 96 (2014) 27-38. [10] G.D. Fan, S.H. Yin, A frequent itemsets mining algorithm based on matrix in sliding window over data streams, in: 3th International Conference on Intelligent System Design and Engineering Applications (ISDEA), 2013, pp. 66-69. [11] D.J. Haglin, A.M. Manning, On Minimal Infrequent Itemset Mining, in: 7th International Conference on Data Mining, 2007, pp. 141-147. [12] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery 8(1) (2004) 53-87. [13] M. Han, J. Ding, J. Li, TDMCS: An Efficient Method for Mining Closed Frequent Patterns over Data Streams Based on Time Decay Model, International Arab Journal of Information Technology 14(6) (2017) 851-860. [14] S.B. Hao, S.H. Cai, R.Z. Sun, S.C. Li, An Efficient Outlier Detection Approach Over Uncertain Data Stream Based on Frequent Itemset Mining, Journal of Information Technology and Control 48(1) (2019) 34-46. [15] D.M. Hawkins, Identification of outliers, 1980, London: Chapman and Hall. [16] Z.Y. He, X.F. Xu, J.Z. Huang, S.C. Deng, FP-Outlier: Frequent pattern based outlier detection, Computer Science and Information Systems 2(1) (2005) 103-118. [17] C.S. Hemalatha, V. Vaidehi, R. Lakshmi, Minimal infrequent pattern based approach for mining outliers in data streams, Expert Systems with Applications 42(4) (2015) 1998-2012. [18] J. Huang, Q. Zhu, L. Yang, D. Cheng, Q. Wu, A novel outlier cluster detection algorithm without top-n parameter, Knowledge-Based Systems 121 (2017) 32-40. [19] F. Keller, E. Muller, K. Bohm, HiCS: High Contrast Subspaces for Density-Based Outlier Ranking, in: 28th International Conference on Data Engineering, 2012, pp. 1037-1048. [20] M. Kontaki, A. Gounaris, A.N. Papadopoulos, K. Tsichlas, Y. Manolopoulos, Efficient and flexible algorithms for monitoring distancebased outliers over data streams, Information Systems 55 (2016) 37-53. [21] G. Lee, U. Yun, A new efficient approach for mining uncertain frequent patterns using minimum data structure without false positives, Future Generation Computer Systems 68 (2017) 89-110. [22] C.K.S. Leung, M.A. Mateo, D.A. Brajczuk, A tree-based approach for frequent pattern mining from uncertain data, in: Proceeding of the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2008, pp. 653-661. [23] C.K.S. Leung, R.K. MacKinnon, F. Jiang, Finding efficiencies in frequent pattern mining from big uncertain data, World Wide Web 20(3) (2017) 571-594. [24] Y. Lim, U. Kang, Time-weighted counting for recently frequent pattern mining in data streams, Knowledge and Information Systems 53(2) (2017) 391-422. [25] F. Lin, W. Le, J. Bo, Research on maximal frequent pattern outlier factor for online high dimensional time-series outlier detection, Journal of Convergence Information Technology 5(10) (2010) 66-71. [26] A.J. Liu, J. Lu, F. Liu, G.Q. Zhang, Accumulating regional density dissimilarity for concept drift detection in data streams, Pattern Recognition 76 (2018) 256-272. [27] J. Lu, A.J. Liu, F. Dong, F. Gu, J. Gama, G.Q. Zhang, Learning under Concept Drift: A Review, IEEE Transactions on Knowledge and Data Engineering 2018. http://dx.doi.org/10.1109/TKDE.2018.2876857 [28] N. Lu, G.Q. Zhang, J. Lu, Concept drift detection via competence models, Artificial Intelligence 209 (2014) 11-28. [29] M. Radovanović, A. Nanopoulos, M. Ivanović, Reverse nearest neighbors in unsupervised distance-based outlier detection, IEEE Transactions on Knowledge and Data Engineering 27(5) (2015) 1369-1382. [30] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak, F. Herrera, A survey on data preprocessing for data stream mining: current status and future directions, Neurocomputing 239 (2017) 39-57. [31] Y. Shi, L. Zhang, COID: A cluster–outlier iterative detection approach to multi-dimensional data analysis, Knowledge and Information Systems 28(3) (2011) 709-733. [32] L. Szathmary, A. Napoli, P. Valtchev, Towards rare itemset mining, in: International Conference on Tools with Artificial Intelligence (ICTAI), 2007, pp: 305-312.

21 / 21

*Author Contributions Section

Journal Pre-proof

Jo

urn a

lP

re-

pro of

Saihua Cai: Methodology, Experimental verification, Writing - original draft, Writing - review & editing. Sicong Li: Experimental verification, Writing - review & editing. Gang Yuan: Experimental verification, Formal analysis. Shangbo Hao: Experimental verification. Ruizhi Sun: Writing - review & editing.

Journal Pre-proof

*Conflict of Interest Form

Conflict of interest We declared that we have no conflicts of interest to this work. We declare that we do not have any commercial or associative

pro of

interest that represents a conflict of interest in connection with the work submitted.

Ruizhi Sun, on behalf of Saihua Cai, Shangbo Hao, Sicong Li and

Jo

urn a

lP

re-

Gang Yuan

MiFI-Outlier: Minimal infrequent itemset-based outlier detection approach on uncertain data stream

MiFI-Outlier: Minimal infrequent itemset-based outlier detection approach on uncertain data stream

Recommend Documents