Journal Pre-proof MiFI-Outlier: Minimal infrequent itemset-based outlier detection approach on uncertain data stream Saihua Cai, Sicong Li, Gang Yuan, Shangbo Hao, Ruizhi Sun
PII: DOI: Reference:
S0950-7051(19)30571-4 https://doi.org/10.1016/j.knosys.2019.105268 KNOSYS 105268
To appear in:
Knowledge-Based Systems
Received date : 13 April 2019 Revised date : 20 November 2019 Accepted date : 24 November 2019 Please cite this article as: S. Cai, S. Li, G. Yuan et al., MiFI-Outlier: Minimal infrequent itemset-based outlier detection approach on uncertain data stream, Knowledge-Based Systems (2019), doi: https://doi.org/10.1016/j.knosys.2019.105268. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
Journal Pre-proof
*Revised Manuscript (Clean Version) Click here to view linked References
Knowledge-Based Systems journal homepage: www.elsevier.com /locate/knosys
pro of
MiFI-Outlier: minimal infrequent itemset-based outlier detection approach on uncertain data stream Saihua Caia, Sicong Lia, Gang Yuana, Shangbo Haoa, Ruizhi Suna,b* a b
College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China Scientific research base for Integrated Technologies of Precision Agriculture (animal husbandry), the Ministry of Agriculture, Beijing 100083, China
ABSTRACT
Keyword: Outlier detection Minimal infrequent itemset mining Uncertain data stream Deviation indices Data mining
Massive outlier detection approaches have been proposed for static datasets in the past twenty years, and they have acquired good achievements. In real life, uncertain data stream is more and more common, but most existing outlier detection approaches were not suitable for uncertain data stream environment. In addition, many outlier detection approaches have not considered the appearing frequency of each element, which resulted the detected outliers are not coincide with the definition of outlier. Itemset-based outlier detection approaches provided a good solution for this problem, and they have got more attentions in these years. In this paper, a novel two-step minimal infrequent itemset-based outlier detection approach called MiFI-Outlier is proposed to effectively detect the outliers from uncertain data stream. In itemset mining phase, a matrix-based method called MiFI-UDSM is proposed to mine the minimal infrequent itemsets (MiFIs) from uncertain data stream, and then an improved approach called MiFI-UDSM* is proposed for more effectively mining these minimal infrequent itemsets using the ideas of “item cap” and “support cap”. In outlier detection phase, based on the mined MiFIs, three deviation indices including minimal infrequent itemset deviation index (MiFIDI), similarity deviation index (SDI) and transaction deviation index (TDI) are defined to measure the deviation degree of each transaction, and then the MiFI-Outlier is used to identify the outliers from uncertain data stream. Several experimental studies are conducted on public datasets and synthetic datasets, and the results show that the proposed approaches outperform in infrequent itemset mining phase and outlier detection phase.
lP
re-
ARTICLE INFO
urn a
1. Introduction
In recent years, the scale of data is increasing faster than ever in various application fields, thus, it is necessary to use more effective technology to analyze and manage these data so as to discover the implicit, previously unknown and potentially useful knowledge. As an important data processing technology, data mining [23] is widely used in traffic data analysis [37], meteorological data analysis [9], mobile data analysis [38] and so on, where frequent itemset mining [21,23] is an important technology in data mining.
Jo
As a main form of data, data stream is very common in real life. Compared with static data, data stream [30] is continues, unbounded and not necessarily uniformly distributed. Because of these features, the processing speed of data stream should be faster, and multiple scans of data stream are impractical. Hence, the processing of data stream is more challenging than static datasets. In recent years, window-based technologies (such as sliding window [6,13], damped window [42] and landmark window [41]) have provided excellent mining solutions for incoming data stream, and most of these methods were based on Apriori algorithm [1] and FP-Growth algorithm [12], such as
*
HAUPM [42], TWMINSWAP [24] and TWMINSWAP-IS [24], TDMCS [13], etc. In order to obtain more accurate mining results, the accuracy of the original data stream is particularly important. However, in real life, due to measurement errors, equipment failures, human errors and other reasons, abnormal data (also called outliers, with two attributes [15]: (1) rarely appearing, and (2) deviating much from most observations) are often present in data stream, which have seriously affected the reliability of data mining, data-based prediction and other operations. Therefore, it is necessary to effectively detect these outliers as soon as possible to improve the quality of data stream. For the traditional outlier detection methods, such as clustering-based methods [18,19,31], distance-based methods [20,29] and density-based methods [2,33], they determine whether the current transactions are outliers by calculating the distances between each subset in the transaction. Thus, when the number of subsets in the transaction is very large (that is, highdimensional data), the time cost of above three kinds of outlier detection methods increases exponentially, thereby resulting the dimension disaster. Fortunately, itemset-based outlier detection approaches [4,5,7,14,16,17,25] provide a good solution for this
Corresponding author.
E-mail address:
[email protected],
[email protected]
1 / 21
Journal Pre-proof
problem, and they divide the entire outlier detection process into itemset mining phase and outlier detection phase.
and and the the
The remainder of this paper is organized as follows. The related work is presented in Section 2. Some preliminaries and the existing problems are introduced in Section 3. The outlier detection framework, the original minimal infrequent itemset mining approach and an improved minimal infrequent itemset mining approach, and outlier detection approach are presented in Section 4. The empirical studies and experimental analysis are stated in Section 5. The conclusions and future work are discussed in Section 6.
2. Related work
In this section, some related work including (1) outlier detection approaches, and (2) infrequent itemset mining approaches are reviewed. 2.1. Outlier detection approaches
Outlier detection is a series of processes for discovering the implicit abnormal data from static datasets and data stream. The outlier detection approaches are roughly divided into four categories: (1) clustering-based outlier detection approaches, (2) distance-based outlier detection approaches, (3) density-based outlier detection approaches, and (4) itemset-based outlier detection approaches.
re-
For the entire process of itemset-based outlier detection for uncertain data stream, both of the accuracy and time cost of outlier detection process are very important. (1) To improve the detection accuracy, more factors that may cause transaction being abnormal need to be considered. In addition, because infrequent itemsets are more consistent with the attribute one of outliers (“rarely appearing”), thus, the use of infrequent itemsets in the outlier detection process can improve the detection accuracy. (2) To reduce the time cost on the entire process, both the time cost on itemset mining stage and the time cost on outlier detection stage need to be considered. In the itemset mining stage, the large amount of data stream makes the time cost and memory usage of the infrequent itemset mining process very expensive, and this problem is even more acute in the era of big data. Because the minimal infrequent itemsets [11] (the subsets of infrequent itemsets, see Definition 6) are the generators of infrequent itemsets and the number of minimal infrequent itemsets is relatively small, thus, much time and memory can be saved both in the mining process and detection process if the infrequent itemset mining is translated into minimal infrequent itemset mining.
(4) We use several datasets including a synthetic dataset several public datasets to test the detection accuracy time efficiency of the proposed methods, and experimental results confirm the effectiveness of proposed methods.
pro of
For the itemset-based outlier detection methods, most of them are designed for precise datasets or precise data stream. However, in many cases, users are uncertain about the presence or the absence of some items or events (for example, a physician may suspect that a patient has (i) an 80% likelihood of suffering from a flu and (ii) a 60%likelihood of suffering from a cold (regardless of having or not having the flu)), that is, uncertain data stream is becoming more common. Compared with the precise data stream, each itemset in uncertain data stream is associated with a probability value that represents the possibility of the existence of this itemset, while the existence of probability makes the itemsetbased outlier detection methods that designed for precise data stream is not applicable for uncertain data stream. In addition, for the existing itemset-based outlier detection methods, such as FIM-UDSOD [14], FindFPOF [16] and OODFP [25], only a few factors that result the transactions abnormal are considered in the design of deviation indices, thus, the detection accuracy is not competitive.
MiFI-Outlier for effectively discovering the outliers from uncertain data stream.
urn a
lP
For the clustering-based outlier detection approaches, the main idea is to cluster the similar patterns into a class, and these elements that deviating from most elements are judged as outliers. COID [31] was a novel clustering-based iterative outlier detection approach, it divided the whole outlier detection into initialization phase and iteration phase. In the initial phase, the center position of cluster and position of abnormal point were discovered to provide the support for iteration phase. In the iteration phase, the clustering and abnormal point sets were refined gradually through exchanging the worst outliers with the data points of worst cluster boundary. Through adjusting the internal relations of the cluster as well as the relationship between clusters and outliers, the efficiency of outlier detection on multidimensional noise datasets could be effectively improved. Based on the idea that the abnormal clusters is smaller than normal clusters, Huang et al. [18] proposed a novel outlier detection method called ROCF to detect the implicit outliers. To speed up the efficiency of outlier detection, each point in the dataset was connected with their neighbor points based on the constructed neighbor graph, and the outlier judging condition was also changed from the parameter n or α to the cluster density of neighbor graph. Although the clustering-based outlier detection approaches are unsupervised, but the accuracy is highly depend on the selected clustering algorithm.
Based on the above ideas, this paper presents a minimal infrequent itemset-based outlier detection approach, namely MiFI-Outlier, to discover the outliers from uncertain data stream. The main contributions of this paper are summarized as follows:
Jo
(1) We propose an algorithm called MiFI-UDSM to mine the minimal infrequent itemsets from uncertain data stream (to our best knowledge, it is the first algorithm to mine the minimal infrequent itemsets from uncertain data stream). It uses the matrix structure to store the information of each itemset in the uncertain data stream, and then the minimal infrequent itemsets are mined by extending the frequent short-itemsets using “pattern extension” operation. (2) We introduce the concepts of “item cap” and “support cap” to reduce the scale of potential extensible itemsets, thereby reducing the meaningless “pattern extension” operation and support value calculation operation, and then propose the MiFI-UDSM* method for quickly mining the minimal infrequent itemsets. (3) We design three deviation indices to measure the deviation degree of each transaction, and then propose a minimal infrequent itemset-based outlier detection method called
For distance-based outlier detection approaches, they need to calculate the distance of each itemset of the transactions to determine whether the transactions are outliers. In 2016, Kontaki et al. [20] proposed two continuous distance-based outlier detection algorithms of COD and ACOD with the help of sliding window technology, which support flexibly adjusting the radius parameter R and threshold parameter k. The algorithm COD was designed to support detecting the implicit outliers with several threshold parameter k accompanied with fixed radius parameter R.
2 / 21
Journal Pre-proof
computationally intensive method and is not suitable processing the high-dimensional datasets.
pro of
For itemset-based outlier detection approaches, He et al. [16] proposed an efficient frequent pattern-based outlier detection approach, namely FindFPOF, to discover the implicit outliers from large scale precise datasets. The outlier judgment basis of FindFPOF was the proportion of the support value of contained frequent itemsets to the total number of mined frequent itemsets. Although the outliers can be identified by FindFPOF method, but the single simple judgment condition makes the detection accuracy is not competitive. In addition, the time cost on outlier detection phase is also very high because the itemsets used in outlier detection phase are frequent itemsets, while the scale of frequent itemsets is very large. Aimed at the problem that much time was used in outlier detection phase in FindFPOF method, the maximal frequent pattern-based outlier detection approaches, including OODFP [25] and MFPM-AD [4], have been proposed in recent years, where the OODFP approach was used to discover the outliers from high-dimensional time-series datasets and MFPM-AD approach was designed for identifying the outliers from precise data stream. In 2015, Hemalatha et al. [17] first proposed the idea that uses the minimal infrequent patterns to detect the outliers from precise data stream, and then proposed the outlier detection approach MIFPOD based on this idea. In the MIFPOD approach, they first used MIP-DS method to mine the minimal infrequent patterns from precise data stream to support outlier detection. In outlier detection phase, they defined three deviation factors named transaction weighting factor (TWF), minimal infrequent deviation factor (MIPDF) and minimal infrequent pattern based outlier factor (MIFPOF) to accurately determine whether the transactions are abnormal. Although the detection accuracy of MIFPOD is higher than that of FindFPOF, but the time cost of whole outlier detection process is relatively high because of the minimal infrequent pattern mining is based on the original Apriori method. In addition, the MIFPOD approach is designed for precise data stream, it is not suitable for uncertain data stream. For the uncertain data stream, the FIMUDSOD approach [14] was proposed to effectively identify the outliers. Similar to MIFPOD approach, the MWIFIM-OD-UDS approach [5] designed three deviation indices to measure the deviation degree of each transaction, and then the outliers were effectively detected from uncertain weighted data stream. Because the itemset-based outlier detection approaches take the appearing frequency of the itemsets into consideration, therefore, the detected outliers are more coincide with the definition of the outliers defined by Hawkins.
re-
The algorithm ACOD was designed to support detecting the implicit outliers with multiple values of both R and k, and it also supported conduct the outlier detection in parallel way to speed up the outlier detection process. Based on the relation between “antihubs” and outliers in high-and-low-dimensional settings, Radovanović et al. [29] explored two ways of using k-occurrence information for expressing the outlierness of points, and then proposed the AntiHub method for unsupervised outlier detection. Moreover, they also proposed a derived method to improve discrimination between the scores to further improve the outlier detection accuracy. Because the distance-based outlier detection approaches need to calculate the distance between each point, it is a time consuming work and not suitable for dense datasets. Outlier detection on uncertain data stream is a new direction in recent years, it was first proposed by Wang et al. [36] in 2010. Different with processing the precise datasets, the probability values of each pattern need to be taken into the definition of distance-based outlier, and an item was regarded as a distancebased outlier if the probability values of its neighbors were not larger than predefined threshold value. To efficiently reduce the intermediate data in sliding window, they designed the pruning method called PBA to prune the meaningless immediate patterns, then, the dynamic programming algorithm called DPA was proposed to efficient process each data in liner time. Moreover, the detected data could be used to incrementally detect outliers in the sliding window.
urn a
lP
For density-based outlier detection approaches, Bai et al. [2] proposed a parallel supported distributed LOF computing method, namely DLC, for identifying the density-based outliers. Tang and He [33] first introduced the concept of RDOS (Relative Densitybased Outlier Score) to measure the local outlierness of the detected objects, and then they proposed an efficient densitybased outlier detection approach based on the local kernel density estimation (KDE), where the k nearest neighbors, reverse nearest neighbors and shared nearest neighbors were used to improve the detection accuracy. Cao et al. [8] proposed a continuous outlier detection approach called CUOD to discover the outliers from uncertain data stream. To reduce the time cost on outlier detection, they used the probability pruning approach to prune the infrequent patterns in the extending process. Then, a new method for parameter variable queries was proposed to enable the concurrent execution of different queries, and the results showed that the proposed approach can reduce the required storage and running time. Overall, the accuracy of density-based outlier detection approaches is relatively high, but it is also a
Table 1 Characteristics of outlier detection approaches Type of data
Precise / Uncertain
Type of outlier detection
COID[31] ROCF[18] COD[20] ACOD[20] DLC[2] RDOS[33] FindFPOF[16] OODFP[25] MIFPOD[17] FIM-UDSOD[14] MWIFIM-OD-UDS[5] DPA[36] CUOD[8] MiFI-Outlier (our method)
static dataset static dataset data stream data stream static dataset static dataset static dataset static dataset data stream data stream weighted data stream data stream data stream data stream
precise precise precise precise precise precise precise precise precise uncertain uncertain uncertain uncertain uncertain
clustering-based clustering-based distance-based distance-based density-based density-based itemset-based itemset-based itemset-based itemset-based itemset-based distance-based density-based itemset-based
Jo
Approaches
3 / 21
Journal Pre-proof
Table 1 shows the characteristics of the aforementioned outlier detection approaches, where the type of data, the uncertainty and the type of outlier detection are considered. Note that the itemset-based outlier detection approaches, FindFPOF, OODFP, MFIPOD, FIM-UDSOD and MWIFIM-OD-UDS, are used as the compared approaches with our proposed MiFI-Outlier approach.
the infrequent weighted itemsets and minimal infrequent weighted itemsets, including the items with most interest within each transaction. Similar to FP-growth methods, the FP-growthbased infrequent mining methods also have the following drawbacks, including the large memory usage and two scans of the whole datasets.
2.2. Infrequent itemset mining approaches
In this section, we first define some concepts related to this article, and then introduce some practical problems encountered in the process of researching the minimal infrequent itemsetbased outlier detection approach on uncertain data stream. 3.1. Preliminaries
Assume that I={i1, i2,…, in} is a set of items and Is={i1, i2, …, ik}, Is I and k[1,n], then Is is a k-itemset and k is the length of Is. Uncertain data stream UDS is composed of infinite transactions, that is UDS=[T1, T2, …, Tm) (m→ ), where Ti is composed of several items selected from I and their existential probabilities p(ik,Ti), that is Ti={i1:p(i1,Ti), i2:p(i2,Ti), …, ik:p(ik,Ti)} (k≤n, 0
re-
For Apriori-based infrequent itemset mining approaches, Haglin and Manning proposed the MINIT approach [11] to recursively discovery the minimal infrequent itemsets from transactions in the datasets. In the MINIT approach, the itemsets were sorted according to their decrease support value to let the most frequently appearing itemsets can be mined first, and then, the minimal infrequent itemsets were mined from the search space and these search spaces were discarded directly after the mining operation to improve the mining efficiency, where the search space referred to the transactions that containing minimal infrequent itemsets. Szathmary et al. [32] proposed the Arima algorithm to effectively mine the minimal infrequent itemsets by traversing the frequent search space using level-searching idea, where the level-searching idea indicated the mining process was conducted from 1-itemsets to longer-itemsets in level and the longer infrequent itemsets were generated by the “pattern extension” operation of the mined shorter minimal infrequent itemsets. Then, the MRG-Exp method [32] was proposed to further improve the mining efficiency, and the main idea of MRG-Exp was to discover the frequent generators to generate the minimal infrequent itemsets recursively. Troiano and Scibelli [34] proposed a breadth-first level-wise lattice-traversal algorithm, namely Rarity, for mining the infrequent itemsets, it first identified the longest rare itemsets in the database, and then power set lattice was moved downwards to filter the frequent itemsets because any subset of frequent itemsets was also frequent. The Rarity approach was very suitable for sparse databases, but the mining efficiency is very low when processing the dense datasets.
pro of
For the itemset-based outlier detection approaches, the itemset mining process is the basis of outlier detection. Over the past twenty years, frequent itemset mining has been researched in most researches, but the infrequent itemset mining is more meaningful for outlier detection due to it aims at discovering the rarely appearing itemsets, which is more coincide with attribute of “rarely appears”. The infrequent itemset mining approaches can be roughly divided into Apriori-based approaches and FPgrowth-based approaches.
3. Preliminaries and problem descriptions
Definition 1. subset, superset: For an itemset Ia={i1, i2,…, ia} and an itemset Ib={i1, i2,…, ib} (a
urn a
lP
Definition 2. probability: The probability of itemset Im in transaction Tj that formed by some items {i} is denoted as p(Im,Tj), and it is defined as
Jo
For FP-growth-based infrequent itemset mining approaches, Tsang et al. [35] designed the RP-Tree method to mine the infrequent itemsets from static precise datasets, where the support value of each 1-itemset was calculated in the first scanning of whole dataset, and then the FP-tree-like structure was constructed in the second scanning of the dataset for recursively mine the infrequent itemsets. After the construction of FP-treelike structure, the conditional trees were constructed to mine each subset of the infrequent itemsets. In 2014, Cagliero and Garza [3] proposed two FP-growth-based algorithms, called IWI-Miner and MIWI-Miner, to discover the relevant infrequent itemsets with different weights from transactional weighted datasets. The IWIsupport-min measure was first defined relying on a minimal cost function to find the infrequent weighted itemsets and minimal infrequent weighted itemsets, including the items with least interest within each transaction. Then, the IWI-support-max measure was defined relying on a maximum cost function to find
p Im, T j
p (i, Tj )
(1)
{i }{ Im }
Definition 3. support: The appear frequency of itemset Im in UDS is denoted as sup(Im), and it is defined as | SW |
sup (Im )
p (i , T j )
(2)
j 1 { i }{ Im }
Definition 4. infrequent itemset (iFI): For an itemset Im, if its support value is less than min_sup, that is sup(Im)
4 / 21
Journal Pre-proof
TID T1 T3 T5
Table 2 An example of UDS Transactions TID Transactions {a:0.8,d:0.2,e:0.4,f:0.2} T2 {a:0.6,b:0.1,c:0.9,e:0.7,f:0.2} {a:0.3,b:0.3,c:0.4,e:0.2} T4 {b:0.2,d:0.2,e:0.7,f:0.1} {a:0.4,b:0.2,c:0.5,d:0.1,f:0.3} … ……
In this example, p(a,T1)=0.8, p(ad,T1)=0.8*0.2=0.16. In the sliding window, sup(a)=0.8+0.6+0.3+0+0.4=2.1>0.6, then, itemset {a} is a frequent itemset; sup(ab)=0.8*0+0.6*0.1+0.3* 0.3+0*0.2+0.4*0.2=0.23<0.6, thus, itemset {ab} is an infrequent itemset. 3.2. Problem descriptions In this subsection, some existing problems on itemset mining and outlier detection on uncertain data stream are pointed out respectively.
frequent:
{}
infrequent:
ab
ac
ad
abc abd abe abf
abcd
abce
abcf
b ae
acd ace
abde
abcde
af
acf
abdf
abcdf
c bc
d bd
ade adf
abef
e
be
bf
f cd
ce
cf
aef bcd bce bcf bde bdf bef
acde
abcef
acdf
acef
abdef abcdef
adef
bcde
acdef
de
df
ef
cde cdf
cef
def
bcdf
bcef
bdef
In general, how to design several accurate deviation indices according to the mined minimal infrequent itemsets to measure the deviation degree of the transactions in uncertain data stream, thereby improving the detection accuracy is a very important for outlier detection. In addition, how to design an efficient minimal infrequent itemset-based outlier detection method to accurately discover the implicit from uncertain data stream is also a tough challenge to solve.
4. Minimal infrequent itemset-based outlier detection approach (MiFI-Outlier)
cdef
bcdef
Fig.1. The specific information of potential itemsets
urn a
For the traditional Apriori-based itemset mining methods, the uncertain data stream needs to be scanned for several times to conduct “pattern extension” operation and calculate the support value of the extended itemsets, thereby mining the frequent itemsets. It is a time consuming work, and the time complexity of this method is unacceptable when processing large-scale datasets and high-dimensional datasets. For the traditional FP-Growthbased itemset mining methods, the recursive idea is used to mine all frequent itemsets from uncertain data stream, and they are more efficient than the Apriori-based methods. However, the mining process needs to construct many conditional trees, it will consume much memory usage. Although a large number of optimized algorithms have been proposed to improve the mining efficiency of Apriori-based methods and FP-Growth-based methods, but most works are directed into static precise datasets and precise data stream, they are not suitable for uncertain data stream.
Jo
For the itemset-based outlier detection approaches, the appear frequency of each item is considered as an important factor to measure the deviation degree of each transaction, thus, the accuracy of outlier detection is more accurate. However, the existing itemset-based outlier detection methods were oriented to the static precise datasets or precise data stream, they are not suitable for uncertain data stream. In addition, the designed deviation indices of FindFPOF [16], OODFP[25] and FIMUDSOD[14] are very simple, which results the detection accuracy lacking of enough competitive.
lP
a
3.2.2 Problem on outlier detection For the clustering-based outlier detection approaches, distance-based outlier detection approaches and density-based outlier detection approaches, each item in the transactions in current sliding window needs to be used to calculate the distances between other items, thus, the computational complexity is very high when processing high-dimensional data stream. In addition, the appear frequency of each item is not considered in these approaches, therefore, the detected outliers are not fit well with the definition of outlier for “appears rarely”.
re-
3.2.1 Problem on itemset mining Assume that the size of itemset {I} is m, it can be known from [39] that the number of potential extensible itemsets is up to (2m1). For the example shown in Table 2, the 1-itemsets are {{a}, {b}, {c}, {d}, {e}, {f}}, and the potential itemsets that can be mined are shown in Fig. 1, it is obviously that the number of potential itemsets is 63 (=26-1). To mine all infrequent itemsets, the support value of all possible (2m-1) itemsets needs to be calculated, it is unrealistic in the era of big data.
memory usage and only one scan of the entire data stream is also a major challenge to solve.
pro of
parameters remain the same in all subsequent sections of this paper.
In general, how to design an efficient strategy to reduce the scale of potential extensible itemsets, especially for highdimensional data stream, thereby reducing the time cost on itemset mining phase is very critical. In addition, how to use an efficient mining approach to mine the minimal infrequent itemsets from uncertain data stream in less time cost, less
For itemset-based outlier detection approaches, they are divided into two phases: (1) itemset mining phase, and (2) outlier detection phase. Similar to the itemset-based outlier detection approaches, the proposed minimal infrequent itemset-based outlier detection also can be divided into: (1) minimal infrequent itemset phase, and (2) minimal infrequent itemset-based outlier detection phase. When the new transactions flow into the sliding window, the minimal infrequent itemset mining operation is conducted to mine the minimal infrequent itemsets from the transactions in current sliding window. As mentioned in 3.2.2, the main limitation of itemset mining is its mining speed, especially processing the high-dimensional data stream. In order to increase the speed of minimal infrequent itemset mining, we can refer to two main ideas. The first idea is to reduce the scale of potential extensible itemsets through an efficient downward closure property, and the second idea is to reduce the times of data stream scans through an efficient data structure. After all minimal infrequent itemsets are mined, the outlier detection operation is conducted to effectively discover the implicit outliers from uncertain data stream. To improve the detection accuracy, more factors that have the probability to influence the detection accuracy need to be considered in the design of deviation indices. Finally, the transactions are sorted from large to small based on the calculated deviation degree, and the transactions with a large deviation degree are judged as outliers. Based on the above analysis, we first describe an efficient minimal infrequent itemset mining approach, namely MiFIUDSM, to mine the minimal infrequent itemsets from uncertain data stream. In subsection 4.2, we introduce two concepts to
5 / 21
Journal Pre-proof
4.1. Minimal infrequent itemset mining (MiFI-UDSM) As the basis of itemset-based outlier detection approach, the minimal infrequent itemsets need to be effectively mined using the “pattern extension” operation to provide protection for the outlier detection process.
After getting the (k+1)-itemsets that extended by k-itemsets, their support values are calculated using “vector multiply” operation to determine whether they are frequent itemsets, in which the “vector multiply” operation refers to multiplying the probabilities of (k+1) items in the (k+1)-itemset, and the probabilities of the (k+1) items are gained from the column vectors in the constructed matrix A. If the support value of the (k+1)-itemset is not less than the predefined min_sup, it is stored in FIL as a candidate itemset for the subsequent “pattern extension” operations. If the support value of the (k+1)-itemset is less than the predefined min_sup, the “minimal infrequent check” operation is performed to discover the minimal infrequent itemsets, in which the “minimal infrequent check” operation refers to checking whether any subset of the (k+1)-itemset is present in MFIL, then the (k+1)-itemset is discarded directly if there is a subset of current (k+1)-itemset in MiFIL, otherwise, the (k+1)-itemset is saved to MiFIL. In particular, for the extended 2itemsets, if their support values are less than min_sup, they are saved to MiFIL because they are extended by frequent 1-itemsets. Recursively performing the above operations until all minimal infrequent itemsets are mined from the transactions in current sliding window.
re-
4.1.1. The main idea of MiFI-UDSM approach Compared with Apriori-based approaches and FP-Growthbased approaches, the use of matrix structure can save much time cost and memory usage in itemset mining process [5,10] because it supports scanning the entire uncertain data stream for only one time and supports mining the minimal infrequent itemsets without generating any conditional tree. The construction process of matrix structure (denoted as matrix A) is similar to the MWIFIM-UDS approach [5], but the last row records the support value of each item. When new transaction Ta flows into the sliding window, the probability of item {ib} is written to position of Aa,b if it is in Ta, otherwise, 0 is written to the corresponding position. When the probabilities of all items in current sliding window are written into matrix A, the support value of each item is written to the corresponding position of last row of matrix A.
itemset, thus, the “pattern extension” should also not conduct on infrequent 1-itemsets to reduce the meaningless time cost. For the frequent k-itemsets that stored in the FIL, conduct the “pattern extension” operation to extend them into (k+1)-itemset. It is important to note that in the process of extending frequent 1itemsets to 2-itemsets, the “pattern extension” operation is conducted directly without judging their prefixes.
pro of
reduce the scale of potential extensible itemsets, and then propose an improved minimal infrequent itemset mining approach, namely MiFI-UDSM*. In subsection 4.3, we discuss the correctness and computing complexity of the proposed MiFIUDSM approach and MiFI-UDSM* approach. In subsection 4.4, we introduce three deviation indices to measure the deviation degree of each transaction, and then the minimal infrequent itemset-based outlier detection approach, namely MiFI-Outlier, is proposed to accurately discover the outliers from uncertain data stream.
urn a
lP
With the use of matrix structure, the scan times of uncertain data stream is reduced into once, thus, the time cost of itemset mining process is reduced to a certain extent. However, it can be seen from Fig. 1 that when the incoming data stream is highdimensional data stream, due to more different 1-items are existing in the high-dimensional data stream, thus, the scale of potential extensible itemsets is very huge, which will slow down the speed of itemset mining. For speeding up the mining process, it is necessary to delete the infrequent 1-items in the highdimensional data stream, thereby reducing the scale of potential extensible itemsets. Thus, the downward closure property should be adopted in “pattern extension” process to delete the meaningless itemsets for the mining process.
With the above operations, the number of mined minimal infrequent itemsets is much less than that of all infrequent itemsets, which facilitates the outlier detection process. The detailed process of MiFI-UDSM is shown in Algorithm 1.
Theorem 1. Downward closure property: Any superset of infrequent itemset is sure infrequent.
Jo
Proof. Assume that {Xk} is an infrequent k-itemset and {Xk+1} is extended by itemset {Xk} and {Y}. Due to the probability of itemset {Y} is not large than 1, and itemset {Xk} and itemset {Y} are not always appearing at the same time, therefore, sup(Xk+1)≤ sup(Xk)*p(Y)≤sup(Xk)
Algorithm 1: MiFI-UDSM Input: Uncertain data stream, min_sup Output: MiFIs 01.MiFIL=Φ, FIL=Φ 02.construct matrix A 03.if A|SW|+1,k
6 / 21
Journal Pre-proof
Step 1. Construct matrix A. The probability of each itemset in transactions T1, T2, T3, T4 and T5 is scanned and written into matrix A successively. When transaction T1 is scanned, the probability of the existing itemsets ({a}, {d}, {e} and {f}) in T1 is written in corresponding position, the result is shown in Fig. 2(a). Fig. 2(b) shows the constructed matrix A after transaction T2 is scanned, and Fig. 2(c) shows the constructed matrix A after all five transactions in sliding window are scanned. Then, the probability of each item is calculated and written in the corresponding position of row (|SW|+1) in matrix A, the specific result is shown in Fig. 2(d). a d e f b c T 1 0.8 0.2 0.4 0.2 0 0 T 2 0.6 0 0.7 0.2 0.1 0.9 (b)
T1 T2 T3 T4 T5
a d e f b c 0.8 0.2 0.4 0.2 0 0 0.6 0 0.7 0.2 0.1 0.9 0.3 0 0.2 0 0.3 0.4 0 0.2 0.7 0.1 0.2 0 0.4 0.1 0 0.3 0.2 0.5 (c)
a 0.8 0.6 0.3 0 0.4 sup 2.1
T1 T2 T3 T4 T5
d 0.2 0 0 0.2 0.1 0.5
e 0.4 0.7 0.2 0.7 0 2 (d)
f 0.2 0.2 0 0.1 0.3 0.8
b 0 0.1 0.3 0.2 0.2 0.8
Fig.2. The creation process of matrix A
c 0 0.9 0.4 0 0.5 1.8
Although the MiFI-UDSM approach can exactly mine the minimal infrequent itemsets from uncertain data stream, but when processing the high-dimensional data stream, the scale of potential extensible itemsets is also very huge, it will increase much time cost in the mining process, including “pattern extension” operation and support value calculation. To further reduce the scale of extensible k-itemsets, we design two concepts including “item cap” and “support cap” to eliminate the extra “pattern extension” operation and support value calculation on the meaningless k-itemsets, where these k-itemsets can be deleted directly to reduce the dimensions, thereby reducing the overall time cost on itemset mining. Specifically, the “pattern extension” operation is conducted by considering an upper bound of existential probability of each itemset in current transaction, where the upper bound is the “item cap” and it is defined in definition 6. Definition 8. item cap (pcap(Im,Tj)): The “item cap” of item {im} in transaction Tj is defined as pcap(im,Tj), it is the product of probability p(im,Tj) and the maximal existential probability value (denoted as M) of frequent items except for itself for (n-1) times. It is defined as
p cap (im, Tj )
n 1
, | Tj | 1
p ( i1 , T j )
urn a
Step 3. Extend the frequent 1-itemsets to 2-itemsets. The frequent 1-itemsets {a}, {e}, {f}, {b} and {c} that stored in FIL are taken out in turn to do the “pattern extension” operation, thereby extending them to 2-itemsets of {{ae}, {af}, {ab}, {ac}, {ef}, {eb}, {ec}, {fb}, {fc}, {bc}}.
Step 4. Search for minimal infrequent 2-itemsets. After the frequent 1-itemsets are extending to 2-itemsets, their support values are calculated to search for minimal infrequent 2-itemsets, thereby saving them into MiFIL. For the extended 2-itemsets, the detailed support values are {{ae}:0.8, {af}:0.4, {ab}:0.23, {ac}:0.86, {ef}:0.29, {eb}:0.27, {ec}:0.71, {fb}:0.1, {fc}:0.33, {bc}:0.31}, thus, itemsets {af}, {ab}, {ef}, {eb}, {fb}, {fc} and {bc} are minimal infrequent 2-itemsets and they are saved to MiFIL, while {ae}, {ac} and {ec} are frequent itemsets and they are saved to FIL. Step 5. Extend the frequent 2-itemsets to 3-itemsets. For frequent 2-itemsets {ae} and {ac} with the same prefix {a}, they can be extended into {aec}. Due to sup(aec)=0.402<0.6, it is an infrequent itemset. Because of the subsets of itemset {aec} ({ae}, {ac}, {ec}) are frequent itemsets, thus, it is a minimal infrequent itemset and should be saved to MiFIL. Because of no longer frequent itemsets are existing in this example, thus, the “pattern extension” process is ended. In summary, the mined minimal infrequent itemsets are {d}, {af}, {ab}, {ef}, {eb}, {fb}, {fc}, {bc} and {aec}.
max p (im, Tj ) (3)
, M
,| Tj | 1
m
[1,|Tj |]
Theorem 2. In the same transaction Tj, the existential probability of any k-itemset Im (k>1) that contains items {im} is not larger than pcap(im,Tj), that is: p(Im,Tj)≤pcap(im,Tj), where ImTj and {im}Im. Proof. Assume that itemset {Im}={i1,…,im} is a subset of transaction Tj, it can be known from definition 2 that
lP
Step 2. Search for minimal infrequent 1-itemsets. After the support values of all 1-itemsets are calculated, the minimal infrequent 1-itemsets are mined based on matrix A. Because sup(d)=0.5<0.6, it is an infrequent itemset and saved to MiFIL. The frequent 1-itemsets {a}, {b}, {c}, {e} and {f} are saved to FIL.
Jo
p ( im , T j ) * M
re-
a d e f T 1 0.8 0.2 0.4 0.2 (a)
4.2. An improved minimal infrequent itemset mining approach (MiFI-UDSM*)
pro of
4.1.2 An example of MiFI-UDSM approach In this subsection, an example is given to explain the MiFIUDSM method more clear. The uncertain data stream used in this example is shown in Table 2, the min_sup value and |SW| are same to that in subsection 3.1.
p ( Im, Tj )
p (i , T j )
p (i m , T j ) *
{i } { Im}
p (i , T j )
(4)
{ i } { Im im}
Because of the existential probability of each item {i} is not large than 1 (that is 0
p (i, T j ) {i } { Im im }
max p (iq, T j )
(5)
1 q m 1
It can be derivate from formula (4) and (5) that:
p ( Im, Tj )
p ( im , T j ) *
p (i , T j ) { i } { Im im }
p (im, Tj ) * max p (iq , Tj )
(6)
1 q m 1
cap
p ( im , T j ) Thus, theorem 2 is correct. Definition 9. support cap (supcap(Im)): The “support cap” of itemset Im is denoted as supcap(Im), it is defined as the sum of all pcap(Im,Tj) that appearing in current sliding window. It is defined as | SW |
sup cap (Im)
( p cap ( Im, Tj ) | Im
Tj )
(7)
j 1
Theorem 3. For any k-itemset Im (k>1), the support value of Im is not larger than supcap(X), that is sup(Im)≤supcap(Im).
7 / 21
Journal Pre-proof
| SW |
sup (Im )
p (i , T j )
j 1 { i }{ Im } | SW |
( p ( im , T j ) * j 1
p (i, Tj ))
{ i }{ Im im }
(8)
| SW |
p ( im , T j ) cap
j 1
sup (Im ) cap
Therefore, theorem 3 is correct. Definition 10. Safe infrequent itemset (SiFI): For a itemset Im, if its “support cap” value is less than min_sup, that is supcap(Im)
Algorithm 2: MiFI-UDSM* Input: Uncertain data stream, min_sup Output: MiFIs 01.MiFIL=Φ, FIL=Φ 02.construct matrix A 03.if A|SW|+1,k
urn a
lP
re-
Based on the “item cap” concept and “support cap” concept, an improved edition of MiFI-UDSM approach, namely MiFIUDSM*, is propsoed to more quickly mine the minimal infrequent itemsets from uncertain data stream. Different with the MiFI-UDSM approach, the pcap(Im,Tj) values and the supcap(Im) values need to be calculated before each “pattern extension” operation from frequent k-itemsets to (k+1)-itemsets to discard these SiFIs, thereby reducing the dimensions of extensible patterns in high-dimensional data stream and reducing the meaningless time cost. In addition, another matrix structure is also need to be constructed to record the pcap value and supcap value of each item, where the last row of matrix structure records the supcap(Im) values instead of the support values to easily exclude the items that do not participate in the subsequent “pattern extension” operations. If the support value of current item is less than the min_sup, all pcap value of this item in the matrix structure are recorded as 0, threrby reducing the time cost on pcap value calculation and supcap value calculation. Then, the specific implementation steps for the MiFI-UDSM* approach are as follows.
supcap(Im) value of each subset is not less than predefined min_sup value, they are saved to FIL. Otherwise, the 2-itemsets are extended with the frequent 1-itemsets stored in FIL using “pattern extension” operation to form the 3-itemsets, then, the “minimal infrequent check” operation is conducted to check whether any subset is existing in MiFIL, and the extended 3itemsets are saved to MiFIL directly if no subset of them is in MiFIL. Then, the operations shown in second phase are recursively conducting until all minimal infrequent itemsets in the transactions in current sliding window are mined. The detailed process of the MiFI-UDSM* approach is shown in Algorithm 2.
pro of
Proof. Assume that Im={i1,…,im} is a subset of Tj, it can be known from definition 3 and theorem 2 that:
Jo
In the first phase, the matrix structure is constructed similar to the MiFI-UDSM approach, and then the minimal infrequent 1itemsets and frequent 1-itemsets are mined with the calculation of support values. Before extending the frequent 1-itemsets to 2itemsets, the supcap(Im) value of each frequent 1-itemset is calculated to find the safe infrequent 1-itemsets. For the safe infrequent 1-itemsets, they are directly connected with the frequent 1-itemsets stored in the FIL to extend to the 2-itemsets, and the extended 2-itemsets are saved in the MiFIL directly without the calculation of their support values, thereby reducing the time cost on support value calculation operations. Because any superset of the safe infrequent 1-itemsets is also infrequent, thus, it is not necessary to conduct the “pattern extension” operation for them and they are moved out from FIL. In the second phase, the frequent 1-itemsets that stored in FIL are conducting the “pattern extension” operation to extend them to 2-itemsets, and the support values of the extended 2-itemsets are calculated to further mine the minimal infrequent 2-itemsets and frequent 2-itemsets. For these frequent 2-itemsets, the supcap(Im) values are calculated to find the safe infrequent 2itemsets. Specifically, in the extended frequent 2-itemsets, if the
4.2.1 An example of MiFI-UDSM* approach In this subsection, we use the example that shown in Table 2 to explain the proposed MiFI-UDSM* approach more clearly, the min_sup value and |SW| are also same to that shown in subsection 3.1. Step 1. Construct matrix A. The construction of matrix A is same to MiFI-UDSM approach. Step 2. Search for minimal infrequent 1-itemsets. The process of this step is same to MiFI-UDSM approach, and the frequent 1itemsets {a}, {b}, {c}, {e} and {f} are saved to FIL.
8 / 21
Journal Pre-proof
T1 T2 T3 T4 T5 sup cap
a 0.32 0.54 0.12 0 0.2 1.18
d 0 0 0 0 0 0
e 0.32 0.63 0.08 0.14 0 1.17
f 0.16 0.18 0 0.07 0.15 0.56
b c 0 0 0.09 0.63 0.12 0.12 0.14 0 0.1 0.2 0.45 0.95
T1 T2 T3 T4 T5 sup cap
a 0.128 0.486 0.048 0 0.1 0.762
d 0 0 0 0 0 0
e 0.256 0.567 0.032 0.028 0 0.883
f 0 0 0 0 0 0
b c 0 0 0 0.441 0 0.036 0 0 0 0.08 0 0.557
Fig.4. The pcap value and supcap value for each frequent 2-itemset
Step 8. Search for safe minimal infrequent 3-itemsets. After the calculation of supcap value for each frequent 2-itemset, the safe minimal infrequent 3-itemsets that extended by the frequent 2-itemsets are mined and stored into MiFIL. Because of supcap(c)=0.557<0.6, thus, itemsets {ac} and {ec} are safe infrequent itemsets, that is, the 3-itemsets that extended by them are minimal infrequent itemsets. Then, they are used to extend with the frequent 2-itemsets that stored in FIL to form the minimal infrequent 3-itemsets. Because the frequent 2-itemsets in FIL are {ae}, {ac} and {ec}, thus, the minimal infrequent 3itemsets extended by them are {aec}, it is saved to MiFIL directly without the calculation of their support values.
pro of
Step 3. Calculate the pcap value and supcap value for each frequent 1-itemset. After the construction of matrix A, the pcap value and supcap value of each frequent 1-itemset are calculated to seek minimal infrequent 2-itemsets. For itemset {a} in transaction T1, the maximal existential probability value except for itself is 0.4, thus, pcap(a,T1)=0.8*0.4=0.32, it is written in corresponding position of matrix structure. The pcap value of the items in other frequent 1-itemsets is also calculating in the same way. In particular, the pcap value of infrequent 1-itemsets (itemset {d}) is written in 0 to omit the calculation operation, it is owing to that any superset of the infrequent itemset is also an infrequent itemset. When all pcap values are calculated and written into matrix structure, the supcap value is calculated and written in the last row of matrix, such as: supcap(a)=0.32+0.54+0.12+0+0.2 =1.18. The detailed result of pcap value and supcap value of each frequent 1-itemset is shown in Fig. 3.
Because of no longer frequent itemsets are existing in this example, thus, the “pattern extension” process is ended. Finally, the minimal infrequent itemsets in MiFIL are {d}, {af}, {ab}, {ef}, {eb}, {fb}, {fc}, {bc} and {aec}.
Fig.3. The pcap value and supcap value for each frequent 1-itemset
re-
Step 4. Search for safe minimal infrequent 2-itemsets. After the calculation of supcap value for each frequent 1-itemset, the safe minimal infrequent 2-itemsets that extended by the frequent 1-itemsets whose supcap value is less than min_sup value are mined and stored into MiFIL. Because of supcap(f)=0.56<0.6 and supcap(b)=0.45<0.6, thus, itemsets {f} and {b} are safe infrequent itemsets, that is, the 2-itemsets that extended by them are minimal infrequent itemsets. Then, they are used to extend with the frequent 1-itemsets that stored in FIL to form the minimal infrequent 2-itemsets. Because the frequent 1-itemsets in FIL are {a}, {b}, {c}, {e} and {f}, thus, the minimal infrequent 2itemsets extended by them are {af}, {ef}, {fb}, {fc}, {ab}, {eb} and {bc}, they are saved into MiFIL directly without the calculation of their support values.
4.3. The correctness and computing complexity of MiFIUDSM and MiFI-UDSM*
lP
In this subsection, the correctness and the computing complexity of the proposed two minimal infrequent itemset mining approaches including MiFI-UDSM and MiFI-UDSM* are analyzed.
urn a
Step 5. Extend the “frequent” 1-itemsets to 2-itemsets. After step 4, we can know the “frequent” 1-itemsets are only {a}, {e} and {c}, they can be extended to 2-itemsets of {{ae}, {ac}, {ec}}. Step 6. Search for minimal infrequent 2-itemsets. After the 2itemsets are extended by “frequent” 1-itemsets, their support value is calculated to determine whether they are minimal infrequent itemsets or not. The detailed support value of the extended 2-itemsets is {{ae}:0.8, {ac}:0.86, {ec}:0.71}, thus, itemsets {ae}, {ac} and {ec} are frequent 2-itemsets, they are saved to FIL.
Jo
Step 7. Calculate the pcap value and supcap value for each frequent 2-itemset. Before extending the frequent 2-itemsets to 3itemsets, the pcap value and supcap value of the 1-items in frequent 2-itemsets are calculated to determine whether they can be further extended. For {a}, pcap(a,T1)=0.8*0.4*0.4 =0.0.128, it is written in corresponding position of matrix structure. The pcap value of other items in frequent 2-itemsets is also calculating in the same way. When all pcap values are calculated and written into matrix structure, the supcap value is calculated and written in the last row of matrix, such as: supcap(a)=0.128+0.486+0.048+0+ 0.1=0.762. The detailed result is shown in Fig. 4.
For the MiFI-UDSM approach, the factor affecting the mining accuracy is the removal of infrequent itemsets during “pattern extension” phase. It can be known from Theorem 1 that if {Xk} is an infrequent k-itemset and {Xk+1} is the superset of {Xk} that extended by itemset {Xk} and itemset {Y} (sup(Y)≤1), therefore, sup(Xk+1)≤sup(Xk)*p(Y)≤sup(Xk)
2), the “minimal infrequent check” operation is conducted after support value calculating, therefore, the itemsets saved in MiFIL are guaranteed to be minimal. Overall, the mined MiFIs by MiFI-UDSM is correct and no any iFI will be missed mining. Different with the MiFI-UDSM approach, the purpose of introducing the concepts of “item cap” and “support cap” for MiFI-UDSM* is to reduce the scale of potential extensible itemsets, thereby improving the mining efficiency. It can be known from Theorem 2 that if the itemset {X} is composed by itemset {xr} and supcap(xr)
9 / 21
Journal Pre-proof
To better quantify the abnormal degree of the detected transactions, the minimal infrequent Itemset deviation index (MiFIDI), similarity deviation index (SDI) and transaction deviation index (TDI) are defined in the next paragraphs. Definition 10. MiFIDI (Minimal infrequent Itemset Deviation Index): For each minimal infrequent itemset {X}, the length of {X} is len(X) and the support value of {X} is sup(X), the number of different items in current sliding window is k. Then, MiFPDI is defined as
MiFIDI ( X ) (min_sup sup( X ))*2k len ( X )
urn a
It can be known from the definition of outlier proposed by Hawkins [15] that the outliers have two major attributes: (1) rarely appear, and (2) deviate much from most observations. For the attribute of rarely appearing, the proposed MiFI-UDSM approach and MiFI-UDSM* approach can effectively mine the rarely appearing itemsets (i.e. minimal infrequent itemsets) from uncertain data stream. For the attribute of deviating much from most observations, we refer to the MWIFIM-OD-UDS approach [5] and MIFPOD approach [17] to design three deviation indices based on the mined minimal infrequent itemsets, thereby measuring the deviation degree of each transaction in the current sliding window. The factors that will influence the deviation degree mainly involve the following aspects. (1) The support value of minimal infrequent itemsets. The smaller support value of minimal infrequent itemset Im means Im is appearing more rarely or the appearing probability is smaller, which results the detected object more abnormal. Thus, the support value is negatively correlated with the deviation degree.
Jo
(2) The length of minimal infrequent itemsets. For a minimal infrequent 1-itemset {ia} and a minimal infrequent 2-itemset {iaib}, the number of infrequent itemsets that can be extended by {ia} is (2k-1-1), and the number of infrequent itemsets that can be extended by {iaib} is (2k-2-1). That is, the shorter minimal infrequent itemsets can generate more infrequent itemsets, therefore, the short minimal infrequent itemsets will result the detected transactions more abnormal. Thus, the length of minimal infrequent itemset is negatively correlated with deviation degree. (3) The number of contained minimal infrequent itemsets. For a transaction Ti, if more minimal infrequent itemsets are
(9)
MiFIDI is the deviation index of minimal infrequent itemsets, the bigger MiFIDI(X) value means itemset {X} is more abnormal. Definition 11. SDI (Similarity Deviation Index): For each transaction Ti in the sliding window, its length is len(Ti). Then, SDI is defined as
SDI (Ti )
1 * len(Y ) sup(Y ) len(Ti )
X MiFIL ,Y (Ti X )
(10)
SDI is an important deviation index of the transactions, the bigger SDI(Ti) value means transaction Ti is more abnormal.
lP
4.4. Outlier detection approach
(4) The similar degree to the minimal infrequent itemsets in MiFIL. For a transaction Ti, if it is more similar with the itemsets stored in MiFIL, then, transaction Ti is more like an outlier. Thus, the similar degree is positively correlated with deviation degree.
re-
The difference of MiFI-UDSM* approach and MiFI-UDSM approach is that the pcap value and supcap value need to be calculated before each “pattern extension” operation, the computing complexity of this operation is O(2*k*n) in the worst case, where k is the maximal length of extensible itemsets. In this case, the computing complexity of MiFI-UDSM* approach is O((2n*|SW|)+(2n-1)+(2n-1-(n2+n)/2)+2*k*n). Because of O(2n* |SW|-(n2+n)/2)-2+2*k*n) is far less than O(2n+1), thus, the final computing complexity of the MiFI-UDSM* approach is also O(2n+1) in the worst case.
contained in Ti, which indicates many itemsets in Ti are appearing rarely, transaction Ti is more like an outlier. Thus, the number of contained minimal infrequent itemsets is positively correlated with deviation degree.
pro of
For the MiFI-UDSM approach, it can be decomposed into four steps, and the computing complexity of each step is shown as follows. Step 1: Scan the item information in the sliding window and write their probability into matrix structure, the computing complexity of this step is O(n*|SW|), n is the number of distinct items. Step 2: Calculate the support value of each 1itemset and discover the minimal infrequent 1-itemsets, the computing complexity of this step is also O(n*|SW|). Step 3: Extend the frequent 1-itemsets to long itemsets and discover the infrequent itemsets. In the worst case, all extended itemsets are frequent, thus, (2n-1) itemsets can be extended by the n frequent 1-itemsets, the computing complexity of this step is O(2n-1) in the worst case. Step 4: Discover the minimal infrequent itemsets. For the n 1-itemsets, the “minimal infrequent check” operation needs not conduct on the n 1-itemsets and (n*(n-1)/2) 2-itemsets. In the worst case, we need to check (2n-1-(n2+n)/2) itemsets, therefore, the computing complexity of this step is O(2n-1(n2+n)/2) in the worst case. In general, the computing complexity of the MiFI-UDSM approach is O((2n*|SW|)+(2n-1)+(2n-1(n2+n)/2)), because of O(2n*|SW|-(n2+n)/2)-2) is far less than O(2n+1), therefore, the final computing complexity of the MiFIUDSM approach is O(2n+1).
Definition 12. TDI (Transaction Deviation Index): For each transaction Ti in the sliding window, its length is len(Ti). Then, TDI is defined as
TDI (Ti )
MiFIDI ( X ) SDI (Ti )
X Ti , X MiFIL
(11)
len(Ti )
TDI is the final deviation index of the transactions, the bigger TDI(Ti) value means the transaction Ti is more likely an outlier. Based on the calculation of MiFIDI(X) value, SDI(Ti) value and TDI(Ti) value, a Minimal inFrequent Itemset-based Outlier detection approach, namely MiFI-Outlier, is proposed to accurately detect the implicit outliers from uncertain data stream. The process of MiFI-Outlier approach is roughly divided into: (1) mine the minimal infrequent itemsets from uncertain data stream using the proposed MiFI-UDSM method or MiFI-UDSM* method, (2) determine the contained minimal infrequent itemsets of current transactions and calculate the deviation degree of each transaction, and (3) sort the transactions using decreasing TDI(Ti) values, thereby detecting the outliers. Then, the top k transactions with largest TDI(Ti) values are judged as outliers, where value k is specified by the users. The detail process of the proposed MiFI-Outlier approach is shown in Algorithm 3. Algorithm 3: MiFI-Outlier Input: Uncertain data stream, min_sup, k Output: Outlier Sets (OS) 01.call Algorithm 2 // mine minimal infrequent itemsets 02.MiFIDI(X)=0, SDI(Ti)=0, TDI(Ti)=0 03.foreach {X}MiFIL then 04. MiFIDI(X)=(min_sup-sup(X))*2k-len(X)
10 / 21
Journal Pre-proof
For transaction T5, the similar parts are {d}:0.1, {af}:0.12, {ab}:0.08, {f}:0.3, {b}:0.2, {fb}:0.06, {fc}:0.15, {bc}:0.1 and {ac}:0.2, thus, SDI(T5)=1/0.1+2/0.12+2/0.08+1/0.3+1/0.2+2/0.06 +2/0.15+2/0.1+2/0.2=136.67.
05.end for 06.for i[1,|SW|] do 07. foreach {X}Ti and {X}MiFIL do
08.
SDI (Ti )
1 * len(Y ) sup(Y ) len(Ti )
Step 3. Determine the contained MiFIs for each transaction and calculate its TDI value.
09. calculate n(X)// n(X) is the number of MiFIs in Ti 10. end for 11. foreach {X}Ti do
12.
TDI (Ti )
For transaction T1, the contained MiFIs are {d}, {af} and {ef}, thus, TDI(T1)=(3.2+3.2+4.96)+62.5=73.86; For transaction T2, the contained MiFIs are {ab}, {af}, {bc}, {eb}, {fb}, {fc}, {ef} and {aec}, thus, TDI(T2)=(5.92+3.2+4.64+ 5.28+8+4.32+4.96+1.584)+234.13=272.034;
MiFIDI ( X ) SDI (Ti )
X Ti , X MiFIL
len(Ti ) 13. end for 14.end for 15.sort transactions using decreasing TDI(Ti) values 16.OS←top k {Ti} 17.return outliers in OS
For transaction T3, the contained MiFIs are {ab}, {bc}, {eb} and {aec}, thus, TDI(T3)=(5.92+4.64+5.28+1.584)+211.39= 228.814; For transaction T4, the contained MiFIs are {d}, {eb}, {fb} and {ef}, thus, TDI(T4)=(3.2+5.28+8+4.96)+179.29=179.29;
For itemset {d}, MiFIDI(d)=(0.6-0.5)*26-1=3.2; For itemset {af}, MiFIDI(af)=(0.6-0.4)*26-2=3.2;
Step 4. Sort the transactions using decrease TDI values. After the calculation of TDI values, the probability of the transactions being outliers in decrease order is T2, T3, T4, T5, T1.
5. Experiments and analysis The extensive experiments are first conducted on a synthetic dataset [5] (the outliers are marked) and a public dataset1 (namely lymphography, classes 1 and 4 are marked as outliers) to evaluate the detection accuracy of the proposed MiFI-Outlier approach, and then four public datasets2 (namely mushroom, pumsb*, chess and kosarak) are used to evaluate the mining efficiency of the proposed two minimal infrequent itemset mining approaches, including time cost and memory usage. For the four public datasets, because of each data in these datasets do not provides the probability value, thus, we assign a randomly generated existential probability ranged in (0.0,1.0] to each data as suggested by [22]. The characteristic of the used datasets is shown in Table 3.
lP
For itemset {ab}, MiFIDI(ab)=(0.6-0.23)*26-2=5.92;
For transaction T5, the contained MiFIs are {d}, {ab}, {af}, {bc}, {fb} and {fc}, thus, TDI(T5)=(3.2+5.92+3.2+4.64+8+ 4.32)+136.67=165.95.
re-
4.4.1 An example of MiFI-Outlier approach This subsection takes the example listed in Table 2 as an example to illustrate the proposed MiFI-Outlier approach more clearly. The mined minimal infrequent itemsets and their support value in this example are {d}:0.5, {af}:0.4, {ab}:0.23, {ef}:0.29, {eb}:0.27, {fb}:0.1, {fc}:0.33, {bc}:0.31 and {aec}:0.402, the min_sup value is 0.6 and the number of different items (recorded as k) is 6. Step 1. Calculate the MiFIDI value for each MiFI.
pro of
X MiFIL ,Y (Ti X )
For itemset {ef}, MiFIDI(ef)=(0.6-0.29)*26-2=4.96;
For itemset {eb}, MiFIDI(eb)=(0.6-0.27)*26-2=5.28; For itemset {fb}, MiFIDI(fb)=(0.6-0.1)*26-2=8;
For itemset {fc}, MiFIDI(fc)=(0.6-0.33)*26-2=4.32;
urn a
For itemset {bc}, MiFIDI(bc)=(0.6-0.31)*26-2=4.64;
For itemset {aec}, MiFIDI(aec)=(0.6-0.402)*26-3=1.584.
Step 2. Determine the similar parts for each transaction and calculate its SDI value. For transaction T1, the similar parts are {d}:0.2, {af}:0.16, {a}:0.8, {ef}:0.08, {e}:0.4, {f}:0.2, {f}:0.2 and {ae}:0.32, thus, SDI(T1)=1/0.2+2/0.16+1/0.8+2/0.08+1/0.4+1/0.2+1/0.2+2/0.32= 62.5;
Jo
For transaction T2, the similar parts are {af}:0.12, {ab}:0.06, {ef}:0.14, {eb}:0.07, {fb}:0.02, {fc}:0.18, {bc}:0.09 and {aec}:0.378, thus, SDI(T2)=2/0.12+2/0.06+2/0.14+2/0.07+2/0.02 +2/0.18+2/0.09+3/0.378=234.13; For transaction T3, the similar parts are {a}:0.3, {ab}:0.09, {e}:0.2, {eb}:0.06, {b}:0.3, {c}:0.4, {bc}:0.12 and {aec}:0.024, thus, SDI(T3)=1/0.3+2/0.09+1/0.2+2/0.06+1/0.3+1/0.4+2/0.12+3/ 0.024=211.39; For transaction T4, the similar parts are {d}:0.2, {f}:0.1, {b}:0.2, {ef}:0.07, {eb}:0.14, {fb}:0.02, {f}:0.1, {b}:0.2 and {e}:0.7, thus, SDI(T4)=1/0.2+1/0.1+1/0.2+2/0.07+2/0.14+2/0.02+ 1/0.1+1/0.2+1/0.7=179.29;
Datasets mushroom kosarak pumsb* chess lymphography synthetic dataset
Table 3 Characteristics of used datasets Num. of Num. of Avg. trans. items Trans. Size 8124 120 23 990002 41270 8.1 49046 2113 74 3196 75 37 148 58 18 1200
76
7.5
Data size 0.545MB 31.4MB 15.9M 901KB 6KB 55KB
To evaluate the proposed MiFI-Outlier approach, the itemsetbased outlier detection methods, FindFPOF [16], MIFPOD [17], OODFP [25] and FIM-UDSOD [14], are used as the compared approaches. Besides that, the distance-based method DPA [36] is also compared in this experiment. This experiment is conducted in different min_sup values and different sizes of sliding window (the ratio of min_sup value to the size of sliding window is set to 33%, 34%, 35%, 36%, 37% and 38% respectively) to discover the implicit outliers from uncertain data stream.
1 2
https://archive.ics.uci.edu/ml/datasets.html http://fimi.cs.helsinki.fi/data/
11 / 21
Journal Pre-proof
The experiments in this section are implementing on a machine running Windows 10 with an Intel dual core i3-2020 3.30 GHz processor and 8GB RAM. And all code used in the experiments are realized in Python language. 5.1. Detection accuracy of MiFI-Outlier approach on synthetic dataset
It can be seen from Table 4 that when the size of sliding window is 30, in the first sliding window, the error detection is appearing when the min_sup value is set to 9.9 and 10.2. Specifically, when the min_sup value is set to 9.9, ten transactions need to be selected when all seven outliers are detected, while nine transactions need to be selected when all seven outliers are detected when the min_sup value is set to 10.2. But when the min_sup value is not less than 10.5, the detection accuracy of MiFI-Outlier approach can reach to 100% in all first windows. In the second window and third window, the detection accuracy of MiFI-Outlier approach can reach to 100% regardless of the selection of the min_sup values. When the size of sliding window is 40, in the first and third windows, the detection accuracy of MiFI-Outlier approach can reach to 100% regardless of the selection of min_sup values, but in the second sliding window, the detection accuracy of MiFI-Outlier approach cannot reach to 100% when the min_sup value is set to 13.2, that is, twelve transactions need to be selected when all nine outliers are detected. When the size of sliding window is 50, the detection accuracy of MiFI-Outlier approach can reach to 100% in all front three windows regardless of the selection of min_sup values. Overall, the detection accuracy of the proposed MiFI-Outlier approach shows an increase trend with the increase of min_sup values, and it can be used to detect the implicit outliers from uncertain data stream when the min_sup values are set relatively large, but not suitable for discovering the outliers when the min_sup values are set very small. The reason is that in small min_sup values, the number of mined minimal infrequent itemsets is also very limited, which results the designed deviation indexes cannot play the maximize role in outlier detection process.
1(7) 2(7) 3(7)
No. SW (|outliers|) 1(11)
Jo
2(9)
Table 4 Detection accuracy of MiFI-Outlier approach on the synthetic dataset Size of sliding window=30 min_sup=9.9 min_sup=10.2 min_sup=10.5 min_sup=10.8 min_sup=11.1 3(3) 3(3) 3(3) 3(3) 3(3) 7(5) 7(6) 7(7) 7(7) 7(7) 10(7) 9(7) 3(3) 3(3) 3(3) 3(3) 3(3) 7(7) 7(7) 7(7) 7(7) 7(7) 3(3) 3(3) 3(3) 3(3) 3(3) 7(7) 7(7) 7(7) 7(7) 7(7) Size of sliding window=40 min_sup=13.2 min_sup=13.6 min_sup=14.0 min_sup=14.4 min_sup=14.8 5(5) 5(5) 5(5) 5(5) 5(5) 11(11) 11(11) 11(11) 11(11) 11(11) 5(4) 5(5) 5(5) 5(5) 5(5) 9(7) 9(9) 9(9) 9(9) 9(9) 12(9) 3(3) 3(3) 3(3) 3(3) 3(3) 7(7) 7(7) 7(7) 7(7) 7(7) Size of sliding window=50 min_sup=16.5 min_sup=17.0 min_sup=17.5 min_sup=18.0 min_sup=18.5 6(6) 6(6) 6(6) 6(6) 6(6) 13(13) 13(13) 13(13) 13(13) 13(13) 5(5) 5(5) 5(5) 5(5) 5(5) 10(10) 10(10) 10(10) 10(10) 10(10) 5(5) 5(5) 5(5) 5(5) 5(5) 10(10) 10(10) 10(10) 10(10) 10(10)
urn a
No. SW (|outliers|)
3(7)
No. SW (|outliers|) 1(13)
2(10) 3(10)
lP
re-
This subsection is to test the detection accuracy of the proposed MiFI-Outlier approach on the synthetic dataset, where the experiments are conducted in different min_sup values and different sizes of sliding window. Firstly, the specific experimental results of the front three windows are shown in Table 4, where the ‘No. SW’ means the number of sliding windows and the ‘|outliers|’ means the number of true outliers, the values in the parentheses represent the number of true outliers and the values outside the parentheses represent the number of transactions have been selected by MiFI-Outlier approach when all true outliers are detected. The closer number of selected transactions and the outliers indicates the higher detection accuracy, while the error detection is marked in red and bold. Secondly, the experimental results of the six compared approaches that under different min_sup values and different
sizes of sliding window are shown in Fig. 5 to Fig. 7 respectively. In these figures, the “Detection accuracy (%)” index (of the yaxis) is used to measure the detection accuracy of each approach (Detection accuracy=n1/n2, n1 indicates the number of true outliers, n2 indicates the number of selected transactions when all true outliers have been detected), while the “No. of sliding window” (of the x-axis) indicates the concrete number of sliding window.
pro of
To test the efficiency of minimal infrequent itemset mining, the time cost and memory usage of proposed MiFI-UDSM and MiFI-UDSM* approaches are tested in different min_sup values and different sizes of sliding window. To prove the effectiveness of the proposed two approaches, the Apriori-based method of MRG-Exp [32], the matrix-based method of MIP-DS [17], the FP-Growth-based method of DSUF-min [40] and RP-Tree [13] are used as the compared methods. In these compared methods, the MRG-Exp method and MIP-DS method are used to mine the minimal infrequent itemsets from precise datasets, RP-Tree method is used to mine infrequent itemsets from precise datasets and DSUF-min method is used to mine the frequent itemsets from uncertain data stream. To illustrate the efficiency of the proposed approaches more intuitively, we have modified the four compared methods to make them can accurately mine the minimal infrequent itemsets from uncertain data streams.
min_sup=11.4 3(3) 7(7) 3(3) 7(7) 3(3) 7(7) min_sup=15.2 5(5) 11(11) 5(5) 9(9) 3(3) 7(7) min_sup=19.0 6(6) 13(13) 5(5) 10(10) 5(5) 10(10)
12 / 21
100
90
90
80
80
70 60 50 40 30 20 10
70 60 50 40 30 20 10
15
20
25
30
35
0
40
10
90
90
80
80
70 60 50 40 30 20 10
20
25
30
35
60
40 30 20
0
40
80
60 50 40 30 20 10
20
25
30
No. of sliding window
(a) min_sup=13.2
10
60 50 40
40 30 20 10
80 70
10
15
20
25
5
10
15
25
10
10
15
30 20
5
10
20
25
30
35
40
(f) min_sup=11.4
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
80 70 60 50 40 30 20
0
5
10
15
20
25
30
(c) min_sup=14.0 MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
90
20
80 70 60
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
50 40 30 20 10
10
15
20
25
0
30
5
10
No. of sliding window
15
20
25
30
No. of sliding window
(f) min_sup=15.2
100
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
90
60 50 40 30 20
0
20
100
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
5
15
No. of sliding window
10
5
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
40
90
Detection accuracy (%)
20
40
100
30
30
70
35
10
20
40
80
30
No. of sliding window
50
90
25
50
0
40
100
Detection accuracy (%)
Detection accuracy (%)
35
60
0
30
20
60
(e) min_sup=14.8 Fig.6. Detection accuracy of MiFI-Outlier approach when |SW| is 40
Jo
100
0
30
10
5
15
MiFI-Outlier 90 OODFP FindFPOF 80 MIFPOD 70 FIM-UDSOD DPA
(b) min_sup=13.6
(d) min_sup=14.4
30
25
20
No. of sliding window
40
20
Detection accuracy (%)
50
50
15
30
0
Detection accuracy (%)
60
10
(c) min_sup=10.5
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
urn a
Detection accuracy (%)
70
90
70
5
No. of sliding window
100
90
60
0
40
No. of sliding window
100
70
35
10
80
20
10
5
lP
70
80
30
re-
80
Detection accuracy (%)
Detection accuracy (%)
90
15
30
(e) min_sup=11.1 Fig.5. Detection accuracy of MiFI-Outlier approach when |SW| is 30
100
90
25
50
90
0
20
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
70
100
10
40
No. of sliding window
(d) min_sup=10.8
5
50
100
No. of sliding window
0
15
10
15
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
60
(b) min_sup=10.2 100
Detection accuracy (%)
Detection accuracy (%)
(a) min_sup=9.9
10
70
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
No. of sliding window
100
5
80
10
5
Detection accuracy (%)
10
Detection accuracy (%)
5
No. of sliding window
0
90
pro of
0
100
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
Detection accuracy (%)
100
Detection accuracy (%)
Detection accuracy (%)
Journal Pre-proof
80 70 60
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
50 40 30 20 10
5
10
15
20
0
5
10
15
No. of sliding window
No. of sliding window
No. of sliding window
(a) min_sup=16.5
(b) min_sup=17.0
(c) min_sup=17.5
20
13 / 21
100
90
90
80
80
70 60 50 40 30 20
0
60 50 40 30 20 10
5
10
15
20
0
80 70
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
60 50 40 30 20 10
5
No. of sliding window
(d) min_sup=18.0
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
90
10
15
20
0
No. of sliding window
(e) min_sup=18.5 Fig.7. Detection accuracy of MiFI-Outlier approach when |SW| is 50
5
10
15
20
No. of sliding window
(f) min_sup=19.0
pro of
10
70
100
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
Detection accuracy (%)
100
Detection accuracy (%)
Detection accuracy (%)
Journal Pre-proof
that when min_sup value is set to 30, the experimental results indicate that the detection accuracy of the proposed MiFI-Outlier approach is gradually increasing as the increase of min_sup values, it is owing to that when the min_sup value is set to relatively large, the scale of minimal infrequent itemsets that used in outlier detection phase is significantly reduced, thus, the detection accuracy is also influenced. For other five compared approaches, the detection accuracy of MIFPOD approach and DPA approach is higher than that of OODFP, FindFPOF and FIM-UDSOD approach in most sliding windows, while the detection accuracy of OODFP approach, FindFPOF approach and FIM-UDSOD approach is very close regardless of the min_sup values.
When the size of sliding window is set to 40, the experimental results of the six compared approaches are shown in Fig. 6(a) to Fig. 6(f). As can be seen from Fig. 6(a), when the min_sup value is set to 13.2, the error detection situation is appearing in six sliding windows of all thirty sliding windows, and when the min_sup value is set to 13.6, the error detection situation is decreased to three windows, while the min_sup value is set to 14.0, 14.4, 14.8 and 15.2, the detection accuracy of MiFI-Outlier approach can reach to 100% in all sliding windows. Similar to
5.2. Detection accuracy of MiFI-Outlier approach on public dataset
re-
When the size of sliding window is set to 30 and the min_sup value is changing from {9.9, 10.2, 10.5, 10.8, 11.1, 11.4}, the detection accuracy of the proposed MiFI-Outlier approach against MIFPOD, FindFPOF, OODFP, FIM-UDSOD and DPA is shown in Fig. 5. When the min_sup value is set to 9.9, some error detection situations are appearing in the proposed MiFIOutlier approach, but the detection accuracy of the proposed MiFI-Outlier approach is always higher than that of other five compared approaches; In the six compared approaches, the detection accuracy of the distance-based outlier detection approach DPA is higher than that of FindFPOF approach and OODFP approach, but it is slightly lower than that of MIFPOD approach in less windows, in addition, the detection accuracy of DPA approach is not changed with the increase of min_sup values. When the min_sup value is set to 10.2, the error detection situation of MiFI-Outlier approach is appearing in the 1st, 13rd, 16rd and 23rd sliding windows, and when the min_sup value is set to 10.5, the error detection situation of MiFI-Outlier approach is only appearing in the 13rd and 23rd sliding windows, while when the min_sup value is set to 10.8, 11.1 and 11.4, the detection accuracy of the proposed MiFI-Outlier approach can reach to 100% in all sliding windows, that is, the detection accuracy of MiFI-Outlier approach shows an increase trend with the increase of min_sup values. In addition, with the increase of min_sup values, the detection accuracy of the MIFPOD approach also shows a slightly increase trend, while the detection accuracy of FindFPOF, OODFP and FIM-UDSOD approaches shows a decrease trend, the reason is that in large min_sup values, the scale of mined minimal infrequent itemsets for MiFI-Outlier approach and MIFPOD approach is much larger than that in small min_sup values, thus, more itemsets can be used in the process of outlier detection, which will improve the detection accuracy, the results indicate the proposed MiFI-Outlier approach can be used for accurately detecting the implicit outliers from uncertain data stream in relatively large min_sup values. However, in large min_sup values, the scale of mined frequent itemsets or maximal frequent itemsets for FindFPOF approach, OODFP approach and FIM-UDSOD approach is much smaller than that in small min_sup values, thus, less itemsets can be used in the process of outlier detection, which will reduce the detection accuracy of these approaches.
Jo
urn a
lP
Fig. 7(a) to Fig. 7(f) show the detection accuracy of the six compared approaches when the size of sliding window is set to 50, where the min_sup value is selected from {16.5, 17.0, 17.5, 18.0, 18.5, 19.0}. When the min_sup value is set to 16.5, the error detection situation is only appearing in two windows, while the min_sup value is not less than 17.0, the detection accuracy of the proposed MiFI-Outlier approach can reach to 100% in all sliding windows. In the six compared approaches, the stability of the detection accuracy of the MIFPOD approach is very poor, while the stability of the compared OODFP approach, FindFPOF approach and FIM-UDSOD approach is much better than that of MIFPOD approach. With the increase of min_sup values, the detection accuracy of the MIFPOD approach has a noticeable boost, while the detection accuracy of the FindFPOF approach, OODFP approach and FIM-UDSOD approach is very stable regardless of the increase of min_sup values. Overall, the proposed MiFI-Outlier approach is more sensitive to the min_sup values than to the sizes of sliding window, and its detection accuracy is much higher in big min_sup values. When the ratio of min_sup value to the size of sliding window is not less than 36%, the detection accuracy of the proposed MiFIOutlier approach can reach to 100% in almost all windows, and the ability of outlier detection of MiFI-Outlier approach is much stronger than that of the distance-based approach of DPA and the existing itemset-based approaches of FindFPOF, OODFP, MIFPOD and FIM-UDSOD.
This subsection is to test the detection accuracy of the six compared approaches on a public dataset, the experiment is also conducted in different min_sup values and different sizes of sliding window, and the experimental result is shown in Table 5, where the number outside the brackets means the number of transactions are selected when all outliers are detected, and the number in the brackets means the detection accuracy.
14 / 21
Journal Pre-proof
Table 5 Detection accuracy of MiFI-Outlier approach on public dataset lymphography Methods MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD |SW| (min_sup) 148 (75.48) 11 (54.55%) 60 (10%) 60 (10%) 96 (6.25%) 60 (10%) 148 (76.96) 11 (54.55%) 60 (10%) 60 (10%) 101 (5.94%) 60 (10%) 148 (78.44) 9 (66.67%) 139 (4.32%) 139 (4.32%) 42 (14.29%) 139 (4.32%) 148 (79.92) 8 (75%) 139 (4.32%) 139 (4.32%) 11 (54.55%) 139 (4.32%) 148 (81.4) 8 (75%) 139 (4.32%) 139 (4.32%) 12 (50%) 139 (4.32%) 148 (82.88) 7 (85.71%) 138 (4.35%) 138 (4.35%) 8 (75%) 138 (4.35%) 148 (84.36) 7 (85.71%) 138 (4.35%) 138 (4.35%) 8 (75%) 138 (4.35%)
five compared approaches (except for DPA approach) shows a decrease trend with the increase of min_sup values, and the decrease magnitude of the FindFPOF approach is very big from small min_sup value to relatively large min_sup value, while the decrease magnitude of the MiFI-Outlier approach is very limited. The reason for appearing this situation is that in big min_sup values, the scale of frequent 1-itemsets is relatively large, thus, the time cost on “pattern extension” process is also very long, but with the increase of min_sup values, the scale of frequent 1itemsets is becoming much smaller, thus, the time cost on “pattern extension” process is becoming shorter (the itemset mining process is the most time consuming phase of MiFIOutlier approach). In addition, the time cost of the proposed MiFI-Outlier approach is very stable in every sliding window, and with the increase of min_sup value, the time cost of the MIFPOD approach is becoming the second lowest.
pro of
It can be seen from Table 5 that in the six compared approaches, the detection accuracy of the proposed MiFI-Outlier approach is much higher than that of other five compared approaches, while the detection accuracy of the OODFP approach, FindFPOF approach and FIM-UDSOD approach is the same regardless of the min_sup value is set to 75.48, 76.96, 78.44, 79.92, 81.4, 82.88 and 84.36. With the increase of min_sup values, the detection accuracy of the proposed MiFIOutlier approach shows an increase trend, the reason is that the scale of minimal infrequent itemsets is relatively large when the min_sup value is set large. For the distance-based outlier detection approach DPA, the detection accuracy is not influenced by the different min_sup values, and its detection accuracy is much higher than other compared approaches except for the proposed MiFI-Outlier approach. The experimental results show that the proposed MiFI-Outlier approach can be used to accurately discover the implicit outliers from public datasets, and the min_sup value should be set much larger for improving the detection accuracy.
DPA 23 (26.09%) 23 (26.09%) 23 (26.09%) 23 (26.09%) 23 (26.09%) 23 (26.09%) 23 (26.09%)
re-
It can be seen from Fig. 9 that when the |SW| is set to 40, the time cost of the proposed MiFI-Outlier approach is also the lowest in the six compared approaches, and the time cost of the distance-based outlier detection approach called DPA is much higher than that of itemset-based outlier detection approaches (such as FindFPOF, OODFP, MIFPOD, FIM-UDSOD and MiFIOutlier). For the distance-based outlier detection approach DPA, the time cost is not influenced by the min_sup values, but it is influenced by the size of sliding window, it is owing to that in big sliding windows, the number of transactions is much more, thus, the time cost on calculating the distance of each transaction is much longer. When the min_sup value is set slightly small, the time cost of the MIFPOD approach, OODFP approach and FIMUDSOD approach is very close, but with the increase of min_sup values, the time cost of the MIFPOD approach is much lower
5.3. Time cost of MiFI-Outlier approach on synthetic dataset
lP
In this subsection, the time cost of six compared approaches on the synthetic dataset is evaluated under different min_sup values and different sizes of sliding window, and the experimental results are shown in Fig. 8 to Fig. 10.
It can be seen from Fig. 8 that when the |SW| is set to 30, the time cost of the proposed MiFI-Outlier approach is always the lowest in the six compared approaches, while the time cost of the DPA approach is the highest regardless of the min_sup values. When the size of sliding window is constant, the time cost of the
1.6 1.4
Time cost (Sec.)
1.2 1 0.8 0.6 0.4
1
0.8 0.6 0.4
0.2 0
1.2
10
15
20
25
30
35
0
40
5
10
No. of sliding window
(a) min_sup=9.9 1.8
Time cost (Sec.)
1.4 1.2 1 0.8 0.6 0.4
Time cost (Sec.)
Jo
1.6
0
20
25
30
35
0
40
10
15
20
25
30
No. of sliding window
(d) min_sup=10.8
35
40
10
15
20
25
30
35
40
No. of sliding window
(b) min_sup=10.2
(c) min_sup=10.5
1.8
1.8
1.6
1.6
1.4
1.4
MiFI-Outlier OODFP FindFPOF MIFPOD 1.2 FIM-UDSOD 1 DPA 0.8 0.6
0
5
No. of sliding window
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
1.2
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
1 0.8 0.6 0.4
0.2
5
0.6
0.2
15
0.4
0.2
1 0.8
0.4
0.2
5
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
MiFI-Outlier OODFP FindFPOF 1.4 MIFPOD 1.2 FIM-UDSOD DPA 1.6
Time cost (Sec.)
Time cost (Sec.)
1.4
1.8
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
Time cost (Sec.)
1.8
1.6
urn a
1.8
0.2
5
10
15
20
25
30
35
40
No. of sliding window
(e) min_sup=11.1 Fig.8. Time cost of MiFI-Outlier approach when |SW| is 30
0
5
10
15
20
25
30
35
40
No. of sliding window
(f) min_sup=11.4
15 / 21
2 1.8
1.6
1.6
1.4
1.4
1.2 1 0.8 0.6
1.2
0.8 0.6 0.4
0.2
0.2
20
25
0
30
10
No. of sliding window
1.8
1.8
1.6
1.6
1.4
1.4
1.2 1 0.8 0.6
0.6
0.2
25
0
30
2
10
1
25
15
20
(a) min_sup=16.5
1
5
2.5
Time cost (Sec.)
2
urn a
2.5
1.5
1
0.5
10
15
20
No. of sliding window
2
1.2
1
0.8 0.6
0
5
10
10
15
(d) min_sup=18.0
20
25
30
(f) min_sup=15.2
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
2.5
2
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
1.5
1
0
20
5
10
15
No. of sliding window
(b) min_sup=17.0
(c) min_sup=17.5
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
20
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
2.5
1
5
15
No. of sliding window
1.5
0
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
0.5
0.5
5
30
No. of sliding window
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
1.5
0
25
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
1.4
30
0.5
No. of sliding window
Time cost (Sec.)
20
Time cost (Sec.)
0.5
0
15
lP
1.5
20
0.2
5
re-
2
Time cost (Sec.)
Time cost (Sec.)
2.5
15
(c) min_sup=14.0
(e) min_sup=14.8 Fig.9. Time cost of MiFI-Outlier approach when |SW| is 40
2.5
10
10
No. of sliding window
No. of sliding window
(d) min_sup=14.4
5
5
0.4
No. of sliding window
0
0
30
1.6
0.8
0.2
20
25
1.8
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
1
0.4
15
20
2
1.2
0.4
10
15
(b) min_sup=13.6 2
Time cost (Sec.)
Time cost (Sec.)
(a) min_sup=13.2
5
0.6
No. of sliding window
2
0
1 0.8
0.2
5
Time cost (Sec.)
15
1.2
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
0.4
Time cost (Sec.)
10
1.4
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
pro of
5
1.6
1
0.4
0
2 1.8
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
Time cost (Sec.)
2 1.8
Time cost (Sec.)
Time cost (Sec.)
Journal Pre-proof
MiFI-Outlier OODFP FindFPOF MIFPOD FIM-UDSOD DPA
2
1.5
1
0.5
10
15
20
No. of sliding window
(e) min_sup=18.5 Fig.10. Time cost of MiFI-Outlier approach when |SW| is 50
0
5
10
15
20
No. of sliding window
(f) min_sup=19.0
lower than that of other five compared approaches, and the time cost of the OODFP approach, MIFPOD approach and FIMUDSOD approach is very close regardless of the min_sup values. With the use of “item cap” concept and “support cap” concept, the time cost on minimal infrequent itemset mining operations is reduced much compared with the traditional Apriori-based approaches and FP-Growth-based approaches, thus, the time cost of the MiFI-Outlier approach is the lowest.
Fig. 10(a) to Fig. 10(f) shows the time cost of the six compared outlier detection approaches when the min_sup value is set to 16.5, 17.0, 17.5, 18.0, 18.5 and 19.0 respectively, where the size of sliding window is set to 50. It can be seen from Fig. 10 that the time cost of the MiFI-Outlier approach is also much
In general, the time cost of the proposed MiFI-Outlier approach is more competitive than other four itemset-based outlier detection approaches, and the time efficiency of the MiFIOutlier approach is also much higher than that of distance-based outlier approach DPA. The experimental results indicate that the
Jo
than that of other two approaches. The reason is that the use of matrix structure can reduce much meaningless time cost on “pattern extension” process, thus, the reduce magnitude of the time cost is relatively larger. Similarly, when the size of sliding window is constant, the time cost of the five compared approaches (except for DPA approach) is slightly reduced with the increase of min_sup values, and the reduced magnitude of the FindFPOF approach is the largest.
16 / 21
Journal Pre-proof
MiFI-Outlier approach can be used to discover the implicit outliers from uncertain data stream for its high time efficiency.
UDSM and MiFI-UDSM*, thereby verifying the efficiency of the proposed outlier detection approach. In this experiment, four public datasets, including three dense datasets (mushroom, pumsb* and chess) and one sparse dataset (kosarak), are used as the target dataset, and the experiment is also conducted under different sizes of sliding window and different min_sup values, where the sizes of sliding window are set to 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100 respectively, and the min_sup values are set to 3, 4 and 5. The experimental results are shown in Fig. 11 to Fig. 14.
5.4. Time cost of minimal infrequent itemset mining approaches For the proposed minimal infrequent itemset-based outlier detection approach, the minimal infrequent itemset mining is the most time consume phase of the whole detection process. Thus, this subsection is mainly to test the time cost of our proposed two minimal infrequent itemset mining approaches, namely MiFI30
40 35
25
Time csot (Sec.)
30 25 20 15
20
12
15
10
5
30
40
50
60
70
80
90
0 10
100
20
30
(a) min_sup=3
1000
800 700 600 500 400 300 200
500
100 40
50
60
90
0 10
100
70
80
90
100
Size of sliding window
(a) min_sup=3
0 10
20
30
40
50
60
70
80
90
350 300
Time csot (Sec.)
120 100
urn a
200
30
40
50
60
70
(a) min_sup=3
80
90
140
20
30
80 60 40 20 20
30
40
50
60
70
Size of sliding window
(a) min_sup=3
80
90
160 140
60
70
90
100
80
90
100
80
90
100
(c) min_sup=5
60 50 40 30
10
40
50
60
70
80
90
0 10
100
20
30
Size of sliding window
40
50
60
70
Size of sliding window
(c) min_sup=5
220
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
200 180
120 100 80 60
160 140 120 100 80 60 40
20 0 10
50
20
40
100
40
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
70
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
180
Jo
100
30
(b) min_sup=4 Fig.13. Time cost of minimal infrequent itemset mining phase on dataset chess 200
120
20
80
Time csot (Sec.)
160
80
150
90
60
0 10
100
Time csot (Sec.)
180
100
200
0 10
220
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
200
90
Size of sliding window
80
Size of sliding window
220
80
50
20
20
70
100
40
50
60
250
100
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
140
100
50
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
400
160
150
40
Size of sliding window
(b) min_sup=4 Fig.12. Time cost of minimal infrequent itemset mining phase on dataset pumsb*
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
250
30
Size of sliding window
350
300
20
450
lP
30
80
re-
900
1500
20
70
500
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
1000
Time csot (Sec.)
Time csot (Sec.)
60
1100
2000
Time csot (Sec.)
50
(b) min_sup=4 (c) min_sup=5 Fig.11. Time cost of minimal infrequent itemset mining phase on dataset mushroom
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
2500
Time csot (Sec.)
6
Size of sliding window
3000
0 10
40
Time csot (Sec.)
20
Size of sliding window
0 10
8
2
5
0 10
10
4
10
0 10
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
14
Time csot (Sec.)
Time csot (Sec.)
45
16
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
Time csot (Sec.)
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
50
pro of
55
20 20
30
40
50
60
70
80
90
100
0 10
20
30
Size of sliding window
(b) min_sup=4 Fig.14. Time cost of minimal infrequent itemset mining phase on dataset kosarak
It can be seen from Fig. 11 to Fig. 13 that on dense datasets mushroom, pumsb* and chess, the time cost of our proposed MiFI-UDSM approach and MiFI-UDSM* approach is lower than
40
50
60
70
Size of sliding window
(c) min_sup=5
that of other four compared approaches under different sizes of sliding window and different min_sup values, while the time cost of MRG-Exp approach is the highest in most windows.
17 / 21
Journal Pre-proof
use of “item cap” and “support cap” have deleted many safe infrequent 1-itemsets, thus, the number of frequent 1-itemsets that can be added into the “pattern extension” operations is reduced much, which reduce much time cost on the whole itemset mining process. On the sparse dataset kosarak, the time cost of FP-Growth-based RP-Tree approach is slightly higher than that of MRG-Exp approach in different sizes of sliding window and different min_sup values. When the size of sliding window is constant, the time cost the six compared approaches shows a decrease trend with the increase of min_sup values, but the decrease magnitude is not large, it is owing to that in big sliding windows, the number of frequent 1-itemsets is more than that in small sliding windows, thus, the time cost in “pattern extension” operation is also much longer, but on sparse datasets, the scale of frequent 1-itemsets is not very large in itself, so the number of frequent 1-itemsets only changes little with the increase of min_sup values. When the min_sup value is constant, the time cost of the six compared approaches shows an increase trend with the increase sizes of sliding window, the reason is that the scale of frequent 1-itemsets is much larger in big windows. Similarly, the use of “item cap” and “support cap” on the sparse dataset also has a positive impact in minimal infrequent itemset mining process. Thus, the proposed MiFI-UDSM* also can be used in the outlier detection for detecting the implicit outliers from sparse uncertain data stream.
pro of
Compared with the MiFI-UDSM approach, the time cost of MiFI-UDSM* approach is much less because the use of “item cap” and “support cap” can reduce many meaningless time cost in “pattern extension” operations and support value calculation operations, thus, the time cost on minimal infrequent itemset mining operations is also reduced. Under the constant sizes of sliding window, the time cost of the six compared approaches shows a decrease trend with the increase of min_sup values, the reason is that the scale of potential frequent itemsets in large min_sup values is relative smaller than that in small min_sup values, thus, the time cost on “pattern extension” process is also reduced in large min_sup values. Under the constant min_sup values, the time cost of six compared approaches shows an increase trend with the increase sizes of sliding window, the reason is that in big windows, the scale of potential frequent itemsets is larger than that in small windows, which results much time will be consumed on “pattern extension” operations. Compared with the Apriori-based approach (for MRG-Exp), the use of matrix structure (for MIP-DS, MiFI-UDSM and MiFIUDSM*) can reduce some time cost on minimal infrequent itemset mining operations, it is owing to that the use of matrix allows the subsequent itemset mining operations to be conducted based on the constructed matrix structure without scanning the datasets for multiple times. The experimental results on dense datasets show that the mining efficiency of the proposed MiFIUDSM* approach is very competitive than the MRG-Exp, RPTree, MIP-DS and DSUF-min approaches, thus, it can be used in outlier detection for detecting the implicit outliers from dense uncertain data stream.
re-
5.5. Memory usage of minimal infrequent itemset mining approaches In addition to the time cost, the memory usage is another important index to reflect the efficiency of the minimal infrequent itemset mining, thus, this subsection is to test the memory usage of the proposed MiFI-UDSM approach and MiFIUDSM* approach under different sizes of sliding window and different min_sup values. The experimental results are shown in Fig. 15 to Fig. 18.
lP
It can be seen from Fig. 14 that on sparse dataset kosarak, the time cost of the proposed MiFI-UDSM approach and MiFIUDSM* approach is lower than that of other four compared approaches. In addition, the time cost of MiFI-UDSM* approach is also much lower than that of MiFI-UDSM approach, the reason is that before extending the frequent 1-itemsets to 2-itemsets, the 110
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
90
Memory usage (MB)
100
80
60
40
100
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
100
80 70
80
50 40 30
20
30
40
50
60
70
80
90
0 10
100
20
30
Size of sliding window
60 50 40 30
10
10
0 10
70
20
20
20
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
90
60
urn a
Memory usage (MB)
120
Memory usage (MB)
140
40
50
60
70
80
90
0 10
100
20
30
Size of sliding window
40
50
60
70
80
90
100
80
90
100
Size of sliding window
(a) min_sup=3 (b) min_sup=4 (c) min_sup=5 Fig.15. Memory usage of minimal infrequent itemset mining phase on dataset mushroom 350
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
250 200 150 100 50 0 10
20
30
40
50
60
70
Size of sliding window
(a) min_sup=3
80
90
Memory usage (MB)
300
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
300
Jo
Memory usage (MB)
350
250
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
250
Memory usage (MB)
400
200
150
100
200
150
100
50 50
100
0 10
20
30
40
50
60
70
Size of sliding window
80
90
100
0 10
20
30
40
50
60
70
Size of sliding window
(b) min_sup=4 (c) min_sup=5 Fig.16. Memory usage of minimal infrequent itemset mining phase on dataset pumsb*
18 / 21
Journal Pre-proof
250
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
200
200
Memory usage (MB)
Memory usage (MB)
250
200
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
150
100
150
160
100
50
50
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
180
Memory usage (MB)
300
140 120 100 80 60 40 20
30
40
50
60
70
80
90
0 10
100
20
30
Size of sliding window
(a) min_sup=3
450 400
Memory usage (MB)
Memory usage (MB)
400 350 300 250 200
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
150 100 50
30
40
50
60
70
80
90
0 10
100
20
30
40
50
60
70
80
90
100
Size of sliding window
70
80
90
400 350
350 300 250 200
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
150 100 50
100
0 10
20
30
Size of sliding window
(a) min_sup=3
60
(b) min_sup=4 (c) min_sup=5 Fig.17. Memory usage of minimal infrequent itemset mining phase on dataset chess
450
20
50
Size of sliding window
500
0 10
40
Memory usage (MB)
20
pro of
0 10
40
50
60
70
80
90
300 250 200
MIP-DS MiFI-UDSM MiFI-UDSM* DSUF-min MRG-Exp RP-Tree
150 100 50
100
Size of sliding window
0 10
20
30
40
50
60
70
80
90
100
Size of sliding window
(b) min_sup=4 (c) min_sup=5 Fig.18. Memory usage of minimal infrequent itemset mining phase on dataset kosarak
It can be seen from Fig. 18 that on sparse dataset kosarak, the memory usage of the proposed MiFI-UDSM approach is also the lowest in the six compared approaches, while the memory usage of RP-Tree approach is slightly larger than other five approaches in most sliding windows, the reason is that on sparse datasets, the scale of frequent 1-itemsets is relatively small unless the min_sup value is set to be particularly small, which results the time cost on infrequent itemset mining operations also very limited. It can be seen from Fig. 18(a) that when the min_sup value is set to 3, the largest memory usage of the six compared approaches can reach to 450MB, and when min_sup value is set to 4, the largest memory usage of the six compared approaches can reach to 430MB (see in Fig. 18(b)), while the largest memory usage of the six compared approaches can reach to 400MB when min_sup value is set to 5 (see in Fig. 18(c)), that is, on sparse dataset kosarak, the reduced memory usage of the six compared approaches is very limited with the increase of min_sup values, it is owing to the scale of frequent 1-itemsets is already very small when the min_sup value is set to 3. When the size of sliding window is constant, the memory usage of the proposed MiFIUDSM approach and MiFI-UDSM* approach shows a slightly decrease trend with the increase of min_sup values, while when the min_sup value is constant, the memory usage of the proposed shows an increase trend with the increase sizes of sliding window. The experimental results also verify that the use of “item cap” and “support cap” can reduce the memory usage when mining the minimal infrequent itemsets from sparse datasets, thus, the proposed MiFI-UDSM* approach can be used to quickly and accurately detect the implicit outliers from sparse uncertain data stream.
urn a
lP
re-
It can be seen from Fig. 15 that on dataset mushroom, when min_sup value is set to 3, the memory usage of the proposed MiFI-UDSM approach and MiFI-UDSM* approach is slightly lower than that of MRG-Exp approach and MIP-DS approach but much lower than that of DSUF-min approach and RP-Tree approach. When the min_sup value is set to 4 and 5, the memory usage of the MiFI-UDSM, MRG-Exp, DSUF-min and MIP-DS approaches is very close, and the memory usage of MiFIUDSM* approach is lower than these approaches, while the memory usage of RP-Tree is much higher than these approaches. When the min_sup value is constant, with the increase sizes of sliding window, the memory usage of the six compared approaches shows an increase trend, but the increase magnitude is becoming slower. When the size of sliding window is constant, with the increase of min_sup values, the memory usage of the six compared approaches shows a decrease trend, it is owing to that the scale of frequent itemsets shows a decrease trend with the increase of min_sup values.
Jo
It can be seen from Fig. 16 and Fig. 17 that on dense datasets pumsb* and chess, the memory usage of MiFI-UDSM* approach is the lowest and the memory usage of MiFI-UDSM approach is the second lowest, while the memory usage of RP-Tree approach and DSUF-min approach is much higher than that of other four compared approaches. When the min_sup value is constant, the memory usage of the six compared approaches shows an increase trend with the increase sizes of sliding window, the reason is that in small sizes of sliding window, the number of frequent 1itemsets is also smaller than that in the large sliding windows, thus, the scale of extended itemsets is also smaller, which results the memory usage in “pattern extension” process is also very small. When the size of sliding window is constant, the memory usage of the six compared approaches shows a decrease trend with the increase of min_sup values, the reason is that the scale of frequent 1-itemsets shows a decreasing trend with the increase of min_sup values. The experimental results indicate that the use of “item cap” and “support cap” can reduce the memory usage in minimal infrequent itemset mining process, it is owing to that the safe infrequent itemsets are discarded directly to reduce the meaningless “pattern extension” operations.
5.6. Discussions In the experiments, the detection accuracy and time cost of the proposed MiFI-Outlier approach are tested under different min_sup values and different sizes of sliding window, and the time cost and memory usage of the proposed MiFI-UDSM approach and MiFI-UDSM* approach are also tested under different min_sup values and different sizes of sliding window. Thus, this subsection first aims at discussing the relationships
19 / 21
Journal Pre-proof
For the proposed MiFI-Outlier approach, when the size of sliding window is constant, the detection accuracy is positive correlation with the min_sup values, that is, the detection accuracy of the proposed MiFI-Outlier approach shows an increase trend with the increase of min_sup values, the reason is that the scale of minimal infrequent itemsets is much larger in the large min_sup values, thus, the number of used itemsets is also larger in large min_sup values. When the min_sup value is constant, the relationship between the detection accuracy of the MiFI-Outlier approach and the size of sliding window is very small. In addition, when the size of sliding window is constant, the time cost of the proposed MiFI-Outlier approach shows a decrease trend with the increase of min_sup values, and when the min_sup value is constant, the time cost of the MiFI-Outlier approach shows an increase trend with the increase sizes of sliding window.
The performance of the proposed MiFI-Outlier approach is evaluated on a synthetic dataset and a public dataset, and the results show that the detection accuracy and the time cost of the proposed MiFI-Outlier approach outperform the compared FindFPOF, MIFPOD, OODFP, FIM-UDSOD and DPA approaches. In addition, the performance of the proposed MiFIUDSM and MiFI-UDSM* approaches is evaluated on four public datasets, and the result shows that both on dense datasets and sparse datasets, the time cost and memory usage of the proposed MiFI-UDSM and MiFI-UDSM* approaches are more competitive than the compared MIP-DS, DSUF-min, MRG-Exp and RP-Tree approaches; Although on the sparse datasets, the improved efficiency of both time cost and memory usage of the proposed MiFI-UDSM* approach is slightly limited. Overall, the proposed MiFI-Outlier can provide a good solution for outlier detection for the uncertain data stream, and the detection accuracy is much higher in large min_sup values.
re-
For the proposed MiFI-UDSM approach and MiFI-UDSM* approach, when the size of sliding window is constant, the time cost and memory usage of the MiFI-UDSM approach and MiFIUDSM* approach shows a decrease trend with the increase of min_sup values, the reason is that the scale of extensible frequent itemsets is much small in large min_sup values, thus, the time cost and memory usage on “pattern extension” operation is also reduced. When the min_sup value is constant, the time cost and memory usage of the proposed MiFI-UDSM approach and MiFIUDSM* approach shows an increase trend with the increase sizes of sliding window, the reason is that the scale of extensible frequent itemsets is larger in large sliding windows, thus, the time cost and memory usage on “pattern extension” operation is also increased.
the scale of potential extensible frequent itemsets, and then an improved editor, namely MiFI-UDSM*, is proposed to raise the mining efficiency. In itemset-based outlier detection phase, three deviation indices, namely minimal infrequent itemset deviation index (MiFIDI), similarity deviation index (SI) and transaction deviation index (TDI), are defined to judge the deviation degree of each transaction, where MiFIDI is used to compute the deviation index of each mined minimal infrequent itemset, SI and TDI are used to compute the deviation index of each transaction. At last, the transactions are sorted using their decreasing TDI values and the transactions having higher TDI values are determined as outliers.
pro of
between the min_sup values, the sizes of sliding window and the detection accuracy of the proposed MiFI-Outlier approach, and then discusses the relationships between the min_sup values, the sizes of sliding window and the time cost and memory usage of the proposed MiFI-UDSM approach, MiFI-UDSM* approach and MiFI-Outlier approach.
lP
In real life, people tend to concern about whether there are outliers in the small scale of the data that meet their constraints, rather than in the entire datasets. However, how to quickly and accurately discover the outliers from the data stream that satisfies the constraints is a new challenge. Thus, in the future, we will research the infrequent-itemset-based outlier detection approach for discovering the implicit outliers from constrained uncertain data stream. In addition, the phenomenon of concept drift [26-28] appears more and more frequently in the data stream, and the appearance of concept drift will cause great distress to the outlier detection, so in the future work, the concept drift detection and outlier detection should be considered together, thereby further improving the credibility of data stream.
6. Conclusion
urn a
In general, the proposed MiFI-Outlier approach is more competitive when the min_sup value is set larger, and in the relatively large min_sup values, the MiFI-Outlier approach can discover the implicit outliers from uncertain data stream faster and more accurately.
Jo
Outlier is the main factor that will affect data-based predicting and analysis, therefore, outlier detection is urgent demanded for improving the reliability of the collected datasets. However, the outlier detection on high-dimensional uncertain data stream is a challenging work, while the efficiency of the distance-based and density-based outlier detection approaches is not competitive for their high computational complexities. Aimed at the definition of outlier, that is the data is appearing rarely and differing much from most normal data elements, this paper proposes an efficient minimal infrequent itemset-based outlier detection approach, namely MiFI-Outlier, for quickly and accurately discovering the implicit outliers from uncertain data stream, where the MiFIOutlier is made up of minimal infrequent itemset mining operation and itemset-based outlier detection operation. In minimal infrequent itemset mining phase, the matrix structure is constructed to store the probability of each itemset existing in the current sliding window first, and then the matrix-based infrequent itemset mining approach called MiFI-UDSM is proposed to mine the minimal infrequent itemsets from uncertain data stream. To reduce the meaningless “pattern extension” operations, the “item cap” concept and “support cap” concept are proposed to reduce
Acknowledgments This work was supported in part by the Chinese Universities Scientific Fund under Grant No. 2017XD001 and the Fundamental Research Funds for the Central Universities under Grant No. 2018XD004.
References [1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: 20th International Conference on Very Large Data Bases, 1994, pp. 487-499. [2] M. Bai, X. Wang, J. Xin, G.R. Wang, An efficient algorithm for distributed density-based outlier detection on big data, Neurocomputing 181 (2016) 19-28. [3] L. Cagliero, P. Garza, Infrequent weighted itemset mining using frequent pattern growth, IEEE Transactions on Knowledge and Data Engineering 26(4) (2014) 903-915. [4] S.H. Cai, R.Z. Sun, J.Y. Li, C. Deng, S.C. Li, Abnormal Detecting over Data Stream Based on Maximal Pattern Mining Technology, in: CCF Conference on Computer Supported Cooperative Work and Social Computing, 2018, pp. 371-385. [5] S.H. Cai, R.Z. Sun, S.B. Hao, S.C. Li, G. Yuan, Minimal weighted infrequent itemset mining-based outlier detection approach on uncertain data stream, Neural Computing and Applications (2018). https://doi.org/10.1007/s00521-018-3876-4
20 / 21
Journal Pre-proof
pro of
[33] B. Tang, H. He, A local density-based approach for outlier detection, Neurocomputing 241 (2017) 171-180. [34] L. Troiano, G. Scibelli, A time-efficient breadth-first level-wise latticetraversal algorithm to discover rare itemsets, Data Mining and Knowledge Discovery 28(3) (2014) 773-807. [35] S. Tsang, Y.S. Koh, G. Dobbie, RP-Tree: Rare Pattern Tree Mining, in: Proceedings of the 13th International Conference on Data Warehousing and Knowledge Discovery, 2011, pp. 277-288. [36] B. Wang, X.C. Yang, G.R. Wang, G. Yu, Outlier detection over sliding windows for probabilistic data streams, Journal of Computer Science and Technology 25(3) (2010) 389-400. [37] I.M. Wagner-Muns, I.G. Guardiola, V.A. Samaranayke, W.I. Kayani, A Functional Data Analysis Approach to Traffic Volume Forecasting, IEEE Transactions on Intelligent Transportation Systems 19(3) (2018) 878-888. [38] K. Xu, K. Zou, Y. Huang, X. Yu, X.F. Zhang, Mining community and inferring friendship in mobile social networks, Neurocomputing 174 (2016) 605-616. [39] G. Yang, The complexity of mining maximal frequent itemsets and maximal frequent patterns, in: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 344-353. [40] Y. Yang, C. Yang, Wei. Y, Frequent pattern mining algorithm for uncertain data streams based on sliding window, in: Proceeding of the 8th IEEE International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016, pp. 265-268. [41] J.X. Yu, Z. Chong, H. Lu, Z. Zhang, A. Zhou, A false negative approach to mining frequent itemsets from high speed transactional data streams, Information Sciences 176(14) (2006) 1986-2015. [42] U. Yun, D. Kim, E. Yoon, H. Fujita, Damped Window based High Average Utility Pattern Mining over data streams, Knowledge-Based Systems 144 (2018) 188-205.
Jo
urn a
lP
re-
[6] S.H. Cai, S.B. Hao, R.Z. Sun, G. Wu, Mining Recent Maximal Frequent Itemsets Over Data Streams with Sliding Window, International Arab Journal of Information Technology 16(6) (2019) 961-969. [7] S.H. Cai, R.Z. Sun, S.B. Hao, S.C. Li, G. Yuan, An Efficient Outlier Detection Approach on Weighted Data Stream Based on Minimal Rare Pattern Mining. China Communications 16(10) (2019) 83-99. [8] K.Y. Cao, G.R. Wang, D.H. Han, G.H. Ding, A.X. Wang, L.X. Shi, Continuous outlier monitoring on uncertain data streams, Journal of Computer Science and Technology 29(3) (2014) 436-448. [9] W. Fang, V.S. Sheng, X.Z. Wen, W.B. Pan, Meteorological data analysis using MapReduce, The Scientific World Journal 96 (2014) 27-38. [10] G.D. Fan, S.H. Yin, A frequent itemsets mining algorithm based on matrix in sliding window over data streams, in: 3th International Conference on Intelligent System Design and Engineering Applications (ISDEA), 2013, pp. 66-69. [11] D.J. Haglin, A.M. Manning, On Minimal Infrequent Itemset Mining, in: 7th International Conference on Data Mining, 2007, pp. 141-147. [12] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery 8(1) (2004) 53-87. [13] M. Han, J. Ding, J. Li, TDMCS: An Efficient Method for Mining Closed Frequent Patterns over Data Streams Based on Time Decay Model, International Arab Journal of Information Technology 14(6) (2017) 851-860. [14] S.B. Hao, S.H. Cai, R.Z. Sun, S.C. Li, An Efficient Outlier Detection Approach Over Uncertain Data Stream Based on Frequent Itemset Mining, Journal of Information Technology and Control 48(1) (2019) 34-46. [15] D.M. Hawkins, Identification of outliers, 1980, London: Chapman and Hall. [16] Z.Y. He, X.F. Xu, J.Z. Huang, S.C. Deng, FP-Outlier: Frequent pattern based outlier detection, Computer Science and Information Systems 2(1) (2005) 103-118. [17] C.S. Hemalatha, V. Vaidehi, R. Lakshmi, Minimal infrequent pattern based approach for mining outliers in data streams, Expert Systems with Applications 42(4) (2015) 1998-2012. [18] J. Huang, Q. Zhu, L. Yang, D. Cheng, Q. Wu, A novel outlier cluster detection algorithm without top-n parameter, Knowledge-Based Systems 121 (2017) 32-40. [19] F. Keller, E. Muller, K. Bohm, HiCS: High Contrast Subspaces for Density-Based Outlier Ranking, in: 28th International Conference on Data Engineering, 2012, pp. 1037-1048. [20] M. Kontaki, A. Gounaris, A.N. Papadopoulos, K. Tsichlas, Y. Manolopoulos, Efficient and flexible algorithms for monitoring distancebased outliers over data streams, Information Systems 55 (2016) 37-53. [21] G. Lee, U. Yun, A new efficient approach for mining uncertain frequent patterns using minimum data structure without false positives, Future Generation Computer Systems 68 (2017) 89-110. [22] C.K.S. Leung, M.A. Mateo, D.A. Brajczuk, A tree-based approach for frequent pattern mining from uncertain data, in: Proceeding of the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2008, pp. 653-661. [23] C.K.S. Leung, R.K. MacKinnon, F. Jiang, Finding efficiencies in frequent pattern mining from big uncertain data, World Wide Web 20(3) (2017) 571-594. [24] Y. Lim, U. Kang, Time-weighted counting for recently frequent pattern mining in data streams, Knowledge and Information Systems 53(2) (2017) 391-422. [25] F. Lin, W. Le, J. Bo, Research on maximal frequent pattern outlier factor for online high dimensional time-series outlier detection, Journal of Convergence Information Technology 5(10) (2010) 66-71. [26] A.J. Liu, J. Lu, F. Liu, G.Q. Zhang, Accumulating regional density dissimilarity for concept drift detection in data streams, Pattern Recognition 76 (2018) 256-272. [27] J. Lu, A.J. Liu, F. Dong, F. Gu, J. Gama, G.Q. Zhang, Learning under Concept Drift: A Review, IEEE Transactions on Knowledge and Data Engineering 2018. http://dx.doi.org/10.1109/TKDE.2018.2876857 [28] N. Lu, G.Q. Zhang, J. Lu, Concept drift detection via competence models, Artificial Intelligence 209 (2014) 11-28. [29] M. Radovanović, A. Nanopoulos, M. Ivanović, Reverse nearest neighbors in unsupervised distance-based outlier detection, IEEE Transactions on Knowledge and Data Engineering 27(5) (2015) 1369-1382. [30] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak, F. Herrera, A survey on data preprocessing for data stream mining: current status and future directions, Neurocomputing 239 (2017) 39-57. [31] Y. Shi, L. Zhang, COID: A cluster–outlier iterative detection approach to multi-dimensional data analysis, Knowledge and Information Systems 28(3) (2011) 709-733. [32] L. Szathmary, A. Napoli, P. Valtchev, Towards rare itemset mining, in: International Conference on Tools with Artificial Intelligence (ICTAI), 2007, pp: 305-312.
21 / 21
*Author Contributions Section
Journal Pre-proof
Jo
urn a
lP
re-
pro of
Saihua Cai: Methodology, Experimental verification, Writing - original draft, Writing - review & editing. Sicong Li: Experimental verification, Writing - review & editing. Gang Yuan: Experimental verification, Formal analysis. Shangbo Hao: Experimental verification. Ruizhi Sun: Writing - review & editing.
Journal Pre-proof
*Conflict of Interest Form
Conflict of interest We declared that we have no conflicts of interest to this work. We declare that we do not have any commercial or associative
pro of
interest that represents a conflict of interest in connection with the work submitted.
Ruizhi Sun, on behalf of Saihua Cai, Shangbo Hao, Sicong Li and
Jo
urn a
lP
re-
Gang Yuan